Can't scrape all the company names from a webpage

UPDATE 01-04-2021

After reviewing the "fine print" in the Algolia API documentation, I discovered that the paginationLimitedTo parameter CANNOT BE USED in a query. This parameter can only be used during indexing by the data's owner.

It seems that you can use the query and offset this way:

payload = {"requests":[{"indexName":"YCCompany_production",
                        "params": "query=&offset=1000&length=500&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit"
                                 "%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="}]}

Unfortunately, the paginationLimitedTo index set by the customer will not let you retrieve more than 1000 records via the API.

"hits": [],
    "nbHits": 2432,
    "offset": 1000,
    "length": 500,
    "message": "you can only fetch the 1000 hits for this query. You can extend the number of hits returned via the paginationLimitedTo index parameter or use the browse method. You can read our FAQ for more details about browsing: https://www.algolia.com/doc/faq/index-configuration/how-can-i-retrieve-all-the-records-in-my-index",

The browsing bypass method mentioned requires an ApplicationID and the AdminAPIKey


ORIGINAL POST

Based on the Algolia API documentation there is a query hit limit of 1000.

The documentation lists several ways to override or bypass this limit.

Part of the API is paginationLimitedTo, which by default is set to 1000 for performance and "scraping protection."

The syntax is:

'paginationLimitedTo': number_of_records

Another method mentioned in the documentation is setting the parameters offset and length.

offset lets you specify the starting hit (or record)

length sets the number of records returned

You could use these parameters to walk the records, thus potentially not impacting your scraping performance.

For instance you could scrape in blocks of 500.

  • records 1-500 (offset=0 and length=500)
  • records 501-1001 (offset=500 and length=500)
  • records 1002-1502 (offset=1001 and length=500)
  • etc...

or

  • records 1-500 (offset=0 and length=500)
  • records 500-1000 (offset=499 and length=500)
  • records 1000-1500 (offset=999 and length=500)
  • etc...

The latter one would produces a few duplicates, which could be easily removed when adding them to your in-memory storage (list, dictionary, dataframe).

----------------------------------------
My system information
----------------------------------------
Platform:    macOS
Python:      3.8.0
Requests:    2.25.1
----------------------------------------

As a workaround you can simulate search using alphabet as a search pattern. Using code below you will get all 2431 companies as dictionary with ID as a key and full company data dictionary as a value.

import requests
import string

params = {
    'x-algolia-agent': 'Algolia for JavaScript (3.35.1); Browser; JS Helper (3.1.0)',
    'x-algolia-application-id': '45BWZJ1SGC',
    'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJl'
                         'c3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUy'
                         'MiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}

url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries'
result = dict()
for letter in string.ascii_lowercase:
    print(letter)

    payload = {
        "requests": [{
            "indexName": "YCCompany_production",
            "params": "hitsPerPage=1000&query=" + letter + "&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="
        }]
    }

    r = requests.post(url, params=params, json=payload)
    result.update({h['id']: h for h in r.json()['results'][0]['hits']})

print(len(result))

Try an explicit limit value in the payload to override the API default. For instance, insert limit=2500 into your request string.