How make simple fast request with "requests" module python?

The bottleneck is the server responding slowly to a simple requests.

Try doing request parallel.

You can also use threads instead of asyncio. Here is a previous question explain for to parallerise tasks in Python:

Executing tasks in parallel in python

Please note that a smartly configured server would still slow your requests or ban you if you are scraping without a permission.


Learning Python through projects such as web scraping is awesome. That is how I was introduced to Python. That said, to increase the speed of your scraping, you can do three things:

  1. Change the html parser to something faster. 'html.parser' is the slowest of them all. Try change to 'lxml' or 'html5lib'. (read https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

enter image description here

  1. Drop the loops and regex as they slow your script. Just use BeautifulSoup tools, text and strip, and find the right tags.(see my script below)

  2. Since the bottleneck in web scraping is usually IO, waiting to get data from a webpage, using async or multithread will boost speed. In the below script, I have use multithreading. The aim is to pull data from multiple pages at the same time.

So if we know maximum number of pages, we can chunk our requests into different ranges and pull them in batches :)

Code example:

from collections import defaultdict
from concurrent.futures import ThreadPoolExecutor
from datetime import datetime

import requests
from bs4 import BeautifulSoup as bs

data = defaultdict(list)

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.3; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'}

def get_data(data, headers, page=1):

    # Get start time
    start_time = datetime.now()
    url = f'https://www.jobstreet.co.id/en/job-search/job-vacancy/{page}/?src=20&srcr=2000&ojs=6'
    r = requests.get(url, headers=headers)

    # If the requests is fine, proceed
    if r.ok:
        jobs = bs(r.content,'lxml').find('div',{'id':'job_listing_panel'})
        data['title'].extend([i.text.strip() for i in jobs.find_all('div',{'class':'position-title header-text'})])
        data['company'].extend([i.text.strip() for i in jobs.find_all('h3',{'class':'company-name'})])
        data['location'].extend([i['title'] for i in jobs.find_all('li',{'class':'job-location'})] )
        data['desc'].extend([i.text.strip() for i in jobs.find_all('ul',{'class':'list-unstyled hidden-xs '})])
    else:
        print('connection issues')
    print(f'Page: {page} | Time taken {datetime.now()-start_time}')
    return data
    

def multi_get_data(data,headers,start_page=1,end_page=20,workers=20):
    start_time = datetime.now()
    # Execute our get_data in multiple threads each having a different page number
    with ThreadPoolExecutor(max_workers=workers) as executor:
        [executor.submit(get_data, data=data,headers=headers,page=i) for i in range(start_page,end_page+1)]
    
    print(f'Page {start_page}-{end_page} | Time take {datetime.now() -     start_time}')
    return data


# Test page 10-15
k = multi_get_data(data,headers,start_page=10,end_page=15)

Results: enter image description here

Explaining the multi_get_data function:

This function will call get_data function in different threads with passing desired arguments. At the moment, each thread get a different page number to call. The maximum numbers of workers is set to 20, meaning 20 threads. You can increase or decrease accordingly.

We have created variable data, a default dictionary, that takes lists in. All threads will populate this data. This variable can then be cast to json or Pandas DataFrame :)

As you can see, we have 5 requests, each taking less than 2 seconds but yet the total is still under 2 seconds;)

Enjoy web scraping.

Update_: 22/12/2019

We could also gain some speed by using session with a single headers update. So we don’t have to start sessions with each call.

from requests import Session

s = Session()
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '\
                         'AppleWebKit/537.36 (KHTML, like Gecko) '\
                         'Chrome/75.0.3770.80 Safari/537.36'}
# Add headers
s.headers.update(headers)

# we can use s as we do requests
# s.get(...)
...

This is my suggestion to write your code with good architecture and divide it into functions and write a less code. This is one of example using request:

from requests import get
from requests.exceptions import RequestException
from contextlib import closing
from bs4 import BeautifulSoup

def simple_get(url):
    """
    Attempts to get the content at `url` by making an HTTP GET request.
    If the content-type of response is some kind of HTML/XML, return the
    text content, otherwise return None.
    """
    try:
        with closing(get(url, stream=True)) as resp:
            if is_good_response(resp):
                return resp.content
            else:
                return None

    except RequestException as e:
        log_error('Error during requests to {0} : {1}'.format(url, str(e)))
        return None


def is_good_response(resp):
    """
    Returns True if the response seems to be HTML, False otherwise.
    """
    content_type = resp.headers['Content-Type'].lower()
    return (resp.status_code == 200 
            and content_type is not None 
            and content_type.find('html') > -1)


def log_error(e):
    """
    It is always a good idea to log errors. 
    This function just prints them, but you can
    make it do anything.
    """
    print(e)

Debug your code on the points which take time and figure out them and discuss here. That way help you lot to solve your problem.