How to solve 403 error in scrapy

Add the following script on your settings.py file. This works well if you are combining selenium with scrapy

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}

Like Avihoo Mamka mentioned in the comment you need to provide some extra request headers to not get rejected by this website.

In this case it seems to just be the User-Agent header. By default scrapy identifies itself with user agent "Scrapy/{version}(+http://scrapy.org)". Some websites might reject this for one reason or another.

To avoid this just set headers parameter of your Request with a common user agent string:

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
yield Request(url, headers=headers)

You can find a huge list of user-agents here, though you should stick with popular web-browser ones like Firefox, Chrome etc. for the best results

You can implement it to work with your spiders start_urls too:

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = (
        'http://scrapy.org',
    )

    def start_requests(self):
        headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
        for url in self.start_urls:
            yield Request(url, headers=headers)

I just needed to get my shell to work and run some quick tests so Granitosaurus's solution was a bit overkill for me.

I literally just went to the settings.py where you'll find mostly everything is commented out. In like line 16-17 or something you'll find something like this...

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'exercise01part01 (+http://www.yourdomain.com)'

You just need uncomment it and replace it with any user agent like 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'

You can find a list of them here https://www.useragentstring.com/pages/useragentstring.php[][1]

So it'll look something like this...

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'

You'll definitely want to rotate user agents if you want to make a make a large-scale crawler. But I just needed to get my scrapy shell to work and make some quick tests without getting that pesky 403 error so this one-liner sufficed. It was nice because I did not need to make a fancy function or anything.

Happy scrapy-ing

Note: PLEASE make sure you are in the same directory as settings.py when you run scrapy shell in order to utilize the changes you just made. It does not work if you are in a parent directory.