How to scrape a website which requires login using python and beautifulsoup?

You can use mechanize:

import mechanize
from bs4 import BeautifulSoup
import urllib2 
import cookielib ## http.cookiejar in python3

cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("https://id.arduino.cc/auth/login/")

br.select_form(nr=0)
br.form['username'] = 'username'
br.form['password'] = 'password.'
br.submit()

print br.response().read()

Or urllib - Login to website using urllib2

There is a simpler way, from my pov, that gets you there without selenium or mechanize, or other 3rd party tools, albeit it is semi-automated.

Basically, when you login into a site in a normal way, you identify yourself in a unique way using your credentials, and the same identity is used thereafter for every other interaction, which is stored in cookies and headers, for a brief period of time.

What you need to do is use the same cookies and headers when you make your http requests, and you'll be in.

To replicate that, follow these steps:

In your browser, open the developer tools
Go to the site, and login
After the login, go to the network tab, and then refresh the page
At this point, you should see a list of requests, the top one being the actual site - and that will be our focus, because it contains the data with the identity we can use for Python and BeautifulSoup to scrape it
Right click the site request (the top one), hover over copy, and then copy as cURL
Like this:

enter image description here

Then go to this site which converts cURL into python requests: https://curl.trillworks.com/
Take the python code and use the generated cookies and headers to proceed with the scraping

How to scrape a website which requires login using python and beautifulsoup?

Tags:

Python

Web Scraping

Beautifulsoup

Related

Recent Posts