How to scrape a website which requires login using python and beautifulsoup?
You can use mechanize:
import mechanize
from bs4 import BeautifulSoup
import urllib2
import cookielib ## http.cookiejar in python3
cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("https://id.arduino.cc/auth/login/")
br.select_form(nr=0)
br.form['username'] = 'username'
br.form['password'] = 'password.'
br.submit()
print br.response().read()
Or urllib - Login to website using urllib2
There is a simpler way, from my pov, that gets you there without selenium
or mechanize
, or other 3rd party tools, albeit it is semi-automated.
Basically, when you login into a site in a normal way, you identify yourself in a unique way using your credentials, and the same identity is used thereafter for every other interaction, which is stored in cookies
and headers
, for a brief period of time.
What you need to do is use the same cookies
and headers
when you make your http requests, and you'll be in.
To replicate that, follow these steps:
- In your browser, open the developer tools
- Go to the site, and login
- After the login, go to the network tab, and then refresh the page
At this point, you should see a list of requests, the top one being the actual site - and that will be our focus, because it contains the data with the identity we can use for Python and BeautifulSoup to scrape it - Right click the site request (the top one), hover over
copy
, and thencopy as cURL
Like this:
- Then go to this site which converts cURL into python requests: https://curl.trillworks.com/
- Take the python code and use the generated
cookies
andheaders
to proceed with the scraping