Download HTTPS website available only through username and password with wget?

The session information is probably saved in a cookie to allow you to navigate to other pages after you have logged in.

If this is the case, you could do this in two steps :

Use wget's --save-cookies mycookies.txt and --keep-session-cookies options on the login page of the website along with your --username and --password options
Use wget's --load-cookies mycookies.txt option on the subsequent pages you are trying to retrieve.

EDIT

If the --password and --username option doesn't work, you must find out the info sent to the server by the login page and mimic it :

For a GET request, you can add the GET parameters directly in the address wget must fetch (make sure you properly quote the &, = and other special characters). The url would probably look something like https://the_url?user=foo&pass=bar.
For a POST request you can use wget's --post-data=the_needed_info option to use the post method on the needed login info.

EDIT 2

It seems that you indeed need the POST method with the j_username and j_password set. Try --post-data='j_username=yourusername&j_password=yourpassword option to wget.

EDIT 3

With the page of origin, I was able to understand a little more of what is happening. That being said, I cannot make sure that it works because, well, I don't have (nor do I want) valid credentials.

That being said, here is what's happening :

The page https://progtest.fit.cvut.cz/ sets a PHPSESSID cookie and present you with login options.
Clicking the login button sends a request to https://progtest.fit.cvut.cz/shibboleth-fit.php which takes the PHPSESSID cookie (not sure if it uses it) and redirects you to the SSO engine with a specially crafted url just for you which looks like this : https://idp2.civ.cvut.cz/idp/profile/SAML2/Redirect/SSO?SAMLRequest=SOME_VERY_LONG_AND_UNIQUE_ID
The SSO response sets a new cookie named _idp_authn_lc_key and redirects you to the page https://idp2.civ.cvut.cz:443/idp/AuthnEngine which redirects you again to https://idp2.civ.cvut.cz:443/idp/Authn/UserPassword (the real login page)
You enter your credentials and send the post data j_username and j_password along with the cookie from the SSO response
???

The first four steps can be done with wget like this :

origin='https://progtest.fit.cvut.cz/'

# Get the PHPSESSID cookie
wget --save-cookies phpsid.cki --keep-session-cookies "$origin"

# Get the _idp_authn_lc_key cookie
wget --load-cookies phpsid.cki  --save-cookies sso.cki --keep-session-cookies --header="Referer: $origin" 'https://progtest.fit.cvut.cz/shibboleth-fit.php'

# Send your credentials
wget --load-cookies sso.cki --save-cookies auth.cki --keep-session-cookies --post-data='j_username=usr&j_password=pwd' 'https://idp2.civ.cvut.cz/idp/Authn/UserPassword'

Note that wget follows redirection all by himself, which helps us quite a bit in this case.

Why are you playing around with wget? Better use some headless browser to automate this task.

What is a headless browser, you ask?

A headless browser is a web browser without a graphical user interface. They provide automated control of a web page in an environment similar to popular web browsers, but are executed via a command line interface or using network communication.

Two popular headless browsers are phantomjs (javascript) and Ghost.py (python).

Solution using phantomjs

First you will need to install phantomjs. On Ubuntu based systems, you can install it using the package manager or you could build it from source from their home page.

sudo apt-get install phantomjs

After this you write javascript script and run it using phantomjs:

phantomjs script.js

That's it.

Now, to learn how to implement it for your case, head over to its quickstart guide. As an example, to login to facebook automatically, and take a snapshot, one could use the gist provided here:

// This code login's to your facebook account and takes snap shot of it.
var page = require('webpage').create();
var fillLoginInfo = function(){
var frm = document.getElementById("login_form");
frm.elements["email"].value = 'your fb email/username';
frm.elements["pass"].value = 'password';
frm.submit();
}
page.onLoadFinished = function(){
if(page.title == "Welcome to Facebook - Log In, Sign Up or Learn More"){
page.evaluate(fillLoginInfo);
return;
}
else
page.render('./screens/some.png');
console.log("completed");
phantom.exit();
}
page.open('https://www.facebook.com/');

Look around the documentation to implement it for your specific case. If you face some troubles for your https website due to ssl errors, run your script like this:

phantomjs --ssl-protocol=any script.js

Solution using Ghost.py

To install Ghost.py, you will need pip:

sudo apt-get install python-pip   #On a Debian based system
sudo pip install Ghost.py

Now you have installed Ghost.py. Now, to use it inside a python script, just follow the documentation given in its home page. I've tried using Ghost.py on an https website but it somehow didn't work for me. Do try it and see if it works.

UPDATE : GUI based solution

You can also use tools like Selenium to automate the login process and retrieve the information. It is pretty easy to use. You will just need to install a plugin for your browser from here. And then you can record your process and replay it later on.

Try using 'curl'

curl --data "j_username=value1&j_password=value2" https://idp2.civ.cvut.cz/idp/Authn/UserPassword

You may need to look at the response type and set the 'content-type' header to match; i.e: XML, json etc

Download HTTPS website available only through username and password with wget?

Solution using phantomjs

Solution using Ghost.py

UPDATE : GUI based solution

Tags:

Https

Wget

Related

Recent Posts