How to: Download a page from the Wayback Machine over a specified interval

The way wayback URLs are formatted are as follows:

http://$BASEURL/$TIMESTAMP/$TARGET

Here BASEURL is usually http://web.archive.org/web (I say usually as I am unsure if it is the only BASEURL)

TARGET is self explanatory (in your case http://nature.com, or some similar URL)

TIMESTAMP is YYYYmmddHHMMss when the capture was made (in UTC):

  • YYYY: Year
  • mm: Month (2 digit - 01 to 12)
  • dd: Day of month (2 digit - 01 to 31)
  • HH: Hour (2 digit - 00 to 23)
  • MM: Minute (2 digit - 00 to 59)
  • ss: Second (2 digit - 00 to 59)

In case you request a capture time that doesn't exist, the wayback machine redirects to the closest capture for that URL, whether in the future or the past.

You can use that feature to get each daily URL using curl -I (HTTP HEAD) to get the set of URLs:

BASEURL='http://web.archive.org/web'
TARGET="SET_THIS"
START=1325419200 # Jan 1 2012 12:00:00 UTC (Noon) 
END=1356998400 # Tue Jan  1 00:00:00 UTC 2013
if uname -s |grep -q 'Darwin' ; then
    DATECMD="date -u '+%Y%m%d%H%M%S' -r "
elif uname -s |grep -q 'Linux'; then
    DATECMD="date -u +%Y%m%d%H%M%S -d @"
fi


while [[ $START -lt $END ]]; do
    TIMESTAMP=$(${DATECMD}$START)
    REDIRECT="$(curl -sI "$BASEURL/$TIMESTAMP/$TARGET" |awk '/^Location/ {print $2}')"
    if [[ -z "$REDIRECT" ]]; then
        echo "$BASEURL/$TIMESTAMP/$TARGET"
    else
        echo $REDIRECT
    fi
    START=$((START + 86400)) # add 24 hours
done

This gets you the URLs that are closest to noon on each day of 2012. Just remove the duplicates, and, and download the pages.

Note: The Script above can probably be greatly improved to jump forward in case the REDIRECT is for a URL more than 1 day in the future, but then it requires deconstructing the returned URL, and adjusting START to the correct date value.


There is a ruby gem on Github: https://github.com/hartator/wayback-machine-downloader