How to use grep and cut in script to obtain website URLs from an HTML file

As I said in my comment, it's generally not a good idea to parse HTML with Regular Expressions, but you can sometimes get away with it if the HTML you're parsing is well-behaved.

In order to only get URLs that are in the href attribute of <a> elements, I find it easiest to do it in multiple stages. From your comments, it looks like you only want the top level domain, not the full URL. In that case you can use something like this:

grep -Eoi '<a [^>]+>' source.html |
grep -Eo 'href="[^\"]+"' | 
grep -Eo '(http|https)://[^/"]+'

where source.html is the file containing the HTML code to parse.

This code will print all top-level URLs that occur as the href attribute of any <a> elements in each line. The -i option to the first grep command is to ensure that it will work on both <a> and <A> elements. I guess you could also give -i to the 2nd grep to capture upper case HREF attributes, OTOH, I'd prefer to ignore such broken HTML. :)

To process the contents of http://google.com/

wget -qO- http://google.com/ |
grep -Eoi '<a [^>]+>' | 
grep -Eo 'href="[^\"]+"' | 
grep -Eo '(http|https)://[^/"]+'

output

http://www.google.com.au
http://maps.google.com.au
https://play.google.com
http://www.youtube.com
http://news.google.com.au
https://mail.google.com
https://drive.google.com
http://www.google.com.au
http://www.google.com.au
https://accounts.google.com
http://www.google.com.au
https://www.google.com
https://plus.google.com
http://www.google.com.au

My output is a little different from the other examples as I get redirected to the Australian Google page.


Not sure if you are limited on tools:

But regex might not be the best way to go as mentioned, but here is an example that I put together:

cat urls.html | grep -Eo "(http|https)://[a-zA-Z0-9./?=_%:-]*" | sort -u
  • grep -E : is the same as egrep
  • grep -o : only outputs what has been grepped
  • (http|https) : is an either / or
  • a-z : is all lower case
  • A-Z : is all upper case
  • . : is dot
  • / : is the slash
  • ? : is ?
  • = : is equal sign
  • _ : is underscore
  • % : is percentage sign
  • : : is colon
  • - : is dash
  • *: is repeat the [...] group
  • sort -u : will sort & remove any duplicates

Output:

bob@bob-NE722:~s$  wget -qO- https://stackoverflow.com/ | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | sort -u
https://stackauth.com
https://meta.stackoverflow.com
https://cdn.sstatic.net/Img/svg-icons
https://stackoverflow.com
https://www.stackoverflowbusiness.com/talent
https://www.stackoverflowbusiness.com/advertising
https://stackoverflow.com/users/login?ssrc=head
https://stackoverflow.com/users/signup?ssrc=head
https://stackoverflow.com
https://stackoverflow.com/help
https://chat.stackoverflow.com
https://meta.stackoverflow.com
...

You can also add in \d to catch other numeral types.


If your grep supports Perl regexes:

grep -Po '(?<=href=")[^"]*(?=")'
  • (?<=href=") and (?=") are lookaround expressions for the href attribute. This needs the -P option.
  • -o prints the matching text.

For example:

$ curl -sL https://www.google.com | grep -Po '(?<=href=")[^"]*(?=")'
/search?
https://www.google.co.in/imghp?hl=en&tab=wi
https://maps.google.co.in/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?gl=IN&tab=w1
https://news.google.co.in/nwshp?hl=en&tab=wn
...

As usual, there's no guarantee that these are valid URIs, or that the HTML you're parsing will be valid.