Regular expression for parsing links from a webpage?

from the RegexBuddy library:

URL: Find in full text

The final character class makes sure that if an URL is part of some text, punctuation such as a comma or full stop after the URL is not interpreted as part of the URL.

\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]

With Html Agility Pack, you can use:

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a@href")
{
Response.Write(link["href"].Value);
}
doc.Save("file.htm");

((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)

I took this from regexlib.com

[editor's note: the {1} has no real function in this regex; see this post]

Regular expression for parsing links from a webpage?

URL: Find in full text

Tags:

Html

.Net

Regex

Related

Recent Posts