Get title of a webpage only?

As far as I am aware, HTTP doesn't support requesting the <title> element specifically. The browser can either GET the entire page, or request only the HTTP header information with HEAD (which notably does not include the HTML headers, where <title> is contained). It is also sometimes possible, as explained in WReach's answer, to request only specific byte ranges of the output page. For more information, see Mozilla's page on HTTP methods.

As the relevant standards don't have an implementation for this specifically, I would assume that there is no specific speed benefit to be had in trying to Import the title directly. Some simple testing agrees with this hypothesis on my machine:

AbsoluteTiming[Import["https://wolfram.com", "Title"]]

{1.51025, "Wolfram: Computation Meets Knowledge"}

AbsoluteTiming[URLRead["https://wolfram.com"];]

{1.47355, Null}

The actual difference in lookup times is negligible, and likely falls within random error on internet connection speeds and a little bit of string manipulation.

That said, kglr's suggestion of Import[...,"Title"] is an exceedingly useful syntax, and there's equally little reason not to use it and save yourself some string parsing.

The HTTP protocol defines the Range header that can be used to limit the number of bytes returned in a response. Web sites are not required to support this header -- but if the target web site does then we may be able to use it to good effect.

We can tell if a web site claims to support the feature by looking for the Accept-Ranges response header from a HEAD request.

acceptRanges[url_] :=
  HTTPRequest[url, <|"Method"->"Head"|>] //
  URLRead[#, "Headers"]& //
  Lookup[#, "accept-ranges", "none"] &

Here are the results for some web sites:

{"www.imgur.com", "en.wikipedia.org", "www.wolfram.com"} //
AssociationMap[acceptRanges] //
Dataset

table showing acceptRanges results

We can define a function to retrieve the first bytes from a URL:

fetchFirstBytes[bytes_][url_] :=
  URLRead[
    HTTPRequest[url, <|"Headers" -> <|"Range" -> "bytes=0-"~~ToString[bytes-1]|>|>]
  , "Body"
  ]

For example:

$prefix = "https://www.imgur.com" // fetchFirstBytes[1024];

StringLength[$prefix]
(* 1024 *)

There is a good chance that the HTML title element will fall within the first kilobyte of the response (but adjust this guess to suit your taste). Since the content has been unceremoniously truncated, we cannot rely upon it containing well-formed HTML. We will have to extract the title using a string search:

$prefix // StringCases["<title>" ~~ title__ ~~ "</title>" :> title] // First // StringTrim
(* "Imgur: The magic of the Internet" *)

There is generally no harm in supplying the Range header unconditionally. If a web site does not support it, we will still get a response -- albeit the full content and no optimization benefit:

"www.wolfram.com" // fetchFirstBytes[1024] //
  StringCases["<title>"~~title__~~"</title>" :> title] // First // StringTrim

(* "Wolfram: Computation Meets Knowledge" *)

Beware that some sites will claim to support ranges but still return the entire response despite the presence of a Range header. There are a number of reasons for this such as content delivery networks, intermediate proxies or web application implementation details. So the bottom line is that we will need good luck to get the optimization. But we might as well try it and see.

Import["http://wolfram.com", "Title"]

"Wolfram: Computation Meets Knowledge"

Import["https://mathematica.stackexchange.com/questions", "Title"]

"Newest Questions - Mathematica Stack Exchange"

Get title of a webpage only?

Tags:

Url

Web Access

Related

Recent Posts