How can I do web scraping in Julia?

Yes.

For the purpose of web-scraping, Julia has three libraries:

  • HTTP.jl to download the frontend source code of the website (this is comparable to python's requests library) ,
  • Gumbo.jl to parse the downloaded source code into a hierarchical structured object,
  • and Cascadia.jl to finally scrape using a CSS selector API.

I saw that you're young (16) from your profile and your python implementation is also correct.

Therefore, I'd suggest you to try to do a web-scraping task with these three libraries to better understand how they work.

The task that you wish to do, unfortunately, cannot be yet accomplished with Cascadia since the h3 is in a <span> which is currently not an implemented SelectorType in Cascadia.jl
Source


Your python code doesn't quite work. I guess the website has been updated recently. Since they have removed the links as far as i can tell,. Here is a similar example using Gumbo.jl and Cascadia.jl.

I am using the built in download command to download the webpage. which writes it to disk in a temp-file, which i then read into String. It might be cleaner to use HTTP.jl, which could read it straight into a String. But for this simple example it's fine

using Gumbo
using Cascadia

url = "https://thebestschools.org/features/best-computer-science-programs-in-the-world/"

page = parsehtml(read(download(url), String))


college_name = String[]
college_location = String[]


sections = eachmatch(sel"section", page.root)
for section in sections
    maybe_col_heading = eachmatch(sel"h3.college", section)
    if length(maybe_col_heading) == 0
        continue
    end
    col_heading = first(maybe_col_heading)

    name = strip(text(last(col_heading.children)))
    push!(college_name, name)

    loc = first(eachmatch(sel".school-location", section))
    push!(college_location, text(loc[1]))
end


[college_name college_location]

Outputs

julia> [college_name college_location]
51×2 Array{String,2}:
 "Massachusetts Institute of Technology (MIT)"  "Cambridge, Massachusetts"
 "Massachusetts Institute of Technology (MIT)"  "Cambridge, Massachusetts"
 "Stanford University"                          "Stanford, California"
 "Carnegie Mellon University"                   "Pittsburgh, Pennsylvania"
 ⋮

 "Shanghai Jiao Tong University"                "Shanghai, China"
 "Lomonosov Moscow State University"            "Moscow, Russia"
 "City University of Hong Kong"                 "Hong Kong"

Seems like it listed MIT twice. probably the filtering code in my demo isn't quiet right. But :shrug: MIT is a great university I hear. Julia was invented there :joy: