List and description of all packages in CRAN from within R

Edit of an almost ten-year old accepted answer. What you likely want is not to scrape (unless you want to practice scraping) but use an existing interface: tools::CRAN_package_db(). Example:

> db <- tools::CRAN_package_db()[, c("Package", "Description")]
> dim(db)
[1] 18978     2
> 

The function brings (currently) 66 columns back of which the of interest here are a part.


I actually think you want "Package" and "Title" as the "Description" can run to several lines. So here is the former, just put "Description" in the final subset if you really want "Description":

R> ## from http://developer.r-project.org/CRAN/Scripts/depends.R and adapted
R>
R> require("tools")
R>
R> getPackagesWithTitle <- function() {
+     contrib.url(getOption("repos")["CRAN"], "source") 
+     description <- sprintf("%s/web/packages/packages.rds", 
+                            getOption("repos")["CRAN"])
+     con <- if(substring(description, 1L, 7L) == "file://") {
+         file(description, "rb")
+     } else {
+         url(description, "rb")
+     }
+     on.exit(close(con))
+     db <- readRDS(gzcon(con))
+     rownames(db) <- NULL
+
+     db[, c("Package", "Title")]
+ }
R>
R>
R> head(getPackagesWithTitle())               # I shortened one Title here...
     Package              Title
[1,] "abc"                "Tools for Approximate Bayesian Computation (ABC)"
[2,] "abcdeFBA"           "ABCDE_FBA: A-Biologist-Can-Do-Everything of Flux ..."
[3,] "abd"                "The Analysis of Biological Data"
[4,] "abind"              "Combine multi-dimensional arrays"
[5,] "abn"                "Data Modelling with Additive Bayesian Networks"
[6,] "AcceptanceSampling" "Creation and evaluation of Acceptance Sampling Plans"
R>

Dirk has provided an answer that is terrific and after finishing my solution and then seeing his I debated for some time posting my solution for fear of looking silly. But I decided to post it anyway for two reasons:

  1. it is informative to beginning scrapers like myself
  2. it took me a while to do and so why not :)

I approached this thinking I'd need to do some web scraping and choose crantastic as the site to scrape from. First I'll provide the code and then two scraping resources that have been very helpful to me as I learn:

library(RCurl)
library(XML)

URL <- "http://cran.r-project.org/web/checks/check_summary.html#summary_by_package"
packs <- na.omit(XML::readHTMLTable(doc = URL, which = 2, header = T, 
    strip.white = T, as.is = FALSE, sep = ",", na.strings = c("999", 
        "NA", " "))[, 1])
Trim <- function(x) {
    gsub("^\\s+|\\s+$", "", x)
}
packs <- unique(Trim(packs))
u1 <- "http://crantastic.org/packages/"
len.samps <- 10 #for demo purpose; use:
#len.samps <- length(packs) # for all of them
URL2 <- paste0(u1, packs[seq_len(len.samps)]) 
scraper <- function(urls){ #function to grab description
    doc   <- htmlTreeParse(urls, useInternalNodes=TRUE)
    nodes <- getNodeSet(doc, "//p")[[3]]
    return(nodes)
}
info <- sapply(seq_along(URL2), function(i) try(scraper(URL2[i]), TRUE))
info2 <- sapply(info, function(x) { #replace errors with NA
        if(class(x)[1] != "XMLInternalElementNode"){
            NA
        } else {
            Trim(gsub("\\s+", " ", xmlValue(x)))
        }
    }
)
pack_n_desc <- data.frame(package=packs[seq_len(len.samps)], 
    description=info2) #make a dataframe of it all

Resources:

  1. talkstats.com thread on web scraping (great beginner examples)
  2. w3schools.com site on html stuff (very helpful)

Tags:

R