Getting a dump of arXiv metadata

My main confusion was not realizing that The Open Archives Initiative Protocol for Metadata Harvesting is a separate protocol, not a subset of arXiv API.

In this case, the relevant queries are ListIdentifiers (10k items per query) and ListRecords (1k items per query). To get just identifiers we need to write:

http://export.arxiv.org/oai2?verb=ListIdentifiers&set=math&metadataPrefix=oai_dc

It results in 10k identifiers in the following form:

<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
  <responseDate>2015-02-16T19:28:22Z</responseDate>
  <request verb="ListIdentifiers" metadataPrefix="oai_dc" set="math">http://export.arxiv.org/oai2</request>
  <ListIdentifiers>
    <header>
      <identifier>oai:arXiv.org:0704.0002</identifier>
      <datestamp>2008-12-13</datestamp>
      <setSpec>math</setSpec>
    </header>
    ...
    <header>
      <identifier>oai:arXiv.org:0712.1769</identifier>
      <datestamp>2011-06-23</datestamp>
      <setSpec>math</setSpec>
    </header>
    <resumptionToken cursor="0" completeListSize="249546">760571|10001</resumptionToken>
  </ListIdentifiers>
</OAI-PMH>

As there are more results, to get next batch we need to specify resumptionToken, in this case:

http://export.arxiv.org/oai2?
   verb=ListIdentifiers&resumptionToken=760571|10001

and so on.

Other useful parameters are from and until, e.g. as in

http://export.arxiv.org/oai2?verb=ListIdentifiers&set=math&metadataPrefix=oai_dc&from=2015-01-14&until=2015-01-14

To directly get categories (bear in mind that set=math specifies mathematics, but there are no smaller subsets), one can write:

http://export.arxiv.org/oai2?verb=ListRecords&set=math&from=2015-01-01&until=2015-01-31&metadataPrefix=arXiv

It's important to set metadataPrefix=arXiv, so that subdisciplines will be listed:

<categories>
  math-ph cond-mat.other math.MP nlin.CD physics.class-ph
</categories>

EDIT:

I used delay as Nate Eldredge suggested, in my case - 25s. Yet, while trying to get all math (250k items so in 250 queries) it gave error at 70. I did continue it (with even higher delay) but sometime around 110 the query was not longer available.

So, the way to go is in getting smaller chunks - e.g. by month (or for mathematics - at most by year).


Shameless plug: I wrote a generic OAI harvesting tool, that will harvest Arxiv just fine. It's called metha and consists of a few commands:

$ metha-sync http://export.arxiv.org/oai2

This will download all data up to the last full day (it will take a couple of days). The XML API responses are compressed and placed under ~/.metha directory. Metha will use monthly windows and a resilient HTTP client to ensure downloads succeed, while not stressing the server. It has been tested in the wild on hundreds of OAI endpoints.

After (and during) download, you can inspect (already downloaded) records with:

$ metha-cat http://export.arxiv.org/oai2

For any further processing you will have to use your favorite XML tools.


Update: Additionally to the metha (incremental) harvester, I wrote a small tool called oaicrawl, which does no caching and just fetches records off an OAI endpoint one by one. This create more overhead, as there's an HTTP request for each record but can be useful, if the OAI endpoint does not support selective harvesting (e.g. by date) or is otherwise broken and you are ok with having a best effort data set harvested from the service.

Syntax would be similarly simple:

$ oaicrawl http://export.arxiv.org/oai2 > arxiv.data

Note, that this will concatenate the raw responses from the API and hence won't be valid XML out of the box.


A torrent for a metadata dump "collected from the OAI-PMH API endpoint using the 'metha-sync' tool" is available at: https://archive.org/details/arxiv-bulk-metadata

NB: This dataset contains metadata also for non-math articles.