Bulk download of arXiv (or other publication data set) with metadata AND citations

arXiv's metadata does not include any citation information.

If you are interested in citations in arXiv documents, your best bet is probably to extract them from the PDF files using a dedicated tool (Cermine, pdfx or pdfextract for instance), or from the LaTeX sources, by inspecting the .aux files.

You can download PDF or source files using the dedicated bulk data interface. Using the OAI interface for this purpose is not recommended as it generates a lot of traffic on the main arXiv site.

You could extract a similar dataset from PubMed Central using their API. If you are looking for something broader (not restricted to a particular topic), you could use COnnecting REpositories (they have full texts from a variety of sources, with an API).

You can also use a pre-curated dataset such as the Microsoft Academic Graph.


Since recently, arXiv citation information is available from Semantic Scholar -- it is the one that is displayed on the arXiv website.

Semantic Scholar apparently provides a dump of their data.