A project that I have been working on involves retrieving the metadata for a large number of DOIs, and the public data export has been very useful - thanks very much for making it available (and for CrossRef in general)!
To be able to readily access the metadata by DOI, I have experimented with converting the public data export into a Lightning memory-mapped database (LMDB), in which the DOIs are the database keys and the associated metadata are the database values. This has worked well for my use-case, so I have created a Python package (crossref-lmdb
) that provides a command-line application and Python library for creating, updating, and reading a Lightning database.
Note that this database is mostly only useful for projects requiring a relatively small portion of the total data - creating and updating the database is likely to be prohibitively slow otherwise.
Here is the respository and the documentation.
It has the following features:
- Create a Lightning database from the CrossRef public data export, with optional filtering of DOI items based on custom Python code.
- Update the database with items from the CrossRef web API that have been added or modified since a given date.
- Read from the database in Python via a dict-like data structure.
Though it also has the folllowing limitations:
- The Lightning database format is not very efficient with disk space for this data (see the LMDB documentation for more details).
- The creation of the database is very slow, with database creation from the full 2024 public data export taking multiple days.
- Updating the database is even slower.
Hopefully someone might find it useful!