Using the public data file via a Lightning key:value (DOI:metadata) database (`crossref-lmdb`)

A project that I have been working on involves retrieving the metadata for a large number of DOIs, and the public data export has been very useful - thanks very much for making it available (and for CrossRef in general)!

To be able to readily access the metadata by DOI, I have experimented with converting the public data export into a Lightning memory-mapped database (LMDB), in which the DOIs are the database keys and the associated metadata are the database values. This has worked well for my use-case, so I have created a Python package (crossref-lmdb) that provides a command-line application and Python library for creating, updating, and reading a Lightning database.

Note that this database is mostly only useful for projects requiring a relatively small portion of the total data - creating and updating the database is likely to be prohibitively slow otherwise.

Here is the respository and the documentation.

It has the following features:

  • Create a Lightning database from the CrossRef public data export, with optional filtering of DOI items based on custom Python code.
  • Update the database with items from the CrossRef web API that have been added or modified since a given date.
  • Read from the database in Python via a dict-like data structure.

Though it also has the folllowing limitations:

  • The Lightning database format is not very efficient with disk space for this data (see the LMDB documentation for more details).
  • The creation of the database is very slow, with database creation from the full 2024 public data export taking multiple days.
  • Updating the database is even slower.

Hopefully someone might find it useful!

2 Likes

Thanks so much for sharing that here!

We’re always grateful when we can share tools like this with other metadata users who could use them, and learn from your experience.

3 Likes