Using the public data file via a Lightning key:value (DOI:metadata) database (`crossref-lmdb`)

djmannion · 5 November 2024 22:15

A project that I have been working on involves retrieving the metadata for a large number of DOIs, and the public data export has been very useful - thanks very much for making it available (and for CrossRef in general)!

To be able to readily access the metadata by DOI, I have experimented with converting the public data export into a Lightning memory-mapped database (LMDB), in which the DOIs are the database keys and the associated metadata are the database values. This has worked well for my use-case, so I have created a Python package (crossref-lmdb) that provides a command-line application and Python library for creating, updating, and reading a Lightning database.

Note that this database is mostly only useful for projects requiring a relatively small portion of the total data - creating and updating the database is likely to be prohibitively slow otherwise.

Here is the respository and the documentation.

It has the following features:

Create a Lightning database from the CrossRef public data export, with optional filtering of DOI items based on custom Python code.
Update the database with items from the CrossRef web API that have been added or modified since a given date.
Read from the database in Python via a dict-like data structure.

Though it also has the folllowing limitations:

The Lightning database format is not very efficient with disk space for this data (see the LMDB documentation for more details).
The creation of the database is very slow, with database creation from the full 2024 public data export taking multiple days.
Updating the database is even slower.

Hopefully someone might find it useful!

Shayn · 5 November 2024 23:16

Thanks so much for sharing that here!

We’re always grateful when we can share tools like this with other metadata users who could use them, and learn from your experience.

Topic		Replies	Views
2024 public data file now available, featuring new experimental formats - Crossref Metadata Retrieval blog , metadata-retrieval , public-data-file	2	237	18 June 2024
New public data file: 120+ million metadata records - Crossref Interfaces for Machines posi , blog , open-data	9	2069	17 December 2022
2023 public data file now available with new and improved retrieval options - Crossref News and current events rest-api , community , blog , metadata-retrieval , public-data-file	5	837	28 July 2023
Increasing Crossref Data Reusability With Format Experiments - Crossref Metadata Retrieval blog , metadata-retrieval , public-data-file	5	415	24 January 2024
Public data file in the cloud Interfaces for Machines rest-api , metadata-retrieval , public-data-file , engineering	3	851	14 October 2022

Using the public data file via a Lightning key:value (DOI:metadata) database (`crossref-lmdb`)

Related topics