New public data file: 120+ million metadata records - Crossref

Crossref · 23 February 2021 17:56

2020 wasn’t all bad. In April of last year, we released our first public data file. Though Crossref metadata is always openly available––and our board recently cemented this by voting to adopt the Principles of Open Scholarly Infrastructure (POSI)––we’ve decided to release an updated file. This will provide a more efficient way to get such a large volume of records. The file (JSON records, 102.6GB) is now available, with thanks once again to Academic Torrents.

This is a companion discussion topic for the original entry at https://0-www-crossref-org.libus.csd.mu.edu/blog/new-public-data-file-120-million-metadata-records/

Ceber · 12 January 2022 08:42

This public data file is indeed very useful. I would like to use the API to get the metadata records that are not included in the public data file and avoid duplicates. I guess I should pull those that are registered after “January, 7, 2021”, according to this post description? And which date field should I use? (e.g. indexed, created, deposited)

ppolischuk · 13 January 2022 18:17

For incremental updates we recommend using the from-index-date filter. The timestamp that from-index-date filters on is guaranteed to be updated every time there is a change to metadata requiring a reindex. This way you’ll pick up updated records in addition to new records.

I’m glad to hear you find the public data file useful! We’re preparing the 2022 public data file for release soon.

Ceber · 18 January 2022 10:16

Thanks for the reply, it is all clear now.
One last question regarding the dates. If I look for a DOI with the search engine in XML format, for instance, DOI 10.1039/d0se01062f, I can see a publication_date field (29 September 2020). However, in the public data file, the JSON for that DOI includes several date fields like indexed, created, published-online, issued, deposited. Some of them like issued or published-online have only the year part. What is the field in the public data file JSON that should correspond to the publication date?

AaronNGray · 3 February 2022 19:58

Hi, How do I import the crossref data dump of *.json.gz ?
Do I import it into ElasticSearch and if so how, please ?

Jens · 22 March 2022 09:47

Thank you for creating and offering this huge amount of valuable data as a downloadable set of files. Is there an update on the release date for the 2022 public data file?

ppolischuk · 23 March 2022 20:34

Hi Jens,

We’re close. Expect an update in the next few days. We’re planning to perform a reindex to finish up a few fixes and enhancements, after which we’ll generate and publish the 2022 public date file.

ppolischuk · 13 May 2022 21:43

The 2022 public data file is now live. Please see the blog post announcement: 2022 public data file of more than 134 million metadata records now available - Crossref

Garry · 17 December 2022 17:32

hello I need help reading the .json.gz file from the torrents using python. Also when I try to unzip the .gz file it says it’s corrupted. Why is this?

Topic		Replies	Views
2022 public data file of more than 134 million metadata records now available - Crossref News and current events metadata , blog , public-data-file , free-to-use	0	930	13 May 2022
2024 public data file now available, featuring new experimental formats - Crossref Metadata Retrieval blog , metadata-retrieval , public-data-file	2	227	18 June 2024
2022 public data file now available Metadata Retrieval metadata , metadata-retrieval , public-data-file	0	872	13 May 2022
2023 public data file now available with new and improved retrieval options - Crossref News and current events rest-api , community , blog , metadata-retrieval , public-data-file	5	832	28 July 2023
Increasing Crossref Data Reusability With Format Experiments - Crossref Metadata Retrieval blog , metadata-retrieval , public-data-file	5	412	24 January 2024

New public data file: 120+ million metadata records - Crossref

Related topics