Not being able to retrieve whole set of 'journal-article's

Greetings!

I’m trying to fetch complete dataset of ‘type:journal-article’ to be available as a cache on our side. This I’m doing via GET /works method of CrossRef API.

I’m using cursor and asking for rows=1000, sending a request every 4-8 seconds (I hope it is not considered as too spammy). I have also set mailto parameter so you can identify my workload.

The whole type:journal-article dataset consists of 106+M entries, as I see in API response.

However I was not able to fetch after trying to do it two times. First time I fetched ~50M entries, second time a bit less than 10M. In both cases I was receiving “400 Bad Request” and the message looked similar as if cursor would have been expired (but I’m pretty sure this is not the case because requests were sent as regular basis without intervals).

Is there any way I can figure out why this is happening? Are you applying some additional throttling? Or maybe I should implement some sort of retry mechanism on my side for 400 errors?

Thank you!

Hello, and thanks for your question.

We may be applying additional rate limiting based on the frequency of your queries. That would be clarified in the response HTTP Headers, in the X-Rate-Limit-Interval and X-Rate-Limit values. There are more details on this in our API documentation at api.crossref.org

That said, if you want to cache a large quantity of metadata, I’d strongly suggest starting with our Public Data File, and then retrieving from the API only the items that have been added or updated since the date that file is current (March 31st 2023) . That will require significantly fewer requests to the API.

Additionally, rather than requesting the entirety of /works with filter=type:journal-article, it’s best to split that into many smaller queries. You can narrow it down by date ranges (using the from-update-date and until-update-date filters would probably be best, if you’re working from a base of what’s in the Public Data File), by ISSN, by member ID, etc.

Best,
Shayn

1 Like

Thank you @Shayn for your comments!
I was able to import 2023-03-31 data from Public file that you have posted.
Now I’m looking into implementing incremental periodic fetch to retrieve the data after 2023-03-31 I’ve stumbled upon this recommendation in API Swagger documentation:

Notes on incremental metadata updates
When using time filters to retrieve periodic, incremental metadata updates, the from-index-date filter should be used over from-update-date, from-deposit-date, from-created-date and from-pub-date. The timestamp that from-index-date filters on is guaranteed to be updated every time there is a change to metadata requiring a reindex.

For type:journal-article,from-index-date:2023-03-31 query I’m getting back "total-results": 61751291. This is more than a half of entries I imported from 2023-03-31 Public file, hence since 2023 some entries have been added while some of them where updated/reindexed.
For sure I will split the import into smaller (monthly/quarterly) chunks, but the issue I’m currently facing is how do I update entries that have been reindexed after 2023-03-31? I was not able to find entryId/Primary Key in Public File data or in API response. So I tried generating id for each entry based on hash Function which was hashing contens of several fields but it actually led to some collisions and entries being overwritten.
Could you recommend some reliable way of defining primary key that can be used for updates operations during range-based import that is applied after initial Public File import? Or is there maybe some other way of solving this problem that I’m not seeing?

Thank you in advance!

The last public data file is over a year old at this point, so it makes sense that there have been a lot of updates since then. In addition to the metadata updates supplied by the publishers, there are instances where we have to reindex content for technical or bug-fix reasons, which is why using the indexed date is suggested over the update date. But that also increases the number of updated items in any time period.

In terms of primary key/id for each record, the DOIs themselves serve that function. Treating the DOI as the primary key should take care of the collision problem, if I’m understanding you correctly.

We are also gearing up to release a new public data file soon, hopefully within the next few weeks. So, if you want to hold off for that to be available, you may not have to go through this update process, or at least you’ll be starting from a more current set of data.