Issues with pagination using API

We are trying to pull the entire file for a particular journal 1000 records at a time. It is returning everything fine, but there is duplication in the DOI’s that are returned, meaning that we are missing some because the script assumes the total row count and stops there. For instance, rows 1-1000 have some of the same rows that are in 1001-2000, if there are 5 duplicated then we are missing the last 5 that should have been returned had the duplicates not been there.

Here is an example:
https://0-api-crossref-org.libus.csd.mu.edu/journals/1522-1466/works?filter=from-deposit-date:1900-01-01,until-deposit-date:2019-09-30&offset=1000&rows=1000

Hello @mkurowski ,

Welcome to the Community Forum, and thank you for your message!

Using large offset values can result in extremely long response times. An alternative to paging through very large result sets is to use cursors. Any combination of query, filters and facets may be used with cursors. While rows may be specified along with cursor, offset and sample cannot be used. To make use of this in a query include the cursor parameter with a value of *.

https://0-api-crossref-org.libus.csd.mu.edu/1522-1466/works?filter=from-deposit-date:1900-01-01,until-deposit-date:2019-09-30&cursor=*

A next-cursor field will be provided in the JSON response. To get the next page of results, pass the value of next-cursor as the cursor parameter. For example:

https://0-api-crossref-org.libus.csd.mu.edu/journals/1522-1466/works?filter=from-deposit-date:1900-01-01,until-deposit-date:2019-09-30&rows=500&cursor=DnF1ZXJ5VGhlbkZldGNoBgAAAAAP5b9LFlo0ZUxVcFdrUnZHX0U1VHJLM2wyRmcAAAAACSi6ThZzdDctS25FTlRseXVSMmFKM2x1Z2pnAAAAAAHlYE0WOThNaG5lOXJRbFNxcUY4TmhDanV6UQAAAAAkIyAZFjNsQ3YwX25kUnFtNjhDTVFFZmVCNHcAAAAADQGubBZhVVJRRXBxZ1NQS1dfbkl6bk5JZDZ3AAAAAA_3xPUWbXNDZkMzblNRQldDelVoSm5BNXNSdw==

Note that the actual cursor value will be different from this illustration.

For each set of results, you should check the number of returned items. If the number of returned items is fewer than the number of expected rows then the end of the result set has been reached. Using next-cursor beyond this point will result in responses with an empty items list.

I’m also attaching a .csv file with all of the 7000+ results. Perhaps that is helpful?

-Isaac

user_url_query_2025-01-17-22-57.csv.zip (6.8 MB)