We are trying to pull the entire file for a particular journal 1000 records at a time. It is returning everything fine, but there is duplication in the DOI’s that are returned, meaning that we are missing some because the script assumes the total row count and stops there. For instance, rows 1-1000 have some of the same rows that are in 1001-2000, if there are 5 duplicated then we are missing the last 5 that should have been returned had the duplicates not been there.
Here is an example:
https://0-api-crossref-org.libus.csd.mu.edu/journals/1522-1466/works?filter=from-deposit-date:1900-01-01,until-deposit-date:2019-09-30&offset=1000&rows=1000
Hello @mkurowski ,
Welcome to the Community Forum, and thank you for your message!
Using large offset
values can result in extremely long response times. An alternative to paging through very large result sets is to use cursors. Any combination of query, filters and facets may be used with cursors. While rows
may be specified along with cursor
, offset
and sample
cannot be used. To make use of this in a query include the cursor
parameter with a value of *
.
https://0-api-crossref-org.libus.csd.mu.edu/1522-1466/works?filter=from-deposit-date:1900-01-01,until-deposit-date:2019-09-30&cursor=*
A next-cursor
field will be provided in the JSON response. To get the next page of results, pass the value of next-cursor
as the cursor
parameter. For example:
https://0-api-crossref-org.libus.csd.mu.edu/journals/1522-1466/works?filter=from-deposit-date:1900-01-01,until-deposit-date:2019-09-30&rows=500&cursor=DnF1ZXJ5VGhlbkZldGNoBgAAAAAP5b9LFlo0ZUxVcFdrUnZHX0U1VHJLM2wyRmcAAAAACSi6ThZzdDctS25FTlRseXVSMmFKM2x1Z2pnAAAAAAHlYE0WOThNaG5lOXJRbFNxcUY4TmhDanV6UQAAAAAkIyAZFjNsQ3YwX25kUnFtNjhDTVFFZmVCNHcAAAAADQGubBZhVVJRRXBxZ1NQS1dfbkl6bk5JZDZ3AAAAAA_3xPUWbXNDZkMzblNRQldDelVoSm5BNXNSdw==
Note that the actual cursor value will be different from this illustration.
For each set of results, you should check the number of returned items. If the number of returned items is fewer than the number of expected rows then the end of the result set has been reached. Using next-cursor
beyond this point will result in responses with an empty items list.
I’m also attaching a .csv file with all of the 7000+ results. Perhaps that is helpful?
-Isaac
user_url_query_2025-01-17-22-57.csv.zip (6.8 MB)