Cursor Based Pagination

tobiasschweizer · 31 January 2023 18:16

I am just starting using the Crossref REST API.

I read your post on cursor based paging: https:// community.crossref .org/t/ticket-of-the-month-march-2022-getting-started-with-rest-api-queries/2587/5#deep-paging-with-cursors-2

I want to harvest all the works that have at least one author / contributor identified with an ORCID.

Initial request: https:// api .crossref .org/works?filter=has-orcid:1&rows=1000&cursor=*

Then I obtain the cursor from .message.next-cursor in the JSON response (DnF1ZXJ5VGhlbkZldGNoBgAAAAAVkxo0FkhUQjJYNFlSVFVPT1k5bVZmVDQzQXcAAAAAEugiXhY0amo4YndCWVRadUx4QV9WQlVKWHdnAAAAAAZWz6oWMUZXWVdYT3hUZTZtQzNpVGM3NzZoUQAAAAAWv4Z_Fk5qWk1fWm1iUVV1V3Nwd3MxN3FqQ3cAAAAAFjFcdhZDNTBUdXhPeVM0bUg3UnZzcl9lR2N3AAAAABPuERgWSTFVWlpBeGRTWi1nRllxOU9nQUYydw==) to make the subsequent request(s), e.g., https:// api .crossref .org/works?filter=has-orcid:1&rows=1000&cursor=DnF1ZXJ5VGhlbkZldGNoBgAAAAAVkxo0FkhUQjJYNFlSVFVPT1k5bVZmVDQzQXcAAAAAEugiXhY0amo4YndCWVRadUx4QV9WQlVKWHdnAAAAAAZWz6oWMUZXWVdYT3hUZTZtQzNpVGM3NzZoUQAAAAAWv4Z_Fk5qWk1fWm1iUVV1V3Nwd3MxN3FqQ3cAAAAAFjFcdhZDNTBUdXhPeVM0bUg3UnZzcl9lR2N3AAAAABPuERgWSTFVWlpBeGRTWi1nRllxOU9nQUYydw%3D%3D

I URL encode the cursor for this purpose (“==” at the end are encoded.)

The thing that I struggle to understand is the way cursors are handled. Normally, a cursor is a base64 encoded string that can be decoded, resulting in some kind of pointer such as an id etc. With each request, the pointer changes and so does the next cursor.

Here, however, a cursor remains the same once obtained (this is at least what I experienced). Does this mean that the server holds some sort of state for a given cursor that changes each time a request ist made?

So in other words, aside from the first request, all subsequent requests are the same URL but for each requests, different results are returned?

I found this:

The problem is this- if you are doing a long sequence of cursor requests, and the API (or your script) becomes unstable in the middle of the sequence, and you get an error, you will have to start from scratch with a new cursor.

https:// www. crossref. org/documentation/retrieve-metadata/rest-api/tips-for-using-the-crossref-rest-api/

What I also noticed is that "query":{"start-index":0} increments with offset based pagination but not when using a cursor.

Thanks for clarification and kind regards,

Tobias

ppolischuk · 31 January 2023 23:49

Hi Tobias,

Thanks for writing, and for reading our docs.

It seems like you’ve sorted out how cursors work in the REST API. The cursor remains the same once obtained, and the server treats it as a stateful artifact. For each request wth the same cursor, different results should be returned. Our cursor configuration is what we get out of the box with Elasticsearch.

When we migrated from Solr to Elasticsearch, some other users had to adjust to this cursor behavior. You might find the comment thread in this old issue informative: As a Metadata Plus user, I'd like the cursor timeout to be increased (a 5-minute expiration is too short) (#649) · Issues · crossref / DEPRECATED User stories · GitLab

I hope this clarifies, but let us know if you have any further questions.

Thanks,
Patrick

tobiasschweizer · 1 February 2023 07:12

Hi Patrick,

Thanks a lot for your quick answer. Yes, it’s now clear to me how it works.

As for my use case, there are a lot of results (9’998’490 items and I can fetch 1’000 per request). As you write on the tips page, it is quite likely that a request could go wrong. This would require me to start over. Is the only other option to download the public data file and apply some filtering myself?

I’ve recently worked on a prototypical GraphQL interface backed by Elasticsearch. I’ve implemented cursor-based pagination using Elasticsearch’s search_after (Paginate search results | Elasticsearch Guide [8.6] | Elastic). It required deterministic sorting by id (sort) and the cursor was just the base64 encoded id of the last result of the page fetched. So with each request, the client would get the next page of results and obtain a new cursor.

Could this be an option, too? I am aware that given your large amounts of data this is just an untested idea but let me know if I should provide more details etc.

Kind regards,

Tobias

ppolischuk · 2 February 2023 00:21

Hi again,

Yes, your two best options for fetching a large volume of records are to iterate over a result set with cursors, or by starting with the public data file.

We’re aware that there is room for improvement in our cursor implementation and will consider the feature you’ve suggested for future enhancements. However we’re not likely to work on developing this feature in the near term due to limited development resources.

Thanks for your suggestion, and for offering to provide additional details. The Elasticsearch documentation should be sufficient for us, but we’ll be sure to reach out if we need to better understand your use case.

Take care,
Patrick

Topic		Replies	Views
Problem with retrieving all paginated REST API responses Interfaces for Machines rest-api , metadata-retrieval	2	1018	4 July 2022
Not being able to retrieve whole set of 'journal-article's Interfaces for Machines rest-api , metadata-retrieval , journal , public-data-file	3	404	11 April 2024
Ticket of the month - March 2022 - Getting started with REST API queries Metadata Retrieval rest-api , metadata-retrieval , ticket_of_month , for_community	26	4027	8 September 2023
Querying multiple ORCIDS at once? Technical Support orcid , rest-api , metadata-retrieval	3	187	11 April 2024
Some changes to Crossref Metadata Search (search.crossref.org) Metadata Retrieval orcid , metadata-search , crmds	9	2347	26 April 2023

Cursor Based Pagination

Related topics