Hello @slnm. Thanks for your message. Welcome to the community forum!
One alternative here is that you could use the from-update-date filter instead of the from-index-date filter. The major difference between the two is that the from-index-date is going to include updated citation counts (and the changes that are also included in the from-update-date filter). If that information isnāt of concern for you, then you can use the from-update-date filter which will result in much fewer results. Index includes changes that we also make to the record - so very occasionally that will include work weāve done on bugs and those citation count updates I mentioned. The from-update-date filter will include all metadata changes made by our members to their records.
Youāre right, 385,706 records is a lot to update in one day, but weāre always updating those citation counts by matching references with existing, cited DOIs, so that from-index-date filter is going to seem high.
My aim is to maintain a relatively current dataset. Should I create a daily job to fetch new records, using your deep cursor? My concern with that approach is that a query takes roughly 30 seconds to return. At the rate of 2 queries per minute, 100,000 updates per day, and 1,000 results per page, it will take 50 minutes per day to fetch the incremental changes. I have tried using the mailto parameter (and https) to get into the preferred query pool but that doesnāt seem to speed up queries.
I think youāll find that the 112,000 updates per day number is a little higher than the average, which should help with the overall time estimate for fetching these incremental changes. And, thereās no reason you canāt send us more than two queries per minute. You should be able to perform up to 50 per second and still be below our rate limits, as discussed here: https://github.com/CrossRef/rest-api-doc#rate-limits
Iād suggest using the Polite pool to the Public pool, as the Polite pool is the more performant of the two over the longer-term.
If you need a higher rate limit or a more performant pool, our Plus pool, with its SLAs, is an option as well. You can learn more here: https://0-www-crossref-org.libus.csd.mu.edu/services/metadata-retrieval/metadata-plus/. If youāre interested in learning more about the Plus service, Iād be happy to answer your questions or connect you with Jennifer Kemp, our Head of Partnerships.
Thanks again, @ifarley, for your help. Iām still not clear.
According to the REST API doc I should use cursor if Iām fetching a large number of rows and offset canāt be used with cursors. So, I do an initial query with parameter cursor=* to get the first cursor and then I get next-cursor from the first set of results and use that cursor for the next query and so on. Given that the cursor changes for every subsequent query, I canāt parallelize those queries but need to get the next cursor before doing the next query. So, to get 112,000 updates with a max of 1,000 rows per query Iāll need to do 112 queries and I donāt see how I can do anything but wait for one query to complete before doing the next one.
Back to the original question of how to efficiently retrieve all updates since April 1.
Letās say I did want to parallelize them by fetching records for April through July in one set of queries and from August on in another set of queries.
So, I can do those two date range query sets in parallel and cut the time roughly in half for fetching the records since April. And, I can do more granular date range searches and parallelize them but Iāll hit duplicate records (i.e. records that were updated in more than one date range.)
But, I think Iām still missing something because you say that I can do up to 50 queries per second to fetch those 112,000 updates for one day.
Youāre right, my suggestion wasnāt well thought out. Sorry about that. You do need to wait for the cursor for each of your queries.
Iām not sure a way around your dilemma, outside of becoming a Plus subscriber and being able to regularly pull the monthly Snapshots. That said, Iāve asked our technical team for any suggestions they may have. Iāll follow up as soon as I know more.
My colleagues on the technical team have some suggestions:
You could divide the set you need to download by the date of creation, and download various creation date ranges in parallel. Creation date should safe because it does not change, and every DOI has only one creation date. So a DOI should belong to exactly one creation date range, assuming all possible ranges are downloaded. The full range to cover is from 2002-07-25 (inclusive, this the older creation date in our data) to the current date.
For example, I can download DOIs updated since April and created in 2020, in parallel download DOIs updated since April and created in 2019, ā¦ , and in parallel download DOIs updated from April and created in 2002, using parallel requests https://0-api-crossref-org.libus.csd.mu.edu/works?filter=from-update-date:2020-04-01,until-update-date:2020-11-16,from-created-date:2020,until-created-date:2020&cursor=ā¦ and https://0-api-crossref-org.libus.csd.mu.edu/works?filter=from-update-date:2020-04-01,until-update-date:2020-11-16,from-created-date:2019,until-created-date:2019&cursor=ā¦ and so on.
Or, I could use smaller ranges and download separately DOIs updated from April and created in 2020-11, DOIs updated from April and created in 2020-10, and so one down to 2002-07. Or use just a few days as the range. The smallest range is 1 day long, as this is the creation date filter āresolutionā.
Those subsets may not be well balanced in terms of the numbers of DOIs, but it should allow you to speed the whole thing up a bit.
@ifarley Yes, this all makes sense. Thank you! Iāll do some queries to get some counts to estimate the volume of searches needed and the time needed then parallelize the whole process. Again, I appreciate your willingness to dig into this issue.
@ifarley Thanks for this great article! It looks like something goes wrong with the cursor. With &cursor=* the api returns a good next-cursor, but with &cursor=the_second_cursor_encoded the api returns the same next-curso as the &cursor= value (while iām not at the end of the list).
For example:
for me your example (right now) /works?filter=from-update-date:2020-04-01,until-update-date:2020-11-15,from-created-date:2020,until-created-date:2020&cursor=DnF1ZXJ5VGhlbkZldGNoBgAAAAADT3Z4FkhUQjJYNFlSVFVPT1k5bVZmVDQzQXcAAAAAAj_MxBY0amo4YndCWVRadUx4QV9WQlVKWHdnAAAAAALaP_IWR25aQ25BM09Rb0ctLThhTmdORDh1ZwAAAAACFDYDFkM1MFR1eE95UzRtSDdSdnNyX2VHY3cAAAAAAq9T1BZOalpNX1ptYlFVdVdzcHdzMTdxakN3AAAAAANPdncWSFRCMlg0WVJUVU9PWTltVmZUNDNBdw%3D%3D
Hi @ps80. Thanks for your message and welcome the community forum.
We migrated our backend to elasticsearch since I wrote the information above. Cursors in elasticsearch are a little different from cursors in Solr, our previous backend, and thus some of the information in my post is out of date. Sorry about that.
In elasticsearch, which we moved to in August 2021, you get a cursor and the server remembers your position in the dataset you are iterating over. The cursor stays the same, but you should be getting different result pages on subsequent calls.
Hi @ifarley Thanks for your fast response! Iām sorry, iām used to solr so I expected a new cursor, but everything works as youāre describing. Works perfect!