Date range search of index changes seems to retrieve too many records

slnm · 12 November 2020 17:03

Some months ago I retrieved the covid-19 dataset. Now I want to retrieve any records that have been added or changed since then.

I run this command to retrieve the the first page of the set of records starting April 1:

https://0-api-crossref-org.libus.csd.mu.edu/works?filter=from-index-date:2020-04-01,until-index-date:2020-11-11

I get total-records = 88403122.

88 million records seems like a lot of records for an incremental update.

So, out of curiosity, I run this command to see how many records have had an index update yesterday.

https://0-api-crossref-org.libus.csd.mu.edu/works?filter=from-index-date:2020-11-11,until-index-date:2020-11-11

I get total-records = 385706.

That’s a lot of records to be updated in one day!

What am I missing?

ifarley · 12 November 2020 21:54

Hello @slnm. Thanks for your message. Welcome to the community forum!

One alternative here is that you could use the from-update-date filter instead of the from-index-date filter. The major difference between the two is that the from-index-date is going to include updated citation counts (and the changes that are also included in the from-update-date filter). If that information isn’t of concern for you, then you can use the from-update-date filter which will result in much fewer results. Index includes changes that we also make to the record - so very occasionally that will include work we’ve done on bugs and those citation count updates I mentioned. The from-update-date filter will include all metadata changes made by our members to their records.

You’re right, 385,706 records is a lot to update in one day, but we’re always updating those citation counts by matching references with existing, cited DOIs, so that from-index-date filter is going to seem high.

My best,
Isaac

slnm · 16 November 2020 15:44

Thanks, Isaac, for your help and engagement.

There are over 112,000 updates for Nov 11 which is better than nearly 386,000 records to fetch but still a large number.

https://0-api-crossref-org.libus.csd.mu.edu/works?filter=from-update-date:2020-11-11,until-update-date:2020-11-11

And, nearly 20,000,000 records to fetch to update the covid-19 dataset to be current.

https://0-api-crossref-org.libus.csd.mu.edu/works?filter=from-update-date:2020-05-01,until-update-date:2020-11-11

My aim is to maintain a relatively current dataset. Should I create a daily job to fetch new records, using your deep cursor? My concern with that approach is that a query takes roughly 30 seconds to return. At the rate of 2 queries per minute, 100,000 updates per day, and 1,000 results per page, it will take 50 minutes per day to fetch the incremental changes. I have tried using the mailto parameter (and https) to get into the preferred query pool but that doesn’t seem to speed up queries.

Thanks.

ifarley · 17 November 2020 03:28

Hello again @slnm,

I think you’ll find that the 112,000 updates per day number is a little higher than the average, which should help with the overall time estimate for fetching these incremental changes. And, there’s no reason you can’t send us more than two queries per minute. You should be able to perform up to 50 per second and still be below our rate limits, as discussed here: https://github.com/CrossRef/rest-api-doc#rate-limits

I’d suggest using the Polite pool to the Public pool, as the Polite pool is the more performant of the two over the longer-term.

If you need a higher rate limit or a more performant pool, our Plus pool, with its SLAs, is an option as well. You can learn more here: https://0-www-crossref-org.libus.csd.mu.edu/services/metadata-retrieval/metadata-plus/. If you’re interested in learning more about the Plus service, I’d be happy to answer your questions or connect you with Jennifer Kemp, our Head of Partnerships.

Kind regards,
Isaac

slnm · 17 November 2020 14:39

Thanks again, @ifarley, for your help. I’m still not clear.

According to the REST API doc I should use cursor if I’m fetching a large number of rows and offset can’t be used with cursors. So, I do an initial query with parameter cursor=* to get the first cursor and then I get next-cursor from the first set of results and use that cursor for the next query and so on. Given that the cursor changes for every subsequent query, I can’t parallelize those queries but need to get the next cursor before doing the next query. So, to get 112,000 updates with a max of 1,000 rows per query I’ll need to do 112 queries and I don’t see how I can do anything but wait for one query to complete before doing the next one.

Back to the original question of how to efficiently retrieve all updates since April 1.

https://0-api-crossref-org.libus.csd.mu.edu/works?filter=from-update-date:2020-04-01,until-update-date:2020-11-16 shows that there are nearly 24M records to fetch to get my covid-19 set current. That’s 24,000 queries which, unless I’m missing something, I can’t parallelize.

Let’s say I did want to parallelize them by fetching records for April through July in one set of queries and from August on in another set of queries.

https://0-api-crossref-org.libus.csd.mu.edu/works?filter=from-update-date:2020-04-01,until-update-date:2020-07-31 shows 13,286,127 records.

https://0-api-crossref-org.libus.csd.mu.edu/works?filter=from-update-date:2020-08-01,until-update-date:2020-11-16 shows 10,660,131 records.

So, I can do those two date range query sets in parallel and cut the time roughly in half for fetching the records since April. And, I can do more granular date range searches and parallelize them but I’ll hit duplicate records (i.e. records that were updated in more than one date range.)

But, I think I’m still missing something because you say that I can do up to 50 queries per second to fetch those 112,000 updates for one day.

Thanks, again.

ifarley · 18 November 2020 19:56

Hi @slnm,

You’re right, my suggestion wasn’t well thought out. Sorry about that. You do need to wait for the cursor for each of your queries.

I’m not sure a way around your dilemma, outside of becoming a Plus subscriber and being able to regularly pull the monthly Snapshots. That said, I’ve asked our technical team for any suggestions they may have. I’ll follow up as soon as I know more.

My best,
Isaac

ifarley · 18 November 2020 21:14

My colleagues on the technical team have some suggestions:

You could divide the set you need to download by the date of creation, and download various creation date ranges in parallel. Creation date should safe because it does not change, and every DOI has only one creation date. So a DOI should belong to exactly one creation date range, assuming all possible ranges are downloaded. The full range to cover is from 2002-07-25 (inclusive, this the older creation date in our data) to the current date.

For example, I can download DOIs updated since April and created in 2020, in parallel download DOIs updated since April and created in 2019, … , and in parallel download DOIs updated from April and created in 2002, using parallel requests https://0-api-crossref-org.libus.csd.mu.edu/works?filter=from-update-date:2020-04-01,until-update-date:2020-11-16,from-created-date:2020,until-created-date:2020&cursor=… and https://0-api-crossref-org.libus.csd.mu.edu/works?filter=from-update-date:2020-04-01,until-update-date:2020-11-16,from-created-date:2019,until-created-date:2019&cursor=… and so on.

Or, I could use smaller ranges and download separately DOIs updated from April and created in 2020-11, DOIs updated from April and created in 2020-10, and so one down to 2002-07. Or use just a few days as the range. The smallest range is 1 day long, as this is the creation date filter “resolution”.

Those subsets may not be well balanced in terms of the numbers of DOIs, but it should allow you to speed the whole thing up a bit.

Does that make sense?

slnm · 18 November 2020 21:54

@ifarley Yes, this all makes sense. Thank you! I’ll do some queries to get some counts to estimate the volume of searches needed and the time needed then parallelize the whole process. Again, I appreciate your willingness to dig into this issue.

ifarley · 18 November 2020 22:29

I’m always happy to help, @slnm. Thanks for posting this message here for all to benefit from the exchange.

RobertStevenson · 30 November 2020 20:23

A perfect and very useful question for many of us. Thanks for the answers and suggestions!

ps80 · 15 February 2022 12:58

@ifarley Thanks for this great article! It looks like something goes wrong with the cursor. With &cursor=* the api returns a good next-cursor, but with &cursor=the_second_cursor_encoded the api returns the same next-curso as the &cursor= value (while i’m not at the end of the list).

For example:
for me your example (right now) /works?filter=from-update-date:2020-04-01,until-update-date:2020-11-15,from-created-date:2020,until-created-date:2020&cursor=DnF1ZXJ5VGhlbkZldGNoBgAAAAADT3Z4FkhUQjJYNFlSVFVPT1k5bVZmVDQzQXcAAAAAAj_MxBY0amo4YndCWVRadUx4QV9WQlVKWHdnAAAAAALaP_IWR25aQ25BM09Rb0ctLThhTmdORDh1ZwAAAAACFDYDFkM1MFR1eE95UzRtSDdSdnNyX2VHY3cAAAAAAq9T1BZOalpNX1ptYlFVdVdzcHdzMTdxakN3AAAAAANPdncWSFRCMlg0WVJUVU9PWTltVmZUNDNBdw%3D%3D

returns:

{
status: "ok",
message-type: "work-list",
message-version: "1.0.0",
message: {
facets: { },
next-cursor: "DnF1ZXJ5VGhlbkZldGNoBgAAAAADT3Z4FkhUQjJYNFlSVFVPT1k5bVZmVDQzQXcAAAAAAj_MxBY0amo4YndCWVRadUx4QV9WQlVKWHdnAAAAAALaP_IWR25aQ25BM09Rb0ctLThhTmdORDh1ZwAAAAACFDYDFkM1MFR1eE95UzRtSDdSdnNyX2VHY3cAAAAAAq9T1BZOalpNX1ptYlFVdVdzcHdzMTdxakN3AAAAAANPdncWSFRCMlg0WVJUVU9PWTltVmZUNDNBdw==",
total-results: 4300774,
items: [

Or am i doing something wrong here?

ifarley · 15 February 2022 22:12

Hi @ps80. Thanks for your message and welcome the community forum.

We migrated our backend to elasticsearch since I wrote the information above. Cursors in elasticsearch are a little different from cursors in Solr, our previous backend, and thus some of the information in my post is out of date. Sorry about that.

In elasticsearch, which we moved to in August 2021, you get a cursor and the server remembers your position in the dataset you are iterating over. The cursor stays the same, but you should be getting different result pages on subsequent calls.

Is that what you are finding?

-Isaac

ps80 · 16 February 2022 09:46

Hi @ifarley Thanks for your fast response! I’m sorry, i’m used to solr so I expected a new cursor, but everything works as you’re describing. Works perfect!

Peter

ifarley · 16 February 2022 14:11

That’s great, Peter. Thanks for confirming.

-Isaac

Topic		Replies	Views
No 2022 and 2023 data for works requested from Journals Metadata Retrieval rest-api , metadata-retrieval	5	123	17 May 2024
New public data file: 120+ million metadata records - Crossref Interfaces for Machines posi , blog , open-data	9	2067	17 December 2022
Best way to extract list of articles related to something? Metadata Retrieval rest-api , metadata-retrieval , covid	6	319	8 June 2024
How long is the delay when updating metadata? Metadata Retrieval rest-api , metadata-retrieval , xml_api	2	30	3 October 2024
Not being able to retrieve whole set of 'journal-article's Interfaces for Machines rest-api , metadata-retrieval , journal , public-data-file	3	408	11 April 2024

Date range search of index changes seems to retrieve too many records

Related topics