Keeping Cited-by counts current across a large corpus

alarba · 2 January 2025 19:19

Wondering what others are finding most effective for keeping cited-by counts up-to-date over say 500k DOIs.
Main options I see are, given that you cannot check too often on most sites that get a large amount of bot traffic:

Register all DOIs with the callback service for cited-by - seems ideal, but slow to register them programmatically one by one for the service. Any issues with this approach?
Store a date last-checked and then make an async works API call when expired (say 5 days) whenever an article is displayed - easiest but large number of calls for no update.
Once a week (say), make a large number of light-weight batch calls against the works API (using an API cursor), requesting just the DOI and the is-referenced-by-count fields (thus keeping the payload small), and with the sort order being is-referenced-by-count DESC – so I can stop making requests when I get a record back with a zero for the is-referenced-by-count.
say like

api.crossref.org/works?filter=prefix:10.XXXX&rows=1000&sort=is-referenced-by-count&order=desc&mailto=x@y.org&select=DOI,is-referenced-by-count

thanks

mrittman · 7 January 2025 15:33

Thanks for the question. It’s tedious, but we’d recommend solution 1: registering each DOI separately for the callback service.

That’s the simple answer, however it does mask a flaw in the notification service, especially at the scale you’re working at. Occasionally, references are removed as well as added but we don’t notify for these removals. You might want to consider something along the lines of option 3: rechecking the counts, on something like a monthly basis. If you do this, I’d recommend using the XML API as at the time of writing it’s in better sync with the notification service. There are delays and mismatches between the two APIs that we hope it address in updates to our technology later this year.

alarba · 8 January 2025 12:51

Thanks!

I am trying to optimize the giant DOI registration process.
I can see 2 ways to approach this.

a GET request, encoding the XML query to set up the FL alert - this returns the result set to my calling process directly, or
a POST sending the same XML query in a file to your query queue - this emails the ‘query submission result’ back to me when processed in your queue.

Can I assume it is not possible to register for alerts with a GET without actually getting the full dataset of forward-links back? (most important for articles that have say 800 - 4000 FLs as they cause timeouts -which really slows the giant batch job down for registering all the articles). I could just send the GET call and then ignore the response if it doesn’t come back in say 10 seconds - which it won’t for a highly cited article. The issue with that is that I will not know if my registration for notifications was successful.

Obviously using the POST approach is easier for essentially ignoring the giant payload for highly-cited articles, but that does mean I need to write code to check the existence of the ‘query submission result’ email for the DOI to make sure my DOI registration for cited-by service was successful.

Is there any means to check if my ‘email’ (even though I am setting these up for notification call backs - i think that email addr in the xml query is the way you identify it with my account/end point) has been registered for callbacks for a given DOI? IOW - an end point that i can poll to ask if alerts are setup for a given DOI / email combination?

alarba · 8 January 2025 13:18

I should add that I made about 3000 [polite] programmatic GET XML query requests attempting to setup the notifications last night. I did get a sub-second 200 response back for each, but only a body of ‘true’ - which I initially thought meant that they worked, but I am doubting there were successful in hindsight. I don’t think I can validate if they worked. The reason I doubt they worked is because i just tried with a GUI (thunderclient in vscode) - essentially replicating my programmatic GET calls - and when that ran, it took a while and returned me the entire set of forward_links for the DOI - IOW the same response as I could get from calling the getForwardLinks servlet - except that it included a msg element for each forward_link
Forward linking query alerts enabled in Crossref

mrittman · 8 January 2025 15:07

It won’t make a difference whether you use POST or GET, both will enable the forward link alerts. If they are successful, you should have a confirmation sent to your notification endpoint.

We can see a total of around 13,000 items set up with alerts to you at the moment. Unfortunately we don’t log the dates the alerts were set up. Does that match the number you expected?

alarba · 8 January 2025 15:20

Thanks Martyn.
Actually that doesn’t match the number that I have (18k) for distinct historical DOIs that were setup.

When you said ‘If they are successful, you should have a confirmation sent to your notification endpoint.’ - does notification end point mean the email mailbox, or our registered cited-by notification API endpoint?
The only confirmations I have seen lately are in the mailbox, even though we have theoretically switched over to the notification call-back approach.

mrittman · 8 January 2025 15:31

Sorry, it’s not a confirmation exactly, but you will see updates being pushed to the endpoint immediately. Have you seen anything since you set up the new alerts yesterday? Did you get anything from your thunderbird/vscode method?

You have just over 17K items with forward link notifications set up if we include both email addresses an additional email address you’ve used in the past - were you including that as well?

alarba · 9 January 2025 15:36

I am hoping that I just solved my problems, primarily using information from the crossref_query_input2.0.xsd.

The primary issue I was facing is that I have ~500k DOIs to setup cited-by notification alerts for, but the HTTP response includes all the forward_link data (which i don’t need for this process) - meaning that most requests with citations over say 200 (of which there are many - some are over 4000) can take 30 - 120 seconds to return, sometimes resulting in 508 errors etc. By the time I add sleep calls to make my batches somewhat polite and retries for all the timeouts - the whole thing was looking like a pain.

Anyway, after adding a start_date and end_date attribute to the fl_query like this

<body><fl_query alert="true" start_date="2024-01-01" end_date="2024-01-01">
<doi>xx/xx.xxx</doi></fl_query></body>

I now get pretty much nothing back, except the below in milliseconds, indicating that alerts are enabled for the DOI.

<query_result>
...
  <body>
   <forward_link doi="xx/xx.xxx">
    <msg>Forward linking query alerts enabled in Crossref</msg>
   </forward_link>
  </body>
 </query_result>

The other issue I was facing is that whilst ThunderClient (REST client in vscode) allowed me to send this type of XML query over an HTTPS GET, my code was not working in the same way. The issue was the URL encoding of the qdata in the querystring. My code was Url-encoing the entire XML query, which did not work, but I still got a 200 back. Not Url-encoing at all seems to work fine. This page I found a bit confusing on what needs to be encoded when - Using HTTPS to query - Crossref

Thanks!!

mrittman · 10 January 2025 09:27

Thanks for sharing the update adding dates is a good option!

Topic		Replies	Views
Retrieving cited-by data using Crossref API Metadata Retrieval cited-by	4	3308	28 June 2021
Questions About Transitioning from Email Notifications to other methods for Cited-by Cited-by references , cited-by , callbacknotification , reference-matching	7	57	23 December 2024
Overlap between different services providing citation data Interfaces for Machines rest-api , references , cited-by , google	5	1437	23 December 2024
Ticket of the month - Sept 2022 - Get Citation Counts for all Articles in a Particular Journal Metadata Retrieval rest-api , cited-by , depositor-report , metadata-retrieval , ticket_of_month	1	1300	10 October 2022
Where I can get the citation count API and how I can embed it in our website? Technical Support cited-by	8	2365	6 September 2019

Keeping Cited-by counts current across a large corpus

Related topics