Hi,
I am doing a thesis for study/research and I want to get a list of articles related to a word (COVID). The only required metadata would be the Title and Year, although the summary would be nice.
I managed to create a code in Python to request such data thru the API, but I fear I might request too much as I read in the guide for starters as I am not too knowledgeable in coding as well.
So what would be the best way to extract a list of articles related to COVID? The biggest list as feasibly possible (something like 10k+ articles across 2019-2024). The big shared data isn’t attractive as well since I probably want only a small subset of those 200gb of data and across 2019-2024.
Thank you very much!
Hello @CesarK,
Thank you for your forum post.
So this is something that is definitely possible through the API and the fact you mention using a python script will be beneficial to extract all of the data from the API as the maximum number of rows/items you can return per page is 1000. So if the total results is more than 1000 you are going to need a script to collate those results.
What are you looking returned between those dates? Are you looking to return works which were published between 2019-2024 or registered between those dates? I imagine the totals would not be too dissimilar but I just wanted to check.
So if you were looking for results of published
then you would use a query URL like:
https://0-api-crossref-org.libus.csd.mu.edu/works?query.bibliographic=COVID&filter=from-pub-date:2019,until-pub-date:2024&mailto=add_your_email_address_here
This however is returning around 750k results, which I imagine is not the size of results set you were hoping for.
The issue is that the search tool is going to return all results that contain “COVID” in any of the bibliographic metadata.
So you would need to think about how you might want to filter the list down further to get a smaller set of results. Although that could be much harder to do as it would need to be very specific to get it down to 10k results.
Are there any other specific filters you would like to add to make the results set smaller?
If I just return journal-articles
using the query URL:
https://0-api-crossref-org.libus.csd.mu.edu/works?query.bibliographic=COVID&filter=from-pub-date:2019,until-pub-date:2024,type:journal-article&mailto=add_your_email_address_here
It reduces the results list ~590k but nowhere near 10k. Was there a specific year that you would be more interested in?
If I add a facet
to the query URL you can see the number of articles publisher per year:
https://0-api-crossref-org.libus.csd.mu.edu/works?query.bibliographic=COVID&filter=from-pub-date:2019,until-pub-date:2024,type:journal-article&mailto=add_your_email_address_here&facet=published:*
- 2019: 191,
- 2020: 112096,
- 2021: 179247,
- 2022: 164123,
- 2023: 107940,
- 2024: 29323
These numbers should give you more of an idea of how difficult it might be to get the totals down to 10k.
I am happy to try and work through this with you if you can send over some more specific queries you might want to investigate and we can then work through a python script if needed as well.
Many thanks,
Paul
Thank you so much Paul!
This is very helpful!
The 10k number was just an number I thought would not be a “heavy” request, seems like I was undershooting by a lot. If I can extract without limits, it would be perfect.
What I ideally want is to get an idea of how “covid” and some other terms together/related to covid has researched - that is, appeared in the metadata of articles across 2019-2024.
So I want to extract from the API an dataset (as big as possible) having metadata such as: title, year of publishment, tags, autor (optional) and summary (optional) - as long it is related to COVID. With this dataset, I would search in the title, tags and summary (optional) what are the most common appearing words with some data manipulation.
Hi @CesarK,
Thanks for the reply and I am glad that it was helpful.
There is not strictly a limit in the number of results you can retrieve from the API, but when querying in these sorts of numbers you will definitely want to use the “polite” method so that you are able to access the “polite” pool of the API.
You can see some great tips on using the API efficiently here: Tips for using the Crossref REST API - Crossref
You also might want to think about paginating through the results of each year and collating those just in case the connection with the API fails, you would be able to continue from the last years results.
If you were to try and return all of the 750K results in one go you would need to paginate through 750 pages of the API query to get the results. It’s easily doable but I didn’t know if the other way would be safer.
Once you have the results query stored you could then iterate over that to extract the metadata that you are looking for, like the title, abstract etc.
Feel free to reply back with any issues you are having or if you need more advice help with the querying of the API or the python script.
Many thanks,
Paul
1 Like
Thank you again Paul!
I unfortunately didn’t have the time to in depth into the code for doing what you recommend. Could you give me a tip on what is the best way to write the code to paginate through every 1000 result using Crossref API? I managed to write one to pull 1000 results (the limit) once, but not sure how to proceed to iterate.
Thanks in advance!
Hello @CesarK,
So you can find the documentation about working with the API here: https://0-api-crossref-org.libus.csd.mu.edu/
That will explain a little about deep paging with cursors but what you would need to do is in your script add the parameter "cursor" : "*"
if you are using parameters or add cursor=*
to the query URL. This will return and extra field in the response cursor
.
e.g. Querying using the URL https://0-api-crossref-org.libus.csd.mu.edu/works?query.bibliographic=COVID&filter=from-pub-date:2019,until-pub-date:2024,type:journal-article&mailto=add_your_email_address_here&facet=published:&cursor=*
Would return the extra element next-cursor
:
You would then substitute the *
for the next cursor and iterate through the pages until all the data has been collected. You might want to test on a small subset first so that you can make sure you have it correct before running ~750 pages of 1000 rows. Maybe start off with a return from 2019 of 191 and have the rows at 100.
In my scripts I use the while
function in python to continue through the pages until there are no items showing in the response. This is a quick example which you would need to build on but something like:
import requests
item_list = []
headers = {"User-Agent": "covid-query-script; mailto:your_email_here@email.org"}
def params(cursor):
return {
"query.bibliographic": "COVID",
"filter": "from-pub-date:2019,until-pub-date:2019,type:journal-article",
"rows": "100",
"cursor": cursor,
}
cursor = "*"
while True:
res = requests.get
(
"https://0-api-crossref-org.libus.csd.mu.edu/works",
headers=headers,
params=params(cursor)
)
try:
body = res.json()["message"]
items = body["items"]
if not items:
break
cursor = str(body["next-cursor"])
for i in items:
item_list.append(i)
except TypeError:
print(
"The API query is not valid or no results available"
)
break
print(len(item_list))
This will just collate the items into the item_list
and you can manipulate that data in whichever way you would like. I am just printing the length of this list at the end to check I got them all.
I would still recommend iterating over the years one at a time and add some time.sleep
functions in if the error messages are showing up. I have also not added any error capturing so you might want to add that as well to stop misuse of the API.
I hope this helps as a starter for you and if you need any other help then please do let me know.
Many thanks,
Paul
2 Likes
Thank you very much Paul,
I’ll try following your tips!
Appreciate the help, truly
1 Like