How to query works that match a given AND family name?

I’ve been using the REST API for a few days now and it’s been great for some research I’m doing. However, from what I can see in the documentation on field queries, if multiple queries are provided, the API always performs an OR search not an AND search. This winds up being problematic for my purposes when searching for a specific author by name since it matches on the given and family name separately. For authors with unique names it isn’t an issue because relatively few results show up, but for common names (eg. John Smith) it returns hundreds of thousands of results matching either the given or family name. Is there a way to perform a query search where results must match all parts of the query?

Thank you in advance!

1 Like

Hello @theonlydvr . Thanks for your message, and welcome to the community forum!

No, our REST API does not support boolean operators, so you’re right you are performing more of an OR search here (for more clarification on this). But, I think there are ways to get the information you seek.

Narrow queries with few values in your query are going to return imprecise results. We know that. We have over 163 million DOIs in our corpus, so sending us more information in the query is better. I found a real-life example here that we can use to help clarify this:

Let’s say we’re looking for works written by author Steven Hoang. Even if we allowed for boolean operators, Steven AND Hoang is going to give us a lot of noise. So, what else do we know about the author. Let’s say that we know this author has long been affiliated with the John Peter Smith Health Network in Fort Worth, Texas, USA. Well, let’s just expand our search so we can get the most relevant results. Thus, we’d query for:

https://0-api-crossref-org.libus.csd.mu.edu/works?query.affiliation=Steven+Hoang+John+Peter+Smith+Health&select=DOI,title,type,published,author,score&rows=750

In this example query, the API is providing results for any content registered with us with any of these words in the author name or affiliation elements of the record registered with us: Steve, Hoang, John, Peter, Smith OR health. A ton of fairly common names/terms, right?

As you can see, because we are including six search terms, we’re going to get many results - over 2 million. Yikes! But, the results are ordered by relevance score, so works (DOIs) with all six words in its affiliation metadata will be ranked higher than works with only one of the search terms in its metadata. I’ve requested the top 750 results and also limited the metadata returned in the results to the relevant works’ DOI, title, work type (e.g., journal article), publication date, author name, and relevance score.

As you can see from the limited metadata that I have searched on, the first two results, look much more relevant to your query than the 750th result:

The second result includes an author named Steven Hoang and that DOIs relevance score is 41.6

The 750th results is for a work that appears to only include one of these six words - Hoang - and relevance score that is much lower: 16.4

Thus, increasing the number of search parameters could be a good strategy for you.

A couple of other things to note: if the author you are searching for has an ORCID iD, here’s a great opportunity to search for the author’s works using the ORCID iD.

For example: https://0-api-crossref-org.libus.csd.mu.edu/works?filter=orcid:0000-0002-9117-4510

Also, we use the default scoring in Elasticsearch: Practical BM25 - Part 2: The BM25 Algorithm and its Variables | Elastic Blog

Please let me know if you have any additional questions,
Isaac

Hi Isaac, thank you for the write up, it’s very helpful! I’m going to look into that score term some more and see if I can get any sort of automatic thresholding to work to consistently get at the desired results. On the topic of ORCID IDs, I assume many publications for a particular author might not show up under their ORCID ID if they didn’t explicitly add them to their profile or the automated system did not add them?

Hi @theonlydvr ,

On the topic of ORCID IDs, I assume many publications for a particular author might not show up under their ORCID ID if they didn’t explicitly add them to their profile or the automated system did not add them?

In order for the work to appear in the API results, the publisher member would have had to register the ORCID iD in the metadata they sent to us, Crossref.

For example, that query above, is for my own ORCID iD: https://0-api-crossref-org.libus.csd.mu.edu/works?filter=orcid:0000-0002-9117-4510

As you can see, I only get five results. Those are the five works that the members included my ORCID iD in the metadata registered.

Compare that to the works in my ORCID profile - https://orcid.org/0000-0002-9117-4510:

Be prepared for a delta.

-Isaac

An update on something that I said in this thread. I incorrectly said that our REST API performs an OR search when given multiple terms in a query or query.bibliographic query. I was wrong about that.

We don’t quite have an OR query at play within our REST API. Instead, query and query.bibliographic require that at least 20% of query words match. We added this requirement a few years ago for performance reasons. So, with 10 input query words, 20% becomes 2 words, and the query results start to drop works that match only one word at that threshold. This is why a query containing poverty+low+middle+income+country+LIC+MIC has more results than a query containing poverty+low+middle+income+country+LIC+MIC+climate+urban+migration+gender . We’ve simply dropped results from the latter of the two queries because of that 20% matching rule that we implemented.

For more context on this see this thread.

-Isaac