Encoding when querying DOI

When I query some DOIs to retrieve the publication text (API of Crossref URL + DOI + /transform/text/plain), sometimes I get strange characters. E.g., this happens with DOIs 10.1145%2F3544548.3580875 and 10.1016%2Fj.inffus.2006.10.007.

It seems a problem of encoding. Can this be solved? Any tip is welcome.

Hello, and thanks for your post.

Yes, it is due to encoding. "– is the iso-8859-1 version of the UTF-8 encoded character – . You’ll see similar encoding presentations for other characters, for example apostrophes. So, it’s likely whatever context you’re accessing that text in is not equipped to handle the UTF-8 encoding, or at least isn’t doing so by default.

When I query some DOIs to retrieve the publication text (API of Crossref URL + DOI + /transform/text/plain),

Just to clarify, that query will not retrieve the publication text. It will only retrieve the bibliographic metadata, of the sort that you would find in a formatted citation.

Can you tell more about the method (script, program, client?) you’re using to make these queries and how you’re processing the results? Once I know that I can suggest possible solutions or ask my colleagues on our technical team to do so.

Thanks,
Shayn

I am just accessing through a URL in the browser. In my workflow, it is very useful to do it like that. Any suggestion is welcome!

The API isn’t really meant to be accessed via a browser. It’s intended for programatic use.

If you want to query for DOI citations manually, you can use our Metadata Search Interface. Each result will have a link beneath it that says “Actions”. If you click “Actions” then “Cite”, and then select any citation style, you’ll get the same results as you do using the /transform/text/plain API query, but without the character encoding problems.

Alternatively, the Citation Formatter at https://citation.crosscite.org/ will do the same thing, but for DataCite and mEDRA DOIs, in addition to Crossref DOIs.

If you prefer to use the API in the browser, the only workaround that I’m aware of to get those characters encoded properly is:

  1. save the page as an html file (you will be prompted to save it as .txt, but replace the .txt extension with .html manually)
  2. open that html file in a text editor, and add <meta charset="utf-8"> at the top of the file, before the citation text.
  3. save that change and then re-open the edited file in your browser.

And, finally, if you’d like to familiarize yourself with programatic API querying in a relatively user-friendly environment, I’d recommend this webinar recording showing how to use the Crossref API in the Postman client.

1 Like