Is that normal to get <html> tags in title metadata?

fmerceur · 2 June 2022 11:40

Hello,

In some records, we get tags in the title metadata. Is that an error ? E.G. :

Api crossref org/v1/works/10.1002/ajpa.24488

“title”: [“A population history of indigenous\n Bahamian\n islanders: Insights from ancient\n DNA”],

Is there a way to get only raw text with the CrossRef API ?

Thanks,
Fred

Shayn · 2 June 2022 13:28

Hi Fred,

It’s not especially common, but it is allowed.

We support certain face markup within the metadata that publishers supply for their registered content.

It’s up to each publisher when and whether they opt to supply those markup tags.

There’s not a way to query the API such that you’ll get back only the text, without tags. You’d have to clean up the data after the fact to strip them out, if that’s what you wanted.

Best,
Shayn

adigitoleo · 22 August 2022 08:16

Hi Shayn,

I had the same question. Thanks for the link and confirmation. I wonder if a full specification is available of what kind of face markup is permitted.

For example. a bibtex query for the paper with DOI 10.1002/2015gl067329 gives me the title:

	title = {An automatically updated
		            $\less$i$\greater$S$\less$/i$\greater$
		            -wave model of the upper mantle and the depth extent of azimuthal anisotropy},

Notice that in order to process this I would need to first decode the latex, and then decode the html tags, in that specific order. The face markup docs only mentions html entities and MathML, not arbitrary LaTeX on top of that. I wonder if the face markup could be better constrained.

Could a specification for the permitted face markup perhaps even be used to implement content negotiation in a way that “application/x-bibtex” queries would always return metadata in LaTeX? I suppose this would require a translation layer between the html (?) based face markup and an equivalent LaTeX representation. I appreciate that this would not be trivial, but it would greatly improve the quality of automated bibliography generation.

Shayn · 22 August 2022 18:51

Hi, and thanks for your feedback

The permitted markup, and the metadata elements where it’s allowed, are described in our documentation.

It’s relatively minimal, just bold (b), italic (i), underline (u), over-line (ovl), superscript (sup), subscript (sub), small caps (scp), and typewriter text (tt). And they can only occur in titles and citations.

I’m not sure, but I can pass the suggestion along to our technical team and the API product manager. Content negotiation is an especially complicated tool to make updates or improvements too, because it’s a collaboration between three organizations.

Dave · 28 June 2023 18:29

In some cases, the markup can be extensive, see this title in DOI: 10.1103/PhysRevB.56.6100 with lots of MathML tags. (I can’t post the XML)

I have added scrubbers in my script to remove the tags as I need metadata, not markup.

ifarley · 3 July 2023 15:24

Hi Dave,

Thanks for following up. I’ve updated your privileges so you can post code and links within the forum.

We preserve deposited markup, so that’s why you’re seeing lots of MathML tags in the metadata record.

We’ve talked about this a lot, and it goes back a bit:

github.com/CrossRef/rest-api-doc

Whitespace is lost from titles that contain formatting markup

opened 05:23PM - 17 Mar 16 UTC

closed 08:06AM - 29 Jun 18 UTC

hubgit

doi metadata

e.g. https://0-api-crossref-org.libus.csd.mu.edu/works/10.7717/peerj.1698 The XML deposited was `<…title>NeisseriaBase: a specialised <i>Neisseria</i> genomic resource and analysis platform</title>`, but the title in the JSON is `NeisseriaBase: a specialisedNeisseriagenomic resource and analysis platform`. Even if the formatting gets lost in the JSON version (it would be nice to have it preserved in some way, perhaps in a `title-html` or `title-markdown` field), the whitespace around words should be preserved.

But, we do convert that markup to readable text in our products/services that are meant to be human readable, like this example:
https://0-search-crossref-org.libus.csd.mu.edu/?q=10.1103%2Fphysrevb.56.6100&from_ui=yes

My best,
Isaac

Dave · 3 July 2023 21:17

Thank you for updating my privileges and the explanation, Isaac.

For the FrontEnd CrossRef search perspective it makes sense to keep the markup while connecting to the same data source.

Topic		Replies	Views
HTML entities in journal titles Metadata Retrieval content-registration , metadata-retrieval , xml_support	1	655	26 October 2022
Java example to retrieve metadata? Metadata Retrieval rest-api , metadata-retrieval	2	1301	22 May 2022
Sub- & Superscript in metadata Web Deposit Form	1	441	9 October 2023
Ticket of the month - March 2022 - Getting started with REST API queries Metadata Retrieval rest-api , metadata-retrieval , ticket_of_month , for_community	26	4037	8 September 2023
Come and get your grant metadata! - Crossref Interfaces for Machines rest-api , metadata , blog , grants	0	1230	8 November 2021

Is that normal to get <html> tags in title metadata?

Related topics