Ticket of the month - February 2024 - "Error: cvc-pattern-valid" and other troubles with pattern matching

When you submit a metadata deposit to Crossref in order to register (or update) a DOI, you’ll get an email with a submission log in response. This is sent to whichever email address you included in the metadata deposit itself - inside the <email_address> tags, or, for Web Deposit Form users, the email address you submit at the very end of the form. Most of the time, deposits are successful, and the log will indicate “successfully added” or “successfully updated”. But, occasionally, you’ll see that your deposit has failed with some kind of error

We’ve covered many types of metadata submission errors in this forum, such as timestamp problems; title discrepancies; invalid characters, metadata that’s missing or out of order, In this ‘ticket of the month’ post, I’m going to cover another common type of metadata submission problem. That’s when some type of metadata you submitted doesn’t match the pattern that’s required for it by our schema.

An XML schema describes the structure of an XML file. It designates what metadata tags you can include, which hierarchy and order they appear in, and what type of data can go in each. For example, if you supply a publication date for an article, like this:

<publication_date media_type="print">
<month>02</month>
<day>1</day>
<year>2024</year>
</publication_date>

That structure is laid out in the schema. You could not, for example, change the <publication_date> tag to <published_date> or just <date> and expect your xml to be valid.

The options for that media_type attribute are just “online”, “print”, and “other”. You couldn’t say media_type=”audio” or media_type=”acceptance”. Or replace “media_type” with some other descriptor.

Likewise, the order of the tags within <publication_date> has to be <month> followed by <day> followed by <year>. Only <year> is required. You can omit the others. But, you can’t order them differently. If you do any of those things your XML file will be rejected with an error that indicates something is invalid.

In that same example snippet of XML, there’s one other requirement that’s a bit subtler. That is a restriction on type of data that’s allowed to be entered into each tag. In this case, <month>, <day>, and <year> can only accept numerals, not text. So, for example, if you used <month>February</month> instead of <month>02</month> you’ll get an error that says:

[Error] :27:27: cvc-datatype-valid.1.2.1: 'February' is not a valid value for 'integer'.
[Error] :27:27: cvc-type.3.1.3: The value 'February' of element 'month' is not valid.

Similarly, there are upper and lower limits for these values to ensure some basic sensibility. E.g. <year>3000</year>, <month>55</month>, or <day>32</day> would all result in an ‘invalid’ error. Those limitations are pretty easy to understand. The only time you’ll bump up against them is if you make a typo somewhere in the process of getting your metadata submitted.

There are others that are a bit less intuitive. Recently, we had a support ticket come in with a failed submission where the error looked like this (URLs changed for anonymity):

<msg>Error: cvc-pattern-valid: Value 'www.example.org/journal/stuff-things-6/article/a-00055-55' is not facet-valid with respect to pattern '[A-Za-z0-9_]+([-.][A-Za-z0-9_]+)*\.[A-Za-z0-9_]+([-.][A-Za-z0-9_]+)*' for type 'cm_domain'.
Error: cvc-type.3.1.3: The value 'www.example.org/journal/stuff-things-6/article/a-00055-55' of element 'domain' is not valid.
</msg>

So, what’s <domain>, and why doesn’t it accept a URL?

<domain> is used within the <crossmark> metadata, and its intent is to ensure that Crossmark can only be displayed and accessed from an approved site. In practice, this only matters if the <crossmark_domain_exclusive> value is set to 'true’ and the <domain> is completely optional. But, if you include it, it can only accept a top-level domain, not a full URL. So, that would have to be just www.example.org rather than https://www.example.org/journal…

In context that looks like this:

<crossmark>
<crossmark_version>1</crossmark_version>
<crossmark_policy>10.5555/crossmark.example</crossmark_policy>
<crossmark_domains>
<crossmark_domain>
<domain>www.example.org</domain>
</crossmark_domain>
</crossmark_domains>
<crossmark_domain_exclusive>false</crossmark_domain_exclusive>
</crossmark>

You can find all the requirements and restrictions in the schema and schema documentation. The required patterns, when they’re specified, are in the form of Regular Expressions.

Here’s a quick rundown of a few other elements with character restrictions that sometimes cause problems for members submitting them in metadata records.

Metadata element Error Requirements
ISSN Cvc-pattern-valid: Value … is not facet-valid with respect to pattern ‘\d{4}-?\d{3}[\dX]’ for type ‘issn_t’. 8 digits, with or without a hyphen* in the middle
ISSN ISSN “…” is invalid The final digit must be a valid check digit
ISBN cvc-pattern-valid: Value 'ISBN: … is not facet-valid with respect to pattern ‘(978-)?\d[\d -]+[\dX]’ for type ‘isbn_t’ 10 or 13 digits, with or without hyphens*
ISBN ISBN “…” is invalid The final digit must be a valid check digit
ORCID cvc-pattern-valid: Value ‘…’ is not facet-valid with respect to pattern ‘https?://orcid.org/[0-9]{4}-[0-9]{4}-[0-9]{4}-[0-9]{3}[X0-9]{1}’ for type ‘orcid_t’ The ORCID has to be supplied as a URL beginning https://orcid.org/ followed by 16 digits in 4 groups of 4, separated by hyphens*
Clinical Trial Number Clinical Trial data does not match format of registry. data: … format: … This varies depending on which Clinical Trial Registry the identifier is from.
DOI DOI: … contains invalid characters Just the DOI itself, in the format prefix/suffix. Suffix characters are restricted§ to the following: a-z; A-Z; 0-9; and - . _ ; ( ) /

* Just hyphens, no other dashes

† A check digit, also called checksum, is mathematically derived from the other digits in the identifier. Our schema doesn’t enforce the check digit requirement in order for the xml to be valid, but it’s checked at a later point when the xml is being processed after it has been submitted. This is an effort to ensure that the ISSNs and ISBNs are legitimate and correct.

‡ Some common mistakes here are appending the DOI proxy URL “https://0-doi-org.libus.csd.mu.edu/” or “http://0-dx-doi-org.libus.csd.mu.edu/” before the DOI; appending “DOI: “ before the DOI; and submitting just a prefix, with no suffix (especially for journal-level DOIs)

§ These character restrictions weren’t implemented until 2008, so you may see older DOIs with characters like brackets, colons, and question marks, but you can’t register any new DOIs with characters like that. Other common issues are accidental spaces within a DOI suffix and letters with diacritic marks like é, ñ, or ö.

There are a variety of other metadata restrictions you might bump up against as you’re submitting metadata to Crossref to register your content. So, as always, please let us know if you have any questions or need assistance by making a new post here in the Community Forum. Or, if you’d prefer to contact us privately, send an email to support@crossref.org.

Thanks!
-Shayn

4 Likes