Our attempts at automating journal subject classification

I thought I’d share @Esha’s write-up of her latest R&D project to investigate a better way to assert subject categories at the container level (i.e., journal titles). Below are excerpts from the introduction and from the summary of the results, and here is the DOI: https://0-doi-org.libus.csd.mu.edu/10.54900/g0qks-tcz98.

Tl;dr: machine learning is super interesting for this, but we’d need more metadata (especially abstracts) to do anything groundbreaking in this area. The challenge remains!

Intro excerpt:

Traditionally, journal subject classification was done manually at varying levels of granularity, depending on the use case for the institution. Subject classification is done to help collate resources by subject enabling the user to discover publications based on different levels of subject specificity. It can also be used to help determine where to publish and the direction a particular author may be pursuing in their research if one wants to track where their work is being published. Currently, most subject classification is done manually as it is a speciality that requires a lot of training. However, this effort can be siloed by institution or can be hampered by various inter-institutional agreements that prevent other resources from being classified. It could also prevent a standardized approach to classifying items if different publications in separate institutions use different taxonomies and classification systems. Automating classification work surfaces questions about the relevance of the taxonomy used, the potential bias that might exist, and the texts being classified. Currently, journals are classified using various taxonomies and are siloed in many systems, such as library databases or software for publishers. Providing a service that can automatically classify a text (and provide a measure of accuracy!) outside of a specific system can democratize access to this information across all systems. Crossref infrastructure enables a range of services for the research community; we have a wealth of metadata created by a very large global community. We wondered how we could contribute in this area.

Summary excerpt:

So, sad trombone, using three different methods, the F1 score is similar across all three methods. Essentially, we needed more data for more accurate predictions. Crossref has abstracts for a subset of the deposited publication metadata. Therefore, this data could not be used at this time for comparison. However, having that data could possibly yield better results. The only way to do that is to use a similar method to get those results. We do not have that currently, and so, for now, it becomes a chicken and egg thought exercise. Getting even more data, such as full-text, could also produce interesting results, but we do not have the data for that either. For now, Crossref decided to remove the existing subject classifications that were present in some of our metadata. We could revisit the problem later - if we have more data.