Flattening the curve of COVID-19 has forced campuses, research labs, and businesses to shut their doors and academic conferences to postpone all events for the foreseeable future. The collaboration and knowledge sharing that takes place on campus and at conferences is vital to the field of physics. But we don’t think that the disruption will slow things down for long — physicists are resourceful and inventive and will adapt to the circumstances.
Please read on for a look at an important, behind-the-scenes development that AIP Publishing has been working on over the past few years. Since our thesaurus launched in 2017, it has greatly improved our semantic tagging capabilities, which is akin to the process of indexing in the pre-digital world, but enables so much more functionality.
Semantic Tagging as a Superpower
Always an important part of the scholarly publishing workflow, classifying and describing content to enable referencing traditionally depended on skilled subject matter experts to analyze the text, distinguish between significant information and passing mentions, and describe and define the relationships between terms and topics. In online publishing, assigning descriptive terms and ontological metadata to content is more important than ever. The process enables many of the value-added functions we have come to depend on: enhanced search, discovery, navigation, and content repurposing. While much of the function is now automated, ongoing human involvement is still needed: Fine-tuning by authors and editors allows the machine-generated tags to take on these superpowers.
The Back Story
Getting the thesaurus to where it is now hasn’t been a straight path, but what we learned along the way has helped to improve the thesaurus. For many years, AIP Publishing used a taxonomy called PACS (Physics and Astronomy Scheme) codes as a guide for classifying content before it published. Developed in-house by AIP, PACS was used for decades across physics and astronomy as the primary knowledge organization tool in the field
AIP Publishing Technical Lead Matt Stratton recalls why they decided to create a thesaurus to replace PACS, “The PACS taxonomy was a big step forward when it was developed, providing a groundbreaking standard for classification and categorization. After years of successful use, we realized that we needed go further to optimize it for discoverability — the codes were more like comments with room for interpretation. As a result, we made the decision to create a thesaurus with more defined terms that would work better with online search.”
The initial attempt at the thesaurus was not a total success. “We started working with a vendor to make sure our content was getting auto indexed on an ongoing basis. The program that assigned terms worked well to a point, but there were some basic problems, notably there was no sustainable process for updating or maintaining the terms. You couldn’t revise the terms attached to an article once they were assigned. PACS had been updated biannually, but there was only ever a version one of the first thesaurus,” explained Matt. Three years later, he recounted that the editors and authors evaluated the thesaurus program, and came to the conclusion that it didn’t meet their standards and there was no recourse for correcting it. “The tipping point was when editors and authors lost confidence in the indexing process. So we turned to Molecular Connections to help us build an optimal tool.”
Thesaurus 2.0 and Beyond
Since the new thesaurus launched in 2017, it’s already been updated three times with enthusiastic buy-in from the editorial team. “This is in sharp contrast to previous implementation, which was viewed primarily as a technology project. I’m happy to say that ongoing maintenance of the new thesaurus is a shared responsibility of the editorial and technology staff,” noted Matt.
Another change is the assignment of terms much earlier in the publishing process before an article goes out for peer review, instead of at publication. That means that AIP Publishing indexes all submissions whether or not they are accepted. “We reach out to authors and ask them to review the assigned terms, giving them the opportunity to say, ‘Nope, you missed the boat on this one, you should delete this one, add another,’ and then we make those changes.” Not only does this improve the assignment of terms for that individual article, it trains the artificial intelligence behind the thesaurus for future articles.
Melissa Patterson, Director, Editorial Development, described how this works, “The initial step is an automated indexing workflow that streams content into an algorithm that uses certain sub sections of the article to extract the most relevant terms. The title and abstract are particularly important here as they typically contain the main points of the paper. We then ask our authors to review the resulting terms for fine-tuning, updating, and correcting.” Because the machine can only go so far without human intervention, AIP Publishing works closely with Molecular Connections to manage the evaluation process and algorithm on an ongoing basis. “Molecular Connections uses the feedback to train the algorithm by adding or modifying conditions for more precise application of indexing terms. This feedback mechanism means that authors won’t have to do as much editing going forward,” explained Melissa.
PACS was a good knowledge organization tool, and was widely accepted by researchers as accurate and comprehensive for their discipline. But it didn’t have the direct effect on search that the thesaurus does. By establishing an ontology mapping out content relationships and linking to related content via metadata, the thesaurus enables readers of specific topics to find related papers. And because the thesaurus classifications are available on Scitation, researchers frequently use its topics for searching.
That’s not all. The thesaurus is also an engine for contextual advertising – it drives ad placement on our site. So a company that creates some sort of instrumentation, for example, can purchase related keywords and the ads will preferentially display when a reader is accessing any articles containing those words, thereby targeting just the right audience.
We are currently developing a way to use the thesaurus to identify new peer reviewers, creating a scientific fingerprint to find a person who is an expert on a specific topic. While the topics are not maintained within the XML, our publishing infrastructure relates the thesaurus to the broader scholarly communications community metadata. Looking ahead we can see many more potential uses through association with industry-wide persistent identifiers for authors (ORCID), institutions, datasets, and more.