Extending the Environment Ontology with Text-mined Habitat Mentions
Ontologies, i.e., formal specifications of concepts and relations relevant to a specialised domain of interest, are information resources which play a crucial role in the tasks of knowledge representation, management and discovery. Knowledge acquisition, the process of curating and updating them, is typically carried out manually, requiring human efforts that are tedious, time-consuming and expensive. This holds true especially in the case of ontologies which are continuously being expanded with new terms, in their aim to support a growing number of use cases. An example of such is the Environment Ontology (ENVO). Initially developed to support the annotation of metagenomic data, ENVO has more recently realigned its goals in support of the Sustainable Development Agenda for 2030 and thus is currently much broader in scope, covering the domains of biodiversity and ecology. As a result, there has been a dramatic increase with respect to ENVO’s number of classes; hence the process of curating and updating the ontology can benefit from automated support. In this work, we aim to help in expanding ENVO in a more efficient manner by automatically discovering new habitat mentions. To this end, we developed a text mining-based approach underpinned by the following pipeline: (1) automatic extraction of habitat mentions from text using named entity recognition methods; (2) normalisation of every extracted mention, i.e., identification of the most relevant ENVO term based on the calculation of lexical similarity between them; (3) application of a filter to retain only habitat mentions that appear to not yet exist in ENVO; and (4) construction of clusters over the remaining mentions. The pipeline results in clusters consisting of potential synonyms and lexical variations of existing terms, as well as semantically related expressions, which can then be evaluated for integration into an existing ENVO class, or, on occasion, be indicative of a new class that could be added to the ontology. Applying our approach to a corpus pertaining to the Dipterocarpaceae family of forest trees (based on documents from the Biological Heritage Library and grey literature), we generated more than 1,000 new habitat terms for potential incorporation into ENVO.