Extracting granular information on habitats and reproductive conditions of Dipterocarps through pattern-based literature analysis
Lowland tropical rainforests in Southeast Asia primarily comprised of dipterocarp species are one of the most threatened ecosystems in the world. Belonging to the family Dipterocarpaceae, dipterocarps are economically and ecologically important due to their timber value as well as contribution to wildlife habitat. The challenge in the restoration and rehabilitation of these Dipterocarp forests lies in their complex reproduction patterns, i.e., supra-annual mass flowering events that may occur in irregular intervals of two to ten years, possibly synchronously across Asia. Understanding their regeneration to make plans for effective reforestation can be aided by providing access to a comprehensive database that contains long-term and wide-scale data on dipterocarps. The content of such a database can be enriched with literature-derived information on habitats and reproductive conditions of dipterocarps. We aim to develop literature mining methods to automatically extract information relevant to the distribution and reproductive cycle of dipterocarps, in order to help predict the likelihood of their regeneration, and subsequently make informed decisions regarding species for reforestation. In previous work, we developed a machine learning-based named entity recognition (NER) model that automatically annotates entities relevant to species’ distribution, e.g., taxon names, geographic locations, temporal expressions, habitats, authorities, and names of herbaria. Furthermore, the species’ reproductive condition, e.g., whether it is sterile or in the state of producing fruit ("in fruit") or flower ("in flower"), was also automatically annotated to enable the derivation of phenological patterns. The model was trained on a manually annotated corpus of documents, e.g., scholarly articles and government agency reports. In this work, we focus our efforts specifically on the extraction of relationships between habitats and their locations, and between reproductive conditions and temporal expressions. To this end, we have developed a syntactic pattern-based matching approach by building upon Grew (http://grew.fr/), a graph rewriting system for manipulating linguistic representations. For our purposes, patterns that made use of syntactic dependencies, part-of-speech tags and named entity types (derived from NER results) were designed. When fed into Grew, these patterns were able to analyse sentences in scholarly articles by associating habitats with their geographic locations, and by determining a species’ reproductive condition at a specific point in time. The resulting relationships are then used to enrich information contained in a database of dipterocarp occurrences. Such a resource will provide more comprehensive ecological data that could form the basis of more informed reforestation decisions.