Towards the Automatic Extraction of Plant Traits from Textual Descriptions
Many ecological restoration programmes are informed by evidence coming from empirical research. Specifically, such programmes analyse species traits in order to differentiate species that are suitable for restoration from unsuitable ones. Indeed, understanding plant traits (and their relationships with each other) informs research into vegetation modelling and environmental change prediction, which in turn help in answering many ecological questions. In 2006, the Center for Tropical Forest Science (CTFS) formulated recommendations in support of their research programme, the foremost of which is the creation of trait databases by building upon published information catalogued by existing herbaria. In this work, we aim to enrich World Flora Online (WFO), a web-based inventory of known plant species, by integrating trait information contained in data sets coming from botanical institutions all over the world. This poses a few challenges, as trait information tends to be buried within verbose textual descriptions and do not conform with conventions of writing. Specifically, they typically do not come in the form of full sentences and look like long-winded enumerations of various types of plant attributes or characteristics. Such descriptions are difficult to search and understand unless decomposed into meaningful units. In order to decompose textual descriptions of plant species into spans pertaining to specific types of attributes, we have developed a machine learning-based approach to automatic text segmentation. Casting the problem as a sequence labelling task, we have investigated a number of probabilistic classifiers including conditional random fields (CRFs), hidden Markov models (HMMs) and naïve Bayes (NB). To train our models, we utilised data contributed by the South African National Biodiversity Institute (SANBI) which contain traits labelled as one of the following trait categories: morphology, habitat and distribution. To help the models discriminate between these categories, we designed features capturing word characteristics (e.g., n-grams at the character and word level), context (i.e., surrounding words within a predefined window), as well as domain knowledge (i.e., words that match terms in plant-related ontologies). In this way, we can automatically elucidate exactly which parts of the original descriptions pertain to plant traits such as morphology, habitat or distribution. By applying the resulting models on textual descriptions coming from several botanical institutes, we can facilitate the automatic population of WFO with plant traits for a number of species.