Extraction of terms highly associated with named rivers
EcoLexicon is an electronic, multilingual, terminological knowledge base on environmental sciences, whose flexible design permits the contextualization of data so that they are more relevant to specific subdomains and geographic areas. However, to facilitate the geographic contextualization of concepts such as those belonging to the semantic category of LANDFORM, it is necessary to know what terms are semantically related to each type of landform according to the research papers published by experts, and how those terms are related to each other. This paper describes a semi-automatic method for extracting knowledge about terms related to rivers as a type of landform, from a specialized corpus of English Environmental Science texts. A GeoNames database dump was first applied to automatically match the sequences of words in the corpus which are the proper names of rivers. For all the named rivers recognized in the corpus, their respective geographic coordinates, i.e. longitude and latitude, were automatically retrieved from the GeoNames database dump and then automatically visualized on top of a static map. This type of visualization accounted for the representativeness of the corpus in reference to the location of rivers and the number of times that they were mentioned. Moreover, a hierarchical clustering technique was deployed in order to group the named rivers, based on their latitude and longitude. This allowed us to automatically annotate each river with the geographical area it belongs to. For each river, the contexts in which it appeared were automatically retrieved in such a way that all the contexts contained complete sentences. Subsequently, the subcorpus of contexts was lemmatized. The multi-word terms, automatically collected from EcoLexicon, were automatically matched in the corpus and joined with underscores. Then, a document term matrix of co-occurrences was obtained, and the terms in the columns transformed into binary variables. Finally, the clustering technique ROCK for categorical variables was adopted to group the named rivers, based on the terms related to them, as reflected in the corpus data. The preliminary results show that there is a slight association between the geographical areas of the named rivers and the processes mentioned by researchers affecting them. Once these experimental results were validated by Coastal Engineering experts, the knowledge extracted with this method facilitates the geographical contextualization of EcoLexicon with regard to rivers, in the sense that a specific named river can be linked to its more highly associated terms dealt with in the corpus data.