Colloction and Term Extraction Using Linguistically Enhanced Statistical Methods
The research presented in this thesis substantiates, defines and evaluates two new linguistically motivated statistical association measures in a language- and domain independent manner, limited syntagmatic modifiability (LSM) for collocation extraction, and limited paradigmatic modifiability (LPM) for term extraction. The task they are designed for – computing lexical association scores to determine the degree of collocativity and termhood of collocation and term candidates – is the crucial backbone of any approach to collocation and term extraction and, thus, resembles a wide variety of standard frequency-based, statistical and information-theoretic association measures put forth in the computational linguistics research literature. What distinguishes LSM and LPM is that their defining parameters are based on actual linguistic properties of the targeted linguistic constructions, viz. collocations and terms. The central linguistic property which is isolated in the linguistic research literature and which is shared by collocations and terms is denoted by the notion of limited modifiability. This property is parameterized in such a way as to account for the obvious linguistic differences between collocations and terms in that collocations are typically manifested in general language and surface in a variety of syntactic constructions, while terms are typically confined to noun phrases manifested in domain-specific sub-language. Limited modifiability is embedded within an appropriate linguistic frame of reference – the lexical-collocational layer of Firth (1957)’s contextualist model of language description. With the help of this model, the linguistic differences are realized as limited syntagmatic modifiability, in the case of collocations, and as limited paradigmatic modifiability, in the case of terms. The respective linguistically enhanced lexical association measures exploit these properties as observable and quantifiable parameters to their statistical computations in that LSM incorporates the tendency of collocations to limit the number of potential syntagmatic attachments whereas LPM incorporates the tendency of terms to limit the number of potential paradigmatic substitutions. Frequency of co-occurrence is another prominent linguistic property incorporated into both linguistic association measures and is the only linguistic property also exploited by other standard frequency-based, statistical and information-theoretic association measures for collocation and term extraction. In order to compare the linguistically enhanced lexical association measures LSM and LPM against their standard competitors, a comprehensive performance evaluation setting is established – for collocation extraction on German-language preposition-noun-word collocation candidates and for term extraction on English-language noun phrase term candidates from a biomedical subdomain. In this setting, a wide array of standard quantitative performance metrics is applied as well as, in addition, a new qualitative performance evaluation metric which compares the output rankings of an association measure to the challenging baseline of frequency of co-occurrence. All experimental results show that LSM and LPM outperform the other frequency-based, statistical and information-theoretic lexical association measures by large margins in every aspect of performance evaluation considered. Thus, lexical association measures which base their statistical computations on linguistic parameters instead of standard statistical ones not only exhibit conceptual but also empirical superiority.