Student Research Workshop Abstracts
Automatic Prediction of Cognate Orthography Using Support Vector Machines
This paper describes an algorithm to automatically generate a list of cognates in a target language by means of Support Vector Machines. While Levenshtein distance was used to align the training file, no knowledge repository other than an initial list of cognates used for training purposes was input into the algorithm. Evaluation was set up in a cognate production scenario which mimed a real-life situation where no word lists were available in the target language, delivering the ideal environment to test the feasibility of a more ambitious project that will involve language portability. An overall improvement of 50.58% over the baseline showed promising horizons.
Kinds of Features for Chinese Opinionated Information Retrieval
This paper presents the results of experiments in which we tested different kinds of features for retrieval of Chinese opinionated texts. We assume that the task of retrieval of opinionated texts (OIR) can be regarded as a subtask of general IR, but with some distinct features. The experiments showed that the best results were obtained from the combination of character-based processing, dictionary look up (maximum matching) and a negation check.
Identifying Linguistic Structure in a Quantitative Analysis of Dialect Pronunciation
Computational dialectometry is a multidisciplinary field that uses quantitative methods in order to measure linguistic differences between the dialects. The distances between the dialects are measured at different levels (phonetic, lexical, syntactic) by aggregating over the entire data sets. These aggregate analyses do not expose the underlying linguistic structure, i.e. the specific linguistic elements that contributed to the differences between the dialects. The aim of this paper is to present a new method for identifying linguistic structure in the aggregate analysis of the language variation. The method is based on the identification of regular sound correspondences and is for the first time applied in dialectometry in order to extract linguistic structure from the aggregate analysis. Regular sound correspondences are automatically extracted from the aligned transcriptions of words and further quantified in order to characterize each site based on the frequency of a certain sound extracted from the pool of the site's pronunciation. All the analyses are based on the transcriptions of 117 words collected from 84 sites equally distributed over the Bulgarian dialect area. The results have shown that identification of regular sound correspondences can be successfully applied in the task of identifying linguistic structure in the aggregate analysis of dialects based on word pronunciation.
Computing Lexical Chains with Graph Clustering
Text understanding tasks such as topic detection, automatic summarization, discourse analysis and question answering require deep understanding of the text’s meaning. The first step in determining this meaning is the analysis of the text’s concepts and their inter-relations. Lexical chains provide a framework for such an analysis. They combine semantically related words across sentences into meaningful sequences that reflect the cohesive structure of the text.
Introduced by Morris and Hirst (1991), lexical chains have been studied extensively in the last decade, since large lexical databases are available in digital form. Most approaches use WordNet or Roget’s thesaurus for computing the chains and apply the results for text summarization.
We present a new approach for computing lexical chains by treating them as graphs, where nodes are document terms and edges reflect semantic relations between them. In contrast to previous methods, we analyze the cohesive strength within a chain by computing the diameter of the chain graph. Weakly cohesive lexical chains have a high graph diameter and are decomposed by a graph clustering algorithm into several highly cohesive chains. Instead of a general lexical database like WordNet, we use the domain-specific thesaurus Agrovoc. The lexical chains produced with this method consist of highly cohesive domain-specific terms and reflect individual topics discussed in a document.
We first give an overview on the existing methods for computing lexical chains and related areas. Then we discuss the motivation behind the new approach and describe the algorithm in detail. Experimental data shows the quality of the extracted lexical chains for the task of topic and keyphrase indexing. The results are compared to keyphrases assigned by humans and by Kea, an existing statistical keyphrase extraction algorithm.
Logistic Online Learning Methods and Their Application to Incremental Dependency Parsing
We investigate a family of update methods for online machine learning algorithms for cost-sensitive multiclass and structured classification problems. The update rules are based on multinomial logistic models. The most interesting question for such an approach is how to integrate the cost function into the learning paradigm. We propose a number of solutions to this problem.
To demonstrate the applicability of the algorithms, we evaluated them on a number of classification tasks related to incremental dependency parsing. These tasks were conventional multiclass classification, hiearchical classification, and a structured classification task: complete dependency tree prediction. The performance figures of the logistic algorithms range from slightly lower to slightly higher than margin-based online algorithms.
Inducing Combinatory Categorial Grammars with Genetic Algorithms
This paper proposes a novel approach to the induction of Combinatory Categorial Grammars (CCGs) by their potential affinity with the Genetic Algorithms (GAs). Specifically, CCGs utilize a rich yet compact notation for lexical categories, which combine with relatively few grammatical rules, presumed universal. Thus, the search for a CCG consists in large part in a search for the appropriate categories for the data-set's lexical items. We present and evaluates a system utilizing a simple GA to successively search and improve on such assignments. The fitness of categorial-assignments is approximated by the coverage of the resulting grammar on the data-set itself, and candidate solutions are updated via the standard GA techniques of reproduction, crossover and mutation.
Towards a Computational Treatment of Superlatives
I propose a computational treatment of superlatives, starting with superlative constructions and the main challenges in automatically recognising and extracting their components. Initial experimental evidence is provided for the value of the proposed work for Question Answering. I also briefly discuss its potential value for Sentiment Detection and Opinion Extraction.
Measuring Syntactic Difference in British English
Nathan C. Sanders
Recent work by (Nerbonne and Wiersma, 2006) has provided a foundation for measuring syntactic differences between corpora. It uses part-of-speech trigrams as an approximation to syntactic structure, comparing the trigrams of two corpora for statistically significant differences.
This paper extends the method and its application. It extends the method by using leaf-path ancestors of (Sampson, 2000) instead of trigrams, which capture internal syntactic structure---every leaf in a parse tree records the path back to the root.
The corpus used for testing is the International Corpus of English, Great Britain (Nelson et al., 2002), which contains syntactically annotated speech of Great Britain. The speakers are grouped into geographical regions based on place of birth. This is different in both nature and number than previous experiments, which found differences between two groups of Norwegian L2 learners of English. Comparison of the twelve British regions from the ICE-GB should show whether dialectal variation is detectable by this algorithm.
A Practical Classification of Multiword Expressions
The present paper formulates a proposal of a taxonomy for multiword expressions, useful for the purposes of natural language processing. The taxonomy is based on the stages in the NLP workflow in which the individual classes of units can be processed successfully. We also suggest the tools that can be used for processing the units in each of the classes.
The first section contains the description of the proposed classification, which at this stage of work consists of two groups of multiword expressions. The first one contains units that should be processed before syntactic analysis, and the other one units whose processing should be combined with parsing.
The second section offers some rationale for the classification, and shows how some formalisms fail to describe multiword expressions properly because of trying to cover all of them with the same methods, which we believe to be a wrong approach.
The third section is an overview of some previous classifications of multiword expressions and shows their inadequacy for the purposes of natural language processing.
An Implementation of Combined Partial Parser and Morphosyntactic Disambiguator
The aim of this paper is to present a simple yet efficient implementation of a tool for simultaneous rule-based morphosyntactic tagging and partial parsing formalism. The parser is currently used for creating a partial treebank in a valency acquisition project over the IPI PAN Corpus of Polish.
Usually tagging and partial parsing are done separately, with the input to a parser assumed to be a morphosyntactically fully disambiguated text. Rules used in rule-based tagging often implicitly identify syntactic constructs, but do not mark such constructs in texts. Althought morphosyntactic disambiguation rules and partial parsing rules often encode the same linguistic knowledge, we are not aware of any partial (or shallow) parsing systems accepting morphosyntactically ambiguous input and disambiguating it with the same rules that are used for parsing. This paper presents a formalism for such simultaneous tagging and parsing, as well as a simple implementation of a parser understanding it.
The input to the parser is a tagset definition, a set of parsing rules and a collection of morphosyntactically analysed texts. The output contains disambiguation annotation and two new levels of constructions: syntactic words (named entities, analytical forms, or any other sequences which, from the syntactic point of view, behave as single words) and syntactic groups (with syntactic and semantic heads identified).
Adaptive String Distance Measures for Bilingual Dialect Lexicon Induction
This paper compares different measures of graphemic similarity applied to the task of bilingual lexicon induction between a Swiss German dialect and Standard German. The metrics have been adapted to this particular language pair by training with the Expectation-Maximisation algorithm or by using hand-made transduction rules. These adaptive metrics show significant improvement over a static metric like Levenshtein distance.
Semantic Classification of Noun Phrases Using Web Counts and Learning Algorithms
This paper investigates using machine learning algorithms to label modifier-noun compounds with a semantic relation. The attributes used as input to the learning algo-rithms are the web frequencies for phrases containing the modifier, noun, and a prepositional joining term. We compare and evaluate different algorithms and different joining phrases on Nastase and Szpakowicz’s (2003) dataset of 600 modi-fier-noun compounds. Modifier-noun phrases are often used interchangeably with paraphrases which contain the modifier and the noun joined by a preposition or simple verb. For example, the query “morning ex-ercise” returns 133,000 results from the Yahoo search engine, and a query for the phrase “exercise in the morning” returns 47,500 results. Sometimes people choose to use a modifier-noun compound phrase to describe a concept, and sometimes they choose to use a paraphrase which includes a prepo-sition or simple verb joining head noun and the modifier. One method for deducing semantic rela-tions between words in compounds involves gath-ering n-gram frequencies of these paraphrases, containing a noun, a modifier and a “joining term” that links them. Some algorithm can then be used to map from joining term frequencies to semantic relations and so find the correct relation for the compound in question. This is the approach we use in our experiments. We choose two sets of joining terms, based on the frequency with which they occur in between nouns in the British National Cor-pus (BNC). We experiment with three different learning algorithms; Nearest Neighbor, Multi-Layer Perceptron and Support Vector Machines (SVM).We find that by using a Support Vector Machine classifier we can obtain better performance on this dataset than a current state-of-the-art sys-tem; even with a relatively small set of prepositional joining terms.
Exploiting Structure for Event Discovery Using The MDI Algorithm.
Effectively identifying events in unstructured text is a very difficult task. This is largely due to the fact that an individual event can be expressed by several sentences. In this paper, we investigate the use of clustering methods for the task of grouping the text spans in a news article that refer to the same event. The key idea is to cluster the sentences, using a novel distance metric that exploits regularities in the sequential structure of events within a document. When this approach is compared to a simple bag of words baseline, more accurate clustering solutions are observed.
Annotating and Learning Compound Noun Semantics
Diarmuid Ó Séaghdha
There is little consensus on a standard experimental design for the compound interpretation task. This paper introduces well-motivated general desiderata for semantic annotation schemes, and describes such a scheme for in-context compound annotation accompanied by detailed publicly available guidelines. Classification experiments on an open-text dataset compare favourably with previously reported results and provide a solid baseline for future research.
Limitations of Current Grammar Induction Algorithms
I review a number of grammar induction algorithms (ABL, Emile, Adios), and test them on the Eindhoven corpus, resulting in disappointing results, compared to the usually tested corpora (ATIS, OVIS). Also, I show that using neither POS-tags induced from Biemann's unsupervised POS-tagging algorithm nor hand-corrected POS-tags as input improves this situation. Last, I argue for the development of entirely incremental grammar induction algorithms instead of the approaches of the systems discussed before.
Clustering Hungarian Verbs on the Basis of Complementation Patterns
Kata Gábor and Enikõ Héja
Our paper reports an attempt to apply an unsupervised clustering algorithm to a Hungarian treebank in order to obtain semantic verb classes. Starting from the hypothesis that semantic metapredicates underlie verbs' syntactic realization, we investigate how one can obtain semantically motivated verb classes by automatic means. The 150 most frequent Hungarian verbs were clustered on the basis of their complementation patterns, yielding a set of basic classes and hints about the features that determine verbal subcategorization. The resulting classes serve as a basis for the subsequent analysis of their alternation behavior.