Impact of MWE Resources on Multiword Recognition

In this paper, we demonstrate the impact of Multiword Expression (MWE) resources in the task of MWE recognition in text. We present results based on the Wiki50 corpus for MWE resources, generated using unsupervised methods from raw text and resources that are extracted using manual text markup and lexical resources. We show that resources acquired from manual annotation yield the best MWE tagging performance. However, a more ﬁne-grained analysis that differentiates MWEs according to their part of speech (POS) reveals that automatically acquired MWE lists outperform the resources generated from human knowledge for three out of four classes.


Introduction
Identifying MWEs in text is related to the task of Named Entity Recognition (NER). However, the task of MWE recognition mostly considers the detection of word sequences that form MWEs and are not Named Entities (NEs). For both tasks mostly sequence tagging algorithms, e.g. Hidden Markov Model (HMM) or Conditional Random Fields (CRF), are trained and then applied to previously unseen text. In order to tackle the recognition of MWEs, most approaches (e.g. (Schneider et al., 2014;Constant and Sigogne, 2011)) use resources containing MWEs. These are mostly extracted from lexical resources (e.g. WordNet) or from markup in text (e.g. Wikipedia, Wiktionary). While these approaches work well, they require respective resources and markup. This might not be the case for special domains or under-resourced languages.
On the contrary, methods have been developed that rank word sequences according to their multiwordness automatically using information from corpora, mostly relying on frequencies. Many of these methods (e.g. C/NC-Value (Frantzi et al., 1998), GM-MF (Nakagawa and Mori, 2002)) require previous filters, which are based on Part-of-Speech (POS) sequences. Such sequences, (e.g. Frantzi et al. (1998)) need to be defined and mostly do not cover all POS types of MWE.
In this work we do not want to restrict to specific MWE types and thus will use DRUID (Riedl and Biemann, 2015) and the Student's t-test as multiword ranking methods, which do not require any previous filtering. This paper focuses on the following research question: how do such lists generated from raw text compete against manually generated resources? Furthermore, we want to examine whether a combination of resources yields better performance.

Related Work
There is a considerable amount of research that copes with the recognition of word sequences, be it NE or MWE. The field of NER can be considered as subtask from the recognition of MWE. However, in NER additionally, singleworded names need to be recognized.
The experiments proposed in our paper are related to the ones performed by Nagy T. et al. (2011). Their paper focuses on the introduction of the Wiki50 dataset and demonstrates how the performance of the system can be improved by combining classifiers for NE and MWE. Here, we focus on the impact of different MWE resources.
An extensive evaluation of different measures for ranking word sequences regarding their multiwordness has been done before. Korkontzelos (2010) performs a comparative evaluation of MWE measures that all rely on POS filtering. Riedl and Biemann (2015), in contrast, introduced a measure, relying on distributional similarities, that does not require a pre-filtering of candidate words by their POS tag. It is shown to compare favorably to an adaption of the t-test, which only relies on filtering of frequent words.

Datasets
For the evaluation we use the Wikipedia-based Wiki50 (Nagy T. et al., 2011)  The dataset primarily consists of annotations for NEs, especially for the person label. The annotated MWEs are dominated by noun compounds followed by verb-particle constructions, light-verb constructions and adjective compounds. Idioms and other MWEs occur only rarely.

Method
For detecting MWEs and NEs we use the CRF sequence-labeling algorithm (Lafferty et al., 2001). As basic features, we use a mixture of features used in previous work (Schneider et al., 2014;Constant and Sigogne, 2011). The variable i indicates the current token postion: • word shape of token i , as used by Constant and Sigogne (2011) • has token i digits • has token i alphanumeric characters • suffix of token i with length l ∈ {1, 2, 3, 4} • prefix of token i with length l ∈ {1, 2, 3, 4} • lemma of token i • lemma of token j and lemma of token j+1 with j ∈ {i − 1, i} For showing the impact of a MWE resource mr, we featurize the resource as follows: • number of times token i occurs in mr • token bigram: token j token j+1 contained in mr with j ∈ {i − 1, i} • token trigram: token j token j+1 token j+2 occurence in mr with j ∈ {i − 2, i − 1, i} • token 4-gram: token j token j+1 token j+2 token j+3 occur in mr with j ∈ {i − 3, i − 2, i − 1, i}

Multiword Expression Resources
For generating features from MWE resources, we distinguish between resources that are extracted from manually generated/annotated content 1 and resources that can be automatically computed based on raw text. First, we describe the resources extracted from manually annotated corpora or resources.
• WordNet: The WordNet resource is a list of 64,188 MWEs that are extracted from Word-Net (Miller, 1995).
• WikiMe: WikiMe (Hartmann et al., 2012) is a resource extracted from Wikipedia that consists of 356,467 MWEs from length two to four that have been extracted using markup information.
• SemCor: This dataset consists of 16,512 MWE and was generated from the Semantic Concordance corpus (Miller et al., 1993).
Additionally, we select the best-performing measures for ranking word sequences according to their multiwordness as described in (Riedl and Biemann, 2015) that do not require any POS filtering: • DRUID: We use the DRUID implementation 2 , which is based on a distributional thesaurus (DT) and does not rely on any linguistic processing (e.g. POS tagging).
• t-test: The Student's t-test is a statistical test that can be used to compute the significance of the co-occurrence of tokens. For this it relies on the frequency of the single terms as well as the word sequence. As this measure favors to rank word sequences highest that begin and end with stopwords, we remove word sequences that begin and end with stopwords. As stopwords, we select the 100 most frequent words from the Wikipedia corpus.

Experimental Setting
We perform the evaluation, using a 10-fold cross validation and use the crfsuite 3 implementation of CRF as classifier. For retrieving POS tags, we apply the OpenNLP POS tagger 4 . The lemmatization is performed using the WordNetLemmatizer, contained in nltk (Loper and Bird, 2002). 5 For the computation of automatically generated MWEs lists, we use the raw text from an English Wikipedia dump, without considering any markup and annotations. For applying them as resources, we only consider word sequences in the resource that are also contained in the Wiki50 dataset, both training and test data. Based on these candidates, we select the n highest ranked MWE candidates. The previous filtering does not influence the performance of the algorithm but enables an easier filtering parameter.

Results
First, we show the overall performance for the Wiki50 dataset for recognizing labeled MWE and NE spans. We show the performance for training classifiers to predict solely NEs and MWEs and also the combination without the usage of any MWE resource. As can be observed (see Table  2), the detection of NE reaches higher scores than learning to predict MWE.  Comparing the performance between classifying solely NEs and MWEs, we observe low recall for predicting MWE. Next, we will conduct experiments for learning to predict MWE with the use of MWE resources.
In Table 3 we present results for the overall labeled performance for MWEs in the Wiki50 dataset. Using MWE resources, we observe consistent improvements over the baseline approach, which does not rely on any MWE resource (None). For manually constructed MWE resources, improvements of up to 3 points F1-measure on MWE labeling are observed, the most useful resource being WikiMe. The combination of manual resources does not yield improvements.   Table 4: Detailed performance in terms of precision (P), recall (R) and F1-measure (F1) for the different MWE types. The experiments have been performed only on the MWE annotations.
ing measures. Whereas we observe improvements by around 1 points F1 for the t-test, we gain improvements of almost 2 points for DRUID. When extracting the top 10,000 MWEs, additional improvements can be obtained, which are close to the performances using the markup-based MWE resources. Here, using DRUID with the top 10,000 highest ranked MWEs achieves the third best improvements in comparison to all resources. Using more than the top 10,000 ranked word sequences does not result in any further performance improvement. Surprisingly, using MWE resources as features for MWE recognition improves the performance only marginally.
We assume that each resource focuses on different kinds of MWEs. Thus, we also show results for the four most frequent MWE types in Table 4. Inspecting the results using MWE lists, that are generated using human knowledge, we obtain the best performance for noun compounds using WikiMe. Verb-particle constructions seem to be better covered by the WordNet-based resource. For light-verb constructions the highest F1 measures are observed using EnWikt and WikiMe and for adjective compounds EnWikt achieves the highest improvements. We omit presenting results for the MWE classes other and idiom as only few annotations are available in the Wiki50 dataset.
Inspecting results for the t-test and DRUID, we obtain slightly higher F1 measures for nouncompounds using DRUID. Whereas for verbparticle constructions the t-test achieves the overall highest precision, recall and F1 measure of DRUID are higher. However, t-test achieves better results for light-verb constructions and using DRUID yields the highest F1 measure for adjective compounds.
Overall, only for noun compounds the best results are obtained using MWE lists that are generated from lexical resources or text annotations. For all remaining labels, the best performance is obtained using MWE lists that can be generated in an unsupervised fashion. However, as noun compounds constitutes the largest class, using unsupervised lists does not result to the best overall performance.
In addition, we performed the classification task of MWEs without labels, as shown in Table 5. In contrast to the overall labeled results (see Table 3) the performance drops. Whereas one might expect higher results for the unlabeled dataset, the labels help the classifier in order to use features according to the label. This is in accordance with the previous findings shown in Table 4  Furthermore, in this evaluation highest improvements are achieved with the EnWikt. Using MWE lists that are generated in an unsupervised fashion results in comparable scores to the EnWikt. Again, these resources have the third-highest performance of all lists and outperform SemCor and WordNet.

Conclusion
In this paper, we have investigated whether unsupervisedly acquired MWE resources are comparable with knowledge-based or manual-annotationbased MWE resources for the task of MWE tagging in context. The highest overall performance, both for the labeled and unlabeled tagging task, is achieved using lists extracted from Wikipedia (WikiMe) and Wiktionary (EnWikt). However, for three out of four MWE types, resources that are extracted using unsupervised methods achieve the highest scores. In summary, using MWE lists for MWE recognition with sequence tagging is a feature that adds a few points in F-measure. In the case that high quality MWE resources exist, these should be used. If not, it is possible to replace them with unsupervised extraction methods such as the t-test or DRUID.