Never-Ending Multiword Expressions Learning

,


Introduction
Multiword expressions (MWEs) are combinations of two or more lexemes which present some lexical, syntactic, semantic, pragmatic or statistical idiosyncrasies with respect to regular combinations (Baldwin and Kim, 2010). Examples include idioms (saw logs as to snore), phrasal verbs (pull over, give up), noun compounds (machine learning, support vector machine) and complex function words (as well as, with respect to).
In human languages, such constructions are very frequent, as native speakers rarely realize how often they employ them (Sag et al., 2002;Jackendoff, 1997b). However, they are not frequent in NLP resources such as lexicons and grammars, and this represents a bottleneck for building robust and accurate NLP applications.
Since the construction of such resources is onerous and demands highly qualified linguistic expertise, automatic MWE lexicon extraction is an attractive alternative which has been one of the most active topics in the MWE research community. Proposed methods are often based on supervised and unsupervised learning of MWE lists from textual corpora (Evert and Krenn, 2005;Pecina, 2008). In spite of the availability of very large corpora like the Gigaword or WaC (Baroni et al., 2009), these methods are still limited by the coverage of the texts in the source corpus. This paper presents NEMWEL, a machine learning system able to learn MWEs following the never-ending approach (Mitchell et al., 2015). NEMWEL automatically extracts MWE candidates from a corpus periodically crawled from a Brazilian online news portal. Then, based on supervised training, NEMWEL classifies the candidates and promotes some of them to the status of "true MWEs", which are used to retrain the classifier. This process is repeated endlessly, taking into consideration the true MWEs learned in previous steps. By doing so, NEMWEL tries to resemble the way human beings learn.
We have developed a prototype that implements this idea. To the best of our knowledge, this is the first attempt to build MWE lexicons using a never-ending learning approach. We have manually evaluated the extracted MWEs and we show that the precision of the learner seems to increase with time.
The remainder of this paper is structured as follows: we discuss related work on MWE extraction (Section 2) and never-ending learning methods (Section 3). Then, we present the architecture and detail the modules in NEMWEL (Section 4). Finaly, we present the results of automatic and manual evaluation in Brazilian Portuguese (Section 5) and ideas for future work (Section 6).
Supervised machine learning methods have also been used for MWE lexicon learning. 1 Pecina (2008) proposes a logistic regression classifier which uses as features a set of 84 different lexical association measures. Ramisch et al. (2008) use decision trees for classifying MWEs based on standard association measures as well, but they add variation entropy. In terms of classifiers, many alternatives have been tested like bayesian networks (Dubremetz and Nivre, 2014) and support vector machines (Farahmand and Martins, 2014). Zilio et al. (2011) use a stable set of features, but compare several classification algorithms implemented in Weka. Furthermore, in-context MWE tagging has been performed using sequence learning models like conditional random fields (Constant and Sigogne, 2011) and structured perceptron (Schneider et al., 2014). 2 Many alternative sources and methods have been tested for MWE extraction, like parallel texts (Caseli et al., 2010;Tsvetkov and Wintner, 2010), bilingual lexicons (Salehi and Cook, 2013), Wikipedia interlingual links (Attia et al., 2010), WordNet synonyms (Pearce, 2001) and distributional neighbors (Reddy et al., 2011). The web has also been considered as a source for MWE learning, often using page hit counts from search engines (Lapata and Keller, 2005;Kim and Nakov, 2011). However, in related work, candidates are not extracted from web texts, but from traditional corpora.
Differently from previous corpus-based or web-based learning approaches, our goal is not to build one static MWE lexicon. Instead, we propose to build a system that continuously learns new expressions from the web. It populates and enriches the lexicon with new MWEs every day. Our proposal is to employ bootstrapping on a traditional supervised machine learning setting, enriched with new features and dynamically crawled corpora. At any given time, a snapshot of the database will include the current MWE lexicon, which can be exported, evaluated and used to retrain the classifier. To the best of our knowledge, this is the first time never-ending learning is applied to MWE lexicon discovery.

Never-Ending Learning
In traditional machine learning, an algorithm is usually applied to learn a model from a fixed amount of labeled training data. Although effective in many applications, this way of learning is very limited and also far from the way that human beings learn. Never-ending learning is an approach that tries to resemble the way humans learn, taking into account different sources of information and using previous experience to guide subsequent learning (Mitchell et al., 2015). It can be classified as a bootstrapping algorithm. It requires a small set of annotated items, used to initialize the model, and then it uses its own results to retrain the classifier in future iterations.
The main system developed following the never-ending learning approach is the Never-Ending Language Learner (NELL) of Carlson et al. (2010). NELL is the learning system of the Read the Web project 3 and it is running 24 hours/day since 2010. NELL's goals are (1) to read the web extracting beliefs (true facts) that populate a knowledge base and (2) to learn better day by day. To do so, NELL is able to perform different learning tasks (category classification, relation classification, etc.) and combine different learning functions to make decisions and improve its learning methods (Mitchell et al., 2015).
In this paper we describe the Never-Ending MultiWord Expressions Learner (NEMWEL). Different from NELL, NEMWEL is in its first year of life and is intended only to learn MWEs. But, following the main never-ending learning premise, NEMWEL uses its previously learned knowledge to better learn new MWEs.
According to Jackendoff (1997a), there are as many MWEs in a lexicon as single words. For Sag et al. (2002) this is an underestimation and the real number of MWEs grows with language evolution. These findings corroborate our idea that a never-ending learning system is a good solution to tackle the MWE extraction problem.

The Never-Ending MWE Learner
The NEMWEL was developed in Java and is divided into four modules -crawler, extractor, processor and promoter -explained in the next subsections. These four modules are applied in sequence and repeatedly in each iteration of NEMWEL.

Crawler
The first module, the Crawler, is responsible for collecting texts from the web to build a corpus. In our current prototype, in each iteration, 40 different articles from the G1 news portal 4 are downloaded randomly, cleaned by removing HMTL markup and boilerplate content, and concatenated in one unique file. Figure 1 shows an excerpt of a text from one iteration of the Crawler module.

Extractor
After collecting and cleaning the texts, the Extractor annotates the tokens in each text with its surface form, part-of-speech tag and lemma. To do so, we used the TreeTagger (Schmid, 1994) with a model trained for Portuguese 5 . Tagging the corpus is required because we evaluate our learner using nominal MWEs, thus we need to be able to identify nouns and their complements. The TreeTagger was chosen because it is free, easy to use and fast, enabling us to quickly process large amounts of crawled texts. The same excerpt of Figure 1 processed by the Extractor is shown in Figure 2.
The sequences of tagged tokens in the crawled texts are processed by the mwetoolkit (Ramisch, 2015), which is the core of our Extractor and Processor modules. In the Extractor, a list of MWE candidates is obtained by matching a multilevel regular-expression pattern (Figure 3) against the tagged corpus. Figure 4 shows an example of MWE candidate extracted from our example sentence, using the pattern of Figure 3. The pattern is based on intuitive noun phrase descriptions, but it also captures more candidates, that are not necessarily nominal compounds. Further filters must be applied to remove regular noun phrases and keep only nominal MWEs.

Processor
In this module, the mwetoolkit calculates some association measures that will be used by the Promoter in the next step. These measures are calculated based on the number of occurrences of the MWE candidate and of the words that SENT . <patterns> <pat> <w pos="NOM"/> <pat repeat="{1,3}"/> <either> <pat> <w pos="PRP*" lemma="de"/> <w pos="NOM"/> </pat> <pat> <w pos="ADJ"/> </pat> </either> </pat> </pat> </patterns> Figure 3: List of part-of-speech sequences describing nominal multiword expressions in Brazilian Portuguese. They correspond to a noun followed by 1 to 3 complements, which can be either an adjective or a prepositional phrase introduced by de.

Features
The next module, the Promoter, uses supervised training performed using the 17 features defined below.
• Association measures -measure of the strength of the association between the frequency of an n-gram and the frequency of each word that forms the n-gram. In our experiments, four measures were used: normalized frequency, Student's t score, point-wise mutual information and Dice's coefficient. All of these measures were calculated by the mwetoolkit in the two corpora: G1 and PLN-BR. Thus, in total, we have eight features based on association measures.
• Translatability -measure based on the non-translatability property of true MWEs. First, we estimate the probability of a content word w 7 to be translated into a word x in English (en) and then back to Portuguese (pt), using a bilingual weighted lexicon: Two new features were proposed based on this probability: Figure 5 shows an example of these features for the candidate expression taxa de juros (interest rate).
• POS context -the part of speech of the three previous and the three next tokens around the MWE candidate. We also use the concatenated parts of speech of the words that form the MWE candidate. When there are more than one possible contexts, the most frequent one is chosen. Thus, seven features are based on the POS context, three in each direction and the POS sequence of the target candidate.
The new features proposed in this paper, based on translatability, are based on linguistic tests that show that MWEs have limited variability and thus, in most cases, cannot be translated word by word. It is calculated using two probabilistic bilingual dictionaries generated by NATools 8 from the FAPESP parallel corpus corpus 9 . This corpus contains 7 In our experiments, content words are nouns and adjectives.

Promoter
The last module, the Promoter, analyses the MWE candidates and promotes to beliefs the ones with the best scores. Beliefs are candidates that were classified as true MWEs in a previous iteration of the learner.
The Promoter applies a classification model trained using Weka (Hall et al., 2009) as a wrapper and LibSVM (Chang and Lin, 2011) as the core. The result is a support vector machine that distinguishes true MWEs from ordinary noun phrases. As training data, it uses previously annotated instances. The Promoter is generated based on examples that were already classified, either manually, for the Promoter-0, or manually+automatically, for the Promoters built in subsequent iterations.
SVM was the chosen classifier because it has presented good performance on diverse NLP tasks such as text categorization (Sassano, 2003), sentiment analysis (Mullen and Collier, 2004) and named entity recognition (Li et al., 2008), as well as standard corpus-based MWE extraction (Farahmand and Martins, 2014).

Evaluation
An initial training corpus was generated from texts of the G1 news portal. From this corpus, NEMWEL extracted 1,100 candidate MWEs which were manually annotated by two native speakers of Brazilian Portuguese: 600 candidates for each one with an intersection of 100 candidates. The annotation interface showed the candidate and the sentences from the G1 corpus from which the candidate was extracted (see Figure 6). The annotators had to perform a binary choice as to whether the candidate was a true MWE ("Sim") or not ("Não"). Each annotator cross-checked the other one's items. This last cross-checking step was crucial because, even though some guidelines were provided, some cases were hard to decide and required discussion. From this first annotation, 19% of the candidates were evaluated as true MWEs. The kappa agreement (Cohen, 1960) was 0.85, which indicates a very good agreement.
The annotated set was used to train our Promoter-0 as explained in section 4.4. NEMWEL, then, run for 15 iterations and, at each 5 iterations (a generation), a new Promoter was trained using the beliefs and false MWEs classified in the previous iterations. 10 After these 15 iterations, a new sample of 1,200 MWE candidates was manually evaluated by the two native speakers, but with no overlap between the annotators. To allow the analysis of the learning curve over time, this sample contained 400 candidates extracted in each generation, from which each annotator judged half, that is, 600 candidates per annotator, 200 for each generation.
From the 1,200 candidates, 15.6% were classified as true MWE. The results are shown in Table 1 in terms of precision, recall, F-measure and accuracy calculated regarding true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN): • Accuracy = T P +T N T P +F P +T N +F N As we can notice from Table 1, the precision rises 10 percentage points from the first to the last iteration, indicating that NEMWEL is capable of improving its learning performance, as expected for a never-ending learning system. The decay in recall from 65.5% to 52.3% from the second to the third generation seems to be related to overfitting. Another possible explanation for this decay is that only the candidate MWEs annotated as true by both annotators were taking into account. Furthermore, since the dataset is unbalanced, the classifier (2) Promoter-1, trained with manually annotated data and the true/false MWEs learned in the first generation, run from iteration 6 to 10; and (3) Promoter-2, trained with manually annotated data and the true/false MWEs learned in the first two generations, run from iteration 11 to 15. may tend to classify new candidates always as non MWEs. New experiments will be carried out to investigate this decay. Table 2 shows some examples of MWE candidates extracted by NEMWEL.

Conclusions
From the results presented in this paper, it is possible to conclude that the never-ending learning approach can be applied to the automatic extraction of MWEs. Although with just a few iterations (15), it was already possible to see that NEMWEL is able to improve its learning based on previously learned knowledge, with an increase of 10 percentage points in precision.
The next steps of this work include running NEMWEL for a long period, ideally 24 hours per day, continuously. It is also our intention to expand NEMWEL to be able to learn other MWEs, from other sources and for different languages, such as English, maybe following a multilingual extraction process. Finally, some new features can be added such as the one that tests the substitutability of a MWE candidate, i.e., the non-replacement of words that form the MWE candidate by synonyms. NEMWEL's source code and search interface will be available soon at: http://www.lalic.dc.ufscar. br/never-ending/.