Identification of Multiword Expressions for Latvian and Lithuanian: Hybrid Approach

We discuss an experiment on automatic identification of bi-gram multi-word expressions in parallel Latvian and Lithuanian corpora. Raw corpora, lexical association measures (LAMs) and supervised machine learning (ML) are used due to deficit and quality of lexical resources (e.g., POS-tagger, parser) and tools. While combining LAMs with ML is rather effective for other languages, it has shown some nice results for Lithuanian and Latvian as well. Combining LAMs with ML we have achieved 92,4% precision and 52,2% recall for Latvian and 95,1% precision and 77,8% recall for Lithuanian.


Introduction
We explore applicability of the automatic detection of multi-word expressions (MWEs) in Latvian (LV) and Lithuanian (LT). Both languages belong to Baltic language group and are synthetic (favor morphologically complex words), thus simple statistical approaches for identification of MWEs do not provide satisfactory results, as the morphological richness leads to lexical sparseness. Representations, such as bag of words ignore variation of MWEs components (Sharoff, 2004). The relatively free word order in both languages does not improve the situation. Lexical resources for complementing or replacing statistical approaches are limited. However, exploration of MWEs flexibility and morpho-syntactic rules could improve detection of MWEs in Lithuanian easier. But even most of the hybrid methods cannot be implemented in a straightforward manner due to limited availability of lexical resources and tools, e.g. POS tagger, parser, etc.
Thus possibility of detecting Latvian and Lithuanian MWEs by combining lexical association measures and machine learning could be a right approach in this situation. Machine learning allows various properties of text to be encoded in feature vectors (lexical, morphological, syntactic, semantic, contextual, etc.) associated with output classes, as well as identifying complex non-linear relations. It permits capturing elaborate features in languages with complex morphology.

Combining LAMs and Supervised Machine Learning
Combination of lexical association measures (LAMs) and supervised machine learning algorithms is already under scrutiny, (Zilio et al., 2011) use it for the extraction and evaluation of MWEs from the English part of Europarl Parallel Corpus, extracted from the proceedings of the European Parliament; (Dubremetz and Nivre, 2014) explores extraction of nominal MWEs from the French part of the Europarl corpus using application of the same method. Performance of different combinations of LAMs is discussed in (Pecina and Schlesinger, 2006;Pecina, 2008a;Pecina, 2008b;Pecina, 2010). LAMs compute an association score for each collocation candidate assessing the degree of connection between its components. Scores can be used for the extraction of collocation candidates, ranking and classification (rejecting collocations below (above) threshold).
Different groups of collocations differ in sensitivity to certain association measures depending on their types, e.g., collocations where components statistically occur more often than incidentally, Log-likelihood ratio, x 2 test, Odds ratio, Jaccard, Pointwise mutual information perform better, while for collocations occurring in the different contexts than their components (non-compositionality principle) J-S divergence, K-L divergence, Skew divergence, Cosine similarity in vector space are preferred suggested (Pecina, 2008b). For discontinuous MWE (with other words in amidst the components of MWE), Left context entropy and Right context entropy perform better (Pecina, 2008b).
Combining association measures, even a relatively small number, helps in the collocation extraction task (Pecina, 2008a), (Pecina and Schlesinger, 2006), (Pecina, 2010), however there is no the best universal combination of association measures, since the task of collocation extraction depends on the corpora, language and type/notion of MWEs.

Experimental Setup
We use LAMs combined with supervised machine learning. LAMs are calculated using mwetoolkit 1 (Ramisch, 2015), and WEKA 2 (Hall et al., 2009) is used to train selected classifiers LAMs.
In this paper we disccuss experiments with bigram MWEs only, but we plan to extended definitions of LAMs to tri-and tetra-grams, which is not always straighforward, and explore LAMs+ML approach for longer MWE in future research.
Candidate MWE bi-grams were extracted from the raw text with mwetoolkit: frequencies of separate words and bi-grams are counted, hapaxes are removed, and values of 5 association measures (Maximum Likelihood Estimation, Dice's coefficient, Pointwise Mutual Information, Student's t score and Log-likelihood score) (Ramisch, 2015) are calculated. For each language, the results were evaluated against the reference lists, based on Eu-roVoc -Multilingual Thesaurus of the European Union 3 .
The results were evaluated against the reference list of bi-gram MWE (converted to ARFF file with the values of true (MWE) and false (not MWE)) using WEKA. Selected algorithms (Naïve Bayes (John and Langley, 1995), OneR (rule-based classifier; (Holte, 1993)), Bayesian Network (Su et al., 2008) and Random Forest (Breiman, 2001)) were applied for automatic identification of MWEs. Feature vectors were constructed from LAMs values for each MWE candidate and its appearance in reference list (true/false). SMOTE (it re-samples a dataset by applying the Synthetic Minority Oversampling TEchnique) (Chawla et al., 2002) and Resample (it produces a random subsample of a dataset using either sampling with or without replacement) (Hall et al., 2009) filters were used to deal with data sparseness.
Association measures and supervised machine learning algorithms were combined in 3 ways: (i) without any filter, (ii) with the SMOTE filter and (iii) with the Resample filter. All the models were tested using standard 10-fold crossvalidation.

1/3 of Latvian and Lithuanian parts of JRC-Acquis
Multilingual Parallel Corpus (Steinberger et al., 2006) 4 , containing the total body of European Union law applicable to its member states (selected texts written since 1950s), i.e., ∼ 9 mil. words for each language, were used. Preprocessing consisted of tokenizing (one sentence per line) and lowercasing only, because the goal is to get the best possible results without relying on special linguistic tools, e.g., POS tagger, parser.

Results
We experimented with 736 (LT) and 772 (LV) MWEs present in the corresponding corpus from the reference. See Figures 1 and 2 for results, Table 1 for summary of experimental results (LAMs only, LAMs combined with a supervised machine learning, LAMs combined with a supervised machine learning and filters). Referece list was based on EuroVoc which mostly contained the EU institutions related terms, hence MWEs mostly fitted into 3 categories: Noun + Noun, Adjective + Noun and Abbreviation or Acronym + Noun. However, as we did not use either POS tagger or parser (see the beginning of the paper), detailed morpho-syntactic analysis is in our future plans.
Using only the lexical association measures implemented in the mwetoolkit against the reference, performance was low: R = 21.4% and 19.4%, and P = 0.1% and 0.2%, and F 1 = 0.3% and 0.2%, for LV and LT, respectively. Almost any candidate MWE out of the 558 772 (LV) and 587 406 (LT) was identified as an MWE. Thus, association measures did not suffice for the successful extraction of MWEs for Latvian and Lithuanian.
Results show, that combining LAMs with supervised ML improves extraction of MWEs for both languages.

Analysis of Misclassified MWE Candidates
Configuration LAMs + Random Forest + Resample performed best for both languages. However, there were misclassified MWE candidates and below there is a more detailed analysis of errors made by Random Forest classifier.

False Positives
For Lithuanian 22 unique items were misclassified as MWEs and for Latvian -31 (sampling was done with replacement, thus some items were repeated).

False Negatives
For Lithuanian 132 unique items were misclassified as non-MWEs and for Latvian -336 (sampling was done with replacement, thus some items were repeated). False negatives belong to one of 2 groups of errors (see Table 3): (i) error, occurred due to extremely low frequency (2-3); (ii) error, occured due to relatively low frequency (3-10). For most misclassified items in the group of extremely low frequency there were pairs of MWE candidates with the same LAMs values (e.g., LT: vertikalusis susitarimas & valdybų susitarimas (vertical agreement & board agreement); LV: vispārējais budžets & vispārējais labums (general budget & overall benefit)). Low frequency group mostly had unique combinations of LAMs values.
Results show that heavier filtering according to frequencies should be considered, e.g., filtering out candidates with < 20 occurrences (Evert, 2008). Beside frequency, other LAMs have to be taken into consideration as there is a possibility

Conclusions
We report our experiment for extraction bi-gram MWEs for Latvian and Lithuanian by combining lexical association measures and supervised machine learning. This method appears to be more effective for Lithuanian than Latvian. All in all, using ML together with LAMs improved results: the best configuration LAMs + Random Forest + Resample filter achieved F 1 = 66.7% for Latvian and F 1 = 85.6% for Lithuanian. However, an exception was the second-best configuration LAMs + OneR + SMOTE, where results for Latvian were slightly better (F 1 = 23.4%) than for Lithuanian (F 1 = 22.4%). Future plans include further analysis of low frequency MWEs, because it was a reason for a significant number of errors. Exploration of other LAMs could help to deal with it, and correctly capture complexities of Latvian and Lithuanian. Using EuroVoc is a poor man's solution, us-ing it resulted in getting a high number of False Positives, which seem to be good candidates for MWEs. Of course, it would be interesting to move from bi-grams, to tri-and tetra-grams as well.