Discovering Light Verb Constructions and their Translations from Parallel Corpora without Word Alignment

We propose a method for joint unsupervised discovery of multiword expressions (MWEs) and their translations from parallel corpora. First, we apply independent monolingual MWE extraction in source and target languages simultaneously. Then, we calculate translation probability, association score and distributional similarity of co-occurring pairs. Finally, we rank all translations of a given MWE using a linear combination of these features. Preliminary experiments on light verb constructions show promising results.


Introduction
The automatic discovery of multiword expressions (MWEs) has been a topic of interest in the computational linguistics community for a while (Choueka, 1988;Church and Hanks, 1990).In the last 20 years, multilingual discovery of MWEs has gained some popularity thanks to the widespread use of statistical machine translation (MT), automatic word alignment tools and freely available parallel corpora (Zarrieß and Kuhn, 2009;Attia et al., 2010;Caseli et al., 2010).MWEs tend to be non compositional or show some kind of lexicosyntactic inflexibility, which is often reflected in translation asymmetries (Manning and Schütze, 1999).Therefore, parallel corpora are rich resources to mine for MWEs.Techniques adapted from machine translation can help to exploit translation information for the specific needs of MWE discovery.
Parallel corpora can be useful for MWE discovery in many ways.First, a second (target) language can be used to model features, which in turn help in the discovery of new MWEs in a single (source) language (Salehi and Cook, 2013;Caseli et al., 2010;Tsvetkov and Wintner, 2014).Second, one can also use parallel data to discover the translations of known multiword lexical units (Morin and Daille, 2010).Finally, it is possible to perform both simultaneously, generating a bilingual lexicon of MWEs and their potential translations from the parallel corpus, as proposed in this paper.
The goal of our paper is to propose a new method for unsupervised joint discovery of MWEs and their translations.It consists in discovering potential MWEs on source and target texts independently, and then trying to match them without using automatic word alignment.It is important to emphasize that we are not against the use of word alignment for this task, but we are interested in seeing how the automatic discovery of MWEs can be performed without relying on this information.Moreover, our experiments focus on light verb constructions such as to make a presentation and to take a walk, which generally contain non-adjacent tokens and thus would probably not be captured by standard word alignment methods.We study several features to rank automatically extracted candidates that could be translations of each other.We show preliminary results that indicate this approach is promising and point towards future improvements.

Related Work
Multilingual resources in general can be used for MWE discovery.Attia et al. (2010), for instance, do not rely on parallel texts but on short Wikipedia page titles, cross-linked across multiple languages.They consider that, if a page whose title contains a cross-lingual link to a page whose title is a single word (in any available language), then the original page title is probably a MWE.Similarly, translation links in Wiktionary can be exploited, among other features, for predicting the compositionality of MWEs (Salehi et al., 2014a).
Another possibility to model non-translatability without recurring to parallel corpora consists in building up artificial word-for-word MWE translations using bilingual single-word dictionaries.Afterwards, the existence of these automatically generated potential translations can be assessed in large monolingual corpora (Morin and Daille, 2010).This can be used as a feature, among other sources of information, in supervised or semi-supervised monolingual MWE discovery (Tsvetkov and Wintner, 2011;Rondon et al., 2015).Bilingual dictionaries can also be used to predict the compositionality of MWEs by estimating the string similarity (Salehi and Cook, 2013) or distributional similarity (Salehi et al., 2014b) between translations of an MWE and of the single words it contains.Melamed (1997) describes one of the earliest attempts to extract MWEs from parallel corpora.The method is based on lexical alignment and mutual information.Statistical lexical alignment can provide straightforward MWE candidates, which can be further filtered using POS patterns and association scores.If two or more words in a source language are aligned to the same word on the target side, the source is likely an MWE (Caseli et al., 2010).Conversely, one can assume that some types of MWEs such as verbnoun combinations tend to be translated as MWEs with the same syntactic structure, using aligned dependency-parsed corpora for discovery (Zarrieß and Kuhn, 2009).Instead of focusing on 1-tomany alignments, Tsvetkov and Wintner (2010) propose a method which incrementally removes from parallel sentences word pairs that are surely not MWEs.Therefore, they use bilingual dictionaries and alignment reliability scores.The remaining units are considered candidate MWEs.
Bilingual lexicons containing MWEs are important resources for MT systems.It has been shown that the presence of MWEs can harm the quality of both statistical (Ramisch et al., 2013) and rulebased (Barreiro et al., 2014) MT systems.Simple techniques for taking MWEs into account such as binary features (Carpuat and Diab, 2010) and special token markers (Cap et al., 2015) can help improving translation quality.However, this may not suffice if the expressions are not correctly identified with the help of bilingual MWE lexicons.

Bilingual MWE Lexicon Creation
Most existing methods exploit parallel corpora to discover MWEs in a single language.They use translation information, among other sources, to confirm the idiosyncratic behaviour of the MWE in the source language, but do not output possible translations as a result of the discovery algorithm.In this section, we propose a method to create probabilistic bilingual MWE dictionaries using minimal supervision.
First, we extract MWE candidates from preprocessed (POS-tagged and lemmatized) source and target texts separately.In our experiments, the texts were pre-processed by TreeTagger (Schmid, 1994).We explicitly configured it not to segment sentences, since we need to preserve the alignment between source and target sentences in our input parallel corpus.
To allow the extraction of these monolingual MWE candidates, it is necessary to manually define POS patterns in both languages.This step requires some knowledge about the languages and about the syntactic patterns of the MWEs that we want to extract.These patterns were defined using the mwetoolkit corpus query language and candidate extraction tools (Ramisch, 2015). 1 In this first moment, we focused on MWEs translated into MWEs, but we believe that the technique could be adapted to MWEs translated into single words.For instance, one could extract verbal MWEs from the source corpus and try to match them with single-word verbs in the target language.In theory, any monolingual MWE discovery approach could be used to obtain candidates on each side of the parallel corpus independently.
The process described above outputs two sets of candidates.The first set S = {s 1 , s 2 , . . ., s |S| } contains MWE candidates s i extracted from the source corpus.
The second set T = {t 1 , t 2 , . . ., t |T | } contains MWE candidates t j extracted from the target corpus.Then, we try to map source MWEs s i to their target correspondences t j .To do so, we calculate the conditional probability of each potential translation (t j ) in T given a source (s i ): Here, c(s i , t j ) is the number of times a source candidate s i was found in a sentence whose transla-tion contained t j and c(s i ) is simply the number of occurrences of the candidate in the source corpus.Since candidates s i and t j can be discontinuous, their numbers of occurrences are not necessarily n-gram counts, but must be obtained during monolingual candidate discovery as output by the mwetoolkit.
Another measure that we use to rank translations is the t-score.This association score estimates to what extent the co-occurrence of a group of words is outstanding compared to random chance co-occurrence.For each target candidate t j = w and then scaling this joint probability by the total number of tokens in the target corpus N : The t-score, also obtained using the mwetoolkit, is the difference between observed and expected counts normalized by an estimate of the standard deviation of the distribution: Finally, we calculate the multilingual distributional similarity between pairs s i and t j .This score is based on a pre-trained vector space model which uses sentence alignment information to ensure that words that are translations of each other end up being close in the resulting semantic space.Since each unit s i and t j is composed of m and n words, respectively, we use the average cosine similarity between all possible m×n source-target pairs present in the semantic space: 2 The bilingual semantic space is obtained using MultiVec (Bérard et al., 2016). 3Distributional similarity between source and target candidate words is obtained using the bag of words mode. 2 The normalization factor may be less than m × n when some pairs w s i k , w t j l do not occur in the semantic space. 3https://github.com/eske/multivec The three scores are normalized so that their values fall between 0 and 1.The final score F is simply a log-linear combination of these scores: The lower its value, the more likely a given pair of source and target MWEs is.

Experimental Setup
For this work, the pre-processed texts (POStagged source and target texts) were obtained from the FAPESP parallel corpus containing 166,719 aligned sentences of Brazilian Portuguese texts translated into English (Aziz and Specia, 2011).The source corpus contains 4,191,942 tokens and the target corpus contains 4,499,064 tokens. 4ur experiments employ manually defined patterns for the monolingual step.These patterns target light-verb constructions in Portuguese and some possible translations into English: GET+ADJ The first pattern consists of the Portuguese verb ficar (to become) immediately followed by an adjective.This frequent construction often indicates a change of state (inchoative).On the target language (English), we build a similar pattern consisting of verbs to be/become/get + an adjective, which we assume as being frequent translations for the source construction.
MAKE+N This pattern is formed by the verb realizar (to make) followed by a noun.Between the verb and the noun there can be any number of adjectives, adverbs or determinants, which are ignored in the extracted candidate.For the translation, we build an equivalent pattern with verbs to make/carry due to the high occurrence of carry out in the target corpus.
TAKE+N This pattern is formed by verbs fazer/tomar/dar (to make/take/give) followed by a noun.
We allow intervening elements as for MAKE+N.In English, we use verbs to make/do/take.Notice that verb to give was considered as an unlikely translation and disregarded.

Preliminary Results
As mentioned in Section 3, we used the mwetoolkit to apply the patterns and calculate t-scores and MultiVec for bilingual similarity.Unfortunately, quantitative evaluation was not yet performed.Nonetheless, in this section, we present some examples of discovered MWEs along with their translations.We point out positive and negative results in this small sample that give us an idea of our approach's potential.
Table 1 shows ranked examples extracted from the source and target corpus for the first pattern.The entries are ranked by final score, more likely translations appear on the top of the table and the correct ones are in bold.According to these examples, the MWE pairs with lowest scores are correctly aligned to a valid translation.In addition to the final score (F), target t-score (ts T) and similarity (Sim), the table also shows how many times the source MWE co-occurred with the target MWE (# T).This information allows us to calculate the conditional probability.
It is important to point out that our approach does not work for all cases, as some spurious pairs also occur.For example, in the first half of table 1, become sick is indeed a possible translation for ficar doente but it appears in a worst position compared to be normal, which is not a possible translation.Beyond the conditional probability, distributional similarity and t-score seem to help in some cases.For instance, get ready appears only once as a translation of ficar pronto, but still it gets a better score than be capable, a wrong translation with higher conditional probability.In general, we have observed that the pattern GET+ADJ is quite "easy" to translate as these constructions show a high degree of regularity.
Table 2  of our approach: that it is not possible to obtain reliable probability scores when the pattern just appears once.
The results in table 3 show the extraction for the last pattern, TAKE+N.Despite the first half of this table presenting good results for do comparison and make comparison, the second half shows that some patterns do not work for the target side.The verb dar in Portuguese is a productive light verb, specially when combined with participles (dar uma caminhada/corrida/passeada lit. to give a walk/run/stroll).On the other hand, the translations usually involve a single verb and not a lightverb construction.This indicates that further error analysis is required, studying the three verbs in this pattern separately.

Conclusions and Future Work
This paper constitutes our first proposal towards automatic discovery of bilingual MWE lexicons.While preliminary results are promising, the obvi-ous next step is to design an evaluation protocol and apply it.Having this goal set, the idea is testing the approach first with other patterns and, then, making a robust evaluation.We would also like to extrapolate this method to other language pairs and MWE categories, specially those MWE translated as single words.In this case, we are still investigating solutions but one of them consists in using monolingual word embeddings and similarity measures in order to define if the translation should be an MWE or a single word.
We believe that the method itself can be improved in many ways.For instance, we would like to design a distributional similarity measure able to focus on valid alignments.We would also like to experiment with different weights for the scores (e.g.similarity seems more important than t-score).Optimizing, that is, learning these weights from small amounts of supervised data, sounds appealing as well.
At the moment, the extraction patterns represent a bottleneck and bias the obtained results towards more plausible translations.We would like to find a way to get rid of them, specially when it comes to the target side.Another point that must be underlined is the fact that, as we are not discarding the use of word alignment in the future, we would like to perform a systematic quantitative comparison with related work and methods based on word alignment.

Table 2 :
shows the results of the extraction for MAKE+N.The results for realizar teste show that the best ranked MWEs are the corrected translations.The last row of this table shows a drawback Pattern MAKE+N: realizar teste/substituic ¸ão (make test/replacement).Correct pairs are in bold.