Weighted Set-Theoretic Alignment of Comparable Sentences

This article presents the STACCw system for the BUCC 2017 shared task on parallel sentence extraction from comparable corpora. The original STACC approach, based on set-theoretic operations over bags of words, had been previously shown to be efficient and portable across domains and alignment scenarios. Wedescribe an extension of this approach with a new weighting scheme and show that it provides significant improvements on the datasets provided for the shared task.


Introduction
Parallel corpora are an essential resource for the development of multilingual natural language processing applications, in particular statistical and neural machine translation (Brown et al., 1990;Bahdanau et al., 2014). Since the professional translations that are necessary to build quality bitexts are expensive and time-consuming, the exploitation of monolingual corpora that address similar topics, known as comparable corpora, has been extensively explored in the last two decades (Munteanu and Marcu, 2005;Sharoff et al., 2016).
A critical part of the process when building parallel resources from comparable data is the alignment of sentences in monolingual corpora. Over the years, several methods have been developed and evaluated for this task, including maximum likelihood (Zhao and Vogel, 2002), suffix trees (Munteanu and Marcu, 2002), binary classification (Munteanu and Marcu, 2005), cosine similarity (Fung and Cheung, 2004), reference metrics over statistical machine translations (Abdul-Rauf and Schwenk, 2009;Sarikaya et al., 2009), and feature-based approaches (Stefȃnescu et al., 2012;Smith et al., 2010), among others.
For comparable sentence alignment, we followed the STACC approach in , which is based on seed lexical translations, simple set expansion operations and the Jaccard similarity coefficient (Jaccard, 1901). This method has been shown to outperform state-of-the-art alternatives on a large range of alignment tasks and provides a simple yet effective procedure that can be applied across domains and corpora with minimal adaptation and deployment costs.
In this paper, we describe STACC w , an extension of the approach with a word weighting scheme, and show that it provides significant improvements on the datasets provided for the BUCC 2017 shared task, while maintaining the portability of the original approach.

STACC
STACC is an approach to sentence similarity based on expanded lexical sets and Jaccard similarity, whose main goal is to provide a portable and efficient alignment mechanism for comparable sentences. The similarity score is computed as follows.
Let s i and s j be two tokenised and truecased sentences in languages l 1 and l 2 , respectively, S i the set of tokens in s i , S j the set of tokens in s j , T ij the set of lexical translations into l 2 for all tokens in S i , and T ji the set of lexical translations into l 1 for all tokens in S j .
Lexical translations are initially computed from sentences s i and s j by retaining the k-best translations for each word, if any, as determined by IBM models. 1 Lexical translations are selected according to the ranking provided by the precomputed lexical probabilities, without using the actual probability values in the computation of similarity. The sets T ij and T ji that comprise the k-best lexical translations are then expanded by means of two operations: 1. For each element in the set difference T ij = T ij − S j (respectively T ji = T ji − S i ), and each element in S j (respectively S i ), if both elements share a common prefix with minimal length of more than n characters, the prefix is added to both sets. This longest common prefix matching strategy is meant to capture morphological variation via minimal computation.
2. Numbers and capitalised truecased tokens not found in the translation tables are added to the expanded translation sets. This operation addresses named entities, which are strong indicators of potential alignment given their low relative frequency and are likely to be missing from translation tables trained on different domains.
No additional operations are performed on the created sets, and in particular no filtering is applied, with punctuation and functional words kept alongside content words in the final sets. With source and target sets as defined here, the STACC similarity score is then computed as in Equation 1: (1) Similarity is thus defined as the average of the Jaccard similarity coefficients obtained between sentence token sets and expanded lexical translations in both directions.
For scenarios where the alignment space is large, target sentences are first indexed using the Lucene search engine 2 and retrieved by building a query over the expanded translation sets created from each source sentence. This strategy drastically reduces the computational load, at the cost of missing some correct alignment pairs. In this mode, one of the two corpora is set as source and the other as target, retrieving n target alignment candidates for each source sentence. Similarity is computed over all candidates and a final optimisation process is applied that enforces 1-1 alignments, a process which has been shown to improve the quality of alignments ). 2 https://lucene.apache.org.

Weighted STACC
Although STACC has been shown to outperform competing state-of-the-art approaches on a variety of domains and scenarios , it ignores lexical weights and thus assigns equal importance to open-class and function words. Although it makes intuitive sense to assign different weights according to the information provided by each word, adequate lexical weighting for a given task is not straightforward. Standard approaches such as TF-IDF often need to be complemented with stop word lists, which can be large and difficult to determine in agglutinative languages, for instance. Term-based approaches in general might assign weights that are too unbalanced for the task at hand, and termhood might be dependent on building accurate contrastive generic corpora (Gelbukh et al., 2010).
We follow the empirical approach in (Mikolov et al., 2013), where the imbalance between frequent and rare words is controlled by a subsampling formula with two variables: an empirically determined threshold and word frequency. Experiments with their exact weighting scheme did not however provide optimal results for our alignment goals. We opted instead to compute lexical weights according to Equation 2, where f (w i ) is the relative frequency of word w i and α is a parameter controlling the smoothness of the curve.
Among the methods we tested empirically, this function has properties that fit rather well the original STACC approach. First, since it is bound between zero and one, it preserves the idea that set membership is a fruitful factor to compute similarity. Secondly, it assigns weights close to 1 for most open-class words while not completely discarding functional words, 3 a feature which has provided optimal results in our experiments.
Weighting is computed on each monolingual corpus to be aligned, thus removing any dependence on defining contrastive generic corpora. STACC w similarity is then computed according to the previously defined equation, except that set membership values of 1 in the original approach are replaced with lexical weights.

BUCC 2017 Shared Task
The BUCC 2017 shared task on parallel sentence extraction from comparable corpora 4 consists in identifying translation pairs within two sentencesplit monolingual corpora. It involves four language pairs, from which we selected French-English and German-English for our participation. The organisers provided three datasets for each language pair, whose statistics are described in Table 1 for the two language pairs we selected; gold reference pairs were provided for the training and sample sets.
Note that the statistics shown here differ slightly from those of the original data provided by the organisers, as we removed the bilingual duplicates that were found. 5

Experimental Settings
Both STACC and STACC w require lexical translation tables to compute similarity, the only external source of information needed in this approach. In previous work , GIZA tables had been created from the JRC corpora only. In order to extend lexical coverage, we opted for a different approach and created generic translation tables from varied corpora.
In each corpus, parallel sentence pairs were first sorted by increasing perplexity scores according to language models trained on the monolingual side of each parallel corpus, where the score was taken to be the mean of source and target perplexities. A portion of each corpus was then selected to compose the final corpus, with an upper selection 4 https://comparable.limsi.fr/bucc2017/bucc2017task.html 5 There were 7 and 1 duplicates in the train and sample sets, respectively, for DE-EN, and 6 in the FR-EN train set. bound taken to be either the median average perplexity score or the top n pairs if selecting up to median perplexity would result in over representing the corpus. Table 2 describes the number of sentence pairs selected for each language pair, the lexical translation tables being extracted from the GENERIC datasets. 6 Regarding hyper-parameters, k-best lexical translations were limited to a maximum of 4 and the minimal prefix length for longest common prefix matching was set to 4. Lucene indexing was based on words with length of 4 or more characters, and a maximum of 100 candidates were retrieved for each source sentence. For each language pair, English was set to be the target language. We experimented with different values of α to control the smoothness of the weighting function and different values for the alignment threshold th used to discard low-confidence alignments.
Since up to three different runs could be submitted for the task, we prepared three variants of the system, where parameters α and th were set according to the best f-measure, precision and recall scores, respectively, obtained on the training set. 7 Each of these variants was submitted to the task, in order to evaluate the behaviour of our system when targeting for precision, recall and f-measure.
Although not submitted to the shared task, the original STACC method was also evaluated on the train and sample sets.

Results
Results on all datasets are shown in Tables 3 and  4, along with the parameters used for each dataset and the percentage of correct candidates retrieved via Lucene indexing and search. On the test sets, our system competed with four other systems in FR-EN and our three submitted variants obtained the best results on all three metrics; for DE-EN, there were no other competing systems. Given the nature of the evaluation, where not all gold parallel sentences are known, pairs identified as false positives may actually be correct alignments. 8 The results shown here are therefore minimum values and the already high scores achieved by our approach were thus quite satisfactory.
Overall, STACC w improves significantly over its non-weighted variant on the training and sample datasets, with improvements of around 10 points in f-measure on the training and sample sets. On the smaller sample sets, the accuracy of the alignments was naturally higher, reaching f-measure minimum scores above the 90% mark.
As expected, each variant of the system was better on the measure it was meant to optimise via

Conclusion
We described STACC w , a weighted set-theoretic alignment method to extract parallel sentences from comparable corpora, which was the top ranked system in the BUCC 2017 shared task on the datasets where it competed with other systems and achieved high minimum value scores across the board. Our approach features generic lexical translation tables, Jaccard similarity over simple expanded translation sets and a generic word weighting scheme. This method improved significantly over the previous non-weighted approach on the provided training and sample datasets, while maintaining its main goals of portability, efficiency and ease of deployment.