Vecalign: Improved Sentence Alignment in Linear Time and Space

We introduce Vecalign, a novel bilingual sentence alignment method which is linear in time and space with respect to the number of sentences being aligned and which requires only bilingual sentence embeddings. On a standard German–French test set, Vecalign outperforms the previous state-of-the-art method (which has quadratic time complexity and requires a machine translation system) by 5 F1 points. It substantially outperforms the popular Hunalign toolkit at recovering Bible verse alignments in medium- to low-resource language pairs, and it improves downstream MT quality by 1.7 and 1.6 BLEU in Sinhala-English and Nepali-English, respectively, compared to the Hunalign-based Paracrawl pipeline.


Introduction
Sentence alignment is the task of taking parallel documents, which have been split into sentences, and finding a bipartite graph which matches minimal groups of sentences that are translations of each other (see Figure 1). Following prior work, we assume non-crossing alignments but allow local sentence reordering within an alignment.
Sentence-aligned bitext is used to train nearly all machine translation (MT) systems. Alignment errors have been noted to have a small effect on statistical MT performance (Goutte et al., 2012). However, misaligned sentences have been shown to be much more detrimental to neural MT (NMT) (Khayrallah and Koehn, 2018).
Sentence alignment was a popular research topic in the early days of statistical MT, but received less attention once standard sentencealigned parallel corpora became available. Interest in low-resource MT has led to a resurgence in data gathering methods (Buck and Koehn, 2016;Zweigenbaum et al., 2018;, but we find limited recent work on bilingual sentence alignment. Automatic sentence alignment can be roughly decomposed into two parts: 1. A score function which takes one or more adjacent source sentences and one or more adjacent target sentences and returns a score indicating the likelihood that they are translations of each other; 2. An alignment algorithm which, using the score function above, takes in two documents and returns a hypothesis alignment. We improve both parts, presenting (1) a novel scoring function based on normalized cosine distance between multilingual sentence embeddings, in conjunction with (2) a novel application of a dynamic programming approximation (Salvador and Chan, 2007) which makes our algorithm linear in time and space complexity with respect to the number of sentences being aligned. We release a toolkit containing our implementation. 1 Our method outperforms previous state-of-theart, which has quadratic complexity, indicating that our proposed score function outperforms prior work and the approximations we make in alignment are sufficiently accurate.

Related Work
Early sentence aligners (Brown et al., 1991;Gale and Church, 1993) use scoring functions based only on the number of words or characters in each sentence and alignment algorithms based on dynamic programming (DP; Bellman, 1953). DP is O(N M ) time complexity, where N and M are the number of sentences in the source and target documents. Later work added lexical features and heuristics to speed up search, such as limiting the search space to be near the diagonal (Moore, 2002;Varga et al., 2007). More recent work introduced scoring methods that use MT to get both documents into the same language (Bleualign; Sennrich and Volk, 2010) or use pruned phrase tables from a statistical MT system (Coverage-Based; Gomes and Lopes, 2016). Both methods "anchor" high-probability 1-1 alignments in the search space and then fill in and refine alignments. Locating anchors is O(N M ) time complexity.

Method
We propose a novel sentence alignment scoring function based on the similarity of bilingual sentence embeddings. A distinct but non-obvious advantage of sentence embeddings is that blocks of sentences can be represented as the average of their sentence embeddings. The size of the resulting vector is not dependent on the number of sentence embeddings being averaged, thus the time/space cost of comparing the similarity of blocks of sentences does not depend on the number of sentences being compared. We show empirically (see § 4.2) that average embeddings for blocks of sentences are sufficient to produce approximate alignments, even in low-resource languages. This enables us to approximate DP in O(N + M ) in time and space.

Bilingual Sentence Embeddings
We propose to use the similarity between sentence embeddings as the scoring function for sentence alignment. Sentence embedding similarity has been shown effective at filtering out non-parallel sentences (Hassan et al., 2018; and locating parallel sentences in comparable corpora (Guo et al., 2018). We use the publicly available LASER multilingual sentence embedding method (Artetxe and Schwenk, 2018) and model, which is pretrained on 93 languages. However, our method is not specific to LASER.

Scoring Function
Cosine similarity is an obvious choice for comparing embeddings but has been noted to be globally inconsistent due to "hubness" (Radovanović et al., 2010;Lazaridou et al., 2015). Guo et al. (2018) proposed a supervised training approach for calibration, and Artetxe and Schwenk (2019) proposed normalization using nearest neighbors. We propose normalizing instead with randomly selected embeddings as it has linear complexity. Sentence alignment seeks minimal parallel units, but we find that DP with cosine similarity favors many-to-many alignments (e.g. reporting a 3-3 alignment when it should report three 1-1 alignments). To remedy this issue, we scale the cost by the number of source and target sentences being considered in a given alignment. Our resulting scoring cost function is: where x, y denote one or more sequential sentences from the source/target document; cos(x,y) is the cosine similarity between embeddings 2 of x, y; nSents(x), nSents(y) denote the number of sentences in x, y; and x 1 ,...,x S , y 1 ,...,y S are sampled uniformly from the given document.
Following standard practice, we model insertions and deletions in DP using a skip cost c skip . The raw value of c skip is only meaningful when compared to other costs, thus we do not expect it to generalize across different languages, normalizations, or resolutions. We propose specifying instead a parameter β skip which defines the skip cost in terms of the distribution of 1-1 alignment costs at alignment time: c skip = CDF −1 (β skip ). CDF is an estimate of the cumulative distribution function of 1-1 alignments obtained by computing costs of randomly selected source/target sentences pairs.

Recursive DP Approximation
Instead of searching all possible sentence alignments via DP, consider first averaging adjacent pairs of sentence embeddings in both the source and target documents, halving the number of embeddings for each document. Aligning these vectors via DP (each of which are averages of 2 sentence embeddings) produces an approximate sen- comparisons. We can then refine this approximate alignment using the original sentence vectors, constraining ourselves to a small window around the approximate alignment. At a minimum, we must search a window size w large enough to consider all paths covered by the lower-resolution alignment path, but w can also be increased to allow recovery from small errors in the approximate alignment. 3 The length of the refinement path to search is at most N + M (all deletions/insertions), so refining the path requires at most (N + M )w comparisons. Thus the full N M comparisons can be approximated by (N + M )w + N 2 M 2 comparisons. Applied recursively, 4 we can approximate our quadratic N M cost with a sum of linear costs: See Figure 2 for an illustration of this method. We consider only insertions, deletions, and 1-1 alignments in all but the final search. Recursive down sampling and refining of DP was proposed for dynamic time warping in Salvador and Chan (2007), but has not previously been applied to sentence alignment. We direct the reader to that work for a more formal analysis showing the time/space complexity is linear. 3 We use w = 10 for all experiments in this work. 4 In practice, we compute the full DP alignment once the down sampled sizes are below an acceptably small constant. We also find vectors for large blocks of sentences become correlated with each other, so we center them around 0.

Text+Berg Alignment Accuracy
We evaluate sentence alignment accuracy using the development/test split released with Bleualign, consisting of manually aligned yearbook articles published in both German and French by the Swiss Alpine Club from the Text+Berg corpus . Hyperparameters were chosen to optimize F 1 on the development set. We consider alignments of up to 6 total sentences; that is we allow alignments of size Q-R where Q + R ≤ 6.

Bible Alignment Accuracy
We are unaware of a multilingual, low resource, parallel dataset with human sentence-level annotations. As a substitute for gold standard sentence alignment, we use Bible verse alignment and sentence split each verse. 8 The Bible has a number of properties which make it appealing for sentence alignment evaluation: It is much larger than existing sentence alignment test sets, and it is multi-way parallel in a large number of languages. Bibles are not aligned at the sentence level, but contain verse marking denoting segments typically on the scale of a partial sentence to a few sentences. This creates two potential issues for sentence alignment evaluation: First, a single sentence may span more than one verse. Inspecting the English Bible suggests that this is rare, and sentence aligners should be able to handle occasional over-segmentation of sentences as in practice they are run on errorful automatic sentence segmentation. Second, a verse may contain more than one sentence. This is problematic 8 There is no clear choice for sentence segmentation in low-resource languages.
We use https://github.com/berkmancenter/ mediacloud-sentence-splitter, falling back on English for unsupported languages.  when it happens on both languages being aligned, since the true sentence alignment cannot be determined (e.g., a verse which is two sentences in each language could be two 1-1 alignments or one 2-2 alignment). To evaluate with verse-level annotations, we propose converting the sentence alignment output into verse alignments by combining any consecutive sentence alignments for which all sentences in the alignments, on both the source and target side, came from the same verse. We report F 1 compared to the gold-standard verse alignments, denoting it as verse-level F 1 to distinguish it from F 1 computed at the sentence level.
We select six languages for which Christodouloupoulos and Steedman (2015) contains a full Bible: see Table 2. Languages were chosen to provide a range of amounts of training data used in LASER. 9 From those six languages, we randomly select 10 language pairs for testing. All parameters are kept the same as § 4.1 except we only consider alignments of up to 4 total sentences. We compare to Hunalign, run in bootstrap mode, as it is the only toolkit we tried which was robust enough to run on documents of this size. Results are shown in Table 3.
On average, we see an improvement of 28 verse-level F 1 points over Hunalign. In manual analysis of the alignments we find large stretches where the Hunalign alignments are nowhere near the gold alignment in the language pairs with verse-level F 1 < 0.35. By contrast, errors in the proposed method are predominately local, indicating success of Vecalign's recursive DP approximation even for very long documents in low-resource languages.

Improvements to Downstream MT
One of the primary applications of sentence alignment is creating bitext for training MT systems. To test Vecalign's impact on downstream MT quality, we re-align noisy, web-crawled data in two low-resource language pairs: Sinhala-English and Nepali-English. The data is collected via Paracrawl 10 and is very similar to that released in the WMT 2019 sentence filtering task , but some new data has been collected and a small amount of data was lost due to a hard disk failure. Our baseline is the standard Paracrawl pipeline using Hunalign in conjunction with a dictionary extracted from the clean data released in the shared task.
We filter the output of Vecalign and Hunalign following , including filtering out sentences with the wrong languages and sentences with high token overlap, as this was the best performing method from the shared task. 11 We train and evaluate NMT models following the procedure/hyperparameters from the shared task.
Results are shown in Figure 3. Using Vecalign, we see improvements of 1.7 and 1.6 BLEU for the best data sizes in Sinhala→English and Nepali→English, respectively, compared to the systems trained on Hunalign output. Time required to align documents of various sizes are shown for Vecalign, Bleualign, Gartantua, and Hunalign in see Figure 4. As expected, Vecalign has approximately linear runtime characteristics. We use truncated portions of Hu-En Bibles in order to use the dictionary provided with Hunalign. Bleualign is run on NMT output. Vecalign settings match § 4.2. Experiments are run on a Thinkpad T480 with 32GB RAM. Times do not include translation (Bleualign), lexicon building (Hunalign), or sentence embedding (Vecalign). For reference, producing embeddings for 32k sentences, including overlaps, in each language took~120 s on a GeForce RTX 2080 Ti GPU. Bleualign and Gargantua run out of memory on 32k sentences. Hunalign and Vecalign usẽ 1GB and are both very fast, aligning 32k sentences in~30 s.

Conclusions
We present Vecalign, a novel sentence alignment method based on similarity of sentence embeddings and a DP approximation which is fast even for long documents. Our method has state-of-theart accuracy in high and low resource settings and improves downstream MT quality.