TransQuest: Translation Quality Estimation with Cross-lingual Transformers

Recent years have seen big advances in the field of sentence-level quality estimation (QE), largely as a result of using neural-based architectures. However, the majority of these methods work only on the language pair they are trained on and need retraining for new language pairs. This process can prove difficult from a technical point of view and is usually computationally expensive. In this paper we propose a simple QE framework based on cross-lingual transformers, and we use it to implement and evaluate two different neural architectures. Our evaluation shows that the proposed methods achieve state-of-the-art results outperforming current open-source quality estimation frameworks when trained on datasets from WMT. In addition, the framework proves very useful in transfer learning settings, especially when dealing with low-resourced languages, allowing us to obtain very competitive results.


Introduction
The goal of quality estimation (QE) is to evaluate the quality of a translation without having access to a reference translation . High-accuracy QE that can be easily deployed for a number of language pairs is the missing piece in many commercial translation workflows as they have numerous potential uses. They can be employed to select the best translation when several translation engines are available or can inform the end user about the reliability of automatically translated content. In addition, QE systems can be used to decide whether a translation can be published as it is in a given context, or whether it requires human post-editing before publishing or even translation from scratch by a human (Kepler et al., 2019). The estimation of translation quality can be done at different levels: document level, sentence level and word/phrase level (Ive et al., 2018). In this research we focus on sentence-level quality estimation.
As we discuss in Section 2, at present neural-based QE methods constitute the state of the art in quality estimation. However, these approaches are based on complex neural networks and require resourceintensive training. This resource-intensive nature of these deep-learning-based frameworks makes it expensive to have QE systems that work for several languages at the same time. Furthermore, these architectures require a large number of annotated instances for training, making the quality estimation task very difficult for low-resource language pairs. In this paper we propose TransQuest, a framework for sentence-level machine translation quality estimation which solves the aforementioned problems, whilst obtaining competitive results. The motivation behind this research is to propose a simple architecture which can be easily trained with different types of inputs (i.e. different language pairs or language from different domains) and can be used for transfer learning in settings where there is not enough training data. We show that TransQuest outperforms current open-source quality estimation frameworks and compares favourably to winning solutions submitted to recent shared tasks on 15 different language pairs on different aspects of quality estimation. In fact, a tuned version of TransQuest was declared the winner for all 8 tasks of the direct This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/. assessment sentence level QE shared task organised at WMT 2020 (for more details see (Ranasinghe et al., 2020) and Section 5.1. The main contributions of this paper are the following: 1. We introduce TransQuest, an open-source framework, and use it to implement two neural network architectures that outperform current state-of-the-art quality estimation methods in two different aspects of sentence-level quality estimation.
2. To the best of our knowledge this is the first neural-based method which develops a model capable of providing quality estimation for more than one language pair. In this way we address the problem of high costs required to maintain a multi-language-pair QE environment.
3. We tackle the problem of quality estimation in low-resource language pairs by showing that even with a small number of annotated training instances, TransQuest with transfer learning can outperform current state-of-the-art quality estimation methods in low-resource language pairs. 4. We provide important resources to the community: the code as an open-source framework, as well as the TransQuest model zoo -a collection of pre-trained quality estimation models which have been trained on 15 different language pairs and different aspects of quality estimation -will be freely available to the community 1 .
The remainder of the paper is structured as follows. We first present a brief overview of related work in order to define the context of our work. In Section 3 we present the TransQuest framework and the methodology employed to train it. The datasets used to train it are presented in Section 4, followed by the evaluation and discussion in Section 5. The paper finishes with conclusions and ideas for future research directions.

Related Work
During the past decade there has been tremendous progress in the field of quality estimation, largely as a result of the QE shared tasks organised annually by the Workshops on Statistical Machine Translation (WMT), more recently called the Conferences on Machine Translation, since 2012. The annotated datasets these shared tasks released each year have led to the development of many open-source QE systems like QuEst (Specia et al., 2013), QuEst++ (Specia et al., 2015), deepQuest (Ive et al., 2018), and OpenKiwi (Kepler et al., 2019). Before the neural network era, most of the quality estimation systems like QuEst (Specia et al., 2013) and QuEst++ (Specia et al., 2015) were heavily dependent on linguistic processing and feature engineering to train traditional machine-learning algorithms like support vector regression and randomised decision trees (Specia et al., 2013). Even though, they provided good results, these traditional approaches are no longer the state of the art. In recent years, neural-based QE systems have consistently topped the leader boards in WMT quality estimation shared tasks (Kepler et al., 2019).
For example, the best-performing system at the WMT 2017 shared task on QE was POSTECH, which is purely neural and does not rely on feature engineering at all (Kim et al., 2017). POSTECH revolves around an encoder-decoder Recurrent Neural Network (RNN) (referred to as the 'predictor'), stacked with a bidirectional RNN (the 'estimator') that produces quality estimates. In the predictor, an encoderdecoder RNN model predicts words based on their context representations and in the estimator step there is a bidirectional RNN model to produce quality estimates for words, phrases and sentences based on representations from the predictor. To be effective, POSTECH requires extensive predictor pre-training, which means it depends on large parallel data and is computationally intensive (Ive et al., 2018). The POSTECH architecture was later re-implemented in deepQuest (Ive et al., 2018).
OpenKiwi (Kepler et al., 2019) is another open-source QE framework developed by Unbabel. It implements four different neural network architectures QUETCH (Kreutzer et al., 2015), NUQE (Martins et al., 2016), Predictor-Estimator (Kim et al., 2017) and a stacked model of those architectures. Both the QUETCH and NUQE architectures have simple neural network models that do not rely on additional parallel data, but do not perform that well. The Predictor-Estimator model is similar to the POSTECH architecture and relies on additional parallel data. In OpenKiwi, the best performance for sentence-level quality estimation was given by the stacked model that used the Predictor-Estimator model, meaning that the best model requires extensive predictor pre-training and relies on large parallel data and computational resources.
In order to remove the dependency on large parallel data, which also entails the need for powerful computational resources, we propose to use crosslingual embeddings that are already fine-tuned to reflect properties between languages. We assume that by using them we will ease the burden of having complex neural network architectures. Over the last few years there has been significant work done in the area of crosslingual embeddings (Ruder et al., 2019).
Since the introduction of BERT (Devlin et al., 2019), transformer models have been used successfully for various NLP tasks such as named entity recognition (Devlin et al., 2019), sentence classification (Sun et al., 2019), and question answering (Devlin et al., 2019), in many cases improving the state of the art. Most of the tasks were focused on English due to the fact that most of the pre-trained transformer models were trained on English data. Although there are several multilingual models like multilingual BERT (mBERT) (Devlin et al., 2019) and multilingual DistilBERT (mDistilBERT) , researchers expressed some reservations about their ability to represent all the languages (Pires et al., 2019). In addition, although mBERT and mDistilBERT showed some crosslingual characteristics, they do not perform well on crosslingual benchmarks (K et al., 2020).
XLM-RoBERTa (XML-R) was released in November 2019 (Conneau et al., 2020) as an update to the XLM-100 model (Conneau and Lample, 2019). XLM-R takes a step back from XLM, eschewing XLM's Translation Language Modeling (TLM) objective since it requires a dataset of parallel sentences, which can be difficult to acquire. Instead, XLM-R trains RoBERTa (Liu et al., 2019) on a huge, multilingual dataset at an enormous scale: unlabelled text in 104 languages is extracted from CommonCrawl datasets, totalling 2.5TB of text. It is trained using only RoBERTa's (Liu et al., 2019) masked language modelling (MLM) objective. Surprisingly, this strategy provided better results in crosslingual tasks. XLM-R outperforms mBERT on a variety of crosslingual benchmarks such as crosslingual natural language inference and crosslingual question answering (Conneau et al., 2020).
Both architectures proposed in TransQuest have been successfully applied in the monolingual semantic textual similarity tasks (Devlin et al., 2019;Reimers and Gurevych, 2019). When applied in monolingual experiments, both of them use monolingual transformer models like BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) as the input. This inspired us to change the input in such a way that it can represent both the source and target sentences for which the quality of translation needs to be estimated, with the hope that the same architectures would also provide good results in the QE task. Our initial experiments showed that crosslingual embeddings like XLM-R provide better results than multilingual embeddings like mBERT. Therefore, in this research we explore the performance of crosslingual embeddings with simple neural network architectures for the sentence-level quality estimation task. To the best of our knowledge, state-of-the-art crosslingual contextual embeddings such as XLM-R have not been used in quality estimation before.

Methodology
This section presents the methodology used to develop our quality estimation methods. We first describe the neural network architectures we proposed, followed by the method used to train these architectures.

Neural Network Architectures
The TransQuest framework that is used to implement the two architectures described here relies on the XLM-R transformer model (Conneau et al., 2020) to derive the representations of the input sentences. The XLM-R transformer model takes a sequence of no more than 512 tokens as input and outputs the representation of the sequence. The first token of the sequence is always [CLS], which contains the special embedding to represent the whole sequence, followed by embeddings acquired for each word in the sequence. As shown below, our proposed neural network architectures can utilise both the embedding for the [CLS] token and the embeddings generated for each word. The output of the transformer (or transformers for SiameseTransQuest described below), is fed into a simple output layer which is used to estimate the quality of translation. We describe below the way the XLM-R transformer is used and the output layer, as they are different in the two instantiations of the framework. The fact that we do not rely on a complex output layer makes training our architectures much less computational intensive than alternative solutions. The TransQuest framework is open-source, which means researchers can easily propose alternative architectures to the ones we present in this paper.
Both neural network architectures presented below use the pre-trained XLM-R-large model released by HuggingFace's Transformers library . The XLM-R-large model covers 104 languages (Conneau et al., 2020), making it potentially very useful to estimate the translation quality for a large number of language pairs.
TransQuest implements two different neural network architectures to perform sentence-level translation quality estimation which we describe below. The architectures are presented in Figure 1.
1. MonoTransQuest (MTransQuest): The first architecture proposed uses a single XLM-R transformer model and is shown in Figure 1a. The input of this model is a concatenation of the original sentence and its translation, separated by the [SEP] token. We experimented with three pooling strategies for the output of the transformer model: using the output of the [CLS] token (CLS-strategy); computing the mean of all output vectors of the input words (MEAN-strategy); and computing a max-over-time of the output vectors of the input words (MAX-strategy). The output of the pooling strategy is used as the input of a softmax layer that predicts the quality score of the translation. We used mean-squared-error loss as the objective function. Early experiments we carried out demonstrated that the CLS-strategy leads to better results than the other two strategies for this architecture. Therefore, we used the embedding of the [CLS] token as the input of a softmax layer.
2. SiameseTransQuest (STransQuest): The second approach proposed in this paper relies on the Siamese architecture depicted in Figure 1b which has shown promising results in monolingual semantic textual similarity tasks (Reimers and Gurevych, 2019;Ranasinghe et al., 2019). In this case, we feed the original text and the translation into two separate XLM-R transformer models. Similar to the previous architecture we used the same three pooling strategies for the outputs of the transformer models. We then calculated the cosine similarity between the two outputs of the pooling strategy. We used mean-squared-error loss as the objective function. In initial experiments we carried out with this architecture, the MEAN-strategy showed better results than the other two strategies. For this reason, we used the MEAN-strategy for our experiments. Therefore, cosine similarity is calculated between the the mean of all output vectors of the input words produced by each transformer.

Training Details
We used the same set of configurations for all the language pairs evaluated in this paper in order to ensure consistency between all the languages. This also provides a good starting configuration for researchers who intend to use TransQuest on a new language pair. In both architectures we used a batch-size of eight, Adam optimiser with learning rate 2e−5, and a linear learning rate warm-up over 10% of the training data. During the training process, the parameters of XLM-R model, as well as the parameters of the subsequent layers, were updated. The models were trained using only training data. Furthermore, they were evaluated while training using an evaluation set that had one fifth of the rows in training data. We performed early stopping if the evaluation loss did not improve over ten evaluation steps. All the models were trained for three epochs. For some of the experiments, we used an Nvidia Tesla K80 GPU, whilst for others we used an Nvidia Tesla T4 GPU. This was purely based on the availability of the hardware and it was not a methodological decision.

Dataset
We used the architectures described above to predict two standard measures that express the quality of a translation: Human-mediated Translation Edit Rate (HTER) and Direct Assessment (DA). All the datasets that we used are publicly available and were released in WMT quality estimation tasks in recent years Fonseca et al., 2019;. This was done to ensure replicability of our experiments and to allow us to compare our results with the state of the art. In the reminder of the section we provide more details about the datasets used.

Predicting HTER
The performance of QE systems has typically been assessed using the semiautomatic HTER (Humanmediated Translation Edit Rate). HTER is an edit-distance-based measure which captures the distance between the automatic translation and a reference translation in terms of the number of modifications required to transform one into another. In light of this, a QE system should be able to predict the percentage of edits required in the translation. We used several language pairs for which HTER information was available: English-Chinese (En-Zh), English-Czech (En-Cs), English-German (En-De), English-Russian (En-Ru), English-Latvian (En-Lv) and German-English (De-En). The texts are from a variety of domains and the translations were produced using both neural and statistical machine translation systems. More details about these datasets can be found in Table 1 and in Fonseca et al., 2019).

Predicting DA
Even though HTER has been typically used to assess quality in machine translations, the reliability of this metric for assessing the performance of quality estimation systems has been questioned by researchers (Graham et al., 2016). The current practice in MT evaluation is the so-called Direct Assessment (DA) of MT quality (Graham et al., 2017), where raters evaluate the machine translation on a continuous 1-100 scale. This method has been shown to improve the reproducibility of manual evaluation and to provide a more reliable gold standard for automatic evaluation metrics (Graham et al., 2015). We used a recently created dataset to predict DA in machine translations which was released for the WMT 2020 quality estimation shared task 1 . The dataset is composed of data extracted from Wikipedia for six language pairs, consisting of high-resource English-German (En-De) and English-Chinese (En-Zh), medium-resource Romanian-English (Ro-En) and Estonian-English (Et-En), and low-resource Sinhala-English (Si-En) and Nepalese-English (Ne-En), as well as a Russian- English (En-Ru) dataset which combines articles from Wikipedia and Reddit . These datasets have been collected by translating sentences sampled from source-language articles using state-of-the-art NMT models built using the fairseq toolkit  and annotated with DA scores by professional translators. Each translation was rated with a score from 0-100 according to the perceived translation quality by at least three translators . The DA scores were standardised using the z-score. The quality estimation systems evaluated on these datasets have to predict the mean DA z-scores of test sentence pairs. Each language pair has 7,000 sentence pairs in the training set, 1,000 sentence pairs in the development set and another 1,000 sentence pairs in the testing set.

Evaluation and discussion
This section presents the evaluation results of our architectures on the datasets described in the previous section in a variety of settings. We first evaluate them in a single language pair setting (Section 5.1), which is essentially the setting employed in the WMT shared tasks. We then evaluate in a setting where we combine datasets in several languages for training (Section 5.2). We conclude the section with an evaluation of a transfer learning setting (Section 5.3). In order to better understand the performance of our approach, we compare our results with the baselines reported by the WMT2018-2020 organisers. Two baselines were used: OpenKiwi (Kepler et al., 2019) and QuEst++ (Specia et al., 2015). Row IV of Tables 2 and 3 present the results for these baselines. For some language pairs in the 2018 WMT quality estimation shared task the organisers did not report the scores obtained using OpenKiwi. These cases are marked with NR in Table 2. Row IV of both tables also includes the results of the best system from WMT2018 to WMT2020 for each setting. Additionally, Row IV of Table 3 shows the results of the TransQuest's submission to WMT 2020 QE Task 1, which was the winning solution is all the languages (Ranasinghe et al., 2020).
The evaluation metric used was the Pearson correlation (r) between the predictions and the gold standard from the test set, which is the most commonly used evaluation metric in recent WMT quality estimation shared tasks Fonseca et al., 2019;.

Supervised Single Language Pair Quality Estimation
The first evaluation we carried out was the supervised single language pair evaluation where we used the training set of each language to build a quality estimation model and we evaluated it on a testing set from the same language. This replicates the standard QE evaluation carried out in the WMT shared tasks. The results for each language in supervised settings are shown in row I of Tables 2 and 3. The results indicate that both architectures proposed in TransQuest outperform the baselines in all the language pairs of both  aspects in quality estimation, and also outperform the best systems from previous competitions. From the two architectures, MTransQuest performs slightly better than STransQuest.
In the HTER aspect of quality estimation, as shown in Table 2, MTransQuest gains ≈ 0.1-0.2 Pearson correlation boost over OpenKiwi in most language pairs. However, OpenKiwi comes very close to MTransQuest in En-De SMT. In the language pairs where OpenKiwi results are not available MTransQuest gains ≈ 0.3-0.4 Pearson correlation boost over QuEst++ in all language pairs for both NMT and SMT. Table 2 also gives the results of the best system submitted for a particular language pair. It is worth noting that for the training setting described in this section, the TransQuest results surpass the best system in all the language pairs with the exception of the En-De SMT and En-Zh NMT datasets.
As shown in Table 3, in the DA aspect of quality estimation, MTransQuest gained ≈ 0.2-0.3 Pearson correlation boost over OpenKiwi in all the language pairs. Additionally, MTransQuest achieves ≈ 0.4 Pearson correlation boost over OpenKiwi in the low-resource language pair Ne-En. Furthermore, TransQuest participated in WMT 2020 quality estimation shared task 1 and it was the winning solution in all the language pairs. To achieve this restult, TransQuest was fine-tuned with self-ensemble and data augmentation to achieve the first place. We do not describe here the fine tuning approaches since they are task specific, but more details can be found in (Ranasinghe et al., 2020).
Additionally, row V in both Tables 2 and 3 shows the results of multilingual BERT (mBERT) in MonoTransQuest architecture. We used the same settings similar to XLM-R. The results show that XLM-R model outperforms the mBERT model in all the language pairs of both aspects in quality estimation and we can safely assume that the cross lingual nature of the XLM-R transformers had a clear impact to the results.

Supervised Multi-Language Pair Quality Estimation
Most of the available open-source quality estimation frameworks require maintaining separate machine learning models for each language. This can be very challenging in a practical environment where the systems have to work with 10-20 language pairs. Furthermore, pre-trained neural quality estimation models are large. In a commercial environment, where the quality estimation systems need to do inference on several language pairs, loading all of the pre-trained models from all language pairs would require a lot of Random Access Memory (RAM) space and result in a huge cost.   (Ranasinghe et al., 2020) which was also the winning solution. Row V presents the results of the multilingual BERT (mBERT) model in MonoTransQuest Architecture. NS implies that the non-English language in the language pair is not supported by mBERT.
Therefore, with TransQuest we propose a single model that can perform quality estimation on several language pairs. We propose two training strategies for supervised multi language pair settings: 1. We separate the language pairs into two groups. One group contains all the language pairs where the source language is English and in the other group the target is always English. We represent the former with En- * , and the latter with * -En. We train both architectures by concatenating training sets in all the language pairs in a particular group. To ease the comparison of the results, we evaluate the model separately on each language pair. We do this process for both aspects in quality estimation.
The results are shown in row II in Tables 2 and 3. 2. We concatenate training data from all the language pairs, without considering the direction of the translation, and build a single model for all language pairs. We refer to these models by MTransQuest-m and STransQuest-m. Similarly to the first multi-language pair training strategy, we evaluate the model separately on each language pair and for both aspects of quality estimation. The results are shown in row III in Tables 2 and 3.
As depicted in Tables 2 and 3, the multi-language pair experiments yielded very competitive results. In fact, for some language pairs the multi-language pair model performed better than the model that was trained solely on that particular pair. When predicting HTER, the multi-language pair model performed better in En-Lv for both SMT and NMT, and in En-De for SMT while performing on par in En-De for NMT. In predicting DA, the multi-language pair model performed better in Et-En, Ru-En and Si-En. In addition, with the exceptions of the En-De SMT and En-Zh NMT setting, the models that consider the direction of the language pairs are better than the best systems submitted to previous editions of WMT.
Throughout our experiments we noted that the TransQuest models built with the direction of the language pairs in mind performed slightly better than the TransQuest models trained without considering the language pair direction. It should be noted that none of the multi-language pair models' Pearson correlation decreased by more than 0.03% in any language pair for either TransQuest architecture. Similar to supervised single language pair experiments, MTransQuest architecture performed better than STransQuest architecture in all the languages in both aspects of quality estimation.
The size of the pre-trained TransQuest models on a single language pair was ≈ 2GB. The pre-trained models for multiple language pairs in this section did not exceed more than 2.1 GB. Therefore, we present multi-language pair pre-trained models as a solution for environments that are on tight resources and seek to conduct quality estimation on multiple language pairs.

Transfer Learning based Quality Estimation
The biggest challenge in building supervised quality estimation models is not having enough annotated data , especially for low-resourced languages. We explore the possibility of performing transfer learning on low-resource languages using the models trained on better-resourced languages. As the low-resource language pairs were only available in the DA aspect, we conducted this experiment only in the DA aspect in quality estimation. All of the low-resource language pairs in the DA aspect had English as the target language. Considering the positive impact that the direction of the language pair had on Pearson correlation in the previous experiment, we decided to consider only the language pairs with English as the target language. This left us only with mid-resource language pairs for training. We were also aware that the TransQuest models had relatively low Pearson correlations with high-resource language pairs in DA.
1. We build a single model for each architecture using all the training data available for mid-resource language pairs: Et-En, Ro-En and Ru-En. We refer to this as TransQuest-Mid.
2. When we train a TransQuest model for a low-resource language pair, we initiate the model weights from TransQuest-Mid and start training. To see whether it is possible to get compatible results even with fewer training instances, we conduct the experiments for 0 (unsupervised), 100, 200, 300 and up to 1,000 training sentence pairs. We do this for Si-En and Ne-En. Depending on the architecture we use, we refer to this model as MTransQuest TL or STransQuest TL.
3. In order to evaluate the effect of transfer learning we conduct the same experiment in step 2, but we train the model from scratch. Depending on the architecture we use, we refer to this model as MTransQuest Scratch or STransQuest Scratch.
As shown in Figure 2, the transfer learning strategy significantly impacts the results. In Ne-En, with only 100 training instances, training MTransQuest scratch achieves only 0.1242 Pearson correlation between the predictions and gold labels of the test set. However, in Ne-En with only 100 training instances, training MTransQuest using the transfer learning strategy achieves 0.7417 Pearson correlation, which is close to the best result obtained with the MTransQuest architecture for Ne-En after training with 7,000 instances (0.7914). When the number of training instances grows, the results from the TransQuest models trained with the transfer learning strategy and the results from the TransQuest models trained from scratch converge. A similar pattern, but with a lower Pearson correlation, can be seen with STransQuest. Similar results can also be observed for the other low-resource language pair, Si-En. Therefore, it is safe to conclude that TransQuest with the transfer learning strategy can be hugely beneficial to low-resource language pairs in quality estimation where annotated training instances are scarce.

Conclusions
In this paper we introduced TransQuest, a new open source framework for quality estimation based on cross-lingual transformers. TransQuest is implemented in PyTorch and supports training of sentencelevel quality estimation systems on new data. It outperforms other open-source tools on both aspects of sentence-level quality estimation and yields new state-of-the-art quality estimation results. As far as we know, it is the first time that an open-source QE framework has been tested on both aspects of quality estimation. Furthermore, it is the first time that a QE system explores multi-language pair models and transfer learning on low-resource language pairs. Unlike many other open-source neural QE frameworks, TransQuest does not use parallel data and hence does not require similar computational resources.
We propose two architectures: MTransQuest and STransQuest, neither of which have been previously explored in QE tasks. The two architectures have a trade-off between accuracy and efficiency. On an Nvidia Tesla K80 GPU, MTransQuest takes 4,480s on average to train on 7,000 instances, while STransQuest takes only 3,900s on average for the same experiment. On the same GPU, MTransQuest takes 35s on average to perform inference on 1,000 instances which takes STransQuest only 16s to do so. Therefore we recommend using MTransQuest where accuracy is valued over efficiency, and STransQuest where efficiency is prioritised above accuracy. Since there is a growing interest 3 in the NLP community for energy efficient machine learning models, we decided to support both architectures in the TransQuest Framework.
In the future, we plan to expand TransQuest with more neural network architectures and more models for different levels of quality estimation such as word-level and document-level. In the sentence-level, we plan to perform transfer learning on language pairs that do not include English at all. We also hope to conduct unsupervised experiments on low-resource language pairs.