QE BERT: Bilingual BERT Using Multi-task Learning for Neural Quality Estimation

For translation quality estimation at word and sentence levels, this paper presents a novel approach based on BERT that recently has achieved impressive results on various natural language processing tasks. Our proposed model is re-purposed BERT for the translation quality estimation and uses multi-task learning for the sentence-level task and word-level subtasks (i.e., source word, target word, and target gap). Experimental results on Quality Estimation shared task of WMT19 show that our systems show competitive results and provide significant improvements over the baseline.


Introduction
Translation quality estimation (QE) has become an important research topic in the field of machine translation (MT), which is used to estimate quality scores and categories for a machine-translated sentence without reference translations at various levels (Specia et al., 2013).
Recent Predictor-Estimator architecture-based approaches (Kim and Lee, 2016a,b;Kim et al., 2017aKim et al., ,b, 2019 have significantly improved QE performance. The Predictor-Estimator (Kim and Lee, 2016a,b;Kim et al., 2017aKim et al., ,b, 2019 is based on a modified neural encoder architecture that consists of two subsequent neural models: 1) a word prediction model, which predicts each target word given the source sentence and the left and right context of the target word, and 2) a quality estimation model, which estimates sentence-level scores and word-level labels from features produced by the predictor. The word prediction model is trained from additional large-scale parallel data and the quality estimation model is trained from small-scale QE data.
Recently, BERT (Devlin et al., 2018) has led to impressive improvements on various natural language processing tasks. BERT is a bidirectionally trained language model from large-scale "monolingual" data to learn the "monolingual" context of a word based on all of its surroundings (left and right of the word).
Both BERT that is based on the Transformer architecture (Vaswani et al., 2017) and the word prediction model in the Predictor-Estimator that is based on the attention-based recurrent neural network (RNN) encoder-decoder architecture (Bahdanau et al., 2015;Cho et al., 2014) have some common ground utilizing generative pretraining of sentence encoder.
In this paper, we propose a "bilingual" BERT using multi-task learning for translation quality estimation (called the QE BERT). We describe how we have applied BERT (Devlin et al., 2018) to the QE task to make much improvements. In addition, for recent QE task, which consists of one sentence-level subtask to predict HTER scores and three word-level subtasks to detect errors for each source word, target (mt) word, and target (mt) gap, we also have applied multi-task learning (Kim et al., 2019(Kim et al., , 2017b to enhance the training data from other QE subtasks 1 . The results of experiments conducted on the WMT19 QE datasets show that our proposed QE BERT using multi-task learning provides significant improvements over the baseline system.

QE BERT
In this section, we describe two training steps for QE BERT: pre-training and fine-tuning. Figure  1 shows QE BERT architecture to predict HTER scores in sentence-level subtask and to detect errors in word-level source word, mt word, and mt gap subtasks. The sentences are tokenized using (SRC)   WordPiece tokenization.

Pre-training
The original BERT (Devlin et al., 2018) is focused on "monolingual" natural language understanding using generative pretraining of sentence encoder. QE BERT, which is focused on "bilingual" natural language understanding 2 , is pre-trained from parallel data to learn the bilingual context of a word based on all of its left and right surroundings.
In pre-training, a default [SEP] token is used to separate source sentence and target sentence of parallel data. In addition, [GAP] tokens, which are newly introduced in this paper for word-level target gap, are inserted between target words.
As a pre-training task of QE BERT, only the masked LM task between parallel sentences is conducted where 15% of the words are replaced with a [MASK] token and then original values of the masked words are predicted 3 . The pretraining enables to make a large-scale parallel data helpful to QE task. As an initial checkpoint of pre-training, we used the released multilingual model 4 .

Fine-tuning
QE BERT is fine-tuned from QE data with the above pre-trained model for a target-specific QE task.
Similar to the pre-training step, a [SEP] token is used to separate source sentence and machine translation sentence of QE data.
[GAP] tokens are inserted between words of the machine translation sentence.

Word-level QE
To compute a word-level QE, the final hidden state (h t ) corresponds to each token embedding is used as follows: where P is the label probabilities and W is the weight matrix used for word-level fine-tuning. Because word-level QE task consists of source word, mt word, and mt gap subtasks, three different types of weight matrix are used for each task: W src.word , W mt.word , and W mt.gap . Because each word of sentences could be tokenized to several tokens, we primarily compute the token-level labels as follows: And then, we compute word-level labels from the token-level labels. In training, if a word is labeled as 'BAD', all of tokens in the word boundary have 'BAD' labels. In inference, if any token in the word boundary is labeled as 'BAD', the output of the word-level QE has a 'BAD' label.

Sentence-level QE
To compute a sentence-level QE, the final hidden state (h s ) corresponds to the [CLS] token embed-ding, which is a fixed-dimensional pooled representation of the input sequence, is used as follows: where W s is the weight matrix used for sentencelevel fine-tuning.

Multi-task learning
The QE subtasks at word and sentence levels are highly related because their quality annotations are commonly based on the HTER measure. Quality annotated data of other QE subtasks could be helpful in training a QE model specific to a target QE task (Kim et al., 2019). To take into account the training data of other QE subtasks as a route of supplementation of target training data, we apply multi-task learning (Kim et al., 2019(Kim et al., , 2017b). For multi-task learning of word-level QE, we use a linear summation of word-level objective losses as follows: L W ORD = L src.word + L mt.word + L mt.gap where most QE BERT components are common across word-level source word, mt word, and mt gap subtasks except for the output matrices W src.word , W mt.word , and W mt.gap . Kim et al. (2019) showed that it is helpful to use word-level training examples for training a sentence-level QE model. For multi-task learning of sentence-level QE, we combine sentencelevel objective loss and word-level objective losses by simply performing a linear summation of the losses for each task as follows: L SEN T = L hter +L src.word +L mt.word +L mt.gap where most QE BERT components are common across sentence-level and word-level tasks except for the output matrices of each task.

Experimental settings
The proposed learning methods were evaluated on the WMT19 QE Shared Task 5 of word-level and sentence-level English-Russian and English-German.
We used parallel data provided for the WMT19 news machine translation task 6 to pre-train QE BERT. The English-Russian parallel data set consisted of the ParaCrawl corpus, Common Crawl corpus, News Commentary corpus, and Yandex Corpus. The English-German parallel data set consisted of the Europarl corpus, ParaCrawl corpus, Common Crawl corpus, News Commentary corpus, and Document-split Rapid corpus.
In pre-training, we used the default hyperparameter setting of the released multilingual model. In fine-turing, a sequence length of 512 was used to cover the length of QE data.

Comparison of learning methods
Tables 1 and 2 show the experimental results obtained from the QE BERT using the different learning methods for the WMT19 word-level and sentence-level QE tasks. For both language pairs, using multi-task learning consistently improves the scores.
We made ensembles by combining five instances of QE BERT models. The word-level results of ensemble A are based on mixtures of the best performance systems on each subtasks (i.e., source word, mt word, and mt gap tasks). On the other hand, the word-level results of ensemble B are based on an all-in-one system using a unified criterion 7 with same model parameters for all word-level subtasks.
Finally , Tables 3 and 4 show the results obtained in the WMT19 test set for our submitted systems and official baseline systems.

Conclusion
In this paper, we explored an adaptation of BERT for translation quality estimation. Because the quality estimation task consists of one sentencelevel subtask to predict HTER scores and three word-level subtasks to detect errors for each source word, target word, and target gap, we also applied multi-task learning to enhance the training data from other subtasks. The results of experiments conducted on WMT19 quality estimation datasets strongly confirmed that our proposed bilingual BERT using multi-task learning      achieved significant improvements. Given this promising approach, we believe that BERT-based quality estimation models can be further advanced with more investigation.