MIPT System for World-Level Quality Estimation

We explore different model architectures for the WMT 19 shared task on word-level quality estimation of automatic translation. We start with a model similar to Shef-bRNN, which we modify by using conditional random fields for sequence labelling. Additionally, we use a different approach for labelling gaps and source words. We further develop this model by including features from different sources such as BERT, baseline features for the task and transformer encoders. We evaluate the performance of our models on the English-German dataset for the corresponding shared task.


Introduction
Current methods of assessing the quality of machine translation, like BLEU (Papineni et al., 2002), are based on comparing the output of a machine translation system with several gold reference translations. The tasks of quality estimation at the WMT 19 conference aims at detecting errors in automatic translation without a reference translation at various levels (word-level, sentencelevel and document-level). In this work we predict word-level quality.
In the task the participants are given a source sentence and its automatic translation and are asked to label the words in the machine translation as OK or BAD. The machine translation system could have omitted some words in the translated sentence. To detect such errors participants are also asked to label the gaps in the automatic translation. A target sentence has a gap between every pair of neighboring words, one gap in the beginning of the sentence and one gap at the end of the sentence. We are also interested in detecting the words in the source sentence that led to errors in the translation. For this purpose participants are also asked to label the words in source sentences. The source labels were obtained based on the alignments between the source and the postedited target sentences. If a target token is labeled as BAD in the translation, then all source tokens aligned to it are labeled as BAD as well.
In section 2 we introduce our base model, which is a modified version of phrase-level Shef-bRNN (Ive et al., 2018), and further develop it by using different methods of extracting features from the input alongside the bi-RNN features. In section 3 we write about our experimental setup and in section 4 we present the scores achieved by our models. In section 5 we summarize our work and propose ways for further development.

Models
All of our models have two stages: feature extraction and tag prediction. The first stage uses different neural architectures like bi-LSTM encoder and BERT (Devlin et al., 2018) to extract features from the input sequences. Some models also use human-crafted features alongside the automatically generated ones. The second stage feeds the sequence of extracted features into a CRF (Lafferty et al., 2001) to obtain labels for words or gaps in the automatic translation.

RNN Features
Our base model is similar to phrase-level Shef-bRNN (Ive et al., 2018). We chose the phrase-level version of Shef-bRNN over the word-level version because we found it to be more understandable and intuitive.
The model is given a sequence of source tokens s 1 , . . . , s n and a sequence of target tokens t 1 , . . . , t m . The source sequence is fed into the source encoder, which is a bidirectional LSTM.
Thus, for every word s j in the source a source vec- is produced, where h src j and h src j are the corresponding hidden states of the forward and backward LSTMs and [x, y] is the concatenation of vectors x and y. Similarly, the target sequence is fed into the target encoder, which is also a bidirectional LSTM, to obtain a target vector h tgt j for every word t j in the target sequence. Global attention (Luong et al., 2015) is used to obtain context vector c j for every target vector h tgt j : The vector c j gives a summary of the source sentence, focusing on parts which are most relevant to the target token. Using the same technique, we obtain self-context vector sc j for every target vector h tgt j by computing global attention for h tgt j over h tgt i , i = j. The resulting feature vector is denoted as f RNN j = h tgt j , c j , sc j for every word t j in the target sequence.

Baseline Features
Specia et al. (2018) use a CRF (Lafferty et al., 2001) with a set of human-crafted features as the baseline model for the same task at WMT 18. The WMT 18 and WMT 19 tasks use the same English-German dataset, so we can use the baseline features provided with the WMT 18 dataset to further improve the performance our model.
For every word t j in the target sequence baseline features represent a sequence of 34 values: b 1 j , . . . , b 34 j , some of which are numerical -like the word count in source and target sentencesand the others are categorical -like the target token, aligned source token and their part-of-speech (POS) tags. We represent categorical features using one-hot encoding. In this case if a value of a categorical feature occurs less than min occurs times in the train dataset, then this value is ignored (i.e. it is represented by a zero vector). After the conversion all features are concatenated into a single feature vector f Base j

BERT Features
BERT is a model for language representation presented by Delvin et al. (2018) which demonstrated state of the art performance on several NLP tasks. BERT is trained on a word prediction task and, as shown in , word prediction can be helpful for the quality estimation task. Pretrained versions of BERT are publicly available and we use one of them to generate features for our models.
To extract BERT features the target sequence is fed into a pretrained BERT model. It is important to note that we do not fine-tune BERT and just use its pretrained version as-is. BERT utilizes WordPiece tokenization (Wu et al., 2016), so for each target token t j it produces k j output vectors BERT 1 j , . . . , BERT k j j . However, we can only use a fixed size feature vector for each source token. We noticed that about 83% of target tokens produce less than three BERT tokens. This means that by using only two of the produced tokens we will preserve most of the information. To obtain the BERT feature vector, we decided to concatenate the first and the last BERT outputs We chose the first and the last outputs, because this approach was the easiest to implement.

Transformer Encoder
We tried replacing bi-RNN encoders with transformer encoders (Vaswani et al., 2017) to include more contextual information in the encoder outputs.
The source transformer encoder produces embeddings h src 1 , . . . , h src n for the source sequence and the target transformer encoder produces outputs h tgt 1 , . . . , h tgt m for the target sequence. After that, similarly to 2.1, a context vector c j is obtained for every word in the target sequence. For transformer encoder we do not compute selfcontext vectors as the transformer architecture itself utilizes the self-attention mechanism.
The resulting feature vector is denoted as f Trf j = h tgt j , c j .

Word Labelling
After the feature vectors for the target sequence have been obtained, they are fed into a CRF that labels the words in the translation. In this paper we explore architectures that use the following feature vectors: • RNN: f RNN j ; • RNN+Baseline: • RNN+BERT: • RNN+Baseline+BERT: • Transformer: f Trf j .
• Transformer+Baseline+BERT: To label words in the source sequence we use the alignments between the source sentence and the machine translation provided with the dataset. Specifically, if a source word s j is aligned with a target word t j , which is labeled as BAD then we label s j as BAD as well. In case when s j is aligned with multiple target words, we label s j as BAD if at least one of the aligned target words is labeled BAD.

Gap Labelling
Unlike word-level Shef-bRNN, we refrain from using a dummy word to predict gap tags, because increasing the input sequence length might make it difficult for encoders to carry information between distant words. Instead, we train different models for word labelling and gap labelling.
To modify a word labelling architecture Arch, where Arch is either RNN+Baseline+BERT or Transformer+Baseline+BERT, to label gaps, we construct a new sequence of features: for j = 0, . . . , m. Here we assume f Arch 0 and f Arch m+1 to be zero vectors.
After the new sequence has been constructed, we feed it into a CRF to label the gaps.

Experimental Setup
We train and evaluate our models on the WMT 19 Word-Level Quality Estimation Task English-German dataset. In our experiments we did not utilize pre-training or multi-task learning unlike some versions of Shef-bRNN. All our models were implemented in PyTorch, the code is available online. 1 For RNN feature extraction we use OpenNMT (Klein et al., 2017) bi-LSTM encoder implementation with 300 hidden units in both backward and forward LSTMs for models that label words and 150 hidden units for models that label gaps. We used FastText models (Grave et al., 2018) for English and German languages to produce word embeddings.
Baseline features were provided with the dataset.
In our experiments we used min occurs = 4 when building baseline feature vocabularies.
Pretrained BERT model was provided by the pytorch-pretrained-bert package. 2 In our experiments we used the bert-base-multilingual-cased version of BERT.
We trained our models using PyTorch implementation of the ADADELTA algorithm (Zeiler, 2012) with all parameters, except the learning rate, set to their default values. For the train loss to converge we used the learning rate of 1 for the RNN and Transformer models, the learning rate of 0.3 for the RNN+Baseline model and the learning rate of 0.1 for RNN+Bert, RNN+Baseline+Bert and Transformer+Baseline+Bert models. The inputs were fed into the model in mini-batches of 10 samples.

Results
We used the English-German dataset provided in the WMT 19 Shared task on Word-Level Quality Estimation. The primary metric for each type of tokens -source words, target words and gaps -is F 1 Mult which is the product of F 1 scores for BAD and OK labels.
The scores for each system are presented in Table 1 (participation results), Table 2 (target words), Table 3 (source words) and Table 4 (gaps).
For the WMT 19 task we submitted the RNN+Baseline+BERT and Trans-former+Baseline+BERT models which correspond to the Neural CRF RNN and Neural CRF Transformer entries in the public leaderboard.
We don't have the scores for the WMT 18 Baseline system and the Shef-bRNN system on the development dataset, so we can compare them directly with only two of our systems from table 1.
Both of these systems perform on par with Shef-bRNN and the Transformer+Baseline+BERT model was able to achieve a slightly better score for target classification. Word-level Shef-bRNN seems to outperform all of our other systems, most likely, because it uses a more appropriate architecture for the task. All of our systems, seem to outperform the WMT 18 baseline system.
The BERT features turned out to improve the performance a little -an increase of 0.02 for target labelling and an increase of 0.01 for source labelling. The baseline features, on the other hand, have a greater impact on the model's performance, increasing the score by 0.05 for target labelling and by 0.04 for source labelling. Replacing the bi-RNN encoder with a transformer encoder also improved the score by 0.03 in case of the RNN+Baseline+BERT configuration.

Conclusion
We applied different neural systems to the task of word-level quality estimation. We measured their performance in comparison to each other and the baseline system for the task. All of our systems outperformed the WMT 18 baseline on the development dataset and can be trained in a couple of hours on a single Tesla K80 GPU.
Our models can be further improved by finetuning BERT and utilizing multi-task learning as proposed in .