Modeling Confidence in Sequence-to-Sequence Models

Recently, significant improvements have been achieved in various natural language processing tasks using neural sequence-to-sequence models. While aiming for the best generation quality is important, ultimately it is also necessary to develop models that can assess the quality of their output. In this work, we propose to use the similarity between training and test conditions as a measure for models’ confidence. We investigate methods solely using the similarity as well as methods combining it with the posterior probability. While traditionally only target tokens are annotated with confidence measures, we also investigate methods to annotate source tokens with confidence. By learning an internal alignment model, we can significantly improve confidence projection over using state-of-the-art external alignment tools. We evaluate the proposed methods on downstream confidence estimation for machine translation (MT). We show improvements on segment-level confidence estimation as well as on confidence estimation for source tokens. In addition, we show that the same methods can also be applied to other tasks using sequence-to-sequence models. On the automatic speech recognition (ASR) task, we are able to find 60% of the errors by looking at 20% of the data.


Introduction
Deep learning methods have significantly increased the quality of natural language generation tasks such as Machine Translation (MT).However, when deployed in a production environment, understanding the model's confidence and how well it correlates with output quality is as important as training the best models.
While humans are often capable of estimating whether their decisions are sensible or produced by random guesses, it is often not possible to know how confident deep learning models are with respect to their output (Gal, 2016).However, information regarding confidence can be essential in production scenarios.In cases with a human-inthe-loop, confidence can be used to identify the parts of the machine output that require human intervention, e.g. in post-editing for machine translation or to guide reformulation of the original input to simplify the task for sequence-to-sequence models.
Intuitively, models should have higher confidence towards data points that are similar to their training data.Motivated by this, our first contribution is an autoencoder network that is applied as an extension to the sequence-to-sequence models to measure the training-testing discrepancy.In contrast to methods that directly compare the test and training data to generate confidence scores, we do not need to store the whole training data, thereby enabling our method to scale to larger datasets and tasks.
Motivated by the successful application of posterior probabilities for confidence estimation in statistical machine translation (SMT) (Ueffing and Ney, 2007) and traditional ASR systems (Siu and Gish, 1999), our second contribution is a combination our approach with this prior approach.
Traditionally, confidence estimation has been defined as a task of assessing the quality of the whole sequence of words in the target sentence.Especially when evaluating translations, there are also several cases when it can be very beneficial to estimate how well the source words are translated beyond coverage.For example, a person only speaking the source language might be able to reformulate the source sentence, if he knows that the system has difficulties with certain words.As our third contribution, we present a method to estimate the alignment between source and target tokens in complex sequence-to-sequence models.We can show that this strongly outperforms external state-of-the-art alignment methods.
Our experiments shows that in machine translation, the posterior probabilities can be competitive with automatic metrics in terms of correlation with human evaluation.For speech recognition, we are able to find 60% of the errors by looking at 20% of the data.

Confidence Estimation Task
Depending on the use case, there are different ways to define the task of confidence estimation.Furthermore, there is no clear separation between confidence estimation and quality estimation.A first important dimension is the granularity of the predictions.We investigate three different use cases in this work, described in the next three subsections in greater detail.
Previous methods differ in whether they predicting continuous values or discrete labels.In this work, we will predict continuous values, but evaluate against gold standard labels.In Section 5, we describe in detail how we map continuous predictions to discrete labels.
In addition, previous methods differ in whether they can be trained on gold standard labels or if no annotated training data is available.Training data is a particular challenge in confidence estimation since annotations are associated with the output of a particular model.Therefore, this raises the question whether the task is to estimate the quality of any model, or of a particular one.This has implications for whether we can use model internal information or not.In this work, we focus on the situation where we want to estimate the confidence of a particular model, using internal information.Since in a realistic real-world scenario we are not able to collect annotated data for each model we are interested in, we further do not use any labeled training data.

Granularity
First, the confidence of the whole output sequence can be estimated.Given an input sequence X = x 1 . . .x Iw and an output sequence Y = y 1 . . .y Jw , the model estimates quality c for the whole sequence.We will present several methods that calculate a sequence of confidence estimations c 1 , ..., c L .Therefore, we need an additional aggregation function for the sequence confidence es-timates.In all our experiments, we are using the minimum as the aggregation function.
In some use cases, it is important to get more fine-grained quality estimation.To be specific, we aim at estimating the confidence of every target token x j instead of one single score for the sequence.Given an input sequence X = x 1 . . .x Iw and an output sequence Y = y 1 . . .y Jw , the output will be a sequence of quality estimations C = c 1 . . .c Jw .One additional challenge is that we might be interested in the confidence using a different granularity than the predicted by the model c 1 , ..., c L (with L = J w ).For example, the user is interested in word-based confidence, while the system uses subword units.In this case, we assume to have a mapping m between the positions 1 . . .J w and 1 . . .L. In the example of subwords, this is straightforward because segmentation is recoverable.Then, we also need an additional aggregation function for the confidence estimates.We estimate the confidence c j by agg m(l)=j (c l ).For this type of aggregation we also use the minimum.
In machine translation, it is not only the confidence at the output level that is of interest, but also how adequately each individual source token is translated.From an application point of view, when the machine translation is used in an interactive scenario, this feature for example enables the user to reformulate the source sentence in order to avoid phrases that the system is not able to handle.
Formally, given an input sequence X = x 1 . . .x Iw and an output sequence Y = y 1 . . .y Jw , the model estimates a sequence of confidence measures C = c 1 . . .c Iw .Therefore, in this case, given the estimation of the model c 1 , ..., c L , we need a mapping m between the positions 1 . . .I w and 1 . . .L.

Posterior Probabilities
As a baseline for our experiments, we use the posterior probabilities.The intuition behind this technique is that the model will distribute the probability mass over several outputs in low-confidence situations.In contrast, if the model is confident about its prediction, it should assign a high probability to the prediction.
Formally, given an input sequence X = x 1 . . .x Iw and an output sequence Y = y 1 . . .y Jw , we first define the input tokens X = x 1 . . .x Iy and an output sequence Y = y 1 . . .y Jt (e.g. by using subwords).The encoder will first calculate a sequence of hidden states E = e 1 , . . .e It = EN C(X ).Secondly, we predict the target hidden states D = d 1 , . . .d Jt = DEC(E, Y ).Finally, we can use the posterior probabilities P = p 1 , . . ., p Jt calculated by: where F F is a linear transformation and [k] indicates the k-th element of the vector.By using P for C as described in Section 2.1, we can calculate now a sequence confidence or an output confidence.
3 Training similarity

Approaches to measure similarity
Two sentences can be similar in many ways.Therefore, there are also many ways to estimate the similarity between sentences.For our use case, it is important how similar the sentence representation generated by the translation systems is.Hence, we use the internal representations of the neural machine translation model to measure the similarity of the sentences.In an NMT system, there are different representation levels which can be used to measure the similarity of the sentence.For example, we can use the final encoder hidden states, the final decoder hidden states, or the context vectors.As motivated in the introduction, one interesting use case for using confidence is to find difficult source segments, so that the user can rewrite them.For this case, we concentrate on the encoder hidden states.
We measure the training-test similarity as follows: First, we run the encoder on the source side of the training data and store the encoder hidden representation (top layer) for every sentence k ).Second, we calculate the hidden representations of the test sentences (E tst = e tst 1 , . . ., e tst I tst T ) and used approximate k-nearest neighbor search (implemented in the Annoy1 toolkit).
We investigated two methods to estimate the similarity, one on the sentence level and one on token level.First, we use the distance to the overall most similar training sentence by using the average vector of the encoder hidden states for the training as well as for the test data.Formally: Then we use s directly as the sequence confidence c from Section 2.1.
The second method is to estimate the confidence for each source token.This is achieved by finding the nearest neighbor for each hidden encoder state e tst i .
By using S = s 1 . . .s I tst as C in Section 2.1, we can calculate a sequence confidence or an confidence for each input token.

Similarity estimation
The main disadvantage of aforementioned method is that we need to calculate and store the hidden representation of all training examples.Such storage consumption is non-trivial even for small datasets like the TED corpus and it is infeasible for large-scale sequence-to-sequence models.Therefore, we also investigate methods to approximate the distance without storing the hidden states for the whole training data.Here we propose to approximate this distance by using autoencoders.The autoencoder will be able to reconstruct typical hidden states seen in the training data, while the reconstruction of unusual hidden states will be less exact.
As shown in Figure 1, we are using an autoencoder with a single hidden layer.In our experiments, we investigate different hidden sizes of the autoencoder.Afterwards, we apply the sigmoid activation function before predicting the output.
Next, we then can use the quality of the reconstruction as a measure of the model's confidence in its predictions.We found that it is possible to get the confidence qualitatively by measuring the L 2distance between the hidden representation and its reconstruction.
As for the direct measurement, we can use S e = s e 1 . . .s e I tst as C to calculate the confidence of the sequence or for each input token.Furthermore, by using the decoder states D instead of the encoder states E, we can calculate S d accordingly and use it to estimate the sequence or target token confidence.
Figure 1: Architecture of the autoencoder

Combining both approaches
While using similarity measurements is able to estimate the quality of the whole sequence as well as of parts of the sequence, it also has two drawbacks: First, in the L 2 -norm all dimensions are equally important, while this might not be the case for the final prediction of the words.Second, we are only looking at the similarity between training and test condition, but ignoring that some outputs might be inherently more difficult to predict than others.
Therefore, we combine both techniques and thereby minimize their respective drawbacks.To do so, the hidden representation is first replaced by the reconstruction generated by the autoencoder.After that, we calculate the probabilities based on these reconstructed hidden representations.If we use the autoencoder on the decoder states, Equation 1 needs to be modified to: By using P s for C as described in Section 2.1, we can calculate now a sequence confidence or an output confidence.Similarly, we can replace Auto(d i ) by d i with D = DEC(Auto(E), Y ) to use the autoencoders on the encoder hidden states.It is important to note, that the similarity approximated by the encoder hidden states can be used for source token confidence estimation, while the combination of autoencoders on the source hidden state and posterior probabilities can only be used for target token confidence estimation.One advantage of this combination is that no additional parameters are introduced.

Alignment
While the previously presented models are all able to generate confidence measures for each target token, only the distance-based similarity measures are able to also generate scores for the source tokens.In order to generate source token confidence qualitatively, a straightforward approach is to use word alignment to map the confidence score from the target side to the source.Our baseline for these experiments uses the IBM4 GIZA alignment model (Och and Ney, 2003) to map the posterior probabilities and the combined approach's confidence estimations from target to source tokens.If several target tokens align to the same source token, we again use the minimal confidence.
Motivated by our autoencoding approach to measure similarity between training and test data, we investigate similar approaches to model the alignment between source and target tokens.In this case, we used a model to predict a target hidden state d j given a source state e i .If a source word aligns to a target word, it should be possible to predict this target word primarily based on this source word.Therefore, we choose the same architecture as for the autoencoder.We use the source hidden state to predict a target hidden state.Then, we compare the predicted hidden state to all decoder hidden states and describe the alignment strength between the source and target hidden state using the cosine similarity between the predicted hidden state and the target hidden state.Let N N () be the neural network-based predictor.We then calculate the alignment by: Based on the alignment scores, we created an alignment matrix by aligning each source word to the target word with the strongest link according to Equation 7.
Since there are not confidence labels with aligned source and target words available, we cannot simply train the neural network.Inspired by the GIZA model, we utilize the EM algorithm for training.Given an alignment a * , we can train our model using the following MSE-based loss function: This can be extended for soft alignments a to: This corresponds to the M-Step in the EM algorithm.To be able to train the model using this loss function, we need to estimate an alignment a in the E-step.Given the source representation e 1 , . . .e I of a sentence, we use the predictor to calculate the prediction p 1 , . . .p I .Based on this, we calculate the alignment similarities a ij based on cosine similarities between p i and the decoder hidden states d j .In order to prevent the model from learning to collapse into aligning all source words to the most obvious words e.g. the period at the end of the sentence, we normalize them to probabilities for each target word.(a ij = a ij / I i =1 (a i j )).

Evaluation
In this work, we evaluate the ability of sequenceto-sequence models to estimate their confidence in their own output on two different tasks: MT and ASR.
It is necessary to define a gold standard for the evaluation.For ASR, there is only one ground truth.Accordingly, we can label each output word from the model as correct/substitution/deletion/insertion.Our confidence measurement is then done on he word level (predicting whether the word is correct or not)2 .
For machine translation, a single correct translation for each source sentence does not exist.To account for this, our experiments are carried out in the following way: We collected annotations with incorrectly translated source words for 1177 sentence pairs, resulting in 39.93% of the source sentences containing mistranslated words.We were not able to test our methods on existing quality estimation data sets, as we cannot access internal model information for this data.
Given the reference labels, the next step is to measure the quality of the confidence measures.In our experiments, we use four different measures.The first possible scenario is that, we assume that the user has a fixed amount of time and wants to maximize the improvements.Therefore, we calculate the confidence score for all the test data and look at the 10% and 20% of the test data that the model has given the lowest confidences.Then, we measure what percentage of errors according to the reference are found in this part of the data.
In another scenario, we want to dynamically correct as many sentences as would be beneficial.This can be measured using the F-Score.Since we need to map the confidence scores to labels, we have the additional challenge of finding a good threshold for when to assign the label "high confidence" or "low confidence" to an output sentence/word.Therefore, we report oracle F-Scores using an optimal threshold found on the test data.Furthermore, we evaluate an approach to find this threshold in an unsupervised manner: While our baseline system uses beam search with beam=8, we also perform greedy decoding.We assume that the model is not confident if the beam search leads to a different outcome from the greedy decoding, and create pseudo-labels where each segment or token is labeled wrong if the results of beam search and greedy search differ.Then, we select the optimal threshold based by comparing the predict scores and these pseudo-labels and evaluate the approach on the real labels.

Experiments
The sequence-to-sequence models in our work are based on the state-of-the-art Transformer architecture (Vaswani et al., 2017).We followed the model configuration with the learning rate schedule from the Base configuration in the original work.The number of layers is adapted for each task for the best performance possible and will be reported respectively.The autoencoders are implemented on top of the Transformer (with PyTorch (Paszke et al., 2017)) using one hidden layer with different sizes and sigmoid activation function. 3The MT model is a 12-layer Transformer trained on the German-English TED corpus (Cettolo et al., 2012) with the development set and test set from the IWSLT 2017 evaluation campaign.The data is preprocessed with Moses tokenization, true-casing and segmented with byte-pair encoding (Sennrich et al., 2016) with 40K codes.The model achieves a BLEU score of 28.82 on the development set and 30.63 on the test data.
We conducted further ASR experiments on the Switchboard-1 Release 2 (LDC97S62) corpus, which contains over 300 hours of speech.The Hub5'00 evaluation data (LDC2002S09) was used as our test set.On this set, we are especially inter-ested in the influence of the model performance on the quality estimation.Therefore, we trained 4 different models with 4,8,12 and 24 layers.These models achieve a WER of 20.8, 14.8, 13.0 and 12.1 on the Switchboard test set respectively, and 33.2, 25.5, 23.9 and 23.0 on the Callhome set.

Machine translation results
The first concern in the experiments is the performance on segment-level quality estimation for machine translation.The results are summarized in Table 1.
Two baseline systems are presented in this experiment.To measure the difficulty of the task, we use the BEER evaluation metric as comparison, which has been performing competitively in the WMT Metric evaluations (Stanojevic and Simaan, 2014).It is important to note that the metric has access to the reference translation, while the confidence measure do not.Even with this advantage, the metric does not clearly outperform a random baseline, showing the difficulty of the task.Using the model's posterior probability, we can improve on all four types of confidence measure.Among the 10% of the sentences with the lowest confidence, this method was able to find 17.66% of the sentences with errors.For this task, this is further the best performance.This confirms our hypothesis that the posterior probabilities can be reliable for modelling the system's confidence.
Proceeding to experiments shown in the next two lines, we evaluate the ability of using the similarity between test and training data as a measure for confidence.Although not performing as well as the posterior probabilities, the data difference is a good estimator for the task difficulty and the confidence of the model.When comparing a single sentence representation (Enc Sent Distance) and the token representation (Enc Distance) in the next line, the second one outperforms the first one, except for the top 10%.Therefore, it seems to be important to measure the distance of each individual token and not only of the whole sentence.
Motivated by these results, we trained the autoencoders on the individual tokens and not on the whole sentence and used the autoencoder networks on the source hidden representation to estimate the performance.We analyze the influence of the size of the bottleneck of the autoencoder.The network with bottleneck size of 256 (Enc Auto 256), which is half the size of the input size, man-aged to get the best performance in all measures.While we see a drop in performance due to the approximation, e.g. from 32.77% to 29.57% when looking at 20% of the data, this is still better than BEER.
We performed the same experiment using the target hidden representations.Again, we investigated the influence of the bottleneck size and achieved the best performance with a bottleneck size of 256 (Dec Auto 256).Reasonably, the target hidden states contain more information about the sequence-to-be-generated than the source states.
Finally, when combining the output probability with the decoder hidden states (Dec Auto 256 + Prob), we are able to achieve the best performance.Again, it is better to use the autoencoder on the decoder hidden state than on the encoder hidden state.It is worth noting that the pseudo-labels perform very well when including the posterior probabilities.Interestingly, we see a clear drop in performance between oracle and pseudo-labels when not using the posterior probabilities.
Moreover, we evaluated methods to identify source words with low confidence.The results for these experiments are summarized in Table 2.In this case the baseline is to map the posterior probabilities to the source sentence using a GIZA (Och and Ney, 2003) alignment.Again, we evaluate the approach with the same four scores.As shown in the first two lines, the Giza alignment from source to target performance clearly better than the one from target to source.Therefore, in the remaining experiments, we only evaluate approaches using the source to target alignment.By using the training-test distance approximated by the autoencoder on the encoder states (Enc Auto 256), we directly have an estimate on the source side and so do not need to map target estimates to the source side.In this case, we see improvements over using the posterior probabilities.Again, the pseudo labels perform not as well without using the posterior probabilities.Next, we map the other three measures, decoder hidden states and the combination of encoder or decoder states and output probabilities, using the Giza alignment to the source.Interestingly, this time, solely using the approximation of the training-test similarity is even better than the combination with the output probabilities.Table 2: Source word confidence for MT.First two columns: Percentage of found errors when selecting 10% and 20% of the data; Final two columns: F-score when using oracle threshold and thresholds optimized on pseudolabels posterior probabilities when looking at 10% and 20% of the data.Finally, we tried to use an internal alignment instead of the Giza alignment.Therefore, we predict the decoder hidden states based on the encoder hidden states as described in Section 4. Again, we investigated different sizes for the hidden states used to map the posterior probabilities.As shown in Table 2, all the models perform better than the GIZA alignment.We can further improve the quality by using a larger hidden layer.Since we need to learn a very complex mapping from source to target hidden states, a larger layer is better.The best performance is achieved using a layer of 8192 hidden units.
In the end, we also used the same model to map the autoencoder predictions.The combination of all three methods leads to the best results (Dec Auto 256 + Prob, 8192).By looking only at 10% of the words, we are able to find more than 35% of the errors and for 20% of the words we identify more than half of the errors.

Speech recognition results
The ASR results for this task are summarized in Table 3.We present the percentage of found errors when looking at 10 and 20 percent of the data.In each column, we estimate the quality of one output generated by the different models.Each row represents the results when using one model to estimate the quality of the different outputs.Here we evaluated four different models with increasing transcription quality.The only difference between the models are the number of hidden layers.We investigated models using 4, 8, 12 and 24 layers.In this task, the test set consists of two subsets.The best model achieves a word error rate of 12.1 and 23.0 on the two subsets, respectively.
Again, we use the output probabilities as well as the combination of the autoencoder and the output probabilities.We again use half the input size for the bottleneck size.Firstly, as shown in the MT experiments, we can improve the quality estimation by combining the posterior probabilities and the autoencoder approach.In all configurations, the combination performs better or similar than the posterior probability.
Secondly, the better models are able to better estimate the confidence on the same output.In most cases, the performance can be improved by using a more complex model to estimate the confidence.One exception is the estimation of its own output.
Finally, the estimation of the distance between training and test data mainly helps when using stronger models, both for the generation of the output and for confidence estimation.Furthermore, this method also removes the effect of models performing worse than their own output.

Related Work
Prior work has investigated confidence measurement for speech recognition models (Siu and Gish, 1999), and statistical machine translation models using either word-level posteriors (Ueffing and Ney, 2007) or external models (Gandrabur and Foster, 2003).Deep learning models have also received attention on uncertainty and confidence measurement recently: (Gal and Ghahramani, 2016) formulate neural network models with dropout as Bayesian models to obtain uncertainty based on sampling methods.Specifically, for neural machine translation models or other sequence-to-sequence models, quality estimation has remained as a topic of concern.While most prior research focused on developing confidence measures for a general system using external features (Specia et al., 2018), this works concentrates on estimating the confidence of a specific system by making use of the information available in the internal representation of the network.

Conclusion
In this work, we investigated the ability of sequence-to-sequence models to model their confidence in their decisions.We performed experiments using these models for two tasks: machine translation and speech recognition.
We analyzed the influence of train-test mismatch on quality estimation.By approximating this mismatch using an autoencoder and combining it with the posterior probabilities, we are able to improve confidence estimation over a strong baseline.We showed that it is better to measure the mismatch on the decoder hidden states than on the encoder hidden states.
Secondly, we also investigated methods to predict how well each individual source token is translated by a given model.In this case, measuring the train-test mismatch was even more important.Furthermore, we present an approach to infer the internal alignment of complex sequence-to-sequence models.Using this alignment instead of a state-of-the-art external alignment for mapping target confidence measure to source tokens clearly improved the quality of the confidence measure for source words

Table 1 :
The best system is achieved by the autoencoder of the decoder states (Dec Auto 256).Segment-level confidence estimation for MT.First two columns: Percentage of found errors when selecting 10% and 20% of the data; Final two columns: F-score when using oracle threshold and thresholds optimized on pseudo-labels

Table 3 :
Confidence estimation on ASR using different ASR systems for output predictions and confidence estimation: Found errors when selection 10% and 20% of the data