Confidence Modeling for Neural Semantic Parsing

In this work we focus on confidence modeling for neural semantic parsers which are built upon sequence-to-sequence models. We outline three major causes of uncertainty, and design various metrics to quantify these factors. These metrics are then used to estimate confidence scores that indicate whether model predictions are likely to be correct. Beyond confidence estimation, we identify which parts of the input contribute to uncertain predictions allowing users to interpret their model, and verify or refine its input. Experimental results show that our confidence model significantly outperforms a widely used method that relies on posterior probability, and improves the quality of interpretation compared to simply relying on attention scores.


Introduction
Semantic parsing aims to map natural language text to a formal meaning representation (e.g., logical forms or SQL queries).The neural sequenceto-sequence architecture (Sutskever et al., 2014;Bahdanau et al., 2015) has been widely adopted in a variety of natural language processing tasks, and semantic parsing is no exception.However, despite achieving promising results (Dong and Lapata, 2016;Jia and Liang, 2016;Ling et al., 2016), neural semantic parsers remain difficult to interpret, acting in most cases as a black box, not providing any information about what made them arrive at a particular decision.In this work, we explore ways to estimate and interpret the model's confidence in its predictions, which we argue can provide users with immediate and meaningful feedback regarding uncertain outputs.
An explicit framework for confidence modeling would benefit the development cycle of neural semantic parsers which, contrary to more traditional methods, do not make use of lexicons or templates and as a result the sources of errors and inconsistencies are difficult to trace.Moreover, from the perspective of application, semantic parsing is often used to build natural language interfaces, such as dialogue systems.In this case it is important to know whether the system understands the input queries with high confidence in order to make decisions more reliably.For example, knowing that some of the predictions are uncertain would allow the system to generate clarification questions, prompting users to verify the results before triggering unwanted actions.In addition, the training data used for semantic parsing can be small and noisy, and as a result, models do indeed produce uncertain outputs, which we would like our framework to identify.
A widely-used confidence scoring method is based on posterior probabilities p (y|x) where x is the input and y the model's prediction.For a linear model, this method makes sense: as more positive evidence is gathered, the score becomes larger.Neural models, in contrast, learn a complicated function that often overfits the training data.Posterior probability is effective when making decisions about model output, but is no longer a good indicator of confidence due in part to the nonlinearity of neural networks (Johansen and Socher, 2017).This observation motivates us to develop a confidence modeling framework for sequenceto-sequence models.We categorize the causes of uncertainty into three types, namely model uncertainty, data uncertainty, and input uncertainty and design different metrics to characterize them.
We compute these confidence metrics for a given prediction and use them as features in a regression model which is trained on held-out data to fit prediction F1 scores.At test time, the regression model's outputs are used as confidence scores.Our approach does not interfere with the training of the model, and can be thus applied to various architectures, without sacrificing test accuracy.Furthermore, we propose a method based on backpropagation which allows to interpret model behavior by identifying which parts of the input contribute to uncertain predictions.
Experimental results on two semantic parsing datasets (IFTTT, Quirk et al. 2015;and DJANGO, Oda et al. 2015) show that our model is superior to a method based on posterior probability.We also demonstrate that thresholding confidence scores achieves a good trade-off between coverage and accuracy.Moreover, the proposed uncertainty backpropagation method yields results which are qualitatively more interpretable compared to those based on attention scores.

Related Work
Confidence Estimation Confidence estimation has been studied in the context of a few NLP tasks, such as statistical machine translation (Blatz et al., 2004;Ueffing and Ney, 2005;Soricut and Echihabi, 2010), and question answering (Gondek et al., 2012).To the best of our knowledge, confidence modeling for semantic parsing remains largely unexplored.A common scheme for modeling uncertainty in neural networks is to place distributions over the network's weights (Denker and Lecun, 1991;MacKay, 1992;Neal, 1996;Blundell et al., 2015;Gan et al., 2017).But the resulting models often contain more parameters, and the training process has to be accordingly changed, which makes these approaches difficult to work with.Gal and Ghahramani (2016) develop a theoretical framework which shows that the use of dropout in neural networks can be interpreted as a Bayesian approximation of Gaussian Process.We adapt their framework so as to represent uncertainty in the encoder-decoder architectures, and extend it by adding Gaussian noise to weights.
Semantic Parsing Various methods have been developed to learn a semantic parser from natural language descriptions paired with meaning representations (Tang and Mooney, 2000;Zettlemoyer and Collins, 2007;Lu et al., 2008;Kwiatkowski et al., 2011;Andreas et al., 2013;Zhao and Huang, 2015).More recently, a few sequence-to-sequence models have been proposed for semantic parsing (Dong and Lapata, 2016;Jia and Liang, 2016;Ling et al., 2016) and shown to perform competitively whilst eschewing the use of templates or manually designed features.There have been several efforts to improve these models including the use of a tree decoder (Dong and Lapata, 2016), data augmentation (Jia and Liang, 2016;Kočiský et al., 2016), the use of a grammar model (Xiao et al., 2016;Rabinovich et al., 2017;Yin and Neubig, 2017;Krishnamurthy et al., 2017), coarse-tofine decoding (Dong and Lapata, 2018), network sharing (Susanto and Lu, 2017;Herzig and Berant, 2017), user feedback (Iyer et al., 2017), and transfer learning (Fan et al., 2017).Current semantic parsers will by default generate some output for a given input even if this is just a random guess.System results can thus be somewhat unexpected inadvertently affecting user experience.Our goal is to mitigate these issues with a confidence scoring model that can estimate how likely the prediction is correct.

Neural Semantic Parsing Model
In the following section we describe the neural semantic parsing model (Dong and Lapata, 2016;Jia and Liang, 2016;Ling et al., 2016) we assume throughout this paper.The model is built upon the sequence-to-sequence architecture and is illustrated in Figure 1.An encoder is used to encode natural language input q = q 1 • • • q |q| into a vector representation, and a decoder learns to generate a logical form representation of its meaning a = a 1 • • • a |a| conditioned on the encoding vectors.The encoder and decoder are two different recurrent neural networks with long short-term memory units (LSTMs; Hochreiter and Schmidhuber 1997) which process tokens sequentially.The probability of generating the whole sequence p (a|q) is factorized as: where Let e t ∈ R n denote the hidden vector of the encoder at time step t.It is computed via e t = f LSTM (e t−1 , q t ), where f LSTM refers to the LSTM unit, and q t ∈ R n is the word embedding Figure 1: We use dropout as approximate Bayesian inference to obtain model uncertainty.
The dropout layers are applied to i) token vectors; ii) the encoder's output vectors; iii) bridge vectors; and iv) decoding vectors. of q t .Once the tokens of the input sequence are encoded into vectors, e |q| is used to initialize the hidden states of the first time step in the decoder.Similarly, the hidden vector of the decoder at time step t is computed by , where a t−1 ∈ R n is the word vector of the previously predicted token.Additionally, we use an attention mechanism (Luong et al., 2015a) to utilize relevant encoder-side context.For the current time step t of the decoder, we compute its attention score with the k-th hidden state in the encoder as: where |q| j=1 r t,j = 1.The probability of generating a t is computed via: (3) where W 1 , W 2 ∈ R n×n and W o ∈ R |Va|×n are three parameter matrices.
The training objective is to maximize the likelihood of the generated meaning representation a given input q, i.e., maximize (q,a)∈D log p (a|q), where D represents training pairs.At test time, the model's prediction for input q is obtained via â = arg max a p (a |q), where a represents candidate outputs.Because p (a|q) is factorized as shown in Equation (1), we can use beam search to generate tokens one by one rather than iterating over all possible results.

Confidence Estimation
Given input q and its predicted meaning representation a, the confidence model estimates Algorithm 1 Dropout Perturbation Input: q, a: Input and its prediction M: Model parameters 1: Run forward pass and compute p(a|q; Mi ) score s (q, a) ∈ (0, 1).A large score indicates the model is confident that its prediction is correct.
In order to gauge confidence, we need to estimate "what we do not know".To this end, we identify three causes of uncertainty, and design various metrics characterizing each one of them.We then feed these metrics into a regression model in order to predict s (q, a).

Model Uncertainty
The model's parameters or structures contain uncertainty, which makes the model less confident about the values of p (a|q).For example, noise in the training data and the stochastic learning algorithm itself can result in model uncertainty.We describe metrics for capturing uncertainty below: Dropout Perturbation Our first metric uses dropout (Srivastava et al., 2014) as approximate Bayesian inference to estimate model uncertainty (Gal and Ghahramani, 2016).Dropout is a widely used regularization technique during training, which relieves overfitting by randomly masking some input neurons to zero according to a Bernoulli distribution.In our work, we use dropout at test time, instead.As shown in Algorithm 1, we perform F forward passes through the network, and collect the results {p(a|q; Mi )} F i=1 where Mi represents the perturbed parameters.Then, the uncertainty metric is computed by the variance of results.We define the metric on the sequence level as: In addition, we compute uncertainty u at at the token-level a t via: where p(a t |a <t , q; Mi ) is the probability of generating token a t (Equation ( 5)) using perturbed model Mi .We operationalize tokenlevel uncertainty in two ways, as the average score avg{u at } |a| t=1 and the maximum score (since the uncertainty of a sequence is often determined by the most uncertain token).As shown in Figure 1, we add dropout layers in i) the word vectors of the encoder and decoder q t , a t ; ii) the output vectors of the encoder e t ; iii) bridge vectors e |q| used to initialize the hidden states of the first time step in the decoder; and iv) decoding vectors d att t (Equation ( 4)).
Gaussian Noise Standard dropout can be viewed as applying noise sampled from a Bernoulli distribution to the network parameters.
We instead use Gaussian noise, and apply the metrics in the same way discussed above.Let v denote a vector perturbed by noise, and g a vector sampled from the Gaussian distribution N (0, σ 2 ).
We use v = v + g and v = v + v g as two noise injection methods.Intuitively, if the model is more confident in an example, it should be more robust to perturbations.
Posterior Probability Our last class of metrics is based on posterior probability.We use the log probability log p(a|q) as a sequence-level metric.
The token-level metric min{p(a t |a <t , q)} |a| t=1 can identify the most uncertain predicted token.The perplexity per token − 1 |a| |a| t=1 log p (a t |a <t , q) is also employed.

Data Uncertainty
The coverage of training data also affects the uncertainty of predictions.If the input q does not match the training distribution or contains unknown words, it is difficult to predict p (a|q) reliably.We define two metrics:

Probability of Input
We train a language model on the training data, and use it to estimate the probability of input p(q|D) where D represents the training data.

Number of Unknown Tokens
Tokens that do not appear in the training data harm robustness, and lead to uncertainty.So, we use the number of unknown tokens in the input q as a metric.

Input Uncertainty
Even if the model can estimate p (a|q) reliably, the input itself may be ambiguous.For instance, the input the flight is at 9 o'clock can be interpreted as either flight time(9am) or flight time(9pm).Selecting between these predictions is difficult, especially if they are both highly likely.We use the following metrics to measure uncertainty caused by ambiguous inputs.

Variance of Top Candidates
We use the variance of the probability of the top candidates to indicate whether these are similar.The sequencelevel metric is computed by: where a 1 . . .a K are the K-best predictions obtained by the beam search during inference (Section 3).

Entropy of Decoding
The sequence-level entropy of the decoding process is computed via: which we approximate by Monte Carlo sampling rather than iterating over all candidate predictions.The token-level metrics of decoding entropy are computed by avg{H[a t |a <t , q]} |a| t=1 and max{H[a t |a <t , q]} |a| t=1 .

Confidence Scoring
The sentence-and token-level confidence metrics defined in Section 4 are fed into a gradient tree boosting model (Chen and Guestrin, 2016) in order to predict the overall confidence score s (q, a).The model is wrapped with a logistic function so that confidence scores are in the range of (0, 1).
Because the confidence score indicates whether the prediction is likely to be correct, we can use the prediction's F1 (see Section 6.2) as target value.
The training loss is defined as: (q,a)∈D ln(1+e −ŝ(q,a) ) yq,a + ln(1+e ŝ(q,a) ) (1−yq,a)   where D represents the data, y q,a is the target F1 score, and ŝ(q, a) the predicted confidence score.We refer readers to Chen and Guestrin (2016) for mathematical details of how the gradient tree boosting model is trained.Notice that we learn the confidence scoring model on the held-out set (rather than on the training data of the semantic parser) to avoid overfitting.

Uncertainty Interpretation
Confidence scores are useful in so far they can be traced back to the inputs causing the uncertainty in the first place.For semantic parsing, identifying The score u m is then redistributed to its parent neurons p 1 and p 2 , which satisfies which input words contribute to uncertainty would be of value, e.g., these could be treated explicitly as special cases or refined if they represent noise.
In this section, we introduce an algorithm that backpropagates token-level uncertainty scores (see Equation ( 7)) from predictions to input tokens, following the ideas of Bach et al. (2015) and Zhang et al. (2016).Let u m denote neuron m's uncertainty score, which indicates the degree to which it contributes to uncertainty.As shown in Figure 2, u m is computed by the summation of the scores backpropagated from its child neurons: where Child(m) is the set of m's child neurons, and the non-negative contribution ratio v c m indicates how much we backpropagate u c to neuron m.Intuitively, if neuron m contributes more to c's value, ratio v c m should be larger.After obtaining score u m , we redistribute it to its parent neurons in the same way.Contribution ratios from m to its parent neurons are normalized to 1: where Parent(m) is the set of m's parent neurons.
Given the above constraints, we now define different backpropagation rules for the operators used in neural networks.We first describe the rules used for fully-connected layers.Let x denote the input.The output is computed by z = σ(Wx+b), where σ is a nonlinear function, W ∈ R |z| * |x| is the weight matrix, b ∈ R |z| is the bias, and neuron z i is computed via ered from the next layer: ignoring the nonlinear function σ and the bias b.
The ratio v z i x k is proportional to the contribution of x k to the value of z i .
We define backpropagation rules for elementwise vector operators.For z = x ± y, these are: where the contribution ratios v z k x k and v z k y k are determined by |x k | and |y k |.For multiplication, the contribution of two elements in 1 3 * 3 should be the same.So, the propagation rules for z = x y are: where the contribution ratios are determined by For scalar multiplication, z = λx where λ denotes a constant.We directly assign z's uncertainty scores to x and the backpropagation rule is As shown in Algorithm 2, we first initialize uncertainty backpropagation in the decoder (lines 1-5).For each predicted token a t , we compute its uncertainty score u at as in Equation ( 7).Next, we find the dimension of a t in the decoder's softmax classifier (Equation ( 5)), and initialize the neuron with the uncertainty score u at .We then backpropagate these uncertainty scores through
the network (lines 6-9), and finally into the neurons of the input words.We summarize them and compute the token-level scores for interpreting the results (line 10-13).For input word vector q t , we use the summation of its neuron-level scores as the token-level score: where c ∈ q t represents the neurons of word vector q t , and |q| t=1 ûqt = 1.We use the normalized score ûqt to indicate token q t 's contribution to prediction uncertainty.

Experiments
In this section we describe the datasets used in our experiments and various details concerning our models.We present our experimental results and analysis of model behavior.Our code is publicly available at https://github.com/donglixp/confidence.

Datasets
We trained the neural semantic parser introduced in Section 3 on two datasets covering different domains and meaning representations.Examples are shown in Table 1.
IFTTT This dataset (Quirk et al., 2015) contains a large number of if-this-then-that programs crawled from the IFTTT website.The programs are written for various applications, such as home security (e.g., "email me if the window opens"), and task automation (e.g., "save instagram photos to dropbox").Whenever a program's trigger is satisfied, an action is performed.Triggers and actions represent functions with arguments; they are selected from different channels (160 in total) representing various services (e.g., Android).There are 552 trigger functions and 229 action functions.The original split contains 77, 495 training, 5, 171 development, and 4, 294 test instances.The subset that removes non-English descriptions was used in our experiments.
DJANGO This dataset (Oda et al., 2015) is built upon the code of the Django web framework.Each line of Python code has a manually annotated natural language description.Our goal is to map the English pseudo-code to Python statements.This dataset contains diverse use cases, such as iteration, exception handling, and string manipulation.The original split has 16, 000 training, 1, 000 development, and 1, 805 test examples.

Settings
We followed the data preprocessing used in previous work (Dong and Lapata, 2016;Yin and Neubig, 2017).Input sentences were tokenized using NLTK (Bird et al., 2009) and lowercased.We filtered words that appeared less than four times in the training set.Numbers and URLs in IFTTT and quoted strings in DJANGO were replaced with place holders.Hyperparameters of the semantic parsers were validated on the development set.The learning rate and the smoothing constant of RMSProp (Tieleman and Hinton, 2012) were 0.002 and 0.95, respectively.The dropout rate was 0.25.A two-layer LSTM was used for IFTTT, while a one-layer LSTM was employed for DJANGO.Dimensions for the word embedding and hidden vector were selected from {150, 250}.The beam size during decoding was 5.
For IFTTT, we view the predicted trees as a set of productions, and use balanced F1 as evaluation metric (Quirk et al., 2015).We do not measure accuracy because the dataset is very noisy and there rarely is an exact match between the predicted output and the gold standard.The F1 score of our neural semantic parser is 50.1%, which is comparable to Dong and Lapata (2016).For DJANGO, we measure the fraction of exact matches, where F1 score is equal to accuracy.Because there are unseen variable names at test time, we use attention scores as alignments to replace unknown to- kens in the prediction with the input words they align to (Luong et al., 2015b).The accuracy of our parser is 53.7%, which is better than the result (45.1%) of the sequence-to-sequence model reported in Yin and Neubig (2017).
To estimate model uncertainty, we set dropout rate to 0.1, and performed 30 inference passes.The standard deviation of Gaussian noise was 0.05.The language model was estimated using KenLM (Heafield et al., 2013).For input uncertainty, we computed variance for the 10-best candidates.The confidence metrics were implemented in batch mode, to take full advantage of GPUs.Hyperparameters of the confidence scoring model were cross-validated.The number of boosted trees was selected from {20, 50}.The maximum tree depth was selected from {3, 4, 5}.We set the subsample ratio to 0.8.All other hyperparameters in XGBoost (Chen and Guestrin, 2016) were left with their default values.

Results
Confidence Estimation We compare our approach (CONF) against confidence scores based on posterior probability p(a|q) (POSTERIOR).We also report the results of three ablation variants (−MODEL, −DATA, −INPUT) by removing each group of confidence metrics described in Section 4. We measure the relationship between confidence scores and F1 using Spearman's ρ correlation coefficient which varies between −1 and 1 (0 implies there is no correlation).High ρ indicates that the confidence scores are high for correct predictions and low otherwise.
As shown in Table 2, our method CONF outperforms POSTERIOR by a large margin.The ablation results indicate that model uncertainty plays the most important role among the confidence metrics.In contrast, removing the metrics of data uncertainty affects performance less, because most examples in the datasets are in-domain.Improve-  3.
ments for each group of metrics are significant with p < 0.05 according to bootstrap hypothesis testing (Efron and Tibshirani, 1994).Tables 3 and 4 show the correlation matrix for F1 and individual confidence metrics on the IFTTT and DJANGO datasets, respectively.As can be seen, metrics representing model uncertainty and input uncertainty are more correlated to each other compared with metrics capturing data uncertainty.Perhaps unsurprisingly metrics of the same group are highly inter-correlated since they model the same type of uncertainty.Table 5 shows the relative importance of individual metrics in the regression model.As importance score we use the average gain (i.e., loss reduction) brought by the confidence metric once added as feature to the branch of the decision tree (Chen and Guestrin, 2016) 3.
the most important role.On IFTTT, the number of unknown tokens (#UNK) and the variance of top candidates (var(K-best)) are also very helpful because this dataset is relatively noisy and contains many ambiguous inputs.
Finally, in real-world applications, confidence scores are often used as a threshold to trade-off precision for coverage.Figure 3 shows how F1 score varies as we increase the confidence threshold, i.e., reduce the proportion of examples that we return answers for.F1 score improves monotonically for POSTERIOR and our method, which, however, achieves better performance when coverage is the same.

Uncertainty Interpretation
We next evaluate how our backpropagation method (see Section 5) allows us to identify input tokens contributing to uncertainty.We compare against a method that interprets uncertainty based on the attention mechanism (ATTENTION).As shown in Equation (2), attention scores r t,k can be used as soft alignments between the time step t of the decoder and the k-th input token.We compute the normalized uncertainty score ûqt for a token q t via: where u at is the uncertainty score of the predicted token a t (Equation ( 7)), and |q| t=1 ûqt = 1.Unfortunately, the evaluation of uncertainty interpretation methods is problematic.For our semantic parsing task, we do not a priori know which tokens in the natural language input contribute to uncertainty and these may vary depending on the architecture used, model parameters, and so on.We work around this problem by creating a proxy gold standard.We inject noise to the vectors representing tokens in the encoder (see Section 4.1) and then estimate the uncertainty caused by each token q t (Equation ( 6)) under the assumption that 100% 90% 80% 70% 60% 50% 40% 30% addition of noise should only affect genuinely uncertain tokens.Notice that here we inject noise to one token at a time1 instead of all parameters (see Figure 1).Tokens identified as uncertain by the above procedure are considered gold standard and compared to those identified by our method.We use Gaussian noise to perturb vectors in our experiments (dropout obtained similar results).

Proportion of Examples
We define an evaluation metric based on the overlap (overlap@K) among tokens identified as uncertain by the model and the gold standard.Given an example, we first compute the interpretation scores of the input tokens according to our method, and obtain a list τ 1 of K tokens with highest scores.We also obtain a list τ 2 of K tokens with highest ground-truth scores and measure the degree of overlap between these two lists: ATTENTION 0.525 0.737 0.637 0.684 BACKPROP 0.608 0.791 0.770 0.788  where K ∈ {2, 4} in our experiments.For example, the overlap@4 metric of the lists τ 1 = [q 7 , q 8 , q 2 , q 3 ] and τ 2 = [q 7 , q 8 , q 3 , q 4 ] is 3/4, because there are three overlapping tokens.
Table 6 reports results with overlap@2 and overlap@4.Overall, BACKPROP achieves better interpretation quality than the attention mechanism.On both datasets, about 80% of the top-4 tokens identified as uncertain agree with the ground truth.Table 7 shows examples where our method has identified input tokens contributing to the uncertainty of the output.We highlight token a t if its uncertainty score u at is greater than 0.5 * avg{u a t } |a| t =1 .The results illustrate that the parser tends to be uncertain about tokens which are function arguments (e.g., URLs, and message content), and ambiguous inputs.The examples show that BACKPROP is qualitatively better compared to ATTENTION; attention scores often produce inaccurate alignments while BACKPROP can utilize information flowing through the LSTMs rather than only relying on the attention mechanism.

Conclusions
In this paper we presented a confidence estimation model and an uncertainty interpretation method for neural semantic parsing.Experimental results show that our method achieves better performance than competitive baselines on two datasets.Directions for future work are many and varied.The proposed framework could be applied to a variety of tasks (Bahdanau et al., 2015;Schmaltz et al., 2017) employing sequence-to-sequence architectures.We could also utilize the confidence estimation model within an active learning framework for neural semantic parsing.

Figure 2 :
Figure 2: Uncertainty backpropagation at the neuron level.Neuron m's score u m is collected from child neurons c 1 and c 2 byu m = v c 1 m u c 1 + v c 2 m u c 2 .The score u m is then redistributed to its parent neurons p 1 and p 2 , which satisfies v m p 1 + v m p 2 = 1.

Figure 3 :
Figure 3: Confidence scores are used as threshold to filter out uncertain test examples.As the threshold increases, performance improves.The horizontal axis shows the proportion of examples beyond the threshold.

Table 2 :
Spearman ρ correlation between confidence scores and F1.Best results are shown in bold.All correlations are significant at p < 0.01.

Table 5 :
. The results indicate that model uncertainty (Noise/Dropout/Posterior/Perplexity) plays Importance scores of confidence metrics (normalized by maximum value on each dataset).Best results are shown in bold.Same shorthands apply as in Table

Table 6 :
Uncertainty interpretation against inferred ground truth; we compute the overlap between tokens identified as contributing to uncertainty by our method and those found in the gold standard.Overlap is shown for top 2 and 4 tokens.Best results are in bold.google calendar−any event starts THEN facebook −create a status message−(status message ({description})) ATT post calendar event to facebook BP post calendar event to facebook feed−new feed item−(feed url( url sports.espn.go.com))THEN ... ATT espn mlb headline to readability BP espn mlb headline to readability weather−tomorrow's low drops below−(( temperature(0)) (degrees in(c))) THEN ... ATT warn me when it's going to be freezing tomorrow BP warn me when it's going to be freezing tomorrow if str number[0] == ' STR ': ATT if first element of str number equals a string STR .BP if first element of str number equals a string STR .start = 0 ATT start is an integer 0 .BP start is an integer 0 .if name.startswith('STR '): ATT if name starts with an string STR , BP if name starts with an string STR ,

Table 7 :
Uncertainty interpretation for ATTEN-TION (ATT) and BACKPROP (BP) .The first line in each group is the model prediction.Predicted tokens and input words with large scores are shown in red and blue, respectively.