Learning to Rank for Plausible Plausibility

Researchers illustrate improvements in contextual encoding strategies via resultant performance on a battery of shared Natural Language Understanding (NLU) tasks. Many of these tasks are of a categorical prediction variety: given a conditioning context (e.g., an NLI premise), provide a label based on an associated prompt (e.g., an NLI hypothesis). The categorical nature of these tasks has led to common use of a cross entropy log-loss objective during training. We suggest this loss is intuitively wrong when applied to plausibility tasks, where the prompt by design is neither categorically entailed nor contradictory given the context. Log-loss naturally drives models to assign scores near 0.0 or 1.0, in contrast to our proposed use of a margin-based loss. Following a discussion of our intuition, we describe a confirmation study based on an extreme, synthetically curated task derived from MultiNLI. We find that a margin-based loss leads to a more plausible model of plausibility. Finally, we illustrate improvements on the Choice Of Plausible Alternative (COPA) task through this change in loss.


Introduction
Contextualized encoders such as GPT (Radford et al., 2018) and BERT (Devlin et al., 2019) have led to improvements on various structurally similar Natural Language Understanding (NLU) tasks such as variants of Natural Language Inference (NLI).Such tasks model the conditional interpretation of a sentence (e.g., an NLI hypothesis) based on some other context (usually some other sentence, e.g., an NLI premise).The structural similarity of these tasks points to a structurally similar modeling approach: (1) concatenate the conditioning context (premise) to a sentence to be interpreted, (2)  read this pair using a contextualized encoder, then (3) employ the resultant representation to support classification under the label set of the task.NLI datasets employ a categorical label scheme (Entailment, Neutral, Contradiction) which has led to the use of a cross-entropy log-loss objective at training time: learn to maximize the probability of the correct label, and thereby minimize the probability of the competing labels.We suggest that this approach is intuitively problematic when applied to a task such as COPA (Choice Of Plausible Alternative) by Roemmele et al. (2011), where one is provided with a premise and two or more alternatives, and the model must select the most sensible hypothesis, with respect to the premise and the other options.As compared to NLI datasets, COPA was designed to have alternatives that are neither strictly true nor false in context: a procedure that maximizes the probability of the correct item at training time, thereby minimizing the probability of the other alternative(s), will seemingly learn to misread future examples.
We argue that COPA-style tasks should intuitively be approached as learning to rank problems (Burges et al., 2005;Cao et al., 2007), where an encoder on competing items is trained to assign relatively higher or lower scores to candidates, rather than maximizing or minimizing probabilities.
In the following we investigate three datasets, beginning with a constructed COPA-style variant of MultiNLI (Williams et al., 2018, later MNLI), designed to be adversarial (see Figure 1).Results on this dataset support our intuition (see Figure 2).We then construct a second synthetic dataset based on JOCI (Zhang et al., 2017), which employed a finer label set than NLI, and a margin-based approach strictly outperforms log-loss in this case.Finally, we demonstrate state-of-the-art on COPA, showing that a BERT-based model trained with margin-loss significantly outperforms a log-loss alternative.

Background
A series of efforts have considered COPA: by causality estimation through pointwise mutual information (Gordon et al., 2011) or data-driven methods (Luo et al., 2016;Sasaki et al., 2017), or through a pre-trained language model (Radford et al., 2018, GPT). 1  Under the Johns Hopkins Ordinal Commonsense Inference (JOCI) dataset (Zhang et al., 2017), instead of selecting which hypothesis is the most plausible, a model is expected to directly assign ordinal 5-level Likert scale judgments (from impossible to very likely).If taking an ordinal interpretation of NLI, this can be viewed as a 5-way variant of the 3-way labels used in SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018).
In this paper, we recast MNLI and JOCI as COPA-style plausibility tasks by sampling and constructing (p, h, h ) triples from these two datasets.Each premise-hypothesis pair (p, h) is labeled with different levels of plausibility y p,h . 2 1 As reported in https://blog.openai.com/language-unsupervised/.

Models
In models based on GPT and BERT for plausibility or NLI, similar neural architectures have been employed.The premise p and hypothesis h are concatenated into a sequence with a special delimiter token, along with a special sentinel token CLS inserted as the token for feature extraction: The concatenated string is passed into the BERT or GPT encoder.One takes the encoded vector of the CLS state as the feature vector extracted from the (p, h) pair.Given the feature vector, a dense layer is stacked upon to get the final score F(p, h), where F : P × H → R is the model.

Cross entropy loss
The model is trained to maximize the probability of the correct candidate, normalized over all candidates in the set (leading to a cross entropy log-loss between the posterior distribution of the scores and the true labels): . (1) Margin-based loss As we have argued before, the cross entropy loss employed in Equation 1 is problematic.Instead we propose to use the following margin-based triplet loss (Weston and Watkins, 1999;Chechik et al., 2010;Li et al., 2018): where N is the number of pairs of hypotheses where the first is more plausible than the second under the given premise p; h > h means that h ranks higher than (i.e., is more plausible than) h under premise p; and ξ is a margin hyperparameter denoting the desired scores difference between these two hypotheses.

Recasting Datasets
We consider three datasets: MNLI, JOCI, and COPA.These are all cast as plausibility datasets, into a format comprising (p, h, h ) triples, where h is more plausible than h under the context of premise p. MNLI In MNLI, each premise p is paired with 3 hypotheses.We cast the label on each hypothesis as a relative plausibility judgment, where entailment > neutral > contradiction (we label them as 2, 1, and 0).We construct two 2-choice plausibility tasks from MNLI: MNLI 1 comprises all pairs labeled with 2/1, 2/0, or 1/0; whereas MNLI 2 removes the presumably easier 2/0 pairs.For MNLI 1 , the training set is constructed from the original MNLI training dataset, and the dev set for MNLI 1 is derived from the original MNLI matched dev dataset.For MNLI 2 , all of the examples in our training and dev sets is taken from the original MNLI training dataset, hence the same premise exists in both training and dev.This is by our adversarial design: each neutral hypothesis appears either as the preferred (beating contradiction), or dispreferred alternative (beaten by entailment), which is flipped at evaluation time.
JOCI In JOCI, every inference pair is labeled with their ordinal inference Likert-scale labels 5, 4, 3, 2, or 1.Similar to MNLI, we cast these to 2-choice problems under the following conditions: We ignore inference pairs with scores below 3, aiming for sets akin to COPA, where even the dispreferred option is still often semi-plausible.
COPA We label alternatives as 1 (the more plausible one) and 0 (otherwise).The original dev set in COPA is used as the training set.
Table 1 shows the statistics of these datasets.

Experiments and Analyses
Setup We fine-tune the BERT-BASE-UNCASED (Devlin et al., 2019)  based loss, and perform hyperparameter search on the margin parameter ξ.
For the recast MNLI and JOCI datasets, the margin hyperparameter ξ = 0.2.Since COPA does not have a training set, we use the original dev set as the training set, and perform 10-fold cross validation to find the best hyperparameter ξ = 0.37.We employ the Adam optimizer (Kingma and Ba, 2014) with initial learning rate η = 3 × 10 −5 , finetune for at most 3 epochs and use early-stopping to select the best model. 2 shows results on the recast MNLI and JOCI datasets.We find that for the two synthetic MNLI datasets, margin-loss performs similarly to cross entropy log-loss.Shifting to the JOCI datasets, with less extreme (contradiction / entailed) hypotheses, especially in the adversarial JOCI 2 variant, marginloss outperforms log-loss.

Results on Recast MNLI and JOCI Table
Though log-loss and margin-loss give close quantitative results on predicting the more plausible (p, h) pairs, they do so in different ways, confirming our intuition.From Figure 3 we find that the log-loss always predicts the more plausible (p, h) pair with very high probabilities close to 1, and predicts the less plausible (p, h) pair with very low probabilities close to 0. Figure 3, showing a perpremise normalized score distribution from marginloss, is more reasonable and explainable: hypotheses with different plausibility are distributed hierarchically between 0 and 1.
(a) An organization is successful if its activities, resources and goals align.Table 4: Examples of premises and their corresponding hypotheses in various plausibility datasets, with gold labels and scores given by the log-loss and margin-loss trained models.
Results on COPA Table 3 shows our results on COPA.Compared with previous state-of-theart knowledge-driven baseline methods, a BERT model trained with a log-loss achieves better performance.When training the BERT model with a margin-loss instead of a log-loss, our method gets the new state-of-the-art result on the established COPA splits, with an accuracy of 75.4%. 3 Analyses Table 4 shows some examples from the MNLI 1 , JOCI 1 and COPA datasets, with scores 3 We exclude a blog-posted GPT result, which comes without experimental conditions and is not reproducible.
normalized with respect to all hypotheses given a specific premise.
For the premise (1) from MNLI 1 , log-loss results in a very high score (0.919) for the entailment hypothesis (1a), while assigning a low score (0.0807) for the neutral hypothesis (1b), and an extremely low score (1.71×10 −8 ) for the contradiction hypothesis (1c).Though the log-loss can achieve high accuracy by making these extreme prediction scores, we argue these scores are unintuitive.For the premise (2) from MNLI 1 , log-loss again gives a very high score (0.505) for the hypothesis (2a).
These are the two ways for the log-loss approach to make predictions with high accuracy: always giving very high score for the entailment hypothesis and low score for the contradiction hypothesis, but giving either very high or very low score for the neutral hypothesis.In contrast, the margin-loss gives more intuitive scores for these two examples.Also, we get similar observations from the JOCI 1 examples (3) and (4).
Example (5) from COPA is asking for a more plausible cause premise for the effect hypothesis.Here, each of the two candidate premises (5) and ( 5) is a possible answer.The log-loss gives very high (0.972) and very low (0.028) scores for the two candidate premises, which is unreasonable.Whereas the margin-loss gives much more rational ranking scores for them (0.52 and 0.48).For example (6), which is asking for a more likely effect hypothesis for the cause premise, margin-loss still gets more reasonable prediction scores than the log-loss.
Our qualitative analysis is related to the concept of calibration in statistics: are these resulting scores close to their class membership probabilities?Our intuitive qualitative results might be thought as a type of calibration for the plausibility task (more "reliable" scores) instead of the more common multi-class classification (Zadrozny and Elkan, 2002;Hastie and Tibshirani, 1998;Niculescu-Mizil and Caruana, 2005).

Conclusion
In this paper, we propose that margin-loss in contrast to log-loss is a more plausible training objective for COPA-style plausibility tasks.Through adversarial construction we illustrated that a logloss approach can be driven to encode plausible statements (Neutral hypotheses in NLI) as either extremely likely or unlikely, which was highlighted in contrasting figures of per-premise normalized hypothesis scores.This intuition was shown to lead to a new state-of-the-art in the original COPA task, based on a margin-based loss.
p I just stopped where I was h E I stopped in my tracks D h N I stopped running right were I was h N I stopped running right were I was D h C I continued on my way

Figure 1 :Figure 2 :
Figure1: COPA-like pairs may be constructed from datasets such as MultiNLI, where a premise and two hypotheses are presented, where the correct -most plausible -item depends on the competing hypothesis.

Figure 3 :
Figure 3: Train and dev score distribution after training with a cross entropy log-loss and a margin-loss.
She jumped off the diving board.(a) The girl landed in the pool.

Table 1 :
Statistics of various plausibility datasets.All numbers are numbers of (p, h, h ) triplets.

Table 3 :
Experimental results on COPA test set.