Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets

Existing datasets for scoring text pairs in terms of semantic similarity contain instances whose resolution differs according to the degree of difficulty. This paper proposes to distinguish obvious from non-obvious text pairs based on superficial lexical overlap and ground-truth labels. We characterise existing datasets in terms of containing difficult cases and find that recently proposed models struggle to capture the non-obvious cases of semantic similarity. We describe metrics that emphasise cases of similarity which require more complex inference and propose that these are used for evaluating systems for semantic similarity.


Introduction
Modelling semantic similarity between a pair of texts is a fundamental task in NLP with a wide range of applications (Baudiš et al., 2016). One area of active research is Community Question Answering (CQA) (Nakov et al., 2017;Bonadiman et al., 2017), which is concerned with the automatic answering of questions based on user generated content from Q&A websites (e.g. StackExchange) and requires modelling the semantic similarity between question and answer pairs. Another well-studied task is paraphrase detection (Socher et al., 2011;He et al., 2015;Tomar et al., 2017), which models the semantic equivalence between a pair of sentences.
Evaluation for such tasks has primarily focused on metrics, such as mean average precision (MAP), F1 or accuracy, which give equal weights to all examples, regardless of their difficulty. However, as illustrated by the examples in Table 1, not all items within text pair similarity datasets are equally difficult to resolve.
Recent work has shown the need to better understand limitations of current models and datasets in natural language understanding (Wadhwa et Rajpurkar et al., 2018). For example, Kaushik and Lipton (2018) showed that models sometimes exploit dataset properties to achieve high performance even when crucial task information is withheld, and Gururangan et al. (2018) demonstrated that model performance is inflated by annotation artefacts in natural language inference tasks.
In this paper, we analyse current datasets and recently proposed models by focusing on item difficulty based on shallow lexical overlap. Rodrigues et al. (2018) found declarative CQA sentence pairs to be more difficult to resolve than interrogative pairs as the latter contain more cases of superficial overlap. In addition, Wadhwa et al. (2018b) showed that competitive neural reading comprehension models are susceptible to shallow patterns (e.g. lexical overlap). Our study digs deeper into these findings to investigate the properties of current text pair similarity datasets with respect to different levels of difficulty and evaluates models based on how well they can resolve difficult cases.
We make the following contributions: 1. We propose a criterion to distinguish between obvious and non-obvious examples in text pair similarity datasets (section 4).
2. We characterise current datasets in terms of the extent to which they contain obvious vs. non-obvious items (section 4).

Datasets and Tasks
We selected well-known benchmark datasets differing in size (small vs. large), document length (single sentence vs. multi-sentence), document types (declarative vs. interrogative) and tasks (answer ranking vs. paraphrase detection vs. similarity scoring), see Table 2.
SemEval The SemEval Community Question Answering (CQA) dataset (Nakov et al., 2015(Nakov et al., , 2016(Nakov et al., , 2017 contains posts from the online forum Qatar Living. The task is to rank relevant posts above non-relevant ones. Each subtask involves an initial post and 10 possibly relevant posts with binary annotations. Task A contains questions and comments from the same thread, task B involves question paraphrases, and task C is similar to A but contains comments from an external thread.  Table 2: Selected text pair similarity data sets. Size as number of text pairs. rank=ranking task, class=classification task, regr=regression task.

MSRP
Quora The Quora duplicate questions dataset contains a large number of question pairs with binary labels 1 . The task is to predict whether two questions are paraphrases, similar to Task B of Se-mEval, but it is framed as a classification rather than a ranking problem. We use the same training / development / test set partition as Wang et al. (2017).
In this paper, we focus on predicting the semantic similarity between two text snippets in a binary classification scenario, as the ranking scenario is only applicable to some of the datasets. Binary labels are already provided for all tasks except for 1 https://engineering.quora.com/Semantic-Question-Matching-with-Deep-Learning STS. In the case of STS, we convert the scores into binary labels. Based on the description of the relatedness scores in Cer et al. (2017), we assign a positive label if relatedness ≥ 4 and a negative one otherwise to use a similar criterion as in the other datasets.

Lexical divergence in current datasets
To characterise the datasets, we represent the text pairs as two distributions over words and measure their lexical divergence using Jensen-Shannon divergence (JSD) (Lin, 1991). 2 Figure 1 shows the entire JSD distribution by label for each dataset.
The datasets differ with respect to the degree of lexical divergence they contain: The three Se-mEval CQA datasets show a high degree of lexical divergence (majority > 0.5), especially in the external QA scenario (task C). Text pairs in MSRP tend to have low-medium JSD scores (majority < 0.6), while items in Quora and STS show the widest range of lexical divergence (see also Appendix A). Overall, pairs with negative labels tend to have higher JSD scores than pairs with positive labels. Especially in Quora, MSRP and STS, distinct distributions emerge for positive vs. negative labels, providing direct clues for label assignment.

Distinguishing between obvious and non-obvious examples
As shown, pairs with high lexical divergence tend to have a negative label in the above datasets (e.g. N o in Table 1), while low lexical divergence is associated with a positive label (e.g. P o in Table 1). Intuitively, these are cases which should be relatively easy to identify. More difficult are text pairs with a positive label but high lexical divergence (e.g. P n in Table 1), or a negative label despite low lexical divergence (e.g. N n in Table 1). We use Table 3 to categorise cases in terms of their difficulty level.
positive label negative label low div obvious pos (Po) non-obvious neg (Nn) high div non-obvious pos (Pn) obvious neg (No) Table 3: Defining obvious and non-obvious similarity cases based on labels and lexical overlap. 2 We also calculated set-based similarity metrics (Jaccard Index and Dice Coefficient) and found consistent results with JSD, but give preference to the distribution-based metric which is more natural for text. Due to space restrictions, we only report JSD in this paper.   Pairs are categorised into high and low lexical divergence categories by comparing their JSD score to the median of the entire JSD distribution in order to account for differences between datasets (>median: high div, ≤median: low div). To verify if this automatic difficulty distinction corresponds with real-world difficulty, the authors of the study annotated the semantic relatedness of 100 random pairs from the Quora development set and measured inter-annotator agreement based on Fleiss' Kappa. The agreement for non-obvious cases (P n and N n ) is significantly lower (p-value< 0.01 with permutation test) than for obvious cases (P o and N o ) and the average annotation time per item is longer for non-obvious cases (Table 4), confirming the validity of this distinction. Table 5 shows the number of instances in the four cases across datasets. In all of the analysed datasets, there are more obvious positives (P o ) than non-obvious positives (P n ) and more obvious negatives (N o ) than non-obvious negatives (N n ). All obvious cases combined (P o +N o ) make up more than 50% of pairs across all datasets.

Evaluating model predictions based on difficulty
We now use this categorisation for the purpose of model evaluation (Tables 6-8). 3 We calculate the

Conclusion
We present an automated criterion for automatically distinguishing between easy and difficult items in text pair similarity prediction tasks. We find that more than 50% of cases in current datasets are relatively obvious. Recently proposed models perform significantly worse on nonobvious cases compared to obvious cases. In or-der to encourage the development of models that perform well on difficult items, we propose to use non-obvious F1 scores (F1 n ) as a complementary ranking metric for model evaluation. We also recommend publishing prediction files along with models to facilitate error analysis.