The Devil is in the Details: Evaluating Limitations of Transformer-based Methods for Granular Tasks

Contextual embeddings derived from transformer-based neural language models have shown state-of-the-art performance for various tasks such as question answering, sentiment analysis, and textual similarity in recent years. Extensive work shows how accurately such models can represent abstract, semantic information present in text. In this expository work, we explore a tangent direction and analyze such models’ performance on tasks that require a more granular level of representation. We focus on the problem of textual similarity from two perspectives: matching documents on a granular level (requiring embeddings to capture fine-grained attributes in the text), and an abstract level (requiring embeddings to capture overall textual semantics). We empirically demonstrate, across two datasets from different domains, that despite high performance in abstract document matching as expected, contextual embeddings are consistently (and at times, vastly) outperformed by simple baselines like TF-IDF for more granular tasks. We then propose a simple but effective method to incorporate TF-IDF into models that use contextual embeddings, achieving relative improvements of up to 36% on granular tasks.


Introduction
In recent years, contextual embeddings (Peters et al., 2018;Devlin et al., 2018) have made immense progress in semantic understanding-based tasks.After being trained using large amounts of data, for example via a self-supervised task like masked language-modeling, such models learn crucial elements of language, such as syntax and semantics (Jawahar et al., 2019;Goldberg, 2019;Wiedemann et al., 2019) from just raw text.The best performing contextual embeddings are trained with Transformerbased methods (TBM) (Vaswani et al., 2017;Devlin et al., 2018).These embeddings have been shown to frequently achieve state-of-the-art results in downstream tasks like question answering and sentiment analysis (van Aken et al., 2019;Sun et al., 2019).Contextual embeddings are also often used to capture the similarity between pairs of documents; for example, on the Semantic Textual Similarity (STS) task (Cer et al., 2017) included in the GLUE benchmark (Wang et al., 2018), TBMs have shown competitive performance, substantially outperforming embedding baselines like Word2Vec (Mikolov et al., 2013) and GloVE (Pennington et al., 2014).However, their performance on similarity tasks beyond abstract, semantic ones (Mickus et al., 2019) -for example, on granular news article matching -is less understood.
In this work, we study the performance of TBMs in textual similarity tasks with the following research question: Are transformer-based methods as performant for granular tasks as they are for abstract ones?Here, granular and abstract reflect varying amounts of coarseness in the concept of similarity.For example, consider the news domain: A granular notion of similarity might be whether a pair of articles both report the exact same news event.Conversely, an abstract notion might be when the articles share the same topical category, like sports or finance.Figure 1 illustrates this with an example for clarity.
Firstly, we define separate tasks to explore these two notions of similarity on two datasets from different domains -News Articles, and Bug Reports.Our analysis on both datasets reveals that contextual embeddings do not perform well on granular tasks, and are outperformed by simple baselines like TF-IDF.Secondly, we demonstrate that TBM contextual embeddings do in fact contain important semantic information, and a simple interpolation strategy between the two methods can help boost the relative individual performance of TBMs (TF-IDF) by up to 36% (6%) on the granular task.

Related Work
We discuss related work in two areas: textual similarity, and TBMs.
Textual Similarity has been studied from various perspectives -comparing documents of different lengths in order to capture varying levels of detail (Gong et al., 2018), evaluating semantic similarity between reference and generated corpus (Clark et al., 2019), and semantic similarity for long documents in a hierarchical fashion (Jiang et al., 2019).It is also shown that sentence meta-embeddings (obtained from combining ensembles of sentence embeddings) perform better (Poerner et al., 2019) for semantic similarity tasks compared to the individual baselines.For duplicate detection, which is a more granular task compared to semantic similarity, Rodier and Carter (2020) show that detection of near-duplicates in news articles can be identified by evaluating n-gram level overlap in documents.In the news domain, Liu et al. (2018) shows that article similarity can be improved by extracting common 'concepts' from the two articles using graph-based approaches.
TBMs (Liu et al., 2019;Devlin et al., 2018) have been shown to consistently perform better in the semantic similarity tasks.Peinelt et al. ( 2020) also shows that BERT-based architectures appended with topic-related details from topic models lead to an increase in semantic similarity performance.However, few works have highlighted TBMs' ability to capture granular information.(Khattab and Zaharia, 2020) shows that BERT can be used for document retrieval by matching embeddings of each word in the query and document, capturing granular similarity.
Unlike previous approaches that focus on either a granular or abstract similarity task, we compare the performance of TBMs with other baseline methods across the two tasks, and in addition, provide a simple method to improve the performance of TBMs on granular similarity tasks.

Method
In this section, we describe the methodology used to compare two documents from a granular and abstract perspective.Further, we also define the granular and abstract text similarity tasks in detail.

Problem Definition
We consider both granular and abstract tasks to be similarity classification tasks operating on a pair of documents.The task-specific labels are binary, indicating whether the pair is judged to be similar or not (one label for abstract, one for granular).From a corpus C, we consider a pair of two documents d 1 , d 2 and their task-specific similarity judgment y (without loss of generality).We define e k = f (d k ), where e k is d k 's embedding, produced by f (•).In practice, f could be a vector space method like TF-IDF, or the final layer from a TBM.Upon obtaining e 1 , e 2 , we generate a (symmetric) pair similarity score g(e 1 , e 2 ) corresponding to the given task, and use it to arrive at a binary prediction ŷ.Performance is measured using standard metrics quantifying agreement between ŷ and y across pairs.

Experimental Setup
We consider two experimental settings: indirect and direct.These are also illustrated in Figure 2 1 .In the indirect setting, we indirectly learn to predict y via a fixed g.Specifically, we a priori define g as the cosine similarity function, and define a vector t, with each entry corresponding to g(e 1 , e 2 ) for a given pair of document embeddings w.l.o.g.We then feed t as a feature and the task-specific label vector y to XGBoost (Chen and Guestrin, 2016) to obtain the predictions ŷ.We evaluate using several embedding functions: • TF-IDF: TF-IDF (Ramos and others, 2003) weights corresponding to the 1-gram tokens inserted in their respective indices in an array of same length as the train's set vocabulary.• WME: Word Mover's Embedding (WME) (Wu et al., 2018) generated from static embeddings like word2vec (Mikolov et al., 2013) using the Word Mover's Distance metric (Kusner et al., 2015).• SIF: The SIF (Arora et al., 2017) weighting scheme employed over pretrained GloVE embeddings.
• RT: The embedding corresponding to the CLS token in the final layer of a pretrained RoBERTa (Liu et al., 2019) model.• LF: The embedding corresponding to the CLS token in the final layer of a pretrained LongFormer (Beltagy et al., 2020) model.• ST-RT: Sentence embeddings generated using Sentence Transformers (Reimers and Gurevych, 2019), which is a RoBERTa model fine-tuned on the STS benchmark.In the direct setting, we directly learn to predict y from embeddings e 1 , e 2 in an end-to-end manner, rather than through a predefined similarity measure.We use the best performing embeddings from the previous setting as input to train the model g which produces a score, which is thresholded to derive ŷ.
• TF-IDF-E2E: We compute the absolute difference between the TF-IDF vectors of the article pairs and train a Logistic Regression classifier with the labels corresponding to the similarity task.• RT-E2E: Since ST-RT embedding uses a pre-trained RoBERTa model, we train RoBERTa end-toend instead (as Reimers and Gurevych (2019) mentions, ST is not intended for end-to-end use).We provide the article pairs to a pre-trained RoBERTa model, separated with the SEP token.It is then directly fine-tuned on the task-specific labels.

Datasets
We evaluate with datasets from News Articles and Bug Reports domains to demonstrate generality.Each includes both abstract and granular labels for the same documents.Table 3: Performance for granular and abstract tasks the 2 datasets as we vary the value of w.Note that best granular results are achieved by interpolating TF-IDF with TBM predictions (w = 0.7).
News Dedup dataset (ND) contains pairs of news articles from 243 different English news sources, collected between September and November 2019 from RSS feeds.For each pair, we assign a granular binary label indicating whether the pair reports the same news event, and an abstract binary label reflecting whether they share the same topic (politics, business, technology, entertainment, sports, science, or other -adapted from Google News2 ).We source annotations from Amazon Mechanical Turk3 , relying on multiple annotator agreement, with Fleiss' κ coefficient of 0.68.See Appendix A for details.
Bug Repo dataset (BR) (Lamkanfi et al., 2013) contains bug reports from several open-source projects like Eclipse and Mozilla, and is used primarily for duplicate bug detection.Each report consists of a title, a description of the error, the broad category that the bug belongs to (e.g.UI or Scripting out of 21 others), and a set of duplicate reports that flag the same bug.We indicate granular similarity as those pairs which flag the same bug, and abstract similarity as those pairs which fall under the same category, with the title of the report as the textual input.For each dataset, the documents in the train and test splits are disjoint sets, ensuring that the model does not memorize textual representations.For ND, the sets are also temporally disjoint, avoiding event overlap between train and test splits.Further details about the splits are provided in Table 1.

Results
Table 2 summarizes the experiments on the two datasets using the methods mentioned in Section 3. We can observe that a simple TF-IDF based approach performs better than all embedding methods for the granular-level similarity tasks.However, for the abstract-level similarity task, training a RoBERTa model to perform the task end-to-end achieves, as expected, the highest performance.
Despite the better results, the complete absence of semantic understanding is a disadvantage of TF-IDF.To mitigate this issue, we propose a simple approach to merge the best performing indirect methods.Let g t and g r be the similarity scores obtained from the TF-IDF and the ST-RT approaches respectively: we obtain a new, interpolated score g i = w•g t +(1−w)•g r , that is then used as an input to the classifier.As Table 3 shows, performance drastically changes when varying w.For both datasets, we observe the best results when w = 0.7, demonstrating that combining semantic and fine-grained information is helpful for granular tasks. 4Conversely, we achieve the best performance on the abstract level when using only ST-RT.We hypothesize the noise introduced by the granular information results in the performance drop in cases prioritizing abstract, semantic relevance.

Conclusion
In this work, we study the use of contextual embeddings derived from transformer-based models (TBMs) for semantic similarity tasks of varying granularity level.Through empirical analysis, we show that while TBMs achieve higher performance in the abstract similarity tasks, simple methods like TF-IDF outperform these models for granular similarity tasks (like event matching).We then propose a simple but effective method to merge these two approaches, achieving relative improvements of 36% (6%) when compared to using only TBMs (TF-IDF).In future work, we plan to investigate the scope for integrating granular information into TBM contextual embeddings to toggle the granularity that such embeddings inherently encode.

Figure 1 :
Figure 1: An example pair of articles from the News Dedup dataset: Both report the same news event, and are thus similar on a granular level; the colored text indicates fine-grained details associated with this determination.Both articles are also of the "sports" topic, and are thus similar on an abstract level.

Figure 2 :
Figure 2: Our experimentation setups take n pairs consisting of two documents (d 1 , d 2 ) and their similarity label (y) and yields their similarity score (ŷ)

Table 1 :
Dataset and evaluation split details: The dataset statistics capture number of unique text pairs.For the task-specific statistics, we report the total number of similar pairs and non-similar (similar/notsimilar) pairs according to the task for each split.The imbalance in the test set of ND replicates the distribution found in the real-world news event similarity detection problems.

Table 2 :
Granular and abstract similarity results: TF-IDF outperforms TBMs on granular tasks, while TBMs outperform on abstract tasks in both settings.