tBERT: Topic Models and BERT Joining Forces for Semantic Similarity Detection

Semantic similarity detection is a fundamental task in natural language understanding. Adding topic information has been useful for previous feature-engineered semantic similarity models as well as neural models for other tasks. There is currently no standard way of combining topics with pretrained contextual representations such as BERT. We propose a novel topic-informed BERT-based architecture for pairwise semantic similarity detection and show that our model improves performance over strong neural baselines across a variety of English language datasets. We find that the addition of topics to BERT helps particularly with resolving domain-specific cases.


Introduction
Modelling the semantic similarity between a pair of texts is a crucial NLP task with applications ranging from question answering to plagiarism detection. A variety of models have been proposed for this problem, including traditional featureengineered techniques (Filice et al., 2017), hybrid approaches (Wu et al., 2017;Feng et al., 2017;Koreeda et al., 2017) and purely neural architectures (Wang et al., 2017;Tan et al., 2018;Deriu and Cieliebak, 2017). Recent pretrained contextualised representations such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) have led to impressive performance gains across a variety of NLP tasks, including semantic similarity detection. These models leverage large amounts of data to pretrain text encoders (in contrast to just individual word embeddings as in previous work) and have established a new pretrain-finetune paradigm.
While large improvements have been achieved on paraphrase detection (Tomar et al., 2017;Gong et al., 2018), semantic similarity detection in Community Question Answering (CQA) remains a challenging problem. CQA leverages user-generated content from question answering websites (e.g. StackExchange) to answer complex real-world questions (Nakov et al., 2017). The task requires modelling the relatedness between question-answer pairs which can be challenging due to the highly domain-specific language of certain online forums and low levels of direct text overlap between questions and answers.
Topic models may provide additional signals for semantic similarity, as earlier feature-engineered models for semantic similarity detection successfully incorporated topics (Qin et al., 2009;Tran et al., 2015;Mihaylov and Nakov, 2016;Wu et al., 2017). They could be especially useful for dealing with domain-specific language since topic models have been exploited for domain adaptation (Hu et al., 2014;Guo et al., 2009). Moreover, recent work on neural architectures has shown that the integration of topics can yield improvements in other tasks such as language modelling (Ghosh et al., 2016), machine translation (Chen et al., 2016), and summarisation (Narayan et al., 2018;. We therefore introduce a novel architecture for semantic similarity detection which incorporates topic models and BERT. More specifically, we make the following contributions: 1. We propose tBERT -a simple architecture combining topics with BERT for semantic similarity prediction (section 3). 1 2. We demonstrate that tBERT achieves improvements across multiple semantic similarity prediction datasets against a finetuned vanilla BERT and other neural models in both F1 and stricter evaluation metrics (section 5).
3. We show in our error analysis that tBERT's gains are prominent on domain-specific cases, such as those encountered in CQA (section 5).

Datasets and Tasks
We select popular benchmark datasets featuring different sizes (small vs. large), tasks (QA vs. paraphrase detection) and sentence lengths (short vs. long) as summarised in Table 1. Examples for each dataset are provided in Appendix A.
MSRP The Microsoft Research Paraphrase dataset (MSRP) contains pairs of sentences from news websites with binary labels for paraphrase detection (Dolan and Brockett, 2005).
SemEval The SemEval CQA dataset (Nakov et al., 2015(Nakov et al., , 2016(Nakov et al., , 2017 comprises three subtasks based on threads and posts from the online expat forum Qatar Living. 2 Each subtask contains an initial post as well as 10 possibly relevant posts with binary labels and requires to rank relevant posts above non-relevant ones. In subtask A, the posts are questions and comments from the same thread, in an answer ranking scenario. Subtask B is question paraphrase ranking. Subtask C is similar to A but comments were retrieved from an external thread, which increases the difficulty of the task. Quora The Quora duplicate questions dataset contains more than 400k question pairs with binary labels and is by far the largest of the datasets. 3 The task is to predict whether two questions are paraphrases. The setup is similar to SemEval subtask B, but framed as a classification rather than a ranking problem. We use Wang et al. (2017)'s train/dev/test set partition.
All of the above datasets provide two short texts (usually a sentence long but in some cases consisting of multiple sentences). From here onward we will use the term 'sentence' to refer to each short text. We frame the task as predicting the semantic  similarity between two sentences in a binary classification task. We use a binary classification setup as this is more generic and applies to all above datasets.

Architecture
In this paper, we investigate if topic models can further improve BERT's performance for semantic similarity detection. Our proposed topic-informed BERT-based model (tBERT) is shown in Figure 1. We encode two sentences S 1 (with length N ) and S 2 (with length M ) with the uncased version of BERT BASE (Devlin et al., 2019), using the C vector from BERT's final layer corresponding to the CLS token in the input as sentence pair representation: where d denotes the internal hidden size of BERT (768 for BERT BASE ). While other topic models can be used, we experiment with two popular topic models: LDA (Blei et al., 2003) and GSDMM (Yin and Wang, 2014), see section 3.2 for details. Based on previous research which successfully combined word and document level topics with neural architectures (Narayan et al., 2018), we further experiment with incorporating different topic types. For document topics D 1 and D 2 , all tokens in a sentence are passed to the topic model to infer one topic distribution per sentence: where t indicates the number of topics. Alternatively, for word topics W 1 and W 2 , one topic distri- bution w i is inferred per token T i before averaging them to obtain a fixed-length topic representation on the sentence level: We combine the sentence pair vector with the sentence-level topic representations similar to Ostendorff et al. (2019) as for document topics and as for word topics (where ; denotes concatenation). This is followed by a hidden and a softmax classification layer. We train the model for 3 epochs with early stopping and cross-entropy loss. Learning rates are tuned per dataset and random seed. 4

Choice of Topic Model
Topic number and alpha value The number of topics and alpha values are important topic model hyper-parameters and dataset dependent. We use the simple topic baseline (section 4) as a fast proxy (on average 12 seconds per experiment on CPU) to identify useful topic models for each dataset without expensive hyper-parameter tuning on the full tBERT model.  Table 2: F1 scores of BERT-based models with different topic settings on development set. We report average performance for two different random seeds. Bold indicates the selected setting for our final model.
Topic model and topic type LDA (Blei et al., 2003) is the most popular and widely used topic model, but it has been reported to be less suitable for short text (Hong and Davison, 2010). Therefore, we also experiment with the popular short text topic model GSDMM (Yin and Wang, 2014). To select the best setting for our final model (in Table 3), we evaluated different combinations of tBERT with LDA vs. GSDMM and word (W 1 and W 2 ) vs. document topics (D 1 and D 2 ) on the development partition of the datasets ( Table 2). The tBERT settings generally scored higher than BERT, with word topics (W 1 and W 2 ) usually outperforming document topics.

Baselines
Topic baselines As a simple baseline, we train a topic model (LDA or GSDMM) on the training portion of each dataset (combining training sets for SemEval subtasks) and calculate the Jensen-Shannon divergence (Lin, 1991) (JSD) between the topic distributions of the two sentences. The model predicts a negative label if JSD is larger than a threshold and a positive label otherwise. We tune threshold, number of topics and alpha value based on development set F1. 5 Previous systems For SemEval, we compare against the highest performing system of earlier work based on F1 score. As these models rely on hand-crafted dataset-specific features (providing an advantage on the small datasets), we also include the only neural system without manual features (Deriu and Cieliebak, 2017). For MSRP, we show a neural matching architecture (Pang et al., 2016). For Quora, we compare against the Interactive Inference Network (Gong et al., 2018) using accuracy, as no F1 has been reported.
Siamese BiLSTM Siamese networks are a common neural baseline for sentence pair classification tasks (Yih et al., 2011;Wang et al., 2017). We embed both sentences with pretrained GloVe embeddings (concatenated with ELMo for BiLSTM + ELMo) and encode them with two weight-sharing BiLSTMs, followed by max pooling and hidden layers.
BERT We encode the sentence pair with BERT's C vector (as in tBERT) followed by a softmax layer and finetune all layers for 3 epochs with early stopping. Following Devlin et al. (2019), we tune learning rates on the development set of each dataset. 4

Results
Evaluation We evaluate systems based on F1 scores ( Table 3) as this is more reliable for datasets with imbalanced labels (e.g. SemEval C) than accuracy. We further report performance on difficult cases with non-obvious F1 score (Peinelt et al., 2019) which identifies challenging instances in the dataset based on lexical overlap and gold labels. Dodge et al. (2020) recently showed that early stopping and random seeds can have considerable impact on the performance of finetuned BERT models. We therefore use early stopping during finetuning and report average model performance across two seeds for BERT and tBERT models.
Overall trends The BERT-based models outperform the other neural systems, while closely competing with the feature-engineered system on the relatively small SemEval A dataset. The simple topic baselines perform surprisingly well in comparison to much more sophisticated models, indicating the usefulness of topics for the tasks.

Do topics improve BERT's performance?
Adding LDA topics to BERT consistently improves F1 performance across all datasets. Moreover, it improves performance on non-obvious cases over BERT on all datasets (except for Quora which contains many generic examples and few domainspecific cases, see Table 4). The addition of GS-DMM topics to BERT is slightly less stable: improving performance on MSRP, Semeval A and B, while dropping on Semeval C. The largest perfor-     Table 3: Model performance on test set. The first 6 rows are taken from the cited papers. Bold font highlights the best system overall and our best implementation is underlined. Italics indicate that F1 and accuracy were identical. less clear for named entities; based on manual inspection BERT dealt better with common named entities likely to have occurred in pretraining (such as well-known brands), while tBERT improved on dataset-specific named entities. We reason that for domain-specific words which are unlikely to have occurred in pretraining (e.g. Fuwairit in Table 5), BERT may not have learned a good representation (even after finetuning) and hence can't make a correct prediction. Here, topic models could serve as an additional source for dataset-specific information. The usefulness of topics for such cases is also supported by previous work, which successfully leveraged topics for domain adaptation in machine translation (Hu et al., 2014) and named entity recognition (Guo et al., 2009).
Could we just finetune BERT longer? Based on our observation that tBERT performs better on dataset-specific cases, one could assume that BERT may simply need to be finetuned longer than the usual 3 epochs to pick up more domain-specific information. In an additional experiment, we finetuned BERT and tBERT (with LDA topics) for 9 epochs (see Figure 2 and Appendix G  epochs achieved marginal gains on Semeval A and C, longer finetuning of BERT could not exceed tBERT's best performance from the first 3 epochs (dotted line) on any dataset. We conclude that longer finetuning does not considerably boost BERT's performance. Adding topics instead is more effective, while avoiding the burden of greatly increased training time (compare Appendix F).

Conclusion
In this work, we proposed a flexible framework for combining topic models with BERT. We demonstrated that adding LDA topics to BERT consistently improved performance across a range of semantic similarity prediction datasets. In our qualitative analysis, we showed that these improvements were mainly achieved on examples involving domain-specific words. Future work may focus on how to directly induce topic information into BERT without corrupting pretrained information and whether combining topics with other pretrained contextual models can lead to similar gains. Another research direction is to investigate if introducing more sophisticated topic models, such as named entity promoting topic models (Krasnashchok and Jouili, 2018)

G Longer Finetuning Experiment
Longer BERT finetuning does not surpass tBERT's best performance from the first 3 epochs (dotted line) while considerably increasing training time (compare Appendix F).