A Cross-Topic Method for Supervised Relevance Classification

In relevance classification, we hope to judge whether some utterances expressed on a topic are relevant or not. A usual method is to train a specific classifier respectively for each topic. However, in that way, it easily causes an underfitting problem in supervised learning model, since annotated data can be insufficient for every single topic. In this paper, we explore the common features beyond different topics and propose our cross-topic relevance embedding aggregation methodology (CREAM) that can expand the range of training data and apply what has been learned from source topics to a target topic. In our experiment, we show that our proposal could capture common features within a small amount of annotated data and improve the performance of relevance classification compared with other baselines.


Introduction
Relevance classification is a task of automatically distinguishing relevant information for a specific topic (Kimura et al., 2019). It can be regarded as a preprocessing task of stance detection, since potential stances should be refined into relevant ones to improve accuracy and efficiency. In Table  1, we show a simple example of relevance classification task in  Here utterance1 is relevant to the topic not only for the contained topic words but also for its related semantics, and then we could leverage its features available for further stance detection. On the contrary, utterance2 is irrelevant to the topic, and its further calculation of stance detection is meaningless. Previously, the relevance task could be approached in an unsupervised way by calculating pairwise semantic distances between topic and utterance (Achananuparp et al., 2008;Kusner et al., 2015). However, in most instances, their performance is not as good as a supervised approach. As to the supervised method, traditionally, a specific topic-oriented classifier could be trained for prediction on a single topic (Hasan and Ng, 2013;Y Wang et al., 2017), but this method actually builds up an isolation among different topics and wastes existing annotated data for new predictions.
Cross-topic classification, which enables the classifier to adapt different topics even in different domains, is an alternative to a supervised approach (Augenstein et al., 2016;Xu et al., 2018). It allows the model to assimilate the common features from existing topics and make inferences for a new topic. For example, in the NTCIR-14 relevance classification task, we could start with an existing classifier containing a well-prepared set of groundtruth data from some other Tsukiji Market history or economic topics, to give a prediction about Tsukiji Market relocation topic.
In this paper, aiming to alleviate insufficient annotated data problems for a specific topic, we have concentrated on cross-topic relevance classification by our novel CREAM proposal. The basic idea of the CREAM method is to capture the common pairwise features between existing topic and utterance, and then apply them to relevance prediction on a target topic. By analyzing F1scores in experiment results, we have known that CREAM has shown its better performance on a known topic's relevance classification compared with baselines. In addition, an associated value to the unknown topic relevance has also been evaluated.

Related Work
To establish a cross-topic relevance classification model for supervised learning, here we regard it as a two-step procedure including pairwise text embedding and binary text classifier. Besides, the literatures around stance detection bright us inspiration as well.

Text Embedding
There are 3 well-known embedding methods named Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and fastText (Joulin et al., 2016) for word-level representation. Although GloVe and fastText show higher performance on some specific aspects, there's no escaping the fact that Word2Vec (CBOW, Skip-Gram) is most popular and widely used among different languages.
As to sentence-level embeddings, the Word2Vec inventor Mikolov proposed doc2vec (Quoc et al., 2014), as its name implies, to learn sentence or document embeddings. What's more, averaged word embeddings (Han and Baldwin, 2016) is also a common sentence-level embedding method.

Text Classifier
There are several classical ML/DL models utilized for text classification such as Support Vector Machine (SVM) (Vapnik, 1998;Vapnik, 2013), and an RNN variant Long Short-term Memory (LSTM) (Hochreiter and Schmidhuber, 1997). It is noteworthy that SVM has an advantage in processing low-resource data.
Besides, nowadays we also could utilize a pretrained model such as BERT (Devlin et al., 2018) or ELMO (Matthew et al., 2018) as a contextual text classifier. However, note that they are always pre-trained by a tremendous amount of open data (E.X. Wikipedia), we still need fine-tuning data on a large scale for root domain recognition.

Stance Detection
Stance detection, which is the task of classifying the attitude expressed in text towards a target, also provides us with valuable inspiration on text classification.
For example, Augenstein (Augenstein et al., 2016) tried to utilize conditional LSTM encoding to build a representation for stance and target independently, and an end-to-end memory network (Mohtarami et al., 2018), which integrates CNN with LSTM, has also been presented to solve this classification task. What's more, a simple but tough-to-beat baseline (Riedel et al., 2017) shows the potential of TF-IDF and cosine similarity on this pairwise classification task. Note that relevance classification can be regarded as a preprocessing of stance detection, since irrelevant stances should be excluded before being classified into support, against or even a neutral stance.

Methodology
In this section, we would like to give a comprehensive introduction about our proposed cross-topic method CREAM, for supervised relevance classification. The overall architecture of CREAM is depicted in Figure 1. As described in the previous section, we briefly divide the whole model into 2 parts including text embedding and text classifier. In the text embedding part, we have implemented Word Embedding Layer and Sentence Aggregation Layer, and as to the text classifier, the SVM Layer and Prediction Layer would achieve their functions. The expected input includes a pair of topic text and topic-oriented utterance in the same domain, and the output would be predicted binary relevance label. In the following, we would illustrate the implementation details of each layer in CREAM.

Word Embedding Layer
Here we adopt pre-trained Word2Vec embeddings to represent each word of two inputs (a topic text T containing n words and a topic-oriented utterance U, e.g., topic and utterance1 in Table 1). Note that utterance could be much longer than topic text, so here we select the same number of words as topic T from utterance U. For each selected word of T,

Sentence Aggregation Layer
The sentence aggregation layer is the key to our cross-topic method CREAM. Here we manage to aggregate topic and utterance vectors by two steps to represent common features. Separated Aggregation: In this step, we aim to provide a sentence-level embedding for T and U respectively. Here we separately aggregate T word vectors for topic and utterance by averaged word embeddings:

Topic-Utterance Aggregation:
Here we further concentrate on applying an aggregation between topic and utterance to represent the common features of relevance. As we have known there exists a classical conclusion from Word2Vec: king man woman queen , we could get an inference that there exist some common features between word pairs (king, man) and (queen, woman) since king man queen woman is still workable. As to sentence-level relevance classification, here we also conduct a vector subtraction between topic T → and utterance U → to represent relevance vector R → as below.
It is noteworthy that here we normalize each dimension value of relevance vector R → by dividing T to limit the subtraction result in the same range. Therefore, assuming that we have a relevance vector 1 R → ( topic1) and 2 R → (topic2), they would be treated equally for the same cross-topic training if they all denote the same relevant relationship (e.g., 1 R → represents a utterance is relevant to topic1, 2 R → represents another utterance is relevant to topic2).

Cross-Topic SVM Layer
In this layer, we decide to adopt a supervised learning model SVM for cross-topic binary classification. The reason is because of lowresource data we have stated in chapter 2.2. In our case, SVM can efficiently perform a non-linear classification using kernel function (Mark et al., 1964) to fit the maximum-margin hyperplane in a transformed feature space. Here the following sigmoid kernel function for relevance vectors 1 R → and 2 R → makes SVM acted as multi-layer neural networks even they are different topics.
After applying the kernel function, the target function of maximum-margin hyperplane could be written in: Here ℎ * , * are optimal parameters to distinguish binary hyperplane, and t is the correct class label for training.

Prediction Layer
We predict the relevance label of each topicutterance pair via sigmoid-fitting method: Where we apply the sigmoid operation to get the predicted probability for relevant and irrelevant classes with parameters A and B.

Experiments
In this section, we would introduce the evaluation results of our proposed methodology. We have evaluated our CREAM on the NTCIR-14 QALab dataset (Kimura et al., 2019). Note that NTCIR-14 QALab dataset maybe is the first dataset focusing on relevance classification besides fact-check and stance detection. Besides our own method, we have also taken three baseline approaches to crosstopic relevance classification. Word Mover's Distance (WMD): this classical unsupervised learning method is often utilized to calculate a word travel cost between two documents. Here we predict the relevance label based on switch cost boundary from utterance to its topic. Bidirectional LSTM (BiLSTM): this approach receives encoded-word sequences (topic and utterance) and makes a concatenation to merge them into one sequence. Finally, the concatenated vector would be fed into its prediction layer to give a relevance label prediction. BERT: There is no doubt that BERT is the stateof-the-art model to solve NLP issues such as text classification. It is well-known that BERT could receive pairwise texts as inputs and output the label between them. Therefore, BERT is also applicable to this relevance classification theoretically. Here we beforehand input labelled topic and utterance separately into pre-trained BERT for fine-tuning.

NTCIR-14 QALab:
This dataset is a Japanese collection for the relevance classification task, which contains around 10000 topic-oriented utterances on 14 different topics. Although task organizers do manual labeling by crowdsourcing, it is still difficult to provide an even larger amount of labeled dataset for each topic. Therefore, the traditional method with low-resource data would easily cause an underfitting problem.

Experiment Setup
Our initial word embeddings are obtained from the pre-trained Wikipedia word vectors (Suzuki et al., 2016).
In experiment 1, we divide our dataset into training data (1620) and test data (180) with the proportion 9:1. Note that there is no new topic in test data of experiment 1 since all topics have been included for training in the learning phase.
In experiment 2, we hope to verify the performance of our method compared with others on unknown topic relevance prediction. Therefore, we extract 13 topics' data for training to predict the last one topic in cross-validation.

Experiment Results
We mainly use F1-score to evaluate classification performance. Figure 2 illustrates the F1-score and averaged precision/recall as well among four methods in experiment 1, and the averaged evaluation results of cross-validation in experiment 2 have been summarized in Figure 3.
Furthermore, the relationship between the threshold of word mover's distance and F1 score is shown as an example in Figure 4. We just go through all the potential thresholds to find out the optimal one on the peak point to give a prediction for test data.

Discussion
As shown in Figure 2, we have known our CREAM has improved performance of relevance classification through experiment 1 since its F1score is higher than others. The potential reasons of improvement are listed in the below.
• The sentence aggregation layer could extract common features between topic-utterance pairs and demonstrate the pairwise relevance degree by sentence aggregation processing. • The cross-topic SVM layer shows high performance especially in low-resource data even compared with BiLSTM and BERT model. The BERT model pre-trained with open data perhaps is limited by the finetuning need for larger-scale data.
As to the unknown topic's relevance prediction in experiment 2, the result of our method is close to the unsupervised WMD method which shows a   powerful predictive power to new data. We believe our CREAM method has an associated value on relevance prediction for unknown topics since the impact of a specific topic has been deducted by topic-utterance aggregation across different topics.

Conclusion and Future Work
In this paper, we have proposed a novel cross-topic aggregation model named CREAM to generalize the common features for solving low-resource data problems in relevance classification. Experiment results show its excellent performance on a known topic's relevance classification by F1-score over baselines. Meanwhile, we have also known that CREAM has an associated value to the unknown topic relevance prediction.
In the future, CREAM for relevance classification deserves more experiments with different datasets. For example, we could evaluate our methodology on multilingual datasets, in order to make it more impressive. Moreover, we could also input extern synonyms from the domain-based thesaurus to expand topic texts. Finally, selfattention mechanisms can be a promising improvement for imbalance length problems between topic and utterance instead of Word2Vecstyle extraction.