Gated Convolutional Bidirectional Attention-based Model for Off-topic Spoken Response Detection

Off-topic spoken response detection, the task aiming at predicting whether a response is off-topic for the corresponding prompt, is important for an automated speaking assessment system. In many real-world educational applications, off-topic spoken response detectors are required to achieve high recall for off-topic responses not only on seen prompts but also on prompts that are unseen during training. In this paper, we propose a novel approach for off-topic spoken response detection with high off-topic recall on both seen and unseen prompts. We introduce a new model, Gated Convolutional Bidirectional Attention-based Model (GCBiA), which applies bi-attention mechanism and convolutions to extract topic words of prompts and key-phrases of responses, and introduces gated unit and residual connections between major layers to better represent the relevance of responses and prompts. Moreover, a new negative sampling method is proposed to augment training data. Experiment results demonstrate that our novel approach can achieve significant improvements in detecting off-topic responses with extremely high on-topic recall, for both seen and unseen prompts.


Introduction
Off-topic spoken response detection is a crucial task in an automated assessment system. The task is to predict whether the response is off-topic for the corresponding question prompt. Table 1 shows an example of on-topic and off-topic responses for a prompt.
Off-topic examples in human-rated data is often too sparse to train an automated scoring system to reject off-topic responses. Consequently, automated scoring systems tend to be more vulnerable than human raters to scoring inaccurately due to off-topic responses ( Lochbaum et al., 2013;Higgins and Heilman, 2014). To ensure the validity of speaking assessment scores, it is necessary to have a mechanism to flag off-topic responses before scores are reported (Wang et al., 2019). In our educational application, we use the automated speaking assessment system to help L2 learners prepare for the IELTS speaking test. We do see a higher rate of off-topic responses in freemium features as some users just play with the system. In such a scenario, accurate off-topic detection is extremely important for building trust and converting trial users to paid customers.
Prompt: What kind of flowers do you like? On-topic: I like iris and it has different meaning of it a wide is the white and um and the size of a as a ride is means the ride means love but I can not speak. Off-topic: Sometimes I would like to invite my friends to my home and we can play the Chinese chess dishes this is my favorite games at what I was child. Table 1: An example of on-topic and off-topic responses for a prompt.
Initially, many researchers used vector space model (VSM) ( Louis and Higgins, 2010;Yoon and Xie, 2014;Evanini and Wang, 2014) to assess the semantic similarity between responses and prompts. In recent years, with the blooming of deep neural networks (DNN) in natural language processing (NLP), many DNN-based approaches were applied to detect off-topic responses. Malinin et al. (2016) used the topic adapted Recurrent Neural Network language model (RNN-LM) to rank the topic-conditional probabilities of a response sentence. A limitation of this approach is that the model can not detect off-topic responses for new question prompt which was not seen in training data (unseen prompt). Later, off-topic response detection was considered as a binary classification task using end-to-end DNN models. Malinin et al. (2017) proposed the first end-to-end DNN method, attention-based RNN (Att-RNN) model, on off-topic response detection task. They used a Bi-LSTM embedding of the prompt combined with an attention mechanism to attend over the response to model the relevance. CNNs may perform better than RNNs in some NLP tasks which require key-phrase recognition as in some sentiment detection and question-answer matching issues (Yin et al., 2017). Lee et al. (2017) proposed a siamese CNN to learn semantic differences between on-topic response-questions and off-topic response-questions. Wang et al. (2019) proposed an approach based on similarity grids and deep CNN.
However, the cold-start problem of off-topic response detection has not been handled well by the aforementioned approaches. It is not until enough training data of unseen prompts are accumulated that good performance could be achieved. Besides, these methods draw little attention to the vital ontopic false-alarm problem for a production system. I.e., extremely high recall of on-topic responses is also required to make real-user-facing systems applicable.
In this paper, to address the issues mentioned above, a novel approach named Gated Convolutional Bidirectional Attention-based Model (GCBiA) and a negative sampling method to augment training data are proposed. The key motivation behind our model GCBiA is as follows: convolution structure captures the key information, like salient n-gram features (Young et al., 2018) of the prompt and the response, while the bi-attention mechanism provides complementary interaction information between prompts and responses. Following R-Net  in machine comprehension, we add the gated unit as a relevance layer to filter out the important part of a response regarding the prompt. These modules contribute to obtaining better semantic matching representation between prompts and responses, which is beneficial for both seen and unseen prompts. Additionally, we add residual connections (He et al., 2016) in our model to keep the original information of each major layer. To alleviate the cold-start problem on unseen prompts, a new negative sampling data augmentation method is considered.
We compare our approach with Att-RNN model and G-Att-RNN (our strong baseline model based on Att-RNN). Experiment results show that GCBiA outperforms these methods both on seen and un-seen prompts benchmark conditioned on extremely high on-topic response recall (0.999). Moreover, the model trained with negative sampling augmented data achieves 88.2 average off-topic recall on seen prompts and 69.1 average off-topic recall on unseen prompts, respectively.
In summary, the contribution of this paper is as follows: • We propose an effective model framework of five major layers on off-topic response detection task. The bi-attention mechanism and convolutions are applied to the focus on both topic words in prompts and key-phrase in responses. The gated unit as a relevance layer can enhance the relevance of prompts and responses. Besides, residual connections for each layer were widely used to learn additional feature mapping. Good semantic matching representation is obtained by these modules on both seen and unseen prompts. The GCBiA model achieves significant improvements by +24.0 and +7.0 off-topic recall on average unseen and seen prompts respectively, comparing to the baseline method.
• To explore the essence of our proposed model, we conduct visualization analysis from two perspectives: bi-attention visualization and semantic matching representation visualization to reveal important information on how our model works.
• To improve our result on unseen prompts further, we propose a novel negative sampling data augmentation method to enrich training data by shuffling words from the negative sample in off-topic response detection task. It allows the GCBiA model to achieve higher averaging off-topic recall on unseen prompts.

Task formulation
The off-topic response detection task is defined as follows in this paper. Given a question prompt with n words X P = {x P t } n t=1 and the response sentence with m words X R = {x R t } m t=1 , output one class o = 1 as on-topic or o = 0 as off-topic.

Model Overview
We propose a model framework of five major layers on off-topic response detection task. The proposed model GCBiA (shown in Figure 1) consists of the following five major layers: • Word Embedding Layer maps each word to a vector space using a pre-trained word embedding model.
• Contextual Encoder Layer utilizes contextual information from surrounding words to reinforce the embedding of the words. These first two layers are applied to both prompts and responses.
• Attention Layer uses the attention mechanism in both directions, prompt-to-response and response-to-prompt, which provides complementary information to each other.
• Relevance Layer captures the important parts of the response regarding a prompt via the gated unit.
• Output Layer predicts whether the response is off-topic given the prompt.
In detail, each layer is illustrated as follows: 1. Word Embedding Layer. We first convert words to respective trainable word embeddings, initialized by pre-trained Glove (Pennington et al., 2014). The embeddings of prompts W P = {w P t } n t=1 and responses W R = {w R t } m t=1 are passed directly to the next contextual encoder layer.

Contextual Encoder Layer.
A stack of convolutional layers are employed to extract salient n-gram features from prompts and responses, aiming at creating an informative latent semantic representation of prompts and responses for the next layer. The l-th convolutional layer with one filter is represented as c l i in Equation (1), where W ∈ R k×d , b ∈ R d . We ensure that the output of each stack matches the input length by padding the input of each stack. The number of convolutional layers l is 7, the kernel size k is 7 and the number of filters in each convolutional layer is 128.
After the convolutional representation of prompts U P and responses U R in Equation (2-3) are obtained, a max pooling layer to extract the fixed-length vector is performed, seen in Equation (4-5). Max-pooling can keep the most salient n-gram features across the whole prompt/response.
3. Attention Layer. In this layer, the attention mechanism is used in both directions, promptto-response and response-to-prompt, which provides complementary information to each other. However, unlike bi-attention applied to question answering and machine comprehension, including QANet (Yu et al., 2018), BiDAF (Seo et al., 2016) and BiDAF++ (Choi et al., 2018), we use max-pooling of CNN representation on prompt/response to summarize the prompt/response into a fixed-size vector.
Prompt-to-Response Attention. Promptto-Response attention implicitly models which response words are more related to the whole prompt, which is crucial to assess the relevance of responses and prompts. Given max pooling vector v P of the prompt and CNN representation U R = {u R t } m t=1 of the response, together with W P = {w P t } n t=1 and W R = {w R t } m t=1 , Prompt-to-Response attention c R is then calculated in Equation (6-10), where the similarity function used is trilinear function (Yu et al., 2018) and residual connec- Response-to-Prompt Attention. Similarly, Response-to-Prompt attention implicitly models which prompt words are more related to the whole response. The calculation of Response-to-Prompt attention, seen in Equation (11-15), is close to Prompt-to-Response Figure 1: An overview of GCBiA. Residual connections were widely used to connect each two-layer. The first two layers are applied to both prompt and response. Convolutions are used in contextual encoder layer and bi-attention mechanism is applied in attention layer. After calculating by the relevance layer with the gated unit, the relevance vector is then fed into the output layer which consists of the normalization layer, dropout, two fully connection layers and softmax.

attention.ũ
4. Relevance Layer. To capture the important parts of responses and attend to the ones relevant to the prompts, we use one gated unit in this layer seen in Equation (16-17). This gated unit focuses on the relation between the prompt and the response. Only relevant parts of each side can remain after the sigmoid operation. The input of this layer is , which uses residual connections of the previous two lay-ers.
5. Output Layer. The fixed-length semantic matching vector produced by the previous layer and the previous second layer vector, are fed into the last output layer. It consists of one normalization layer, one dropout, two fully connected layers, and one softmax layer. The output distribution indicates the relevance of the prompt and the response. We classify the output into two categories on-topic or offtopic through the threshold. Different threshold is chosen for the different prompt to make sure the on-topic recall of the prompt meets the lowest requirement, such as 0.999 for the online product system in our study.

Dataset
Data from our IELTS speaking test mobile app 1 was used for training and testing in this paper. There are three parts in the IELTS 2 test: Part1 focuses on general questions about test-takers and a range of familiar topics, such as home, family, work, studies, and interests. In Part2, test-takers will be asked to talk about a particular topic. Discussion of more abstract ideas and issues about Part2 will occur in Part3. Here is an example from our IELTS speaking test mobile app, seen in Table 2.
All responses from test-takers were generated from our automatic speech recognition (ASR) system, which will be briefly introduced in Section 3.2. Responses for a target prompt collected in our paid service were used as its on-topic training examples, and responses from the other prompts were used as the off-topic training examples for the target prompt. It is a reasonable setup because most of the responses in our paid service are on-topic (we labeled about 5K responses collected under our paid service and found only 1.3% of them are offtopic) and a certain level of "noise" in the training is acceptable. The test data was produced in the same way as the training data except that human validation was further introduced to ensure its validity. To ensure the authenticity of our train and test data further, we filter short responses for each part. The length of words from each response in Part1, Part2, and Part3 should be over 15, 50, and 15, respectively. Table 3 shows the details of our train and test datasets: 1.12M responses from 1356 prompts are used to train our model. The average number of 1 https://www.liulishuo.com/ielts.html 2 https://www.ielts.org/about-the-test/test-format responses to each prompt is 822. The number of on-topic and off-topic responses are 564.3K and 551.3K in training data. We divide the test data into two parts: seen benchmark and unseen benchmark. Prompts of the seen benchmark can appear in train data, while prompts of unseen benchmark cannot. The seen benchmark consists of 33.6K responses from 156 prompts, including 17.7K ontopic responses and 15.9K off-topic responses, and the average number of responses of each prompt is 216. In the unseen benchmark, there are 10.1K responses from 50 prompts, including 5.0K on-topic responses and 5.1K off-topic responses, and the average number of responses of each prompt is 202.

ASR System
A hybrid deep neural network DNN-HMM system is used for ASR. The acoustic model contains 17 sub-sampled time-delay neural network layers with low-rank matrix factorization (TDNNF) (Povey et al., 2018), and is trained on over 8000 hours of speech, using the lattice-free MMI (Povey et al., 2016) recipe in Kaldi 3 toolkit. A tri-gram LM with Kneser-Ney smoothing is trained using the SRILM 4 toolkit and applied at first pass decoding to generate word lattices. An RNN-LM (Mikolov et al., 2010) is applied to re-scoring the lattices to achieve the final recognition results. The ASR system achieves a word error rate of around 13% on our 50 hours ASR test set.

Metric
We use two assessment metrics in this paper: Average Off-topic Recall (AOR) and Prompt Ratio over Recall0.3 (PRR3). AOR denotes the average number of off-topic responses recall of all prompts (156 prompts on the seen benchmark and 50 prompts on the unseen benchmark). PPR3 denotes the ratio of prompts whose off-topic recall is over 0.3.

Training settings
The model is implemented by Keras 5 . We use pretrained Glove as word embedding, the dimension of which is 300. The train and dev batch size are 1024 and 512. The kernel size, filter number, and block size of CNN are 7, 128, and 7 by tuning on the dev set. The fix-length of prompts and responses are 40 and 280 according to the length distribution of prompts and responses in the training data. Nadam (Dozat, 2016) is used as our optimizer with a learning rate of 0.002. The loss function is binary cross-entropy. The epoch size is 20, and we apply early-stop when dev loss has not been improving for three epochs.

Results
We carried out experiments on both seen benchmark and unseen benchmark mentioned in Section 3.1. As is shown in Table 4, Att-RNN is our baseline model. To make the evaluation more convincing, we built a stronger baseline model G-Att-RNN based on Att-RNN by adding residual connections with each layer. Additionally, we add a gated unit as the relevance layer for our baseline model G-Att-RNN. Compared with Att-RNN, our baseline model G-Att-RNN achieved significant improvements on both seen (by +3.2 PPR3 points and +4.6 AOR points) and unseen benchmark (by +22.0 PPR3 points and +17.1 AOR). From Table 4, comparing with Att-RNN baseline, we can see that our approach GCBiA can achieve impressive improvements by +36.0 PPR3 points and +24.0 AOR points on the unseen benchmark, as well as +9.0 PPR3 points and +7.0 AOR points on the seen benchmark. Meanwhile, our approach significantly outperforms G-Att-RNN by 5 https://keras.io/ +14.0 PPR3 points and + 6.9 AOR points on the unseen benchmark, as well as +5.8 PPR3 points and +2.4 AOR points on the seen benchmark.

Ablation Studies
As gated unit and residual connections have been proved useful in Section 4.1, we conducted ablation analysis on seen and unseen benchmarks, seen in Table 4, to further study how other components contribute to the performance based on G-Att-RNN.
Because topic words of the prompt were focused on, the bi-attention mechanism is beneficial to replace the uni-attention by adding response-toprompt attention, with +2.0 PPR3 points and +1.6 AOR points improvements on the unseen benchmark, as well as +2.6 PPR3 points and +1.5 AOR points on the seen benchmark. Besides, CNN with average-pooling applied to substitute RNN is also useful on the unseen benchmark by +10.0 PPR3 and +4.0 AOR points improvement. Though a little drop (-1.7% on seen AOR) in performance was caused by CNN with average-pooling, CNN with max-pooling can achieve improvements on the seen benchmark by +2.6 PPR3 and + 2.5 AOR in return. In general, CNN is more suitable than RNN for the contextual encoder layer in our model framework, for seen and unseen prompts. Finally, we also benefit from the residual connections for the gated unit with +2.8 AOR points improvement on the unseen benchmark.

Analysis
In this section, we analyzed the essence of our model from two perspectives. One is the biattention mechanism visualization and the other is the dimension reduction analysis of the semantic matching representation. More details are illustrated as follows: Bi-Attention Visualization. Figure 2 gives the visualization of the bi-attention mechanism. Biattention mechanism can capture the interrogative "what" and topic words "spare time" of prompt "what do you do in your spare time" seen in subfigure 2a , capture the key-phrases "usually watch    movies" and "shopping" of the response seen in subfigure 2b, and capture the key-phrases "change name" and "name" seen in subfigure 2c. Due to the increased focus on the prompt, bi-attention is more beneficial for assessing the relevance of responses and prompts by matching the key phrases or words between them. The response in subfigure 2b is classified as on-topic, while the response in subfigure 2c is classified as off-topic.
Semantic Matching Representation Visualization. As the output vector of the relevance layer using the gated unit can better represent the relevance of prompts and responses, the semantic matching representation was obtained from the rel-evance layer. With the help of t-SNE (Maaten and Hinton, 2008), the visualization result was shown in Figure 3. Subfigure 3a tells the true response distribution of one prompt, "describe a special meal that you have had, what the meal was, who you had this meal with and explain why this meal was special", which has a clear-semantic topic "meal". Meanwhile, subfigure 3b reveals the response distribution using our semantic matching representation on the same prompt as subfigure 3a .
We can see that semantic matching representation of our model maintains good performance on this kind of prompt, which has one clear-semantic topic to limit the discussion in one scope. Additionally, some prompts are open to discuss, which are divergent. Given a case of the prompt "what do you do in your spare time", and we can observe its true response distribution in subfigure 3c . Compared with it in subfigure 3c , our model tends to predict responses on-topic, seen in subfigure 3d , because high on-topic recall (0.999) is limited.

Negative Sampling Augmentation Method
To investigate the impact of training data size, we conduct some experiments with varying sizes of training data. In figure 4, we find that the larger the training data size, the better the performance.  To augment training data and strengthen the generalization of the off-topic response detection (a) True resp distribution on clear-semantic topic prompt.

Model
(b) Model's resp distribution on clear-semantic topic prompt.
(c) True response distribution on divergent prompt.
(d) Model's resp distribution on divergent prompt.  model for unseen prompts, we proposed a new and effective negative sampling method for offtopic response detection task. Comparing with the previous method of generating only one negative sample for each positive one, we generated two. The first one is chosen randomly as before, and the second one consists of words shuffled from the first one. This method contributes to the diversity of negative samples of training data. The size of our training data reaches 1.67M, compared with 1.12M in the previous negative sampling method. To make training data balanced, we gave the weight of positive and negative samples: 1 and 0.5, respectively. As is shown in Table 5, a significant performance improvement (+9.0 seen AOR and +24.1 unseen AOR) is achieved by this negative sampling method. Our model GCBiA equipped with negative sampling augmentation can achieve 88.2% and 69.1% average off-topic response recall on seen and unseen prompts, conditioned on 0.999 on-topic recall.

Conclusion
In this paper, we conducted a series of work around the task of off-topic response detection. First of all, a model framework of five major layers was proposed, within which bi-attention mechanism and convolutions were used to well capture the topic words of prompts and key-phrase of responses, and gated unit as relevance layer was applied to better obtaining semantic matching representation, as well as residual connections with each major layer. Moreover, the visualization analysis of the off-topic model was given to study the essence of the model. Finally, a novel negative sampling augmentation method was introduced to augment off-topic training data. We verified the effectiveness of our approach and achieved significant improvements on both seen and unseen test data.