Learning to Rank Question-Answer Pairs Using Hierarchical Recurrent Encoder with Latent Topic Clustering

In this paper, we propose a novel end-to-end neural architecture for ranking candidate answers, that adapts a hierarchical recurrent neural network and a latent topic clustering module. With our proposed model, a text is encoded to a vector representation from an word-level to a chunk-level to effectively capture the entire meaning. In particular, by adapting the hierarchical structure, our model shows very small performance degradations in longer text comprehension while other state-of-the-art recurrent neural network models suffer from it. Additionally, the latent topic clustering module extracts semantic information from target samples. This clustering module is useful for any text related tasks by allowing each data sample to find its nearest topic cluster, thus helping the neural network model analyze the entire data. We evaluate our models on the Ubuntu Dialogue Corpus and consumer electronic domain question answering dataset, which is related to Samsung products. The proposed model shows state-of-the-art results for ranking question-answer pairs.


Introduction
Recently neural network architectures have shown great success in many machine learning fields such as image classification, speech recognition, machine translation, chat-bot, question answering, and other task-oriented areas. Among these, the automatic question answering (QA) task has long been considered a primary objective of artificial intelligence.
In the commercial sphere, the QA task is usually tackled by using pre-organized knowledge bases and/or by using information retrieval (IR) based methods, which are applied in popular intelligent voice agents such as Siri, Alexa, and Google Assistant (from Apple, Amazon, and Google, respectively). Another type of advanced QA systems is IBM's Watson who builds knowledge bases from unstructured data. These raw data are also indexed in search clusters to support user queries Chu-Carroll et al., 2012).
In academic literature, researchers have intensely studied sentence pair ranking task which is core technique in QA system. The ranking task selects the best answer among candidates retrieved from knowledge bases or IR based modules. Many neural network architectures with endto-end learning methods are proposed to address this task (Yin et al., 2016;Wang and Jiang, 2016;Wang et al., 2017). These works focus on matching sentence-level text pair (Wang et al., 2007;Yang et al., 2015;Bowman et al., 2015). Therefore, they have limitations in understanding longer text such as multi-turn dialogue and explanatory document, resulting in performance degradation on ranking as the length of the text become longer.
With the advent of the huge multi-turn dialogue corpus (Lowe et al., 2015), researchers have proposed neural network models to rank longer text pair (Kadlec et al., 2015;Baudiš et al., 2016). These techniques are essential for capturing context information in multi-turn conversation or understanding multiple sentences in explanatory text.
In this paper, we focus on investigating a novel neural network architecture with additional data clustering module to improve the performance in ranking answer candidates which are longer than a single sentence. This work can be used not only for the QA ranking task, but also to evaluate the relevance of next utterance with given dialogue generated from the dialogue model. The key contributions of our work are as follows: First, we introduce a Hierarchical Recurrent Dual Encoder (HRDE) model to effectively calculate the affinity among question-answer pairs to determine the ranking. By encoding texts from an word-level to a chunk-level with hierarchi-cal architecture, the HRDE prevents performance degradations in understanding longer texts while other state-of-the-art neural network models suffer.
Second, we propose a Latent Topic Clustering (LTC) module to extract latent information from the target dataset, and apply these additional information in end-to-end training. This module allows each data sample to find its nearest topic cluster, thus helping the neural network model analyze the entire data. The LTC module can be combined to any neural network as a source of additional information. This is a novel approach using latent topic cluster information for the QA task, especially by applying the combined model of HRDE and LTC to the QA pair ranking task.
Extensive experiments are conducted to investigate efficacy and properties of the proposed model. Our proposed model outperforms previous state-of-the-art methods in the Ubuntu Dialogue Corpus, which is one of the largest text pair scoring datasets. We also evaluate the model on real world QA data crawled from crowd-QA web pages and from Samsung's official web pages. Our model also shows the best results for the QA data when compared to previous neural network based models.

Related Work
Researchers have released question and answer datasets for research purposes and have proposed various models to solve these datasets. (Wang et al., 2007;Yang et al., 2015;Tan et al., 2015) introduced small dataset to rank sentences that have higher probabilities of answering questions such as WikiQA and insuranceQA. To alleviate the difficulty in aggregating datasets, that are large and have no license restrictions, some researchers introduced new datasets for sentence similarity rankings (Baudiš et al., 2016;Lowe et al., 2015). As of now, the Ubuntu Dialogue dataset is one of the largest corpus openly available for text ranking.
To tackle the Ubuntu dataset, (Lowe et al., 2015) adopted the "term frequency-inverse document frequency" approach to capture important words among context and next utterances (Ramos et al., 2003). (Bordes et al., 2014;Yu et al., 2014) proposed deep neural network architecture for embedding sentences and measuring similarities to select answer sentence for a given question. (Kadlec et al., 2015) used convolution neu-ral network (CNN) architecture to embed the sentence while a final output vector was compared to the target text to calculate the matching score. They also tried using long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997), bidirectional LSTM and ensemble method with all of those neural network architectures and achieved the best results on the Ubuntu Dialogues Corpus dataset. Another type of neural architecture is the RNN-CNN model, which encodes each token with a recurrent neural network (RNN) and then feeds them to the CNN (Baudiš et al., 2016). Researchers also introduced an attention based model to improve the performance (Tan et al., 2015;Wang and Jiang, 2016;Wang et al., 2017).
Recently, the hierarchical recurrent encoderdecoder model was proposed to embed contextual information in user query prediction and dialogue generation tasks (Sordoni et al., 2015;Serban et al., 2016). This shows improvement in the dialogue generation model where the context for the utterance is important. As another type of neural network architecture, memory network was proposed by (Sukhbaatar et al., 2015). Several researchers adopted this architecture for the reading comprehension (RC) style QA tasks, because it can extract contextual information from each sentence and use it in finding the answer (Xiong et al., 2016;Kumar et al., 2016). However, none of this research is applied to the QA pair ranking task directly.

Model
In this section, we depict a previously released neural text ranking model, and then introduce our proposed neural network model.

Recurrent Dual Encoder (RDE)
A subset of sequential data is fed into the recurrent neural network (RNN) which leads to the formation of the network's internal hidden state h t to model the time series patterns. This internal hidden state is updated at each time step with the input data w t and the hidden state of the previous time step h t−1 as follows: where f θ is the RNN function with weight parameter θ, h t is hidden state at t-th word input, w t is t-th word in a target question w Q = {w Q 1:tq } or an answer text w A = {w A 1:ta } . The previous RDE model uses two RNNs for encoding question text and answer text to calculate affinity among texts (Lowe et al., 2015). After encoding each part of the data, the affinity among the text pairs is calculated by using the final hidden state value of each question and answer RNNs. The matching probability between question text w Q and answer text w A with the training objective are as follows: where h Q tq and h A ta are last hidden state of each question and answer RNN with the dimensionality h t ∈ R d . The M ∈ R d×d and bias b are learned model parameters. The N is total number of samples used in training and σ is the sigmoid function.

Hierarchical Recurrent Dual Encoder (HRDE)
From now we explain our proposed model. The previous RDE model tries to encode the text in question or in answer with RNN architecture. It would be less effective as the length of the word sequences in the text increases because RNN's natural characteristic of forgetting information from long ranging data. To address this RNN's forgetting phenomenon, (Bahdanau et al., 2014) proposed an attention mechanism, however, we found that it still showed a limitation when we consider very large sequential length data such as 162 steps average in the Ubuntu Dialogue Corpus dataset (see Table 1). To overcome this limitation, we designed the HRDE architecture. The HRDE model divides long sequential text data into small chunk such as sentences, and encodes the whole text from word-level to chunk-level by using two hierarchical level of RNN architecture. Figure 1 shows a diagram of the HRDE model. The word-level RNN part is responsible for encoding the words sequence w c = {w c,1:t } in each chunk. The chunk can be sentences in paragraph, paragraphs in essay, turns in dialogue or any kinds of smaller meaningful sub-set from the text. Then the final hidden states of each chunk will be fed into chunk-level RNN with its original sequence order kept. Therefore the chunk-level RNN can deal with pre-encoded chunk data with less sequential steps. The hidden states of the hierarchical RNNs are as follows: where f θ and g θ are the RNN function in hierarchical architecture with weight parameters θ, h c,t is word-level RNN's hidden status at t-th word in c-th chunk. The w c,t is t-th word in c-th chunk of target question or answer text. The u c is chunklevel RNN's hidden state at c-th chunk sequence, and h c is word-level RNN's last hidden state of each chunk h c ∈ {h 1:c,t }. We use the same training objective as the RDE model, and the final matching probability between question and answer text is calculated using chunk-level RNN as follows: where u Q cq and u A ca are chunk-level RNN's last hidden state of each question and answer text with the dimensionality u c ∈ R d u , which involves the M ∈ R d u ×d u .

Latent Topic Clustering (LTC)
To learn how to rank QA pairs, a neural network should be trained to find the proper feature that represents the information within the data and fits the model parameter that can approximate the true-hypothesis. For this type of problem, we propose the LTC module for grouping the target data to help the neural network find the true-hypothesis with more information from the topic cluster in end-to-end training.
The blue-dotted box on the right-side of Figure  2 shows LTC structure diagram. To assign topic information, we build internal latent topic memory Figure 2: Diagram of the HRDE-LTC. Input vector is compared to each latent topic memory m k to calculate cluster-info contained vector. This vector will be concatenated to original input vector.
m ∈ R d m ×K , which is only model parameter to be learned, where d m is vector dimension of each latent topic and K is number of latent topic cluster. For a given input sequence x = {x 1:t }with these K vectors, we construct LTC process as follows: First, the similarity between the x and each latent topic vector is calculated by dot-product. Then the resulting K values are normalized by the softmax function softmax(z k ) = e z k / i e z i to produce a similarity probability p k . After calculating the latent topic probability p k , x K is retrieved from summing over m k weighted by the p k . Then we concatenate this result with the original encoding vector to generate the final encoding vector e with the LTC information added.
Note that the input sequence of the LTC could be any type of neural network based encoding function x = f enc θ (w) such as RNN, CNN and multilayer perceptron model (MLP). In addition, if the dimension size of x is different from that of memory vector, additional output projection layer should be placed after x before applying dotproduct to the memory.

Combined Model of (H)RDE and LTC
As the LTC module extracts additional topic cluster information from the input data, we can combine this module with any neural network in their end-to-end training flow. In our experiments, we combine the LTC module with the RDE and HRDE models.

RDE with LTC
The RDE model encodes question and answer texts to h Q tq and h A ta , respectively. Hence, the LTC module could take these vectors as the input to generate latent topic cluster information added vector e. With this vector, we calculate the affinity among question and answer texts as well as additional cluster information. The following equation shows our RDE-LTC process: In this case, we applied the LTC module only for the answer side, assuming that the answer text is longer than the question. Thus, it needs to be clustered. To train the network, we use the same training objective, to minimize cross-entropy loss, as in equation (2).

HRDE with LTC
The LTC can be combined with the HRDE model, in the same way it is applied to the RDE-LTC model by modifying equation (6 as follows: where u Q cq is the final network hidden state vector of the chunk-level RNN for a question input sequence. The e u,A is the LTC information added vector from equation (5), where the LTC module takes the input x = u A from the HRDE model equation (3). The HRDE-LTC model also use the same training objective, minimizing cross-entropy loss, as in equation (2). Figure 2 shows a diagram of the combined model with the HRDE and the LTC.

The Ubuntu Dialogue Corpus
The Ubuntu Dialogue Corpus has been developed by expanding and preprocessing the Ubuntu Chat Logs 1 , which refer to a collection of logs from the Ubuntu-related chat room for solving problem in using the Ubuntu system by (Lowe et al., 2015).
Among the utterances in the dialogues, they consider each utterance, starting from the third one, as a potential {response} while the previous utterance is considered as a {context}. The data are training a model on past data to predict future data, changing sampling procedure to increase average turns in the {context}. We consider this Ubuntu dataset is one of the best dataset in terms of its quality, quantity and availability for evaluating the performance of the text ranking model.
To encode the text with the HRDE and HRDE-LTC model, a text needs to be divided into several chunk sequences with predefined criteria. For the Ubuntu-v1 dataset case, we divide the {context} part by splitting with end-of-sentence delimiter " eos ", and we do not split the {response} part since it is normally short and does not contain " eos " information. For the Ubuntu-v2 dataset case, we split the {context} part in the same way as we do in the Ubuntu-v1 dataset while only using end-of-turn delimiter " eot ". Table 1 shows properties of the Ubuntu dataset.
Question how do i set a timer of clock in applications and development for samsung galaxy s4 mini?
Answer 1 from within the clock application, tap timer tab. 2 tap the hours, minutes, or seconds field and use the on-screen keypad to enter the hour, minute, or seconds. the timer plays an alarm at the end of the countdown. 3 tap start to start the timer. 4 tap stop to stop the timer or reset to reset the timer and start over. 5 tap restart to resume the timer counter.

Consumer Product QA Corpus
To test the robustness of the proposed model, we introduce an additional question and answer pair dataset related to an actual user's interaction with the consumer electronic product domain. We crawled data from various sources like the Samsung Electronics' official web site 2 and crowd QA web sites 34 in a similar way that (Yoon et al., 2016) did in building QA system for consumer products. On the official web page, we can retrieve data consisting of user questions and matched answers like frequently asked questions and troubleshooting. From the crowd QA sites, there are many answers from various users for each question. Among these answers, we choose answers from company certificated users to keep the reliability of the answers high. If there are no such answers, we skip that question answer pair. Table 2 shows an example of question-answer pair crawled from the web page. In addition, we crawl hierarchical product category information related to QA pairs. In particular, mobile, office, photo, tv/video, accessories, and home appliance as top-level categories, and specific categories like galaxy s7, tablet, led tv, and others are used. We collected these meta-information for further use. The total size of the Samsung QA data is over 100,000 pairs and we split the data into approximately 80,000/10,000/10,000 samples to create train/valid/test sets, respectively. To create the train set, we use a QA pair sample as a groundtruth and perform negative sampling for answers among training sets to create false-label datasets. In this way, we generated ({question}, {answer}, flag) triples (see Table 1). We do the same procedure to create valid and test sets by only differentiating more negative sampling within each dataset to generate 9 false-label samples with one ground-truth sample. We apply the same method in such a way that the Ubuntu dataset is generated from the Ubuntu Dialogue Corpus to maintain the consistency. The Samsung QA dataset is available via web repository. We refer the readers to Appendix A for more examples of each dataset.

Ubuntu dataset case
To implement the RDE model, we use two single layer Gated Recurrent Unit (GRU) (Chung et al., 2014) with 300 hidden units . Each GRU is used to encode {context} and {response}, respectively. The weight for the two GRU are shared. The hidden units weight matrix of the GRU are initialized using orthogonal weights (Saxe et al., 2013), while input embedding weight matrix is initialized using a pre-trained embedding vector, the Glove (Pennington et al., 2014), with 300 dimension. The vocabulary size is 144,953 and 183,045 for the Ubuntu-v1/v2 case, respectively. We use the Adam optimizer (Kingma and Ba, 2014), with gradients clipped with norm value 1. The maximum time step for calculating gradient of the RNN is determined according to the input data statistics in Table 1. For the HRDE model, we use two single layer GRU with 300 hidden units for word-level RNN part, and another two single layer GRU with 300 hidden units for chunk-level RNN part. The weight of the GRU is shared within the same hierarchical part, word-level and chunk-level. The other settings are the same with the RDE model case. As for the combined model with the (H)RDE and the LTC, we choose the latent topic memory dimensions as 256 in both ubuntu-v1 and ubuntu-v2. The number of the cluster in LTC module is decided to 3 for both the RDE-LTC and the HRDE-LTC cases. In HRDE-LTC case, we applied LTC module to the {context} part because we think it is longer having enough information to be clustered with. All of these hyper-parameters are selected from additional parameter searching experiments.
The dropout (Srivastava et al., 2014) is applied for the purpose of regularization with the ratio of: 0.2 for the RNN in the RDE and the RDE-LTC, 0.3 for the word-level RNN part in the HRDE and the HRDE-LTC, 0.8 for the latent topic memory in the RDE-LTC and the HRDE-LTC.
We need to mention that our implementation of the RDE module has the same architecture as the LSTM model (Kadlec et al., 2015) in ubuntu-v1/v2 experiments case. It is also the same architecture with the RNN model (Baudiš et al., 2016) in ubuntu-v2 experiment case. We implement the same model ourselves, because we need a baseline model to compare with other proposed models such as the RDE-LTC, HRDE and HRDE-LTC.

Samsung QA dataset case
To test the Samsung QA dataset, we use the same implementation of the model (RDE, RDE-LTC, HRDE and HRDE-LTC) used in testing the Ubuntu dataset. Only the differences are, we use 100 hidden units for the RDE and the RDE-LTC, 300 hidden units for the HRDE and 200 hidden units for the HRDE-LTC, and the vocabulary size of 28,848. As for the combined model with the (H)RDE and LTC, the dimensions of the latent topic memory is 64 and the number of latent cluster is 4. We chose best performing hyperparameter of each model by additional extensive hyper-parameter search experiments.
All of the code developed for the empirical results are available via web repository 5 .

Evaluation Metrics
We regards all the tasks as selecting the best answer among text candidates for the given question. Following the previous work (Lowe et al., 2015), we report model performance as recall at k (R@k) relevant texts among given 2 or 10 candidates (e.g., 1 in 2 R@1). Though this metric is useful for ranking task, R@1 metric is also meaningful for classifying the best relevant text.
Each model we implement is trained multiple times (10 and 15 times for Ubuntu and the Samsung QA datasets in our experiments, respectively) with random weight initialization, which largely influences performance of neural network model. Hence we report model performance as mean and standard derivation values (Mean±Std).

Comparison with other methods
As Table 3 shows, our proposed HRDE and HRDE-LTC models achieve the best performance for the Ubuntu-v1 dataset. We also find that the RDE-LTC model shows improvements from the baseline model, RDE.    (Lowe et al., 2015;Wang and Jiang, 2016;Wang et al., 2017;Baudiš et al., 2016;Tan et al., 2015), respectively.
For the ubuntu-v2 dataset case, Table 4 reveals that the HRDE-LTC model is best for three cases (1 in 2 R@1, 1 in 10 R@2 and 1 in 10 R@5). Comparing the same model with our implementation (RDE) and (Baudiš et al., 2016)'s implementation (RNN), there is a large gap in the accuracy (0.610 and 0.664 of 1 in 10 R@1 for RDE and RNN, receptively). We think this is largely influenced by the data preprocessing method, because the only differences between these models is the data preprocessing, which is (Baudiš et al., 2016)'s contribution to the research. We are certain that our model performs better with the exquisite datasets which adapts extensive preprocessing method, because we see improvements from the RDE model to the HRDE model and additional improvements with the LTC module in all test cases (the Ubuntu-v1/v2 and the Samsung QA).   In the Samsung QA case, Table 5 indicates that the proposed RDE-LTC, HRDE, and the HRDE-LTC model show performance improvements when compared to the baseline model, TF-IDF and RDE. The average accuracy statistics are higher in the Samsung QA case when compared to the Ubuntu case. We think this is due to in the smaller vocabulary size and context variety. The Samsung QA dataset deals with narrower topics than in the Ubuntu dataset case. We are certain that our proposed model shows robustness in several datasets and different vocabulary size environments.

Degradation Comparison for Longer Texts
To verify the HRDE model's ability compared to the baseline model RDE, we split the testset of the Ubuntu-v1/v2 datasets based on the "number of chunks" in the {context}. Then, we measured the top-1 recall (same case as 1 in 10 R@1 in Table 3, and 4) for each group. Figure 3 demonstrates that the HRDE models, in darker blue and red colors, shows better performance than the RDE models, in lighter colors, for every "number of chunks" evaluations. In particular, the HRDE models are consistent when the "number-of-chunks" increased, while the RDE models degrade as the "numberof-chunks" increased.

Effects of the LTC Numbers
We analyze the RDE-LTC model for different numbers of latent clusters. Table 6 indicates that the model performances increase as the number of latent clusters increase (until 3 for the Ubuntu and 4 for the Samsung QA case). This is probably a major reason for the different number of subjects in each dataset. The Samsung QA dataset has an internal category related to the type of consumer electronic products (6 top-level categories; mobile, office, photo, tv/video, accessories, and home appliance), so that the LTC module makes clusters these categories. The Ubuntu dataset, however, has diverse contents related to issues in using the Ubuntu system. Thus, the LTC module has fewer clusters with the sparse topic compared to the Samsung QA dataset.  information; hence, latent topic clustering results can be compared with real categories. We randomly choose 20k samples containing real category information and evaluate each sample with the HRDE-LTC model. The cluster with the highest similarity among the latent topic clusters is considered a representative cluster of each sample. Figure 4 shows proportion of four latent clusters among these samples according to real category information. Even though the HRDE-LTC model is trained without any ground-truth category labels, we observed that the latent cluster is formed accordingly. For instance, cluster 2 is shown mostly in "Mobile" category samples while "clusters 2 and 4" are rarely shown in "Home Appliance" category samples.

Comprehensive Analysis of LTC
Additionally, we explore sentences with higher similarity score from the HRDE-LTC module for each four cluster. As can be seen in Table 7, "cluster 1" contains "screen" related sentences (e.g., brightness, pixel, display type) while "cluster 2" contains sentences with exclusive information re-lated to the "Mobile" category (e.g., call rejection, voice level). This qualitative analysis explains why "cluster 2" is shown mostly in the "Mobile" category in Figure 4. We also discover that "cluster 3" has the largest portion of samples. As "cluster 3" contains "security" and "maintenance" related sentences (e.g., password, security, log-on, maintain), we assume that this is one of the frequently asked issues across all categories in the Samsung QA dataset. Table 7 shows example sentences with high scores from each cluster.

Conclusion
In this paper, we proposed the HRDE model and LTC module. HRDE showed higher performances in ranking answer candidates and less performance degradations when dealing with longer texts compared to conventional models. The LTC module provided additional performance improvements when combined with both RDE and HRDE models, as it added latent topic cluster information according to dataset properties. With this proposed model, we achieved state-of-the-art performances in Ubuntu datasets. We also evaluated our model in real world question answering dataset, Samsung QA. This demonstrated the robustness of the proposed model with the best results.