Conversational Word Embedding for Retrieval-Based Dialog System

Human conversations contain many types of information, e.g., knowledge, common sense, and language habits. In this paper, we propose a conversational word embedding method named PR-Embedding, which utilizes the conversation pairs  to learn word embedding. Different from previous works, PR-Embedding uses the vectors from two different semantic spaces to represent the words in post and reply.To catch the information among the pair, we first introduce the word alignment model from statistical machine translation to generate the cross-sentence window, then train the embedding on word-level and sentence-level.We evaluate the method on single-turn and multi-turn response selection tasks for retrieval-based dialog systems.The experiment results show that PR-Embedding can improve the quality of the selected response.


Introduction
Word embedding is one of the most fundamental work in the NLP tasks, where low-dimensional word representations are learned from unlabeled corpora.The pre-trained embeddings can reflect the semantic and syntactic information of words and help various downstream tasks get better performance (Collobert et al., 2011;Kim, 2014).
The traditional word embedding methods train the models based on the co-occurrence statistics, such as Word2vec (Mikolov et al., 2013a,b), GloVe (Pennington et al., 2014).Those methods are widely used in dialog systems, not only in retrievalbased methods (Wang et al., 2015;Yan et al., 2016) but also the generation-based models (Serban et al., 2016;Zhang et al., 2018b).The retrieval-based methods predict the answer based on the similarity of context and candidate responses, which can be divided into single-turn models (Wang et al., 2015) and multi-turn models (Wu et al., 2017;Zhou et al., 2018;Ma et al., 2019) based on the number of turns in context.Those methods construct the representations of the context and response with a single vector space.Consequently, the models tend to select the response with the same words .
On the other hand, as those static embeddings can not cope with the phenomenon of polysemy, researchers pay more attention to contextual representations recently.ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), and XLNet (Yang et al., 2019) have achieved great success in many NLP tasks.However, it is difficult to apply them in the industrial dialog system due to their low computational efficiency.
In this paper, we focus on the static embedding, for it is flexible and efficient.The previous works learn the embedding from intra-sentence within a single space, which is not enough for dialog systems.Specifically, the semantic correlation beyond a single sentence in the conversation pair is missing.For example, the words 'why' and 'because' usually come from different speakers, and we can not catch their relationship by context window within the sentence.Furthermore, when the words in post and reply are mapped into the same vector space, the model tends to select boring replies with repeated content because repeated words can easily get a high similarity.
To tackle this problem, we propose PR-Embedding (Post-Reply Embedding) to learn representations from the conversation pairs in different spaces.Firstly, we represent the post and the reply in two different spaces similar to the source and target languages in the machine translation.Then, the word alignment model is introduced to gener-

Notation
We consider two vocabularies for the post and the reply V p := {v p 1 , v p 2 , ..., v p s }, V r := {v r 1 , v r 2 , ..., v r s } together with two embedding matrices E p , E r ∈ R s×d , where s is the size of the vocabularity and d is the embedding dimension.We need to learn the embedding from the conversation pair post, reply .They can be formulated as P = (p 1 , ..., p m ), R = (r 1 , ..., r n ), where m, n are the length of the post and the reply respectively.For each pair in the conversation, we represent the post, reply in two spaces E p , E r , by which we can encode the relationship between the post and reply into the word embeddings.

Conversational Word Alignment
Similar to the previous works (Mikolov et al., 2013b;Pennington et al., 2014), we also learn the embeddings based on word co-occurrence.The difference is that we capture both intra-sentence and cross-sentence co-occurrence.For the single sentence, the adjacent words usually have a more explicit semantic relation.So we also calculate the co-occurrence based on the context window in a fixed size.
However, the relationship among the crosssentence words is no longer related to their distance.As shown in Figure 1, the last word in the post 'from' is adjacent to the first word 'i' in reply, but they have no apparent semantic relation.So we need to find the most related word from the other sequence for each word in the pair.In other words, we need to build conversational word alignment between the post and the reply.
In this paper, we solve it by the word alignment model in statistical machine translation (Och and Ney, 2003).We treat the post as the source language and the reply as the target language.Then we align the words in the pair with the word alignment model and generate a cross-sentence window centered on the alignment word.

Embedding Learning
We train the conversational word embedding on word and sentence level.Word-level.PR-Embedding learns the word representations from the word-level co-occurrence at first.Following the previous work (Pennington et al., 2014), we train the embedding by the global log-bilinear regression model where X ik is the number of times word k occurs in the context of word i. w, w are the word vector and context word vector, b is the bias.We construct the word representations by the summation of w and w.
Sentence-level.To learn the relationship of embeddings from the two spaces, we further train the embedding by a sentence-level classification task.We match the words in the post and reply based on the embeddings from word-level learning.Then we encode the match features by CNN (Kim, 2014) followed by max-pooling for prediction.We can formulate it by where W 1 , b 1 are trainable parameters, M i:i+h−1 refers to the concatenation of (M i , ..., M i+h−1 ) and hits@1 hits@5 hits@10 h is the window size of the filter.At last, we feed the vector M into a fully-connected layer with sigmoid output activation.
where W 2 , b 2 are trainable weights.We minimize the cross-entropy loss between the prediction and ground truth for training.

Experiment 3.1 Datasets
To better evaluate the embeddings, we choose the manual annotation conversation datasets.For the English dataset, we use the multi-turn conversation dataset PersonaChat (Zhang et al., 2018a).For the Chinese dataset, we use an in-house labeled test set of the single-turn conversations, which contains 935 posts, and 12767 candidate replies.Each of the replies has one of the three labels: bad, middle, and good.The training set comes from Baidu Zhidao 3 and contains 1.07 million pairs after cleaning.

Evaluation
Baselines.We use GloVe as our main baseline, and compare PR-Embedding with the embedding layer of BERT, which can also be used as static word embedding.We also compare with the the public embeddings of Fasttext (Joulin et al., 2017) and DSG (Song et al., 2018).
3 https://zhidao.baidu.com/NDCG NDCG@5 P@1 P@1(s) Tasks.We focus on the response selection tasks for retrieval-based dialogue systems both in the single-turn and multi-turn conversations.For the Personchat dataset, we use the current query for response selection in the single-turn task and conduct the experiments in no-persona track because we focus on the relationship between post and reply.Models.For the single-turn task, we compare the embeddings based on BOW (bag-of-words, the average of all word embedding vectors), and select replies by cosine similarity; For the multi-turn task, we use a neural model called key-value (KV) memory network4 (Miller et al., 2016), which has been proved to be a strong baseline in the ConvAI2 competition (Dinan et al., 2020).
Metrics.We use the recall at position k from 20 candidates (hits@k, only one candidate reply is true) as the metrics in the PersonaChat dataset following the previous work (Zhang et al., 2018a).
For the Chinese dataset, we use NDCG and P@1 to evaluate the sorted quality of the candidate replies.
Setup.We train the model by Adagrad (Duchi et al., 2011) and implement it by Keras (Chollet et al., 2015) with Tensorflow backend.For the PersonaChat dataset, we train the embeddings by the training set containing about 10k conversation pairs, use validation sets to select the best embeddings, and report the performance on test sets.

Results
The results on the PersonaChat dataset are shown in Table 1.The strongest baseline in the singleturn task is GloVe, but PR-Embedding outperforms the baseline by 4.4%.For the multi-turn task, we concatenate PR-Embeddings with the original embedding layer of the model.We find that the performance becomes much better when we concatenate PR-Embedding with the randomly initialized embedding.The model KVMemnn becomes much stronger when the embedding layer initializes with the embeddings from GloVe.However, PR-Embedding still improves the performance significantly.
The results on the in-house dataset are in Table 2. Our method (PR-Emb) significantly exceeds all the baselines in all metrics.The improvement is greater than the results on the English dataset the training corpus is much larger.Note that, all the improvements on both datasets are statistically significant (p-value ≤ 0.01).

Ablation
We conduct the ablations on Chinese datasets in consideration of its larger training corpus.The results are in the last part of Table 2.When we change the two vector spaces into the single one (w/o PR), the model is similar to GloVe with sentence-level learning.The performance becomes much worse in all the metrics, which shows the effect of two vector spaces.Furthermore, all the scores drop significantly after sentence-level learning is removed (w/o SLL), which shows its necessity.

Nearest Tokens
We provide an analysis based on the nearest tokens for the selected words in the whole vector space, including the word itself.For PR-Embedding, we select the words from the post vocabulary and give the nearest words both in the post and the reply space.Note that all of them are trained by the training set of the PersonaChat dataset.
The results are in Table 3.For the columns in GloVe and P-Emb, the words are the same (first one) or similar to the selected ones because the nearest token for any word is itself within a single vector space.The similarity makes that the model tends to select the reply with repeated words.While the words in the column R-Emb are relevant to the selected words, such as words 'why ' and 'because,' 'thanks' and 'welcome,' 'congratulations' and 'thank.'Those pairs indicate that PR-Embedding catches the correlation among the conversation pairs, which is helpful for the model to select the relevant and content-rich reply.

Visualization
To further explore how PR-Embedding represents words and the relation between the two spaces, we use t-SNE (Maaten and Hinton, 2008) to visualize the embeddings of 40 words with the highest frequency except for stop words in the spaces.
The embeddings are visualized in Figure 2.For the embeddings in the same spaces, the words with similar semantic meanings are close to each other, indicating that PR-Embedding catches the similarity within the same space.For example, the words 'hello ' and 'hi', 'good' and 'great', 'not' and 'no'.For the same words in different spaces, most of them have close locations, especially nouns and verbs, such as 'work,' 'think,' 'know.'Maybe it is because they play a similar role in the post and the reply.While some question words have different situations, for example, 'how ' and 'good, great,' 'why' and 'because' show the clear relations in the post and the reply spaces, which conforms to the habit of human dialog.Furthermore, PR-Embeddings can also capture the correlation between pronouns such as such as 'my, we' and 'your' also catch the correlation.We can conclude that our method can encode the correlation among the two spaces into the embeddings.

Conclusions
In this paper, we have proposed a conversational word embedding method named PR-Embedding, which is learned from conversation pairs for retrieval-based dialog system.We use the word alignment model from machine translation to calculate the cross-sentence co-occurrence and train the embedding on word and sentence level.We find that PR-Embedding can help the models select the better response both in single-turn and multi-turn conversation by catching the information among the pairs.In the future, we will adapt the method to more neural models especially the generationbased methods for the dialog system.

Figure 2 :
Figure 2: The visualization of the 40 words with the highest frequency in PR-Embedding.

Table 1 :
Experimental results on the test set of the PersonaChat dataset.The upper part compares the embeddings in the single-turn and the lower one is for the multi-turn task.train: train the embedding with the training set; emb: use the public embedding directly; †: take the results from the paper of the dataset.

Table 3 :
Four nearest tokens for the selected words trained by our PR-Embedding (P/R-Emb) and GloVe.