Vocabulary Pyramid Network: Multi-Pass Encoding and Decoding with Multi-Level Vocabularies for Response Generation

We study the task of response generation. Conventional methods employ a fixed vocabulary and one-pass decoding, which not only make them prone to safe and general responses but also lack further refining to the first generated raw sequence. To tackle the above two problems, we present a Vocabulary Pyramid Network (VPN) which is able to incorporate multi-pass encoding and decoding with multi-level vocabularies into response generation. Specifically, the dialogue input and output are represented by multi-level vocabularies which are obtained from hierarchical clustering of raw words. Then, multi-pass encoding and decoding are conducted on the multi-level vocabularies. Since VPN is able to leverage rich encoding and decoding information with multi-level vocabularies, it has the potential to generate better responses. Experiments on English Twitter and Chinese Weibo datasets demonstrate that VPN remarkably outperforms strong baselines.


Introduction
As one of the long-term goals in AI and NLP, automatic conversation devotes to constructing automatic dialogue systems to communicate with humans (Turing, 1950). Benefited from large-scale human-human conversation data available on the Internet, data-driven dialog systems have attracted increasing attention of both academia and industry (Ritter et al., 2011;Shang et al., 2015a;Vinyals and Le, 2015;Li et al., 2016aLi et al., ,c, 2017. Recently, a popular approach to build dialog engines is to learn a response generation model within an encoder-decoder framework such as sequence-to-sequence (Seq2Seq) model (Cho et al., 2014a). In such a framework, an encoder transforms the source sequence into hidden vectors, and a decoder generates the targeted sequence based on the encoded vectors and previ- Figure 1: Vocabulary pyramid networks for response generation. The dialogue input (context) and output (response) are represented by multi-level vocabularies (e.g., raw words, low-level clusters and high-level clusters) and then processed by multi-pass encoder and decoder. ously generated words. In this process, the encoder and decoder share a vocabulary (word list) 1 , and the targeted words are typically performed by a softmax classifier over the vocabulary word-byword.
However, such typical Seq2Seq model is prone to generate safe and repeated responses, such as "Me too" and "I don't know". In addition to the exposure bias issue 2 , the main reasons of this problem include: 1) a fixed (single) vocabulary (word list) in decoding, which usually covers high-frequency words, so it is easy to capture high-frequency patterns (e.g., "Me too") and lose a great deal of content information in middle and low-frequency patterns; 2) one-pass decoding, 1 Encoder and decoder may have different word lists. We find it performs closely using same or different vocabularies. 2 A model generates the next word given the previous gold words in training while it is based on previously predicted words in the test (Ranzato et al., 2016). where word-by-word generation from left to right is prone to error accumulation since previously generated erroneous words will greatly affect future un-generated words. More importantly, it can leverage only the previously generated words but not the future un-generated words.
In fact, there are some researches in text generation tasks such as dialogue generation, machine translation and text summarization, are dedicated to solving the above issues. In order to alleviate issues on the fixed vocabulary, Wu et al. (2018a) incorporated dynamic vocabulary mechanism into Seq2Seq models, which dynamically allocates vocabularies for each input by a vocabulary prediction model. Xing et al. (2017) presented topic aware response generation by incorporating topic words obtained from a pre-trained LDA model (Blei et al., 2003). Besides, several works attempted to solve the dilemma of one-pass decoding. Xia et al. (2017) proposed deliberation network for sequence generation, where the first-pass decoder generates a rough sequence and then the secondpass decoder refines the rough sequence.
However, so far there has been no unified framework to solve both of the aforementioned problems. In this study, we present Vocabulary Pyramid Networks (VPN) to tackle the issues of one fixed vocabulary and one-pass decoding simultaneously. Specifically, VPN incorporates multipass encoding and decoding with multi-level vocabularies into response generation. As depicted in Figure 1, the multi-level vocabularies contain raw words, low-level clusters and high-level clusters, where low-level and high-level clusters are obtained from hierarchical clustering of raw words. Afterward, the multi-pass encoder (rawword encoder, low-level encoder, and high-level encoder) gradually works on diminishing vocabularies from raw words to low-level clusters until to high-level clusters, and it looks like a "pyramid" concerning the vocabulary size. On the other side, the multi-pass decoder gradually increases the size of processed vocabularies from high-level clusters to low-level clusters and finally to raw words.
From a theoretical point of view, people usually associate raw input words with low-level or highlevel abstractions like semantic meanings and concepts on human-human conversations. Based on the abstractive cognition, people organize contents and select the expressive words as the response (Xing et al., 2017). From a practical perspective, VPN is able to capture much more sequence information with multi-level vocabularies. As a result, VPN has the potential to generate better responses.
To verify the effectiveness of the proposed model, we conduct experiments on two public response generation datasets: English Twitter and Chinese Weibo. Both automatic and manual evaluations demonstrate that the proposed VPN is remarkably better than the state-of-the-art.

Sequence-to-Sequence Model
In Seq2Seq models (Cho et al., 2014a), an encoding RNN (recurrent neural network) transforms the source sequence Here, x t is the word embedding for x t , f is a non-linear transformation, where GRU (Cho et al., 2014b) and LSTM (Hochreiter and Schmidhuber, 1997) are widely used for capturing long-term dependencies. Then a decoder generates the targeted sequence Y = {y 1 , y 2 , ..., y L Y } as follows: where c = h L X , s t is the decoding state in time step t, and g is a non-linear function. In the basic Seq2Seq models, each word is generated from a same context vector c. In order to capture different contexts for each generated word, attention mechanism (Bahdanau et al., 2015) extracts dynamic context vector c t in different decoding time steps. Formally, c t = L X j=1 α ij h j , α ij ∝ exp(η(s i−1 , h j )), where η is a non-linear function.

Deliberation Network
Conventional Seq2Seq models can leverage only the generated words but not the un-generated words in decoding, so they lack global information to refine and polish the raw generated sequence. The deliberation network (Xia et al., 2017) is proposed to deal with this issue. A deliberation network has two decoders, where the firstpass decoder generates a raw word sequence Y 1 = {y 1 1 , y 1 2 , ..., y 1 L Y 1 } and the second-pass decoder polishes the raw word sequence. In the secondpass decoder, an extra attention model is leveraged to selectively read the output vector sequence Y 1 from the first-pass decoder, and then generate the refined output sequence Y 2 = {y 2 1 , y 2 2 , ..., y 2 L Y 2 }.
2 nd Pass Dec.

Multi-Pass Decoder Vocabulary Pyramid Network(VPN)
High-Level Dec.
Low-Level Dec.

Common Words
Topic Words

Dynamic Words
Low-level Clusters
Figure 2: Differences in our VPN with typical Seq2Seq model and its variations, where different rectangles denote different vocabularies (details in "Legend"). Seq2Seq uses a vocabulary (word list) in decoding. Dynamic vocabulary Seq2Seq integrates a common vocabulary and a dynamic vocabulary in decoding. Topic-Aware Seq2Seq incorporates topic words for each input. Deliberate network exploits first-pass and two-pass decoder within the same vocabulary list. VPN employs multi-pass encoder and multi-pass decoder with multi-level vocabularies (raw words, low-level clusters and high-level cluster). Among these models, only VPN makes use of vocabularies beyond words. Therefore, VPN could capture rich encoding and decoding information with multi-level vocabularies.

Model Overview
As shown in Figure 2, VPN consists of three submodules: multi-level vocabularies (Section 3.2), multi-pass encoder (Section 3.3) and multi-pass decoder (Section 3.4). Specifically, multi-level vocabularies contain raw words, low-level clusters and high-level clusters (black, blue and red solid rectangles in Figure 2). The multi-pass encoder starts from the raw words and then to the low-level clusters finally to the high-level clusters. In contrast, the multi-pass decoder works from the highlevel clusters to the low-level clusters until to the raw words. The details of each component are in the following. As illustrated in Figure 3, multi-level vocabularies contain three different vocabularies: raw words, low-level clusters and high-level clusters. Specifically, the raw words are the original words in the training data, and they are denoted as

Multi-Level Vocabularies
.., w h H } by "bottom-up" hierarchical clustering. In order to decide which clusters could be agglomerated, we utilize the implementation of hierarchical clustering in Scipy 3 . Specifically, we pre-train rawword embeddings by the word2vec model 4 as inputs, and then we leverage the Ward (Ward, 1963) linkage and maxclust 5 criterion to automatically construct hierarchical clustering.
In this way, we could obtain three different vocabularies: V r , V l and V h , where their vocabulary sizes are decreased: |V r |>|V l |>|V h |, and it looks like a "pyramid" concerning the vocabulary size. It should be emphasized that an original input sequence could be expanded into three input sequences through the three vocabulary lists, and it is the same for the output sequence.

Multi-Pass Encoder
The encoder aims to transform input sequences into distributional representations. In order to capture much more information from different input sequences, VPN employs a multi-pass encoder, which contains three different encoders in order: raw-word encoder, low-level encoder and highlevel encoder. As a result, the multi-pass encoder is able to encode more and more abstractive infor-mation from words to clusters. The details are in the following.

Raw-Word Encoder
The raw-word encoder accepts an input sequence of word ids from raw words V r . A bi-directional LSTM (Schuster and Paliwal, 1997) is leveraged to capture the long-term dependency from forward and backward directions. The concatenated representation of bi-directional hidden states: , is regarded as the encoded vector for each input word. Finally, the input sequence is transformed into a hidden state sequence: Specifically, the initiated hidden state is a zero vector, and the hidden state (h r L i ) in the last word could be used for initiating the next encoder (lowlevel encoder).

Low-Level Encoder
Low-level encoder is similar to the raw-word encoder. However, low-level encoder takes a sequence of low-level cluster ids from V l as inputs, and the hidden state is initiated by the last hidden state of the raw-word encoder (h r L i ). Similarly, we can obtain the hidden state sequence in the lowlevel encoder:

High-Level Encoder
The high-level encoder accepts a sequence of high-level cluster ids from V h , and the initiated hidden state is the final hidden state h l L i in the lowlevel encoder. Finally, the hidden state sequence in the high-level encoder is denoted as follows:

Multi-Pass Decoder
The decoder is responsible for generating targeted sequences. Inspired from the deliberation network (Xia et al., 2017), we present a multi-pass decoder which consists of three decoders in order: highlevel decoder, low-level decoder and raw-word decoder. The three decoders have their own targeted sequences from different vocabulary lists, and the multi-pass decoder first generates the abstractive (high and low-level) clusters and then generates the raw (specific) words. It is different from the deliberation network where both the first-pass decoder and the second-pass decoder aim to generate raw words in the same vocabulary. The details of our multi-pass decoder are in the following.

High-Level Decoder
The high-level decoder generates a high-level cluster sequence from V h . Similar to humanhuman conversations, where people usually associate an input message with high-level abstractions like concepts in their minds before speaking, the high-level decoder generates the most abstractive cluster sequence before selecting specific words as responses.
The high-level decoder is based on another L-STM, which is initiated with the last hidden state h h L i in the high-level encoder. In order to decide which parts of sources need more attention, an attention mechanism (Bahdanau et al., 2015) is introduced in the high-level decoder. Intuitively, the encoded hidden state sequence H h in the highlevel encoder contains the most relevant encoded information for the high-level decoder because they share the same vocabulary V h . Nevertheless, in order to capture much more encoded information from the source sequences, the high-level decoder adopts three attention models to attentively read different encoded state sequences: H r , H l and H h (Equation 3-5), respectively. Take H r as an example, at each decoding time step j, the high-level decoder dynamically chooses the context vector c hr j based on H r = {h r 1 , h r 2 , ..., h r L i } and the decoding state s h j−1 as follows: where ρ is a non-linear function to compute the attentive strength. Similarly, the attentive context vectors (c hl j and c hh j ) from the low-level and highlevel encoders could be calculated by the attention models. Based on c hr j , c hl j and c hh j , the decoding state s h j is updated as: where y h j−1 is the embedding vector of the previously decoded cluster at time step j − 1, and f h is the decoding LSTM unit. Finally, the targeted cluster is typically obtained by a softmax classifier over V h based on the embedding similarity. In this way, the high-level decoder could generate the output sequence y h = {y h 1 , y h 2 , ..., y h Lo }, which corresponds to the embedding sequence:

Low-Level Decoder
Once the high-level cluster sequence is generated from the high-level decoder, it could be leveraged to the low-level decoder for further decoding the low-level cluster sequence. Based on the three encoded state sequences (H r , H l , H h ) and the output embedding sequence Y h of the high-level decoder, the low-level encoder generates another sequence from the low-level clusters V l . The low-level decoder is similar to the highlevel decoder. However, there still are some differences between them: 1) The initiated hidden state s l 0 in the low-level decoder is performed as the final decoding state s h Lo in the high-level decoder.
2) The attentive context vectors (c lr j , c ll j and c lh j ) from encoded state sequences are calculated with different parameters compared to ones in the high-level decoder. 3) Inspired from deliberation networks, previously generated sequence Y h in the highlevel encoder is fed into the low-level decoder, where high-level (global) information guides lowlevel generations, and another attention model is leveraged to capture such information, which is similar to Equation 6 mathematically: where the attentive weight β ji is calculated from the low-level decoding states s l j−1 and output embedding sequence Y h (Equation 8) in the highlevel decoder. Thereafter, o lh j is concatenated to update the decoded hidden state as follows: where f l is another LSTM unit. Finally, the output y l j is generated by a softmax classifier from V l based on embedding similarity.

Raw-Word Decoder
After obtaining the high-level and low-level cluster sequence, the next step is to produce the final raw word sequence from V r by the raw-word decoder. The hidden state of the raw-word decoder s h 0 is initiated with the final decoding state s l Lo in the low-level decoder. The decoding state in the raw-word decoder is updated as follows: where c rr j , c rl j , c rh j are attentive context vectors from three encoded hidden state sequences. o rh j and o rl j (similar to Equation 9) are the weighted sums of output embedding sequences from the high-level decoder and low-level decoder. Similarly, the targeted word is typically predicted by a softmax classifier over V r based on the word embedding similarity. Eventually, the raw-word decoder iteratively generates a targeted word sequence y r = {y r 1 , y r 2 , ..., y r Lo }.

Learning
Multi-level vocabularies of hierarchical clustering are obtained in advance through an un-supervised way, while the multi-level encoder and decoder could be optimized with supervised learning. The encoder and decoder are totally differential, so they are able to be optimized in an end-to-end manner by the back propagation. Giving a source input and a targeted output, there are three inputoutput pairs obtained from different vocabulary lists: {x n , y n } n∈{r,l,h} . Each output sequence corresponds to a training loss, and the total losses perform as follows: where the three negative log-likelihoods (L h , L l and L r ) are losses for different-level targeted outputs. Y h and Y l are output embedding sequences in the high-level decoder and low-level decoder, respectively. Finally, the sum of different losses in three decoders is considered as the total losses L.

Datasets
There are large-scale message-response pairs on social websites, which consist of informational text from different topics . Our experimental data comes from two public corpus: English "Twitter" 6 and Chinese "Weibo" (Shang et al., 2015b). In order to improve the quality of datasets, some noisy message-response pairs are filtered (e.g., containing too many punctuations or emoticons), and the datasets are randomly split into Train/Dev/Test by a proportion (9:0.5:0.5).

Implementation Details
In order to make our model comparable with typical existing methods, we keep the same experimental parameters for VPN and comparative methods. We set the vocabulary size of raw words as 34000, and the word vector dimension is 300. Moreover, source inputs are encoded by 600dimensional vectors with bi-direction LSTMs, and responses are also decoded by LSTM with 600 dimensions. The total losses are minimized by an Adam optimizer (Kingma and Ba, 2015) with 0.0001 learning rate. Particularly, the size of lowlevel clusters and high-level clusters are 3400 and 340, respectively, which are significantly smaller than the size of raw words (34000), and these clusters are also represented by 300-dimensional vectors. Finally, we implemented all models with the TensorFlow.

Evaluation Metrics
Evaluation for generative responses is a challenging and under-researching problem (Novikova et al., 2017). Similar to (Li et al., 2016b;Gu et al., 2016), we borrow two well-established automatic evaluation metrics from machine translation and text summarization: BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) 7 , which could be leveraged to analyze the co-occurrences of n-gram between the generated responses and references. In addition to automatic evaluations, we also leverage manual evaluations to enhance the evaluations. Following previous studies (He et al., 2017;Qian et al., 2018;Liu et al., 2018), we employ three metrics for manual evaluations as follows. 1) Fluency (Flu.): measuring the grammaticality and fluency of generated responses, where too short responses are regarded as lack of fluency. 2) Consistency (Con.): measuring whether the generated responses are consistent with the inputs or not. 3) Informativeness (Inf.): measuring whether the response provides informative (knowledgeable) contents or not.
(3) S2STA: Seq2Seq with topic aware networks, the implementation is similar to Xing et al. (2017). S2STA could be regarded as using dynamic vocabulary because topic words are changed along with the input.
(4) DelNet: deliberation networks, the implementation is similar to Xia et al. (2017). Different from the above methods, deliberation networks are beyond one-pass decoding.
Comparison Results. We first report overall performances on Table 1. These results support the following statements: (1) Our VPN achieves the highest performances on English Twitter and Chinese Weibo dataset in all metrics, which demonstrates multi-pass encoding and decoding with multi-level vocabularies are able to deliver better responses than baselines.
(2) For the one-pass decoding (the first three methods in Table 1), S2STA performs the best. Pre-trained topic words for each input are able to make the generation more target-focused in S2STA. Nevertheless, it is still worse than VPN.
(3) As for models beyond one-pass decoding (the last two lines in Table 1), VPN is remarkably better than the deliberation network (DelNet), which indicates the effectiveness of multi-pass encoder and decoder with multi-level vocabularies.

The Effectiveness of Multi-Level Vocabularies
Comparison Settings. To validate the effectiveness of multi-level vocabularies obtained from hierarchical clustering, we design experiments on whether using Multi-level Vocabularies (MVs) or not. The comparison setting is shown in the first  Table 2: Performances on whether using multi-level vocabularies or not, where "SV" represents single vocabulary (from raw words), and "MVs" means multilevel vocabularies obtained from hierarchical clustering. "enc" and "dec" denote encoder and decoder, respectively, and numbers after them represent how many passes. For example, "enc1-dec3" means a encoder along with three passes of decoders.
column in Table 2, where numbers after "enc/dec" represent the number of encoders/decoders. "SV" denotes single vocabulary (from raw words) while "MVs" means multi-level vocabularies obtained from hierarchical clustering. Comparison Results. Table 2 demonstrates performances on whether using multi-level vocabularies. We can observe that incorporating multilevel vocabularies could improve performances on almost all of the metrics. For example, "enc3-dec3 (MVs)" improves relative performance up to 25.73% in BLEU score compared with "enc3-dec3 (SV)" on the Weibo dataset. Only on the Twitter dataset, "enc1-dec3 (MVs)" is slightly worse than "enc1-dec3 (SV)" in the BLEU score.  Table 3: Influences of multi-pass encoding and decoding, where "w/o" indicates without, "ED" represents encoder and decoder. For example, "w/o low-level ED" means removing low-level encoder and low-level decoder.

The Effectiveness of Multi-Pass Encoding and Decoding
Comparison Settings. In order to demonstrate the effectiveness of multi-pass encoder and multi-pass decoder, we design an ablation study as follows. 1) w/o low-level ED: without lowlevel encoder and low-level decoder; 2) w/o high-level ED: without high-level encoder and highlevel decoder; 3) w/o low&high-level ED: without low-level encoder/decoder and high-level encoder/decoder, which is the same as the Seq2Seq model with attention mechanisms.
Comparison Results. Results of the ablation study are shown in Table 3. We can clearly see that removing any encoder and decoder causes obvious performance degradation. Specifically, "w/o highlevel ED" obtains worse performances than "w/o low-level ED". We guess that the high-level encoder and decoder are well trained since they have the smallest vocabulary (the size of high-level clusters is only 340), so removing the well-trained component ("w/o high-level ED") performs poorly (Details in Section 4.8). Furthermore, "w/o low&high-level ED" performs the worst. This further indicates that multi-pass encoder and decoder contribute to generating better responses.  Table 4: Manual evaluations with fluency (Flu.), consistency (Con.), and informativeness (Inf.). The score is the percentage that VPN wins a baseline after removing "tie" pairs. VPN is clearly better than all baselines on the three metrics, and all results are at 99% confidence intervals.

Manual Evaluations (MEs)
Comparison Settings. Similar to manual evaluations used in Zhou et al. (2018), we conduct a pair-wise comparison between the response generated by VPN and the one for the same input by two typical baselines: S2STA and DelNet. we sample 100 responses from each system, then two curators judge (win, tie and lose) between these two methods.
Comparison Results. The results of manual evaluations are shown in Table 4, where the score is the percentage that VPN wins a baseline after removing "tie" pairs. The Cohen Kappa for interannotator statistics is 61.2, 62.1 and 70.8 for fluency, consistency and informativeness, respectively. We can see that our VPN is significantly (sign test, p-value < 0.01) better than all baselines in terms of the three metrics, which further demonstrates that VPN is able to deliver fluent, consistent and informative responses.  The multi-pass decoder in VPN has three decoders. In order to investigate the reasons why the multi-pass decoder works, we will see performances on each decoder in Table 5. We can observe that the high-level decoder obtains the best performances on all metrics, and the low-level decoder outperforms the raw-word decoder. It is intuitive that the high-level decoder performs the best since it has the smallest vocabulary (340), while the raw-word decoder performs the worst because it is equipped with the biggest vocabulary (34000). From the point of performances on each decoder, the effectiveness of multi-pass decoder could be explained from curriculum learning (Bengio et al., 2009). Curriculum learning is a learning strategy in machine learning, where the key idea is to start easier aspects of the targeted task and then gradually increase the complexity. It is difficult for response generation tasks to generate raw words directly. To alleviate this problem, the multi-pass decoder first generates the easier (high-level and low-level) clusters from the small vocabularies, and then generates the raw words from the big vocabulary under the guide of the well-generated clusters. Therefore, the multi-pass decoder obtains significant performances.

Related Work
Researches have achieved remarkable improvements on response generation for human-machine conversations. Currently, encoder-decoder framework, especially the Seq2Seq learning (Cho et al., 2014a), is becoming a backbone of data-drive response generation, and it has been widely applied in response generation tasks. For example, Shang et al. (2015a) presented neural recurrent encoderdecoder frameworks for short-text response gener-ation with attention mechanisms (Bahdanau et al., 2015). Li et al. (2016b) introduced persona-based neural response generation to obtain consistent responses for similar inputs to a speaker. Shao et al. (2017) added a self-attention to generate long and diversified responses in Seq2Seq learning.
In this study, we focus on two important problems in response generation: one fixed vocabulary and one-pass decoding. Our work is inspired by following researches to alleviate issues on the fixed vocabulary. Gu et al. (2016) proposed Copy-Net, which is able to copy words from the source message. External knowledge bases were also leveraged to extend the vocabulary (Qian et al., 2018;Zhou et al., 2018;Ghazvininejad et al., 2018). Moreover, Xing et al. (2017) incorporated topic words into Seq2Seq frameworks, where topic words are obtained from a pre-trained L-DA model (Blei et al., 2003). Wu et al. (2018b) changed the static vocabulary mechanism by a dynamic vocabulary, which jointly learns vocabulary selection and response generation.
We also borrow the idea from studies beyond one-pass decoding. Mou et al. (2016) designed backward and forward sequence generators. Xia et al. (2017) proposed deliberation networks on sequence generation beyond one-pass decoding, where the first-pass decoder generates a raw word sequence, and then the second decoder delivers a refined word sequence based on the raw word sequence. Furthermore, Su et al. (2018) presented hierarchical decoding with linguistic patterns on data-to-text tasks.
However, there has been no unified frameworks to solve the issues of fixed vocabulary and onepass decoding. Differently, we propose multi-pass encoding and decoding with multi-level vocabularies to deal with the above two problems simultaneously.

Conclusion and Future Work
In this study, we tackle the issues of one fixed vocabulary and one-pass decoding in response generation tasks. To this end, we have introduced vocabulary pyramid networks, in which dialogue input and output are represented by multi-level vocabularies and then processed by multi-pass encoding and decoding, where the multi-level vocabularies are obtained from hierarchical clustering of raw words. We conduct experiments on English Twitter and Chinese Weibo datasets. Experiments demonstrate that the proposed method is remarkably better than strong baselines on both automatic and manual evaluations.
In the future, there are some promising explorations in vocabulary pyramid networks. 1) we will further study how to obtain multi-level vocabularies, such as employing other clustering methods and incorporating semantic lexicons like WordNet; 2) we also plan to design deep-pass encoding and decoding for VPN; 3) we will investigate how to apply VPN to other natural language generation tasks such as machine translation and generative text summarization.