A Syntactically Constrained Bidirectional-Asynchronous Approach for Emotional Conversation Generation

Traditional neural language models tend to generate generic replies with poor logic and no emotion. In this paper, a syntactically constrained bidirectional-asynchronous approach for emotional conversation generation (E-SCBA) is proposed to address this issue. In our model, pre-generated emotion keywords and topic keywords are asynchronously introduced into the process of decoding. It is much different from most existing methods which generate replies from the first word to the last. Through experiments, the results indicate that our approach not only improves the diversity of replies, but gains a boost on both logic and emotion compared with baselines.


Introduction
In recent years, as artificial intelligence has developed rapidly, researchers are pursuing technologies with greater similarities to human intelligence. As a subjective factor, emotion performs an elemental difference between humans and machines. In other words, machines that could understand emotion would be more responsive to human needs. For example, in education, positive emotions improve students' learning efficiency (Kort et al., 2002). In healthcare, mood prediction can be used in mental health counseling to help anticipate and prevent suicide or depression . To make machine more intelligent, we must resolve the conundrum of emotional interactions.
There are tons of researches about conversation, an important channel for communication between humans. And lots of work has recently been carried out in open-domain conversation devoted to generating meaningful replies (Vinyals and Le, 2015;Li et al., 2016;. Unfortunately, the factors considered in these methods only concerns topic, like (Xing et al., 2017), where * The corresponding author of this paper. they failed to take emotion into account. Unlike the former, the work in  first addressed the emotional factor in large-scale conversation generation, and it showed that emotional replies obtain superior performances compared to the baselines that did not consider emotion. However, two defects still manifest themselves in the aforementioned models. First, all methods above only adopted a single factor (i.e., topic or emotion), because of which the bias of information can not comprehensively summarize the human conversations to achieve favorable results. Second, the way that generates replies from the first word to the last can lead to a decline in diversity, limited by the high-frequency generic words in the beginning (e.g., I and you), as argued in (Mou et al., 2016).
The deficiencies above inspire us to introduce a new approach called E-SCBA, studying both emotion and topic. Three main contributions are presented in this paper: (1) It conducts a study of compound information, which constitutes the syntactic constraint in the conversation generation.
(2) Different from the work in (Mou et al., 2016), a bidirectional-asynchronous decoder with multistage strategy is proposed to utilize the syntactic constraint. It ensures the unobstructed communication between different information and allows a fine-grained control of the reply to address the problem of fluency and grammaticality as argued in (Ghosh et al., 2017;. (3) Our experiments show that E-SCBA work better on emotion, logic and diversity than the general seq2seq and other models that consider only a single factor during the generation.

Overview
The whole process of emotional conversation generation consists of the following three steps: Step I: Given a post, we first use two networks combined with category embeddings to respectively predict emotion keyword and topic keyword that should appear in the final reply (see Section 2.2).
Step II: After the prediction, a newly designed decoder is used to introduce both keywords into the content 1 , as shown in Figure 1. It first produces a sequence of hidden states based on the emotion keyword (Step I), and then uses an emotional attention mechanism to affect the generation of middle sequence, which is based on the topic keyword (Step II). The remaining two sides are ultimately generated by the combination of middle part and keywords (Step III). A detailed description is given in Section 2.3.
Step III: Finally, a direction selector is used to arrange the generated reply in a logically correct order by selecting the better one from forward and backward forms of the reply generated in the last step (see Section 2.4).
In this work, we default that the replies contain at least one emotion keyword and one topic keyword, which are expected to appear in the dictionaries we used.

Keyword Predictor
The keywords to be selected are pre-stored in the prepared dictionaries. The adopted emotion dictionary was proposed by (Xu et al., 2008), which contains 27,466 emotion words divided into 7 categories: Happy, Like, Surprise, Sad, Fear, Angry and Disgust. The adopted topic dictionary was obtained by the LDA model (Blei et al., 2003), including 10 categories with 100 words for each category. And to avoid situations in which emotion and topic keywords are predicted to be the same word, all the overlapping words in these two dictionaries default to emotion keywords.
The prediction of emotion and topic keywords follows the similar path. We first derive topic category and emotion category from the post with two classifiers separately. To be more specific, the pretrained LDA model is used for the topic category inference. And the work in (Sun et al., 2018) is applied for emotion. The concrete model is an emo- where w k et and w k tp separately represent the emotion keyword and topic keyword that are expected to appear in the reply.

Bidirectional-Asynchronous Decoder
Due to the decoder architecture shown in Figure  1, we suppose the reply in this section is y = (y ct , w k tp , y md , w k et , y ce ) 2 where y md is the middle part between two keywords and y ct , y ce represent the remaining sides connected to the topic keyword and emotion keyword. The generation of middle part y md = (y md 1 , ..., y md K ) can be described as follows: where w k = < w k et , w k tp > represents the set of keywords, s et i and s tp j separately represent the decoding state of the steps that introduce emotion keyword and topic keyword. c et j is the emotional constrain unit at time j, computing by the emotion control function f et att as follows: where e et j,i represents the impact scores of the emotion state s et i on the topic state s tp j−1 . After generating the middle part, we connect it with the keywords to form a new sequence. Two seq2seq models are used to encode the connected sequences and decode y ce = (y ce 1 , ..., y ce M ) and y ct = (y ct 1 , ..., y ct N ), as below: where y md,f and y md,b are the forward and backward situations of the middle part, respectively.

Direction Selector
To make the samples meet the requirements of decoder, by default we place the topic keyword as the first keyword on the left and the emotion keyword on the right in training. However, in real situations, the topic keyword does not always appear before the emotion keyword, where we must determine correct direction by the machine. By connecting the results in the preceding section, we get y f = (y ct,b , w k tp , y md,f , w k et , y ce,f ) as the forward situation and y b means the backward situation. GRU networks are used as encoders to process sequences in different situations, which do not share parameters. And the direction is predicted by: where * ∈ {f , b} means forward or backward. After the operation completes, one of the sequences y f and y b should conform to our expectations.

Data
We evaluated and trained E-SCBA on the emotional conversation dataset NLPCC2017. There are a total of 1,119,201 Chinese post-reply pairs in the set. The dictionaries mentioned in Section 2.2 were used to mark the conversation. The cases whose replies contain both emotion keywords and topic keywords account for 42.6% (476,121) of the total 3 , which are suitable data for the bidirectional-asynchronous decoder. We randomly sampled 8,000 for validation, 3,000 for testing and the rest for training. We also sampled another 60,000 pairs from the training set to train the LDA model 4 mentioned in Section 2.2. Besides, an error analysis is presented based on a Chinese movie subtitle dataset which is collected from the Internet.

Metrics
To evaluate our approach, we use the metrics as below: Embedding-based Metrics: We measure the similarity computed by cosine distance between a candidate reply and the target reply using sentence-level embedding, following the work in (Liu et al., 2016;Serban et al., 2017).

Method
Overall    Distinct Metrics: By computing the number of different unigrams (Distinct-1) and bigrams (Distinct-2), we measure information and diversity in the candidate replies, following the work in Xing et al., 2017).
Human Annotations: We asked four annotators to evaluate the replies 5 generated from our approach and baselines from Consistency, Logic and Emotion. Consistency measures fluency and grammaticality of the reply on a three-point scale: 0, 1, 2; Logic measures the degree to which the post and the reply logically match on a three-point scale 6 as above; Emotion judges whether the reply includes the right emotion. A score of 0 means the emotion is wrong or there is no emotion, and a score of 1 is the opposite.

Baselines
In the experiments, E-SCBA is compared with the following baselines: S2S: the general seq2seq model with attention method (Bahdanau et al., 2014).

S2S
-STW: the model uses a synchronous method that starts generating its reply solely and directly from the topic keyword.
S2S-SEW: the model uses a synchronous method that starts generating its reply solely and directly from the emotion keyword.
S2S-AW: the model uses an asynchronous method the same as (Mou et al., 2016).
The synchronous method in S2S-STW and S2S-SEW was mentioned in (Mou et al., 2015), acting as the contrast to the asynchronous models.

Results and Discussion
The results of automatic evaluation are shown in Table 2. Compared with the best model (S2S-AW) that considers only a single factor, E-SCBA makes significant improvement on the distinct metrics (+0.056 and +0.165), which verifies the effectiveness of taking both emotion and topic information into account to improve the diversity. Likewise, our approach also respectively achieves 0.042, 0.068 and 0.043 gains on G-M, E-A and V-E, benefiting from the compound information that captures the thrust of human conversation so that E-SCBA has a better ability to learn the goal distribution. Furthermore, the grades of the asynchronous models are higher than the synchronous models on both kinds of metrics, showing that the asynchronous method is a more suitable way for content-introducing conversation generation. Table 1 depicts the human annotations (t-test: p < 0.05 for C and L, p < 0.01 for E). Overall, E-SCBA outperforms S2S-AW on all three metrics, where the compound information plays a positive role in the comprehensive promotion. However, in Surprise and Angry, the grades of Consistency and Logic are not satisfactory, since the data for them are much less than others (Surprise (1.2%)  and Angry (0.7%)). Besides, the score of Emotion in Surprise has a big difference from others. We think the reason is that the characteristic of Surprise overlaps with other categories that have much more data, such as Happy, which interferes with the learning efficiency of the approach in Surprise. Meanwhile, it is harder for annotators to determine which one is the right emotion.

Case Study and Error Analysis
In this section, we sampled some typical cases from a Chinese movie subtitle dataset to do a further error analysis. The cases are shown in Table 3. The post of weibo and movie subtitle are applied in different scenes to obey different distributions. The weaker correlation between training sets and test sets can present a more reliable study. The first three conversations are positive samples and others are negative samples that have content with flaws. For the reply in the antepenultimate line, its problem is the faint emotion. Since the emotion keyword in this sentence is a polysemic word, and it expresses a meaning with no emotion here. Under diverse circumstances, a polysemic word probably have different meanings, emotional or neutral. For example, the word "like" can be a generic word when it denotes similar, but it can also be an emotion word when it denotes enjoy. Same situation also occurs in Chinese. Besides, we notice that if the LDA model pick a meaningless topic keyword from the dictionary, our approach may have a difficulty in generating a diverse and long reply, as the reply in the penultimate line. The lack of information causes generic replies which are consisted of few words generated from the networks. The last line presents another limitation. The emotion keyword hooligan corresponds to the post and the topic keyword looking forward to is meaningful, but the combination of them, looking forward to a hooligan, does not conform to the normal logic. This situation is caused by the fact that two kinds of keywords are generated independently before decoding, and it may cause a mismatch. In the future, we will try to explore different network architectures to make keywords interact with each other during the generation.

Conclusion
In this paper, we proposed a novel conversation generation approach (E-SCBA) to make a more comprehensive optimization for the quality of reply, which introduces both emotion and topic knowledge into the generation. The newly designed decoder makes use of syntactic knowledge to constrain generation and ensures fluency and grammaticality of reply. Experiments show that our approach can generate replies that have rich diversity and feature both emotion and logic.