Read, Attend and Comment: A Deep Architecture for Automatic News Comment Generation

Automatic news comment generation is beneficial for real applications but has not attracted enough attention from the research community. In this paper, we propose a “read-attend-comment” procedure for news comment generation and formalize the procedure with a reading network and a generation network. The reading network comprehends a news article and distills some important points from it, then the generation network creates a comment by attending to the extracted discrete points and the news title. We optimize the model in an end-to-end manner by maximizing a variational lower bound of the true objective using the back-propagation algorithm. Experimental results on two public datasets indicate that our model can significantly outperform existing methods in terms of both automatic evaluation and human judgment.


Introduction
In this work, we study the problem of automatic news comment generation, which is a less explored task in the literature of natural language generation (Gatt and Krahmer, 2018). We are aware that numerous uses of these techniques can pose ethical issues and that best practices will be necessary for guiding applications. In particular, we note that people expect comments on news to be made by people. Thus, there is a risk that people and organizations could use these techniques at scale to feign comments coming from people for purposes of political manipulation or persuasion. In our intended target use, we explicitly disclose the generated comments on news as being formulated automatically by an entertaining and engaging chatbot (Shum et al., 2018). Also, we understand that the behaviors of deployed systems may need to be monitored and guided with methods, including post-processing techniques (van Aken Existing work on news comment generation includes preliminary studies, where a comment is generated either from the title of a news article only (Zheng et al., 2018;Qin et al., 2018) or by feeding the entire article (title plus body) to a basic sequence-to-sequence (s2s) model with an attention mechanism (Qin et al., 2018). News titles are short and succinct, and thus only using news titles may lose quite a lot of useful information in comment generation. On the other hand, a news article and a comment is not a pair of parallel text. The news article is much longer than the comment and contains much information that is irrelevant to the comment. Thus, directly applying the s2s model, which has proven effective in machine translation (Bahdanau et al., 2015), to the task of news comment generation is unsuitable, and may bring a lot of noise to generation. Both approaches oversimplify the problem of news comment generation and are far from how people behave on news websites. In practice, people read a news article, draw attention to some points in the article, and then present their comments along with the points they are interested in. Table 1 illustrates news commenting with an example from Yahoo! News. 1 The article is about the new FIFA ranking, and we pick two comments among the many to explain how people behave in the comment section. First, both commenters have gone through the entire article, as their comments are built upon the details in the body. Second, the article gives many details about the new ranking, but both commenters only comment on a few points. Third, the two commenters pay their attention to different places of the article: the first one notices that the ranking is based on the result of the new world cup, and feels curious about the position of Brazil; while the second one just feels excited about the new position of England. The example indicates a "read-attend-comment" behavior of humans and sheds light on how to construct a model.
We propose a reading network and a generation network that generate a comment from the entire news article. The reading network simulates how people digest a news article, and acts as an encoder of the article. The generation network then simulates how people comment the article after reading it, and acts as a decoder of the comment. Specifically, from the bottom to the top, the reading network consists of a representation layer, a fusion layer, and a prediction layer. The first layer represents the title of the news with a recurrent neural network with gated recurrent units (RNN-GRUs) (Cho et al., 2014) and represents the body of the news through self-attention which can model longterm dependency among words. The second layer forms a representation of the entire news article by fusing the information of the title into the representation of the body with an attention mechanism and a gate mechanism. The attention mechanism selects useful information in the title, and the gate mechanism further controls how much such information flows into the representation of the article. Finally, the third layer is built on top of the previous two layers and employs a multi-label classifier and a pointer network (Vinyals et al., 2015) to predict a bunch of salient spans (e.g., words, 1 https://www.yahoo.com/news/ fifa-rankings-france-number-one-112047790. html phrases, and sentences, etc.) from the article. With the reading network, our model comprehends the news article and boils it down to some key points (i.e., the salient spans). The generation network is an RNN language model that generates a comment word by word through an attention mechanism (Bahdanau et al., 2015) on the selected spans and the news title. In training, since salient spans are not explicitly available, we treat them as a latent variable, and jointly learn the two networks from article-comment pairs by optimizing a lower bound of the true objective through a Monte Carlo sampling method. Thus, training errors in comment prediction can be back-propagated to span selection and used to supervise news reading comprehension.
We conduct experiments on two large scale datasets. One is a Chinese dataset published recently in (Qin et al., 2018), and the other one is an English dataset built by crawling news articles and comments from Yahoo! News. Evaluation results on the two datasets indicate that our model can significantly outperform existing methods on both automatic metrics and human judgment.
Our contributions are three-folds: (1) proposal of "read-attend-comment" procedure for news comment generation with a reading network and a generation network; (2) joint optimization of the two networks with an end-to-end learning approach; and (3) empirical verification of the effectiveness of the proposed model on two datasets.

Related Work
News comment generation is a sub-task of natural language generation (NLG). Among various NLG tasks, the task studied in this paper is most related to summarization (Rush et al., 2015;Nallapati et al., 2016;See et al., 2017) and product review generation (Tang et al., 2016;Dong et al., 2017). However, there is stark difference between news comment generation and the other two tasks: the input of our task is an unstructured document, while the input of product review generation is structured attributes of a product; and the output of our task is a comment which often extends the content of the input with additional information, while the output of summarization is a condensed version of the input that contains the main information from the original. Very recently, there emerge some studies on news comment generation. For example, Zheng et al. (2018) propose a gated attention neural network model to generate news comments from news titles. The model is further improved by a generative adversarial net. Qin et al. (2018) publish a dataset with results of some basic models. Different from all the existing methods, we attempt to comprehend the entire news articles before generation and perform endto-end learning that can jointly optimize the comprehension model and the generation model.
Our model is partially inspired by the recent success of machine reading comprehension (MRC), whose prosperity can be attributed to an increase of publicly available large scale annotated datasets, such as SQuAD (Rajpurkar et al., 2016(Rajpurkar et al., , 2018 and MS Marco (Nguyen et al., 2016) etc. A great number of models have been proposed to tackle the MRC challenges, including BiDAF (Seo et al., 2016), r-net (Wang et al., 2017), DCN (Xiong et al., 2016), Document Reader (Chen and Bordes, 2017), QANet (Yu et al., 2018), and s-net (Tan et al., 2018) etc. Our work can be viewed as an application of MRC to a new NLG task. The task aims to generate a comment for a news article, which is different from existing MRC tasks whose goal is to answer a question. Our learning method is also different from those in the MRC works.

Problem Formalization
consists of a news title T i , a news body B i , and a comment C i . Our goal is to estimate a probability distribution P (C|T, B) from D, and thus, given a new article (T, B) with T the news title and B the news body, we can generate a comment C following P (C|T, B). Figure 1 illustrates the architecture of our model. In a nutshell, the model consists of a reading network and a generation network. The reading network first represents a news title and a news body separately in a representation layer, then forms a representation of the entire article by fusing the title into the body through a fusion layer, and finally distills some salient spans from the article by a prediction layer. The salient spans and the news title are then fed to the generation network to synthesize a comment. With the two networks, we can factorize the generation probability P (C|T, B) as P (S|T, B) · P (C|S, T ), where S = (s 1 , . . . , s w ) refers to a set of spans in B, P (S|T, B) represents the reading network, and P (C|S, T ) refers to the generation network.

Reading Network
In the representation layer, let T = (t 1 , . . . , t n ) be a news title with t j the j-th word, and B = (b 1 , . . . , b m ) be the associated news body with b k the k-th word, we first look up an embedding table and represent t j and b k as e T,j ∈ R d 1 and e B,k ∈ R d 1 respectively, where e T,j and e B,k are randomly initialized, and jointly learned with other parameters. Different from the title, the body is long and consists of multiple sentences. Hence, to emphasize positional information of words in the body, we further expand e B,k with o B,k and s B,k , where o B,k , s B,k ∈ R d 2 are positional embeddings with the former indexing the position of b k in its sentence and the latter indicating the position of the sentence in the entire body. The representation of b k is then given bŷ refers to a multi-layer perceptron with two layers, and [·; ·; ·] means the concatenation of the three arguments.
Starting from E T = (e T,1 , . . . , e T,n ) and E B = (ê B,1 , . . . ,ê B,m ) as initial representations of T and B respectively, the reading network then transforms T into a sequence of hidden vectors H T = (h T,1 , . . . , h T,n ) with a recurrent neural network with gated recurrent units (RNN-GRUs) (Cho et al., 2014). In the meanwhile, B is transformed to H B = (h B,1 , . . . , h B,m ) with the k-th entry h B,k ∈ R d 1 defined as MLP([ê B,k ; c B,k ]) (two layers) and c B,k is an attention-pooling vector calculated by a scaled dot-product attention (Vaswani et al., 2017) (1) In Equation (1), a word is represented via all words in the body weighted by their similarity. By this means, we try to capture the dependency among words in long distance. The fusion layer takes H T and H B as inputs, and produces V = (v 1 , . . . , v m ) as new representation of the entire news article by fusing

Reading network
Generation network attend to H T and form a representation (1). With this step, we aim to recognize useful information in the title. Then we combine c T,k and h B,k as v k ∈ R d 1 with a gate g k which balances the impact of c T,k and h B,k and filters noise from The top layer of the reading network extracts a bunch of salient spans based on V. Let S = ((a 1 , e 1 ), . . . , (a w , e w )) denote the salient spans, where a i and e i refer to the start position and the end position of the i-th span respectively, we propose detecting the spans with a multi-label classifier and a pointer network. Specifically, we formalize the recognition of (a 1 , . . . , a w ) as a multilabel classification problem with V as an input and L = (l 1 , . . . , l m ) as an output, where ∀k ∈ {1, . . . , m}, l k = 1 means that the k-th word is a start position of a span, otherwise l k = 0. Here, we assume that (a 1 , . . . , a w ) are independent with each other, then the multi-label classifier can be defined as m binary classifiers with the k-th classifier given byŷ where MLP k (·) refers to a two-layer MLP, and y k ∈ R 2 is a probability distribution with the first entry as P (l k = 0) and the second entry as P (l k = 1). The advantage of the approach is that it allows us to efficiently and flexibly detect a variable number of spans from a variable-length news article, as there is no dependency among the m classifiers, and they can be calculated in parallel. Given a k , the end position e k is recognized via a probability distribution (α a k ,1 , . . . , α a k ,m ) which is defined by a pointer network: where h 0 = att(V, r) is an attention-pooling vector based on parameter r: Let us denote (a 1 , . . . , a w ) as start and P (l i = 1) as p i , then P (S|T, B) can be formulated as In practice, we recognize the i-th word as a start position if p i > 1 − p i , and determine the associated end position by arg max 1≤k≤m α i,k . Note that we do not adopt a pointer network to detect the start positions, because in this case, either we have to set a threshold on the probability distribution, which is sensitive to the length of the news article and thus hard to tune, or we can only pick a fixed number of spans for any articles by ranking the probabilities. Neither of them is favorable.

Generation Network
With S = ((a 1 , e 1 ), . . . , (a w , e w )) the salient spans, V = (v 1 , . . . , v m ) the representation of the news article, and H T the representation of the news title given by the three layers of the reading network respectively, we define a representation of S as The generation network takes H T and H S as inputs and decodes a comment word by word via attending to both H T and H S . At step t, the hidden and C S,t−1 = att(H S , h t−1 ) are context vectors that represent attention on the title and the spans respectively. att(·, ·) is defined as Equation (5).
With h t , we calculate C T,t and C S,t via att(H T , h t ) and att(H S , h t ) respectively and obtain a probability distribution over vocabulary where P t (c t ) refers to the c t -th entry of P t . In decoding, we define the initial state h 0 as an attention-pooling vector over the concatenation of H T and H S given by att([H T ; H S ], q). q is a parameter learned from training data.

Learning Method
We aim to learn P (S|T, B) and P (C|S, T ) from , but S is not explicitly available, which is a common case in practice. To address the problem, we treat S as a latent variable, and consider the following objective: where S refers to the space of sets of spans, and S i is a set of salient spans for (T i , B i ). Objective J is difficult to optimize, as logarithm is outside the summation. Hence, we turn to maximizing a lower bound of Objective J which is defined as: Let Θ denote all parameters of our model and ∂L i ∂Θ denote the gradient of L on an example To calculate the gradient, we have to enumerate all possible S i s for (T i , B i ), which is intractable. Thus, we employ a Monte Carlo sampling method to approximate ∂L i ∂Θ . Suppose that there are J samples, then the approximation of ∂L i ∂Θ is given by where ∀n, S i,n is sampled by first drawing a group of start positions according to Equation (3), and then picking the corresponding end positions by Equations (4)-(5). Although the Monte Carlo estimator is unbiased, it typically suffers from high variance. To reduce variance, we subtract baselines from log P (C i |S i,n , T i ). Inspired by (Mnih and Gregor, 2014), we introduce an observationdependent baseline B ψ (T i , C i ) to capture the systematic difference in news-comment pairs during training. Besides, we also exploit a global baseline B to further control the variance of the estimator. The approximation of ∂L i ∂Θ is then re-written as To calculate B ψ (T i , C i ), we first encode the word sequences of T i and C i with GRUs respectively, and then feed the last hidden states of the GRUs to a three-layer MLP. B is calculated as an average of P (C i |S i,n , T i ) − B ψ (T i , C i ) over the current mini-batch. The parameters of the GRUs and the MLP are estimated via The learning algorithm is summarized in Algorithm 1. To speed up convergence, we initialize our model through pre-training the reading network and the generation network. Specifically, ∀(T i , B i , C i ) ∈ D, we construct an artificial span setS i , and learn the parameters of the two networks by maximizing the following objective: S i is established in two steps: first, we collect all associated comments for (T i , B i ), extract n-grams (1 ≤ n ≤ 6) from the comments, and recognize an n-gram in B i as a salient span if it exactly matches with one of the n-grams of the comments. Second, we break B i as sentences and calculate a matching score for a sentence and an associated comment. Each sentence corresponds to a group of matching scores, and if any one of them exceeds 0.4, we recognize the sentence as a salient span. The matching model is pre with C i as a positive example and a randomly sampled comment from other news as a negative example. In the model, T i and C i are first processed by GRUs separately, and then the last hidden states of the GRUs are fed to a three-layer MLP to calculate a score.

Algorithm 1: Optimization Algorithm
Input: training data D, initial learning rate lr, MaxStep, sample number n. Init: Θ 1 Construct {Si} N i=1 and pre-train the model by maximizing Objective (12). 2 while step < MaxStep do 3 Randomly sample a mini-batch k from D.  Compute the terms related to Si,n in Eq. (11).

11
Update the parameters of the model and B ψ (Ci, Ti) with SGD. Output: Θ

Experiments
We test our model on two large-scale news commenting datasets.

Experimental Setup
The first dataset is a Chinese dataset built from Tencent News (news.qq.com) and published recently in (Qin et al., 2018). Each data point contains a news article which is made up of a title and a body, a group of comments, and some side information including upvotes and categories. Each test comment is labeled by two annotators according to a 5-scale labeling criteria presented in Table 3. All text in the data is tokenized by a Chinese word segmenter Jieba (https: //github.com/fxsjy/jieba). The average lengths of news titles, news bodies, and comments are 15 words, 554 words and 17 words respectively. In addition to the Chinese data, we also build another dataset by crawling news articles and the associated comments from Yahoo! News. Besides upvotes and categories, side information in Yahoo data also includes paragraph marks, WIKI-entities, downvotes, abusevotes, and sentiment tagged by Yahoo!. Text in the data is tokenized by Stanford CoreNLP pipline (Manning et al., 2014). As pre-processing, we filter out news articles shorter than 30 words in the body and comments shorter than 10 words or longer than 100 words. Then, we remove news articles with less than 5 comments. If the number of comments of an article exceeds 30, we only keep top 30 comments with the most upvotes. On average, news titles, news bodies, and comments contain 12 words, 578 words and 32 words respectively. More information about Yahoo data can be found in Appendix A. After the pre-processing, we randomly sample a training set, a validation set, and a test set from the remaining data, and make sure that there is no overlap among the three sets. Table 2 summarizes the statistics of the two datasets. Note that we only utilize news titles, news bodies and comments to learn a generation model in this work, but both datasets allow modeling news comment generation with side information, which could be our future work. Following (Qin et al., 2018), we evaluate the  performance of different models with both automatic metrics and human judgment. In terms of automatic evaluation, we employ BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), ROUGE (Lin, 2004), and CIDEr (Vedantam et al., 2015) as metrics on both data. Besides these metrics, Qin et al. (2018) propose human score weighted metrics including W-BLEU, W-METEOR, W-ROUGE and W-CIDEr. These metrics, however, requires human judgment on each comment in the test set. Thus, we only involve results w.r.t. these metrics in Tencent data. As Qin et al. (2018) do not publish their code for metric calculation, we employ a popular NLG evaluation project available at https://github.com/ Maluuba/nlg-eval, and modify the scripts with the scores provided in the data according to the formulas in (Qin et al., 2018) to calculate all the metrics. In human evaluation, for each dataset, we randomly sample 500 articles from the test data and recruit three native speakers to judge the quality of the comments given by different models. For every article, comments from all models are pooled, randomly shuffled, and presented to the annotators. Each comment is judged by the three annotators under the criteria in Table 3.

Baselines
The following models are selected as baselines: Basic models: the retrieval models and the generation models used in (Qin et al., 2018) including (1) IR-T and IR-TC: both models retrieve a set of candidate articles with associated comments by cosine of TF-IDF vectors. Then the comments are ranked by a convolutional neural network (CNN) and the top position is returned. The difference is that IR-T only utilizes titles, while IR-TC leverages both titles and news bodies; (2) Seq2seq: the basic sequence-to-sequence model (Sutskever et al., 2014) that generates a comment from a title; and (3) Att and Att-TC: sequence-to-sequence with attention (Bahdanau et al., 2015) in which the input is either a title (Att) or a concatenation of a title and a body (Att-TC). In Seq2seq, Att, and Att-TC, top 1 comment from beam search (beam size=5) is returned.
GANN: the gated attention neural network proposed in (Zheng et al., 2018). The model is further improved by a generative adversarial net. We denote our model as "DeepCom" standing for "deep commenter", as it is featured by a deep reading-commenting architecture. All baselines are implemented according to the details in the related papers and tuned on the validation sets.

Implementation Details
For each dataset, we form a vocabulary with the top 30k frequent words in the entire data. We pad or truncate news titles, news bodies, and comments to make them in lengths of 30, 600, and 50 respectively. The dimension of word embedding and the size of hidden states of GRU in all models are set as 256. In our model, we set d 1 as 256 and d 2 (i.e., dimension of the position embedding in the reading network) as 128. The size of hidden layers in all MLPs is 512. The number of samples in Monte Carlo sampling is 1. In pre-training, we initialize our model with a Gaussian distribution N (0, 0.01) and optimize Objective (12) using AdaGrad (Duchi et al., 2011) with an initial learning rate 0.15 and an initial accumulator value 0.1. Then, we optimize L using stochastic gradient descent with a learning rate 0.01. In decoding, top 1 comment from beam search with a size of 5 is selected for evaluation. In IR-T and IR-TC, we use three types of filters with window sizes 1, 3, and 5 in the CNN based matching model. The number of each type of filters is 128. Table 4 reports evaluation results in terms of both automatic metrics and human annotations. On most automatic metrics, DeepCom outperforms baseline methods, and the improvement is statistically significant (t-test with p-value < 0.01). The improvement on BLEU-1 and W-BLEU-1 is much bigger than that on other metrics. This is because BLEU-1 only measures the proportion of matched unigrams out of the total number of unigrams in the generated comments. In human evaluation, although the absolute numbers are different from those reported in (Qin et al., 2018) due to the difference between human judgements, the overall trend is consistent. In human evaluation, the values of Fleiss' kappa over all models are more   than 0.6, indicating substantial agreement among the annotators. Although built in a complicated structure, GANN does not bring much improvement over other baseline methods, which demonstrates that only using news titles is not enough in comment generation. IR-TC and Att-TC represent the best retrieval model and the best generation model among the baselines on both datasets, implying that news bodies, even used in a simple way, can provide useful information to comment generation.

Discussions
Ablation study: We compare the full model of DeepCom with the following variants: (1) No Reading: the entire reading network is replaced by a TF-IDF based keyword extractor, and top 40 keywords (tuned on validation sets) are fed to the generation network; (2) No Prediction: the prediction layer of the reading network is removed, and thus the entire V is used in the generation network; and (3) No Sampling: we directly use the model pre-trained by maximizing Objective (12). Table 5 reports the results on automatic metrics.
We can see that all variants suffer from performance drop and No Reading is the worst among the three variants. Thus, we can conclude that (1) span prediction cannot be simply replaced by TF-IDF based keyword extraction, as the former is based on a deep comprehension of news articles and calibrated in the end-to-end learning process; (2) even with sophisticated representations, one cannot directly feed the entire article to the generation network, as comment generation is vulnerable to the noise in the article; and (3) pre-training is useful, but optimizing the lower bound of the true objective is still beneficial.
To further understand why DeepCom is superior to its variants, we calculate BLEU-1 (denoted as BLEU span ) with the predicted spans and the ground truth comments in the test sets of the two data, and compare it with a baseline BLEU-1 (denoted as BLEU base ) which is calculated with the entire news articles and the ground truth comments. On Tencent data, BLEU span and BLEU base are 0.31 and 0.17 respectively, and the two numbers on Yahoo data are 0.29 and 0.16 respectively. This means that by extracting salient spans from news articles, we can filter out redundant information while keeping the points that people like to comment on, which explains why DeepCom is better than No Prediction. When comparing Deep-Come with No Sampling, we find that spans in DeepCom are longer than those in No Sampling. In the test set of Tencent data, the average lengths of salient spans with and without sampling are 11.6 and 2.6 respectively, and the two numbers in Yahoo data are 14.7 and 2.3 respectively. Thus, DeepCom can leverage discourse-level informa-  Table 6: Human score distributions tion rather than a few single words or bi-grams in comment generation, which demonstrates the advantage of our learning method.

Analysis of human annotations:
We check the distributions of human labels for DeepCom, Att-TC, and IR-TC to get insights on the problems these models suffer from. Table 6 shows the results. Most of the bad comments from IR-TC are labeled as "2", meaning that although IR-TC can give attractive comments with rich and deep content, its comments are easy to diverge from the news articles and thus be judged as "irrelevant".
In terms of Att-TC, there are much more comments judged as "1" than the other two models, indicating that Att-TC often generates ill-formed sentences. This is because a news article and a comment are highly asymmetric in terms of both syntax and semantics, and thus the generation process cannot be simply modeled with an encoderdecoder structure. Bad cases from DeepCom concentrate on "3", reminding us that we need to further enrich the content of comments and improve their relevance in the future.
Case Study: Finally, to further understand our model, we visualize the predicted salient spans and the generated comment using a test example from Tencent dataset in Table 7. Due to space limitation, we truncate the body and only show three of the selected spans in the truncated body. The full article with the full set of spans and another test example from Yahoo! News dataset are shown in Appendix B. In spite of this, we can see that the model finds some interesting points after "reading" the article and synthesizes a comment along one of the spans (i.e. "Chinese Paladin 3"). More interestingly, the model extends the content of the article in the comment with "Luo Jin", who is Tiffany Tang's partner but is not talked about in the article. On the other hand, comments given by the baseline methods are either too general (Att-TC, the best generation baseline), or totally irrelevant with the article (IR-TC, the best re-Title: 唐 嫣 为 什 么 不 演 清 宫 剧 ？(Why Tiffany Tang never plays a role in a drama set in Qing dynasty?) (... The numerous television series and movies Tiffany Tang acted in have made her very popular, and her varied modelling in many activities has set her an image of "the queen of reinvention" in the hearts of the public. ... If her most beautiful role is Zi Xuan in "Chinese Paladin 3", then the most ugly one should be this! ... ) DeepCom: 唐嫣罗晋的演技真的很好，特别喜欢她 演《仙剑奇侠传》。我觉得这部剧挺好看。(Tiffany Tang and Luo Jin are really good actors. I especially like her role in "Chinese Paladin 3". I think the TV drama is worth watching.) Att-TC: 我也是醉了(I have nothing to comment.) IR-TC: 星 爷 和 谁 开 撕 过 嘛, 都 是 别 人 去 撕 星 爷！(Stephen Chow never fights with others. It is others that fight with Stephen Chow!) Table 7: A Case from Tencent News dataset. The contents in the red box represent salient spans predicted by the reading network. The content in the blue box is a generated entity which is included in the salient spans.
trieval baseline). The example demonstrates that our model can generate relevant and informative comments through analyzing and understanding news articles.

Conclusions
We propose automatic news comment generation with a reading network and a generation network. Experimental results on two datasets indicate that the proposed model can significantly outperform baseline methods in terms of both automatic evaluation and human evaluation. On applications, we are motivated to extend the capabilities of a popular chatbot. We are aware of potential ethical issues with application of these methods to generate news commentary that is taken as human. We hope to stimulate discussion about best practices and controls on these methods around responsible uses of the technology.

A Yahoo! News Dataset
More information about the Yahoo dataset is shown in Table 8. The side information associated with news including: • Paragraph. After pre-processing, we retain the paragraph structure of news articles.
• Category. There are 31 news categories and the distribution is shown in Figure 2.
• Wiki-Entities. The Wikipedia entities mentioned in the news articles are extracted.
• Vote. Each comment has upvote, downvote and abusevote information from news readers.
• Sentiment. Each comment is annotated with POSITIVE, NEGATIVE or NEUTRAL by Yahoo!.

B Case Study
We demonstrate the advantage of DeepCom over IR-TC and Att-TC with examples from the test sets of the two datasets. Table 9 and Table 10 show the comments given by the three models. The content in the red box represents salient spans predicted by the reading network and that in the blue box is generated entities included in the salient spans. We can see that compared to the two baseline methods, comments from DeepCom are more relevant to the content of the news articles. Comments from IR-TC are rich with content, but are irrelevant to the news articles, while Att-TC is prone to generate generic comments.