Neural News Recommendation with Topic-Aware News Representation

News recommendation can help users find interested news and alleviate information overload. The topic information of news is critical for learning accurate news and user representations for news recommendation. However, it is not considered in many existing news recommendation methods. In this paper, we propose a neural news recommendation approach with topic-aware news representations. The core of our approach is a topic-aware news encoder and a user encoder. In the news encoder we learn representations of news from their titles via CNN networks and apply attention networks to select important words. In addition, we propose to learn topic-aware news representations by jointly training the news encoder with an auxiliary topic classification task. In the user encoder we learn the representations of users from their browsed news and use attention networks to select informative news for user representation learning. Extensive experiments on a real-world dataset validate the effectiveness of our approach.


Introduction
Online news platforms such as Google News and MSN News have attracted hundreds of millions of users to read news online (Das et al., 2007;Lavie et al., 2010). Massive news are generated everyday, making it impossible for users to read all news to find their interested content (Phelan et al., 2011). Thus, personalized news recommendation is very important for online news platforms to help users find their interested news and alleviate information overload (IJntema et al., 2010).
Learning accurate representations of news and users is critical for news recommendation (Wu et al., 2019b,a). Several deep learning based methods have been proposed for this task (Okura et al., 2017;Wang et al., 2018;Kumar et al., 2017;Khattar et al., 2018;Zheng et al., 2018). For example,

James Harden's incredible heroics lift Rockets over Warriors Sports
These Are Some of The Safest Airlines in the World Travel Weekend snowstorm forecast from Midwest to East Coast Unlabeled Figure 1: Three example news articles. Okura et al. (2017) proposed to learn news representations from news bodies via denoising autoencoders, and learn user representations from the representations of their browsed news via a gated recurrent unit (GRU) network. Wang et al. (2018) proposed to learn news representations from news titles via a knowledge-aware convolutional neural network (CNN), and learn user representations from news representations using the similarity between candidate news and browsed news. However, these methods do not take the topic information of news into consideration. Our work is motivated by the following observations. First, the topic information of news is useful for news recommendation. For example, if a user clicks many news with the topic "sport", we can infer she is probably interested in sports. Thus, exploiting the topic information of news has the potential to learn more accurate news and user representations. Second, not all news articles contain topic labels, since it is very expensive and timeconsuming to manually annotate the massive news articles emerging everyday. Thus, it is not suitable to directly incorporate the topic labels of news as model input. Third, different words in the same news may have different informativeness in representing news. For example, in Fig. 1 the word "Airlines" is more informative than "Some". Besides, different news may also have different importance for user representation. For instance, the first news in Fig. 1 is more informative than the third one in inferring the interest of users.
In this paper, we propose a neural news recom-mendation approach with topic-aware news representations (TANR) which exploit the useful topic information in news. The core of our approach is a topic-aware news encoder and a user encoder.
In the news encoder, we learn the representations of news from their titles by capturing the local contexts via CNNs. Since different words may have different informativeness for news representation, we apply attention network to select important words for news representation learning. In addition, we propose to learn topic-aware news representations by jointly training the news encoder with an auxiliary topic classification task. In the user encoder, we learn representations of users from the representations of their browsed news. Since different news may have different informativeness for user representation, we apply attention network to select informative news for user representation learning. Extensive experiments are conducted on a real-world dataset. The results show our approach can effectively improve the performance of news recommendation.

Our Approach
In this section, we first introduce our basic neural news recommendation model. Then we introduce how to learn topic-aware news representations.

Neural News Recommendation Model
The architecture of our basic neural news recommendation model is shown in Fig. 2. It consists of three major modules, i.e., news encoder, user encoder and click predictor. News Encoder. The news encoder module is used to learn representations of news from their titles. It contains three layers. The first one is word embedding, which converts a news title from a word sequence into a vector sequence. Denote a news title as [w 1 , w 2 , ..., w M ], where M is title length. It is converted into word vector sequence [e 1 , e 2 , ..., e M ] via a word embedding matrix.
The second layer is a CNN network (Kim, 2014). Local contexts are important for understanding news titles. For example, in the news title "90th Birthday of Mickey mouse", the local contexts of "mouse" such as "Mickey" is useful for inferring it is a comic character name. Thus, we use CNN to learn contextual word representations by capturing local contexts. The CNN layer takes the word vectors as input, and outputs the contextual word representations [c 1 , c 2 , ..., c M ].  The third layer is an attention network. Different words in the same news title may have different importance in representing news. For example, in the first news of Fig. 1, the word "Rockets" is more informative than "over" for news representation. Thus, we propose to use attention mechanism to select important words in news titles to learn informative news representations. Denote the attention weight of the i th word in a news title as α t i : where V t and v t are parameters, q t is the attention query vector. The final representation of a news title is the summation of the contextual representations of its words weighted by their attention weight, i.e., r = M i=1 α t i c i . User Encoder. The user encoder module is used to learn the representations of users from the representations of their browsed news. Different news browsed by the same user may have different informativeness for representing this user. For example, the news "The best movies in 2018" is more informative than the news "Winter storms next week" in inferring user interests. Therefore, we apply a news attention network to select important news to learn more informative user representations. Denote the attention weight of the i th browsed news as α n i : where q n , V n and v n are the parameters, and N is the number of browsed news. The final repre-sentation of a user is the summation of the representations of her browsed news weighted by their attentions, i.e., u = N i=1 α n i r i . Click Predictor. The click predictor module is used to predict the probability of a user clicking a candidate news based on their hidden representations. Denote the representation of a candidate news D c as r c . Following (Okura et al., 2017), the click probability scoreŷ is calculated by the inner product of the representation vectors of the user and the candidate news, i.e.,ŷ = u T r c .
Motivated by (Huang et al., 2013), we propose to use negative sampling techniques for model training. For each news browsed by a user (denoted as positive sample), we randomly sample K news displayed in the same impression but not click by this user as negative samples. We then jointly predict the click probability scores of the positive newsŷ + and the K negative news In this way, we formulate the news click prediction problem as a pseudo K + 1way classification task. The posterior click probability of a positive sample is calculated as follows: .
The loss function for news recommendation is the negative log-likelihood of all positive samples: where S is the set of positive training samples.

Topic-Aware News Encoder
The topic information of news is useful for news recommendation. For example, if a user browses many news with the topic "sport", then she may be interested in sports. Thus, exploiting the news topics has the potential to improve the representations of news and users. However, not all news in online news platforms contain topic labels, since it is very costly and time-consuming to annotate the massive news emerging everyday. Thus, instead of incorporating news topics as model input, we propose to learn topic-aware news encoder which can extract topic information from news titles by jointly training it with an auxiliary news topic classification task, as shown in Fig. 3. We propose a news topic classification model for this task, which consists of a news encoder module and a topic predictor module. The news encoder module is shared with the news recommendation model. The topic predictor is used to predict the topic probability distribution from news representation as follows: where W t and b t are parameters, andt is the predicted topic distribution. The loss function of the topic classification task is formulated as follows: where N t is the number of news with topic labels, K c is the number of topic categories, and t i,k and t i,k are the gold and predicted probability of the ith news in the k-th topic category. We jointly train the news recommendation and topic classification tasks. The overall loss function is a weighted summation of the news recommendation and topic classification losses: where λ is a positive coefficient. Since the news recommendation and the topic classification tasks share the same news encoder, via joint training, the news recommendation model can capture the topic information to learn topic-aware news and user representations for news recommendation.   In our experiments, word embeddings are 300dimensional and were initialized by the pretrained Glove embedding (Pennington et al., 2014). The CNN network has 400 filters, and their window sizes are 3. The negative sampling ratio K is 4 and the coefficient λ is 0.2. Adam (Kingma and Ba, 2014) is used as the optimization algorithm, and the batch size is 64. These hyperparameters were selected according to the validation set. The metrics used for result evaluation in our experiments include AUC, MRR, nDCG@5 and nDCG@10. We repeated each experiment 10 times and reported the average results.

Performance Evaluation
We evaluate the performance of our TANR approach by comparing it with several baseline methods, including: (1) LibFM (Rendle, 2012), a feature based matrix factorization method for recommendation; (2) CNN (Kim, 2014), using Kim CNN to learn news representations from news titles, and building user representations via max pooling; (3) DSSM (Huang et al., 2013), using the deep structured semantic model by regarding the concatenation of browsed news titles as the query and candidate news as the documents;   sion model by combining dense layers with different depths and using attention mechanism to select important features; (7) GRU (Okura et al., 2017), using autoencoders to learn news representations and using a GRU network to learn user representations; (8) DKN (Wang et al., 2018), a neural news recommendation method which can utilize entity information in knowledge graphs via a knowledge-aware CNN; (9) TANR-basic, our basic neural news recommendation model; (10) TANR, our approach with topic-aware news representations. The results of different methods are summarized in Table 2. From Table 2, We have several observations. First, the methods based on neural networks (e.g., CNN, DSSM and TANR) outperform LibFM. This is because neural networks can learn better news and user representations than traditional matrix factorization methods. Second, both TANR-basic and TANR can outperform many baseline methods. This is because our approaches can select important words and news for learning informative news and user representations via a hierarchical attention network, which is not considered in baseline methods. Third, TANR consistently outperforms TANR-basic. It validates the news topics are useful for news recommendation, and our approach can effectively exploit the topic information.
Then, we want to evaluate the performance of our approach in topic classification. The performance in Fscore over each topic category is shown in Fig. 5. From Fig. 5, we find the classification of most topic classes is satisfactory, except for the class "kids". This may be because the training data in this class is too scarce and difficult to be recognized. These results show that our approach can capture useful topic information by training the news encoder with an auxiliary topic classification task to learn topic-aware news representations.

Effectiveness of Hierarchical Attention
We conducted experiments to explore the hierarchical attentions in our approach. The results are shown in Fig. 6. We find the news-level attention network can effectively improve the performance of our approach. This is because different news usually have different informativeness in representing users, and selecting important news can help learn more informative user representations. In addition, the word-level attention network is also useful. This is because different words usually have different importance for representing news, and our approach can select important words to learn informative news representations. Moreover, combining both attention networks can further improve the performance of our approach. These results validate the effectiveness of hierarchical attentions in our approach.

Influence of Hyperparameter
In this section, we explore the influence of the coefficient λ in Eq. (9) on our approach. It controls the relative importance of the topic classification task. The results are shown in Fig. 7. We find if λ is too small, the performance of our approach is not optimal, since the useful topic information is not fully exploited. Thus, the performance improves when λ increases from 0. However, when λ becomes too large, the performance of our approach declines. This is because the topic classification task is over-emphasized and the news recommendation task is not fully respected. A moderate value of λ (e.g., 0.2) is the most appropriate.

Conclusion
In this paper, we propose a neural news recommendation approach with topic-aware news representations. In our approach we propose a new encoder to learn news representations from news titles, and use attention network to select important words. In addition, we propose to train a topic-aware news encoder via jointly training it with an auxiliary topic classification task to extract the topic information in news. In addition, we propose a user encoder to learn representations of users from their browsed news, and use attention network to select important news. Extensive experiments on a real-world dataset validate the effectiveness of our approach.