Fine-grained Interest Matching for Neural News Recommendation

Personalized news recommendation is a critical technology to improve users’ online news reading experience. The core of news recommendation is accurate matching between user’s interests and candidate news. The same user usually has diverse interests that are reflected in different news she has browsed. Meanwhile, important semantic features of news are implied in text segments of different granularities. Existing studies generally represent each user as a single vector and then match the candidate news vector, which may lose fine-grained information for recommendation. In this paper, we propose FIM, a Fine-grained Interest Matching method for neural news recommendation. Instead of aggregating user’s all historical browsed news into a unified vector, we hierarchically construct multi-level representations for each news via stacked dilated convolutions. Then we perform fine-grained matching between segment pairs of each browsed news and the candidate news at each semantic level. High-order salient signals are then identified by resembling the hierarchy of image recognition for final click prediction. Extensive experiments on a real-world dataset from MSN news validate the effectiveness of our model on news recommendation.


Introduction
Recently, people's news reading habits have gradually shifted to digital content services. Many online news websites, such as Google News 1 and MSN News 2 , aim to collect news from various sources and distribute them for users (Das et al., 2007;Lavie et al., 2010). However, the overwhelming number of newly-sprung news makes it difficult for users to find their interested content (Wu et al., 2019c). Therefore, personalized news recommendation becomes an important technology to 1 https://news.google.com/ 2 https://www.msn.com/news Historical Browsed News alleviate information overload and improve users' online reading experience (IJntema et al., 2010).
The key to news recommendation lies in the accurate matching of user's interests and candidate news. The same user usually has diverse interests, which are reflected in different news she has browsed. Meanwhile, the important semantic features of news are implied in text segments of different granularities. Figure 1 illustrates the challenges with an example. As demonstrated, different historical browsed news can reveal user's interests about different topics or events. The first and second historical news are about pet dogs and the issue of weight loss respectively. Naturally, they provide critical clues to select the candidate news C 2 and C 3 which reveal relevant information. However, they are less informative to identify the candidate news C 1 , which is about the competition of National Football League (NFL). Besides, the matched segment pairs across browsed news and candidate news lie in different granularities, such as the words "Dog's"-"puppy" and phrases "lost 245 pounds"-"Weight Loss". Moreover, different segments in news texts have different importance for selecting proper news candidates. For example, in the third historical browsed news D 3 , "Philip Rivers" and "Chiefs" are more important than other words like "hilariously" and "after" for inferring that the user is a fan of NFL, since they refer to the famous quarterback and team of this sport.
Existing work, however, usually learns a single representation for each user by integrating all historical news that the user has browsed, then recommendations are performed by matching the final user vector and the candidate news vector (Okura et al., 2017;Wu et al., 2019e,b). For instance, Okura et al. (2017) encode news via denoising autoencoders, and learn representations of users from their browsed news via a GRU network. Wu et al. (2019e) apply multi-head self-attentions to learn news representations, then learn user representations by modeling the relatedness between browsed news. Wu et al. (2019b) enhance personalized news and user representations by exploiting the embedding of user's ID to generate a query vector for attending to important words and news. Despite the improvements of these methods in news recommendation performance, they are limited in capturing fine-grained user-news matching signals, since user's various latent interests implied in distinct historical readings cannot match with the candidate news until the final step of click prediction.
In this paper, we propose a Fine-grained Interest Matching network (FIM), which is a new architecture for news recommendation that can tackle the above challenges. The advantages of FIM lie in two cores: the multi-level user/news representation and the fine-grained interest matching. Instead of representing each user as a single abstract vector, we employ hierarchical dilated convolutions in a unified module to construct multi-level representations of each news article based on the title and category annotations. By hierarchically stacking the dilated convolutions, the receptive input width at each layer grows exponentially, while the number of parameters increases only linearly. Meanwhile, the outputs of each layer are preserved as feature maps across different length of text segments, with no loss in coverage since any form of pooling or stride convolution is not applied. In this way, we can gradually obtain the semantic features of news from local correlation and long-term dependency at different granularities, including word, phrase, and sentence levels.
Furthermore, to avoid information loss, FIM matches the text segments of the candidate news and each historical news browsed by the user at each semantic granularity. In practice, for each pair of news, the model constructs a segment-segment similarity matrix from word-level to sentence-level based on the hierarchical news representations. By this means, user's reading interests implied in the browsing history can be recognized under the supervision of candidate news, and carried into matching with minimal loss, so as to provide sufficient clues about the content relevance for recommending proper news. Afterwards, we merge the multiple matching matrices of each news pair at each granularity into a 3D image, whose channels indicate the relevant degrees of different kinds of user-news matching patterns. By resembling the CNN-based hierarchy of image recognition, higherorder salient signals are identified to predict the probability of the user clicking the candidate news.
We conducted extensive experiments on a realworld dataset collected from MSN news. Experimental results validate that our approach can effectively improve the performance of news recommendation compared with the state-of-the-art methods.

Related Works
With the explosive growth of digital news, building personalized news recommender systems has drawn more attentions in both natural language processing and data mining fields (Phelan et al., 2011;Zheng et al., 2018;Wu et al., 2019a). Conventional news recommendation methods focus on utilizing manual feature engineering to build news and user representations for matching (Phelan et al., 2009;Li et al., 2010;Liu et al., 2010;Son et al., 2013;Li et al., 2014;Bansal et al., 2015). For example, Liu et al. (2010) used topic categories and interest features generated by a Bayesian model to build news and user representations. Son et al. (2013) extracted topic and location features from Wikipedia pages to build news representations for locationbased news recommendation.
In recent years, deep learning based models have achieved better performance than traditional methods for news recommendation, due to their capabilities of distilling implicit semantic features in news content (Okura et al., 2017;Wang et al., 2018;Wu et al., 2019e,d). For example, Okura et al. (2017) ) proposed to learn long-term user preferences from the embeddings of their IDs, and learn short-term user interests from their recently browsed news via GRU network. (Wu et al., 2019a) proposed an attentive multi-view learning model to learn unified news representations from titles, bodies and topic categories by regarding them as different views of news. Different from these existing methods, in FIM, the representations of user's multiple browsed news are not fused into an abstract user vector before matching with the candidate news. Instead, we perform matching between each pair of segments in the news texts from multiple semantic levels. Therefore, more fine-grained information can be distilled for the final recommendation.
3 Our Approach

Problem Definition
The news recommendation problem can be formulated as follows. Given a user u, the set of historical news she has browsed at the online news platform is formulated as s u = {d 1 , . . . , d n }. For a news candidate c i , a binary label y i ∈ {0, 1} is adopted to indicate whether u will click c i in latter impressions. The aim is to build a prediction model g(·, ·).
For each pair of user and candidate news (u, c), we can predict the probability that u would like to click c using the function g : s u , c →ŷ. Recommendations are performed based on the ranking of candidate news according to their click scores.

Model Overview
We present a Fine-grained Interest Matching network (FIM) to model g(·, ·). The architecture of FIM is illustrated in Figure 2, which contains three major components, i.e., a news representation module to construct hierarchical semantic features for news text segments, a cross interaction module to exploit and aggregate matching information from each pair of news at each level of granularity, and a prediction module to calculate the probability that the user will click the candidate news. Next, we introduce each component in detail.

News Representation Module
We design a hierarchical dilated convolution (HDC) encoder to learn representations of news from multiple semantic views. Besides titles that can reflect the central information of news, at many digital platforms such as MSN, news articles are usually labeled with a category annotation (e.g., "sports", "entertainment") and a subcategory annotation (e.g., "football nba", "movies celebrity") to help indicate news topics and target users' in-2020/4/21 dilated_cnn.drawio terests. HDC encodes each news by connecting its title, category and subcategory annotations into a sequence of words as input. Given the word se- where N is the sequence length, the model first looks up an embedding table Then hierarchical dilated convolution layers are applied to capture multi-grained semantic features in news texts. Different from standard convolution that convolves a contiguous subsequence of the input at each step, dilated convolution (Yu and Koltun, 2016) has a wider receptive field by skipping over δ input elements at a time, where δ is the dilation rate. For a context of x j and a convolution kernel W of size 2w + 1, the dilated convolution operation is: where is the vector concatenation, b is the bias and ReLU (Nair and Hinton, 2010) is the nonlinear activation function. As shown in Figure 3, the darker output of each convolution layer is a weighted combination of the lighter regular spaced inputs in the previous layer. We start with δ = 1 (equals to standard convolution) for the first layer to ensure that no element of the input sequence is excluded. Afterwards, by hierarchically stacking the dilated convolutions with wider dilation rates, the length of convolved text segments expands exponentially, and the semantic features of different n-grams can be covered using only a few layers and a modest number of parameters.
Moreover, to prevent vanishing or exploding of gradients, we apply layer normalization (Ba et al., 2016) at the end of each convolution layer. Since there may be irrelevant information introduced to semantic units at a long distance, we practically design the multi-level dilation rates based on the performance in validation. The output of each stacked layer l is preserved as feature maps of the news text at a specific level of granularity, formulated as d l = [x l j ] N j=1 ∈ R N ×fs , where f s is the number of filters for each layer. Suppose there are L layers stacked, the multi-grained news representations can be defined as [d 0 , d 1 , . . . , d L ]. By this means, HDC gradually harvests lexical and semantic features from word and phrase levels with small dilation rates, and captures long dependences from sentence level with larger dilation rates. Meanwhile, the computational path is greatly shortened, and the negative effects of information loss caused by down-sampling methods such as max-pooling can be reduced. Our news encoder is superior to the recurrent units in parallel ability and the entirely attention-based approach in reducing token-pair memory consumptions.

Cross Interaction Module
Given representations of the k-th browsed news [d l k ] L l=0 and the candidate news [c l ] L l=0 , a segmentsegment matching matrix is constructed for each granularity, i.e., M l k,c ∈ R N d k ×Nc , where l ∈ {0, L} is the semantic level, N d k and N c are the length of the news d k and c. The (i, j)-th element of M l k,c is calculated by scaled dot product as: indicating the relevance between the i-th segment in d k and the j-th segment in c according to the l-th representation type. The L + 1 matching matrices for the news pair <d k , c> can be viewed as different feature channels of their matching information.
To summarize the information of user's entire reading sequence, FIM fuses all interaction matrices across each browsed news and the candidate news into a 3D matching image Q, formulated as: where n denotes the total number of browsed news in user history, and each pixel Q k,i,j is defined as: Specifically, each pixel is a concatenated vector with L + 1 channels, indicating the matching degrees between a certain segment pair of the news content at different levels of granularity. As user's click behaviors may be driven by personalized interests or temporary demands and events, different historical browsed news has different usefulness and representativeness for matching and recommending the proper candidate news. Inspired by Zhou et al. (2018) in the issue of dialogue system, we resemble the compositional hierarchy of image recognition, and employ a layered 3D convolution & max-pooling neural network to identify the salient matching signals from the whole image. The 3D convolution is the extension of typical 2D convolution, whose filters and strides are 3D cubes. Formally, the higher-order pixel at (k, i, j) on the z-th feature map of the t-th layer is computed as: where z denotes each feature map of the previous layer, K (t,z) ∈ R Wt×Ht×Rt is a 3D convolution kernel with the size of W t × H t × R t , and b (t) is the bias for the t-th layer. A max pooling operation is then adopted to extract salient signals as follows: where P and P (t,z) r are sizes of 3D maxpooling. Outputs of the final layer are concatenated as the integrated matching vector between the user and the candidate news, denoted as s u,c ∈ R v .

Click Prediction Module
In the recommendation scenario studied in this paper, recommendations are made based on ranking the candidate news articles according to their probabilities of being clicked by a user in an impression. Given the integrated matching vector s u,c of a user and candidate news pair, the final click probability is calculated as: where W o and b o are learned parameters. Motivated by (Huang et al., 2013b) and (Wu et al., 2019e), we leverage the negative sampling technique for model training. For each news browsed by a user (regarded as a positive sample), we randomly sample K news which are showcased in the same impression but not clicked by the user as negative samples. Besides, the orders of these news are shuffled to avoid positional biases. FIM jointly predicts the click probability scores of the positive news and the K negative news during training. By this means, the news click prediction problem is reformulated as a (K + 1)-way classification task. The loss function is designed to minimize the summation of negative log-likelihood of all positive samples, which is defined as: where S is the number of positive training samples, and c i,k is the k-th negative sample in the same impression with the i-th positive sample.

Dataset and Experimental Settings
We conducted experiments on the Microsoft News dataset used in (Wu et al., 2019b) 3 , which was built from the user click logs of Microsoft News 4 . The detailed statistics are shown in Table 1. Logs in the last week were used for test, and the rest for model training. Besides, we randomly sampled 10% of logs in the training data for validation.
In our experiments, the word embeddings are 300-dimensional and initialized using pre-trained Glove embedding vectors (Pennington et al., 2014). Due to the limitation of GPU memory, the maximum length of the concatenated word sequence of news title and category is set to 20, and at most 50 browsed news are kept for representing the user's recently reading behaviors. We tested stacking 1-5 HDC layers with different dilation rates. The reported results utilize [1-2-3] hierarchy (dilation rate for each convolution layer) as it gains the best performance on the validation set. The window size and number of convolution filters for news representation are 3 and 150 respectively. For the cross interaction module, we use two-layered composition to distill higher-order salient features of the 3D matching image, and the number and window size of 3D convolution filters are 32-[3,3,3] for the first layer and 16-[3,3,3] for the second layer, with [1,1,1] stride. The followed max-pooling size is [3,3,3] with [3,3,3] stride. Meanwhile, the negative sampling ratio K is set to 4. Adam (Kingma and Ba, 2014) is used as the optimizer, the mini-batch size is 100, and the initial learning rate is 1e-3.
Following the settings of state-of-the-art methods (Okura et al., 2017;Wu et al., 2019e), we use popular ranking metrics to evaluate the performance of each model, including AUC (Area   (Järvelin and Kekäläinen, 2002). We independently repeated each experiment for 10 times and reported the average performance.

Comparison Methods
We compare FIM with the following methods: Manual Feature-based Methods: Traditional recommendation methods which rely on manual feature engineering to build news and user representations, including (1) LibFM (Rendle, 2012), a feature-based matrix factorization model that is widely used in recommendations. We extract TF-IDF features from users' browsed news and candidate news, and concatenate them as the input for Neural Recommendation Methods: Neural networks specially designed for news recommendation, including (1) DFM (Lian et al., 2018), a deep fusion model combining dense layers with different depths and using attention mechanism to select important features; (2) DKN (Wang et al., 2018), incorporating entity information in knowledge graphs with Kim CNN (Kim, 2014) to learn news representations and using news-level attention network to learn user representations; (3) GRU (Okura et al., 2017), using auto-encoders to represent news and a GRU network to represent users; (4) NRMS (Wu et al., 2019e), leveraging multi-head self-attentions for news and user representation learning; (5) Hi-Fi Ark , summarizing user history into highly compact and complementary vectors as archives, and learning candidate-dependent user  Table 2: The performance of different methods on news recommendation. The best and second best results are highlighted in boldface and underlined respectively. The improvement over all baseline methods is significant at p-value < 0.05.
representation via attentive aggregation of such archives; (6) NPA (Wu et al., 2019b), using personalized attention with user ID's embedding as the query vector to select important words and news.
Ablation Variants: To verify the effects of multi-grained representation and sequential matching, we further setup two comparing ablation models, i.e., (1) FIM f irst : a variant in which we use feature maps of the first news representation layer for matching and recommendation. In this scenario, the HDC module degenerates into a one-layer standard CNN encoder. (2) FIM last : a variant using the outputs of the last layer in HDC (namely, the L-th embedding type) to represent each news for matching. Due to the hierarchical representation architecture, higher-level features synthesize information from lower-level features, and can model more complex lexical and semantic clues. Table 2 shows the results of our model and all comparative methods. Several observations can be made. First, neural news recommendation methods (e.g., GRU, NRMS, Hi-Fi Ark, NPA) are generally better than traditional methods (e.g., LibFM, DeepFM) that are based on manual feature engineering. The reason might be that handcrafted features are usually not optimal, and deep neural networks take the advantages of extracting implicit semantic features and modeling latent relationships between user and news representations. Second, our model FIM consistently outperforms other baselines in terms of all metrics, including the state-of-the-art deep learning based mod-   els. This validates the advantage of the pair-wise multi-level matching architecture in synthetically detecting fine-grained matching information from news segment pairs to predict the probability of a user clicking a candidate news.

Experimental Results
Third, both FIM f irst and FIM last show a decrease of performance compared to FIM. The latter is better than the former, indicating the effectiveness of constructing higher-level representations on the basis of low levels via the hierarchical mechanism of HDC. Besides, compared with DKN that utilizes knowledge-enhanced CNNs to learn news representations, FIM f irst has a better performance, illustrating the advantage of pair-wise matching fashion. Another notable thing is that while FIM last underperforms FIM, it can outperform all other competitors on all metrics. However, the benefit of interacting news pairs at multigrained semantic levels is still significant.

Analysis
In this section, we further investigate the impacts of different parameters and inputs on the model performance, and discuss the contribution of multigrained representation and matching architecture.

Quantity & Input Analysis
We first study how FIM perfroms with different negative sampling ratio K. Figure 4(a) shows the experimental results. We can find that the performance consistently improves when K is lower than 5, then begins to decline. The possible reason is that with a too small K, the useful information exploited from negative samples is limited. However, when too many negative samples are incorporated, they may become dominant and the imbalance of training data will be increased. Thus it is more difficult for the model to precisely recognize the positive samples, which will also affect the recommendation performance. Overall, the optimal setting of K is moderate (e.g., K = 4).
We then explore the influence of the 3D convolution & max-pooling neural network for processing the matching image Q. Comparing results are illustrated in Figure 4(b), where the CNN hierarchy a b means that the number of filters for the first layer and the second layer are set to a and b, separately. As shown, given the filter number a for the first layer, the performance first increases with a larger filter number b for the second layer, since more high-order information can be extracted. Then the performance begins to decrease, possibly because  We further compare different combinations of the number of dilated convolution filters and stacked layers in the HDC news representation module. Figure 4(c) demonstrates the results, where darker areas represent larger values. We observe a consistent trend over settings with different number of filters at each layer, i.e., there is a significant improvement during the first few stacked layers, and then the performance decreases a lot when the depth grows to 5. The results indicate that depth of representation layers indeed matters in terms of matching and recommendation accuracy. The optimal setting of the number of stacked layers and convolution filters is 3 and 150 respectively. We think the reason might be that in this scenario, the perceived field of dilated convolution filters at each layer ranges among [3-7-13] (with dilation rates as [1-2-3]), which is sufficient for modeling multi-grained n-gram features through hierarchical composition of local interactions, compared to the average length of news word sequences.
We also investigate the effectiveness of incorporating two-level category annotations of news as inputs. The results are shown in Figure 4(d). We can find that incorporating either categories or subcategories can benefit the performance of our model. This is interpretable since category annota-tions are helpful to reveal user's interested aspects more explicitly. In addition, enhancing news representations with subcategories is better than with categories. This is probably because compared to the general category labels, subcategories can provide more concrete and detailed information to indicate the core topic of news content. Overall, jointly incorporating the two-level category annotations can achieve the best performance.

Visualization
In this subsection, we further study the effectiveness of constructing hierarchical news representations and performing multi-grained interest matching. Figure 5 gives visualizations of the multigrained matching matrices (defined as formula 2) between historical browsed news and candidate news for a user, where M l denotes a matching matrix of a news pair at the l-th representation level. We observe that the important matching information captured by the 1st-level matching matrix is mainly lexical relevance. For example, the words "football", "nfl", "playoff", "playoffs" and "quarterbacks" are more correlated and assigned higher matching values in M 1 , which may due to their similar co-occurrence information encoded in word embeddings. Differently, higher-level matching matrices have the ability to identify more sophisticated semantic structures and latent long-term dependencies. From Figure 5(b), the interactive areas between the segments "weight loss" in the candidate news and "lost pounds" in the browsed news significantly gain larger matching scores among the 2-nd level semantic representations. In the matching matrix M 3 in Figure 5(c), the subsequences about "trump walks out" are distinguished, since the expressions have correlated meanings. Mean-while, the results also indicate that our model has the ability to identify important segments of a sentence and ignore the parts with less information, which is helpful to capture user's interested topics or events more accurately.

Conclusion and Future Work
In this paper, we propose a new architecture for neural news recommendation based on multi-grained representation and matching. Different from previous work that first integrates user's reading history into a single representation vector and then matches the candidate news representation, our model can capture more fine-grained interest matching signals by performing interactions between each pair of news at multi-level semantic granularities. Extensive experiments on a real-world dataset collected from MSN news show that our model significantly outperforms the state-of-the-art methods. In the future, we will do more tests and surveys on the improvement of business objectives such as user experience, user engagement and service revenue.