LRMM: Learning to Recommend with Missing Modalities

Multimodal learning has shown promising performance in content-based recommendation due to the auxiliary user and item information of multiple modalities such as text and images. However, the problem of incomplete and missing modality is rarely explored and most existing methods fail in learning a recommendation model with missing or corrupted modalities. In this paper, we propose LRMM, a novel framework that mitigates not only the problem of missing modalities but also more generally the cold-start problem of recommender systems. We propose modality dropout (m-drop) and a multimodal sequential autoencoder (m-auto) to learn multimodal representations for complementing and imputing missing modalities. Extensive experiments on real-world Amazon data show that LRMM achieves state-of-the-art performance on rating prediction tasks. More importantly, LRMM is more robust to previous methods in alleviating data-sparsity and the cold-start problem.


Introduction
Recommender systems (RS) are useful filtering tools which aid customers in a personalized way to make better purchasing decisions and whose recommendations are based on the customer's preferences and purchasing histories. Recommender systems can be roughly divided into collaborative filtering (CF) (Koren et al., 2009) or contentbased filtering (CBF) (Pazzani and Billsus, 2007) methods. CF-based methods predict the product preference of users based on their previous purchasing and reviewing behavior by computing latent representations of users and products. Standard matrix factorization (MF) and its variants are widely used in CF approaches (Koren et al., 2009). While CF-based approaches were demonstrated to * Work done in NEC Laboratories Europe.   Figure 1: Examples of typical multimodal product data from online retailers: image, title, description, reviews, star ratings. The cold-start problem is present in cases (b) and (c) where neither review text nor ratings are available.
perform well in many application domains (Ricci et al., 2015), these methods are based solely on the sparse user-item rating matrix and, therefore, suffer from the so-called cold-start problem (Schein et al., 2002;Huang and Lin, 2016; as shown in Figure 1(b)+(c). For new users without a rating history and newly added products with few or no ratings, the systems fail to generate high-quality personalized recommendations. Alternatively, CBF approaches incorporate auxiliary modalities/information such as product descriptions, images, and user reviews to alleviate the cold-start problem by leveraging the correlations between multiple data modalities. Unfortunately, a pure CBF method often suffers difficulties in generating a recommendation on incomplete and missing data (Sedhain et al., 2015;Wang et al., 2016b;Volkovs et al., 2017;García-Durán et al., 2018).
In this work, a multimodal imputation framework (LRMM) is proposed to make RS robust to incomplete and missing modalities. First, LRMM learns multimodal correlations (Ngiam et al., 2011;Srivastava and Salakhutdinov, 2012;Wang et al., 2016aWang et al., , 2018 from product images, product metadata (title+description), and product reviews. We propose modality dropout (m-drop) which randomly drops representations of some data modalities. In combination with the modality dropout approach, a sequential autoencoder (mauto) for multi-modal data is trained to reconstruct missing modalities and, at test time, is used to impute missing modalities through its learned reconstruction function.
Multimodal imputation for recommender systems is a non-trivial issue. (1) Existing RS methods usually assume that all data modalities are available during training and inference. In practice, however, incomplete and missing data modalities are very common. (2) At its core it addresses the cold-start problem. In the context of missing modalities, cold-start can be viewed as missing user or item preference information.
With this paper we make the following contributions: • For the first time, we introduce multimodal imputation in the context of recommender systems.
• We reformulate the data-sparsity and coldstart problem when data modalities are missing.
• We show that the proposed method achieves state-of-the-art results and is competitive with or outperforms existing methods on multiple data sets.
• We conduct additional extensive experiments to empirically verify that our approach alleviates the missing data modalities problem.
The rest of paper is structured as follows: Section 2 introduces our proposed methods. Section 3 describes the experiments and reports on the empirical results. In section 4 we discuss the method and its advantages and disadvantages, and in section 5 we discuss related work. Section 6 concludes this work.

Proposed Methods
The general framework of LRMM is depicted in Figure 2. There are two objectives for LRMM: (1) learning multimodal embeddings that capture inter-modal correlations, complementing missing modalities (Sec. 2.1); (2) learning intra-modal rating regression s Ù multimodal embeddings correlation learning s Figure 2: Overview of LRMM. It adopts CNN for visual embeddings (pink part) and three LSTMs for textual embeddings of user review text (red part), item review text (green part) and item metadata (blue part), respectively. The generative (autoencoder) model is used to reconstruct modalityspecific embeddings and impute missing modality. Missing user and item review text lead to user-and item-based cold-start respectively.
distributions where missing modalities are reconstructed via a missing modality imputation mechanism (sec. 2.2 and 2.3).

Learning Multimodal Embeddings
We denote a user u having k review texts as r u =(r u o 1 , r u o 2 , ..., r u o k ) where r u o i represents review text written by u for item o i . An item o is denoted as r o =(r o u 1 , r o u 2 , ..., r o up ) where r o u j represents the review text written by user u j for item o. Following Zheng et al. (2017), to represent each user and item, the reviews of u and o are concatenated into one review history document: where ⊕ is the concatenation operator. Similarly, the metadata of each item o can be represented as D m . For readability, we use u, o, m, v to denote user, item, metadata, and the image modality, respectively.
For text-based representation learning for user and item, unlike Zheng et al. (2017) in which CNNs (Convolutional Neural Networks) with Word2Vec (Mikolov et al., 2013) are employed, our method treats text as sequential data and learns embeddings over word sequences by maximizing the following probabilities: where M g , g ∈ {u, o, m} is a recurrent model and (x g 1 , ..., x g T ) is the word sequence of either review or metadata text, each x g t ∈ V and V is a vocabulary set. T g is the length of input and output sequence and e g t is the hidden state computed from the corresponding LSTM (Long Short Term Memory) (Hochreiter and Schmidhuber, 1997) where i t , f t and o t are input, forget and output gate respectively, c t is memory cell, h t is the hidden output that we used for computing user or item embedding e g , g ∈ {u, o, m}.
As we treat each text document D g as a word sequence of length T g , we adopt average pooling on word embeddings for each modality to obtain document-level representations: Visual embeddings e v are extracted with a pretrained CNN and transformed by a function f where Θ f ∈ R 4096×d to ensure e v has same dimension as the user e u , item e o , and metadata embedding e m . The multimodal joint embedding then can be learned by a shared layer and used for making a prediction: where f s : R 4×d → R 1 , parameterized with W s and b s , is a scoring function to map the multimodal joint embedding to a rating score.

Modality Dropout
Modality dropout (m-drop) is designed to remove a data modality during training according to some parametric distribution. This is motivated Figure 3: Missing modality imputation. (a) Full training data, (b) m-drop randomly drops modalities, (c) m-auto learns to reconstruct missing data based on existing data. (d) Inference with missing modalities. Dropping user and item view is equivalent to learning models being able to address cold-start problem.
by dropout (Srivastava et al., 2014) which randomly masks hidden layer activations to zero to increase the generalization capability of the underlying model. More formally, m-drop changes the original feed-forward equation: being able to randomly drop modality by: where each sample X 1 = x 1 , ..., x nm and n m is the number of modalities. r (L) is a vector of independent Bernoulli random variables each of which has probability p m of being 1. k (L) is a vector of independent variables which indicate the dropout on modality with a given probability. ϕ(·) is an activation function. Figure 3 (a-b) shows how m-drop works. Note the differences between modality dropout (mdrop) and original dropout: (1) m-drop targets specifically the multimodal scenario where some modalities are completely missing; and (2) m-drop is performed on the input layer (L ≡ 0).

Mutlimodal Sequential Autoencoder
The autoencoder has been used in prior work (Sedhain et al., 2015;Strub et al., 2016) to reconstruct missing elements (mostly ratings) in recommender systems. This is equivalent to the case of missing at random (MAR). For MAR, it is rare to have a continuous large block of missing entries (Tran et al., 2017). Differently, in recommending with missing modality, the missing entries typically occur in a large continuous block. For instance, an extreme case is the absence of all item reviews and ratings (data sparsity is 100%, leading to the so-called item cold-start problem). Existing methods (Lee and Seung, 2000;Koren, 2008;Marlin, 2003;Wang and Blei, 2011;McAuley and Leskovec, 2013;Zheng et al., 2017) have difficulties when entire data modalities are missing during the training and/or inference stages.
To address this limitation, we propose a multimodal sequence autoencoder (m-auto) to impute textual sequential embeddings and visual embeddings for the missing modalities. Modality-specific autoencoders are placed between the modality-specific encoders (i.e., CNN and LSTMs) and the shared layer (equation 11). The reconstruction layers, therefore, can capture the inter-modal and intra-modal correlations. More formally, for each data modality g ∈ {u, o, m, v}, the modality-specific encoder is given as and the modality-decoder is given as where W vh ∈ R d×d h and W hv ∈ R d h ×d are weights, b vh , b hv are biases receptively for visible-to-hidden, and hidden-to-visible layers. e g in , e g hid present the original, hidden word-level embeddings, and e g recon is the reconstructed document-level embeddings. The e g is a modality-specific embedding.
m-auto is different from previous reconstruction models (Sedhain et al., 2015;Strub et al., 2016) in that its reconstructions are based on inter-modal and intra-modal correlations in the context of multimodal learning.

Model Optimization
The optimization of the network is formulated as a regression problem by minimizing the mean squared error (MSE) loss L reg : whereŝ and s are the predicted and truth rating scores. |D| is dataset size , λ is weight decay parameter and Θ r is regression model parameters.
To constrain the representations to be compact in reconstruction, a penalty term is utilized where ρ andρ are sparsity parameters and average activation of hidden unit i, h n is the number of hidden units. The reconstruction loss for each modality is now where λ ρ is a sparsity regularization term. The objective of the entire model is then where α and β are learnable parameters. The model is learned in an end-to-end fashion through back-prorogation (LeCun et al., 1989).

Experiments
This section evaluates LRMM on rating prediction tasks with real-world datasets. We firstly compare LRMM with recent methods (sec. 3.4), then we empirically show the effectiveness of LRMM in alleviating the cold-start, the incomplete/missing data, and the data sparsity problem (sec. 3.5-3.8).

Datasets and Evaluation Metrics
We conducted experiments on the Amazon dataset ( i are the concatenated reviews of users and items in the training data. V is the vocabulary that was built based on reviews and metadata on the training data. Words with an absolute frequency of at least 20 are included in the vocabulary. To evaluate the proposed models on the task of rating prediction, we employed two metrics, namely, Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) whereŝ i,j and s i,j represent the predicted rating score and ground truth rating score that user i gave to item j.

Baselines and Competing Methods
We compare our models with several baselines 2 . The baselines can be categorized into three groups.
We also include a naive method-Offset (McAuley and Leskovec, 2013) which simply takes the average across all training ratings.

Implementation
We implemented LRMM with Theano 3 .  (Saxe et al., 2014). We used a batch size of 256, λ = 0.0001, sparsity parameter ρ = 0.05, λ ρ = 0.01, an initial learning rate of 0.0001 and a dropout rate of 0.5 after the recurrent layer. The models were optimized with ADADELTA (Zeiler, 2012). The length of the user, item and meta-data document D u , D o , and D m o were fixed to L = 100. We truncated documents with more than 100 words. The image features are extracted from the first fully-connected layer of CNN on ImageNet (Russakovsky et al., 2015).
We implemented NMF and SVD++ with the SurPrise package 4 . Offset and HFT were implemented by modifying authors' implementation 5 . For DeepCoNN, we adapted the implementation from (Chen et al., 2018) 6 . The numbers of other methods are taken from .

Compare with State-of-the-art
First, we compare LRMM with state-of-the-art methods listed in Sec. 3.2. In this setting, LRMM is trained with all data modalities and tested with different missing modality regimes. Table 2 lists the results on the four datasets. By leveraging multimodal correlations, LRMM significantly outperforms MF-based models (i.e. NMF, SVD++) and topic-based methods (i.e., URP, CTR, RMR, and HFT). LRMM also outperforms recent deep learning models (i.e., NRT, DeepCoNN) with respect to almost all metrics.
LRMM is the only method with a robust performance for the cold-start recommendation problem where user review or item review texts are removed. While the cold-start recommendation is more challenging, LRMM(-U) and LRMM(-O) are still able to achieve a similar performance to the baselines in the standard recommendation setting. For example, RMSE 1.101 (LRMM(-O)) to 1.107 (NRT) on Electronics, MAE 0.680 (LRMM(-O)) to 0.667 (DeepCoNN)on S&O. We conjecture that the cross-modality dependen-  cies (Srivastava and Salakhutdinov, 2012) make LRMM more robust when modalities are missing. Table 5 lists some randomly selected rating predictions. Similar to Table 2, missing user (-U) and item (-O) preference significantly deteriorates the performance.

Cold-Start Recommendation
Prior work (McAuley and Leskovec, 2013; has considered users (items) with sparse preference information as the cold-start problem (e.g., Figure 1(d)), that is, where there is still some information available. In practice, preference information could be missing in larger quantities or even be entirely absent (e.g., Figure 1(b-c)). In this situation, the aforementioned methods are not applicable as they require some data to work with. In this experiment, we examine how LRMM leverages modality correlations to alleviate the data sparsity problem when training data becomes even sparser. To this end, we train models for the item cold-start problem by reducing the number of reviews (for LRMM) and ratings (for NMF and SVD++) of each item in the training set. Figure 4 demonstrates the robustness of LRMM when the training data becomes more sparse. Note that NMF and SVD++ fail to train models when there is no ratings data available. In contrast, LRMM is trained by leveraging item images and metadata even if item reviews are completely missing for a product. The average number of reviews per item on this dataset is 16.7. Reducing the number of ratings to 5 severely degrades the performance of NMF, SVD++, and LRMM. However, LRMM remains rather stable in maintaining good performance when considering the performance degradation at 5, 1, and 0 reviews (ratings), respectively. One interesting observation is that, with a reduced number of reviews, the product metadata plays a more and more important role in maintaining the performance: LRMM(-V) is close to LRMM(+F) in Figure 4 while the gap between LRMM(-M) and LRMM(+F) is large.

Missing Modality Imputation
The proposed m-drop and m-auto methods allow LRMM to be more robust to missing data modalities. Table 3 lists the results of train-  ing LRMM with missing data modalities for the modality dropout ratio p m = 0.5 on the S&O and H&P datasets, respectively. Both RMSE and MAE of LRMM deteriorate but are still comparable to the MF-based approaches NMF and SVD++. However, the proposed method LRMM is robust to missing data in both training and inference stages, a problem rarely addressed by existing approaches. In Figure 5, we visualized the modality-specific embeddings and their reconstructed embeddings of 100 randomly selected samples with t-SNE (van der Maaten and Hinton, 2008). The plots suggest that it is more challenging to reconstruct item metadata and image embeddings as compared to the user or item embeddings. One possible explanation is that some selected metadata contains noisy data (e.g., "ISBN -9780963235985", "size: 24 ×46" and "Dimensions: 15W × 22H") for which visual data is more diverse. This would increase the difficulty of incorporating visual data into the embeddings.

The Effect of Text Length
To alleviate the data sparsity problem, existing work (McAuley and Leskovec, 2013; concatenates review texts and utilizes topic modeling (e.g. HFT) or CNNs combined with Word2Vec (e.g. DeepCoNN) to learn user or item embeddings. Differently, LRMM treats the con- catenated reviews as sequential data and learns sequence embeddings with RNNs. In this experiment, we show that learning sequential embedding is beneficial on sparse data because it is unnecessary to exploit all reviews so as to reach good performance. Figure 6 shows the performance of LRMM with varied word sequence lengths. In general, sequence embeddings learned with larger length achieve better performance. Note that, by considering a certain amount of words (e.g. L=50), LRMM is able to achieve a result as good as accounting more words (e.g. L=100 or 200). Although this is dataset-dependent to some degree, e.g., LRMM (L=200) improves RMSE and MAE in a certain margin as compared to L=100 on the H&P data, it demonstrates the superiority of sequential user or item embeddings as compared to topic and CNN+Word2Vec embeddings on more sparse data as shown in Table 2.

Cross-Domain Adaptation
To consider an even more challenging situation we explore cases where the full training set is missing. Inspired by the recent success of domain adaptation (DA) (Csurka, 2017), a special form of transfer learning (Pan and Yang, 2010;Weiss et al., 2016), we perform the recommendation task on the target domain test set D t test (e.g., "Sport") but with the model C trained on a different domain training set D s train (e.g. "Movie"). This is achieved by extracting the multimodal embeddings on the source domain and by performing prediction on the target domain. Table 4 shows the performance of LRMM when performing adaptation from larger datasets to smaller datasets. Although the performance is not as good as on D s test , LRMM is still able to obtain decent results even   without using training data D t train . Table 6 shows some example rating predictions on DA for different categories of products. It demonstrates the strong generalization capability of DA from one product category to another.

Discussion
Empirically, we have shown that multimodal learning (+F) plays an important role in mitigating the problems associated with missing data/modality and, in particular, those associated with the cold-start problem (-U and -O) of recommender systems. The proposed method LRMM is in line and grounded in recent developments (e.g. DeepCoNN, NRT) to incorporate multimodal data. LRMM distinguishes itself from previous methods: (1) the cold-start problem is reformulated in the context of missing modality; (2) A novel multimodal imputation method which consists of m-drop and m-auto is proposed to learn models more robust to missing data modalities in both the training and inference stages.

Related Work
Collaborative filtering (CF) is the most commonly used approach for recommender systems. CF methods generally utilize the item-user feedback matrix. Matrix factorization (MF) is the most popular CF method (Koren et al., 2009) due to its simplicity, performance, and high accuracy as demonstrated in previous work . Another strength of MF, making it widely used in recommender systems, is that side information other than existing ratings can easily be integrated into the model to further increase its accuracy. Such information includes social network data Lagun and Agichtein, 2015;Zhao et al., 2016;Xiao et al., 2017), locations of users and items (Lu et al., 2017) and visual appearance (He and McAuley, 2016; proposed Probabilistic Matrix Factorization (PMF) which extends MF to a probabilistic linear model with Gaussian noise. Following PMF, there are many extensions Chen et al., 2013;Zheng et al., 2016;He et al., 2016bHe et al., , 2017 aiming to improve its accuracy.
Unfortunately, CF methods suffer from the cold-start problem when dealing with new items or users without rich information. Content based filtering (CBF) (Pazzani and Billsus, 2007), on the other hand, is able to alleviate the cold-start problem by taking auxiliary product and user information (texts, images, videos, etc.) into consideration. Recently, several approaches (Almahairi et al., 2015;Xu et al., 2014;He et al., 2014;Tan et al., 2016) were proposed to consider the information of review text to address the data sparsity problem which leads to the cold-start problem. The topic model (e.g. LDA (Blei et al., 2003)) based approaches including CTR Blei, 2011), HFT (McAuley andLeskovec, 2013), RMR (Ling et al., 2014), TriRank (He et al., 2015), and sCVR (Ren et al., 2017) achieve significant improvements compared to previous work on recommender systems. Inspired by the recent success of deep learning techniques (Krizhevsky et al., 2012;He et al., 2016a), some deep network based recommendation approaches have been introduced Sedhain et al., 2015;Wang et al., 2016b;Seo et al., 2017;Xue et al., 2017;. Deep cooperative neural network (Deep-CoNN) (Zheng et al., 2017) was introduced to learn a joint representation from items and users using two coupled network for rating prediction. It is the first approach to represent users and items in a joint manner with review text. TransNets (Catherine and Cohen, 2017) extends Deep-CoNN by introducing an additional latent layer representing the user-item pair. NRT ) is a method for rating prediction and abstractive tips generation . A four-layer neural network was used for rating regression model. NRT outperforms the state-ofthe-art methods on rating prediction. There is a large body of work for recommender systems and we refer the reader to for surveys of state-of-theart CF based approaches, CBF methods, and deep learning based methods, respectively (Shi et al., 2014;Lops et al., 2011;.
Our work differs from previous work in that we simultaneously address various types of missing data together with the data-sparsity and cold-start problems.

Conclusion
We presented LRMM, a framework that improves the performance and robustness of recommender systems under missing data. LRMM makes novel contributions in two ways: multimodal imputation and jointly alleviating the missing modality, data sparsity, and cold-start problem for recommender systems. It learns to recommend when entire modalities are missing by leveraging interand intra-modal correlations from data through the proposed m-drop and m-auto methods. LRMM achieves state-of-the-art performance on multiple data sets. Empirically, we analyzed LRMM in different data sparsity regimes and demonstrated the effectiveness of LRMM. We aim to explore a generalized domain adaptation approach for recommender systems with missing data modalities.