Using Images to Improve Machine-Translating E-Commerce Product Listings.

In this paper we study the impact of using images to machine-translate user-generated e-commerce product listings. We study how a multi-modal Neural Machine Translation (NMT) model compares to two text-only approaches: a conventional state-of-the-art attentional NMT and a Statistical Machine Translation (SMT) model. User-generated product listings often do not constitute grammatical or well-formed sentences. More often than not, they consist of the juxtaposition of short phrases or keywords. We train our models end-to-end as well as use text-only and multi-modal NMT models for re-ranking n-best lists generated by an SMT model. We qualitatively evaluate our user-generated training data also analyse how adding synthetic data impacts the results. We evaluate our models quantitatively using BLEU and TER and find that (i) additional synthetic data has a general positive impact on text-only and multi-modal NMT models, and that (ii) using a multi-modal NMT model for re-ranking n-best lists improves TER significantly across different n-best list sizes.


Introduction
In e-commerce, there is a strong requirement to make products accessible regardless of the customer's native language and home country, by leveraging the gains available from machine translation (MT). Among the challenges in automatic processing are the specialized language and grammar for listing titles, as well as a high percentage of user-generated content for nonbusiness sellers, who often are not native speakers themselves.
We investigate the nature of user-generated auction listings' titles as listed on the eBay main site 1 . Product listings contain extremely high trigram perplexities even if trained (and applied) on in-domain data, which is a challenge not only for proper language models but also for automatic evaluation metrics such as the n-gram precision-based BLEU (Papineni et al., 2002) 1 http://www.ebay.com/ metric. Nevertheless, when presenting humans with images of the product which come along with the auction titles, the listings are perceived as somewhere between "easy" and "neutral" to understand.
Images can bring useful complementary information to MT (Calixto et al., 2012;Hitschler et al., 2016;Huang et al., 2016). Therefore, we explore the potential of multi-modal, multilingual MT of auction listings' titles and product images from English into German. To that end, we compare eBay's production system, due to service-level agreements a classic phrase-based statistical MT (PBSMT) system, with two neural MT (NMT) systems. One of the NMT models is a text-only attentional NMT and the other is a multi-modal attentional NMT model trained using the product images as additional data.
PBSMT still outperforms both text-only and multimodal NMT models in the translation of product listings, contrary to recent findings (Bentivogli et al., 2016). Under the hypothesis that the amount of training data could be the culprit and since curated multilingual, multi-modal in-domain data is very expensive to obtain, we back-translate monolingual listings and incorporate them as additional synthetic training data. Utilising synthetic data leads to big gains in performance and ultimately brings NMT models closer to bridging the gap with an optimized PBSMT system. We also use multi-modal NMT models to rescore the output of a PBSMT system and show significant improvements in TER (Snover et al., 2006). This paper is structured as follows. In §2 we describe the text-only and multi-modal MT models we evaluate and in §3 the data sets we used, also introducing and discussing interesting findings. In §4 we discuss how we structure our quantitative evaluation, and in §5 we analyse and discuss our results. In §6 we discuss some relevant related work and in §7 we draw conclusions and devise future work.

Model
We first briefly introduce the two text-only baselines used in this work: a PBSMT model ( §2.1) and a textonly attentive NMT model ( §2.2). We then discuss the doubly-attentive multi-modal NMT model that we use in our experiments ( §2.3), which is comparable to the model introduced by Calixto et al. (2016). This decoder learns to independently attend to image patches and source-language words when generating translations.

Statistical Machine Translation (SMT)
We use a PBSMT model built with the Moses SMT Toolkit (Koehn et al., 2007). The language model (LM) is a 5-gram LM with modified Kneser-Ney smoothing (Kneser and Ney, 1995). We use minimum error rate training (Och, 2003) for tuning the model parameters for BLEU scores.

Text-only Neural Machine Translation (NMT t )
We use the attentive NMT model introduced by Bahdanau et al. (2015) as our text-only NMT baseline. It is based on the encoder-decoder framework and it implements an attention mechanism over the sourcesentence words. Being X = (x 1 , x 2 , · · · , x N ) and Y = (y 1 , y 2 , · · · , y M ) a one-hot representation of a sentence in a source language and its translation into a target language, respectively, the model is trained to maximise the log-likelihood of the target given the source.
The encoder is a bidirectional recurrent neural network (Schuster and Paliwal, 1997) with GRU units (Cho et al., 2014). The annotation vector for a given source word x i , i ∈ [1, N ] is the concatenation of forward and backward vectors h i = − → h i ; ← − h i obtained with forward and backward RNNs, respectively, and C = (h 1 , h 2 , · · · , h N ) is the set of source annotation vectors.
The decoder is also a recurrent neural network, more specifically a neural LM (Bengio et al., 2003) conditioned upon its past predictions via its previous hidden state s t−1 and the word emitted in the previous time step y t−1 , as well as the source sentence via an atten-tion mechanism. The attention mechanism computes a context vector c t for each time step t of the decoder where this vector is a weighted sum of the source annotation vectors C: where α src t,i is the normalised alignment matrix between each source annotation vector h i and the word to be emitted at time step t, and v src a , U src a and W src a are model parameters.

Multi-modal Neural Machine Translation (NMT m )
We We use a publicly available pre-trained Convolutional Neural Network (CNN), namely the 50-layer Residual network (ResNet-50) of He et al. (2015) to extract convolutional image features (a 1 , · · · , a L ) for all images in our dataset. These features are extracted from the res4f layer and consist of a 196 x 1024 dimensional matrix where each row (i.e., a 1024D vector) represents features from a specific area and therefore only encodes information about that specific area of the image. In our NMT experiments, the ResNet-50 network is fixed during training, and there is no finetuning done for the translation task.
The visual attention mechanism computes a context vector i t for each time step t of the decoder similarly to the textual attention mechanism described in §2.2: where α img t,l is the normalised alignment matrix between each image annotation vector a l and the word to be emitted at time step t, and v img a , U img a and W img a are model parameters.

Data sets
The multi-modal NMT model we evaluate uses parallel sentences and an image as input. Thus, we use the data set of product listings and images produced by eBay. They consist of 23, 697 triples of products, henceforth original, containing each (i) a listing in English, (ii) its translation into German and (iii) a product image. Validation and test sets used in our experiments consist of 480 and 444 triples, respectively. The curation of parallel product listings with an accompanying product image is costly and timeconsuming, so the in-domain data is rather small. More easily accessible are monolingual German listings accompanied by the product image where the source text input can be emulated by back-translating the target listing. For this set of experiments, we use 83, 832 tuples, henceforth mono. Finally, we also use the publicly available Multi30k dataset , a multilingual expansion of the original Flickr30k (Young et al., 2014) with ∼30k pictures from Flickr, one description in English and one human translation of the English description into German.
Translating user-generated product listings has particular challenges; they are often ungrammatical and can be difficult to interpret in isolation even by a native speaker of the language, as can be seen in the examples in Table 1. To further demonstrate this issue, in Table 2 we show the number of running words as well as the perplexity scores obtained with LMs trained on three sets of different German corpora: the Multi30k, eBay's in-domain data and a concatenation of the WMT 2015 2 Europarl (Koehn, 2005), Common Crawl and News Commentary corpora (Bojar et al., 2015).  We see that different LM perplexities on eBay's test set are high even for an LM trained on eBay in-domain data. LMs trained on mixed-domain corpora such as the WMT 2015 corpora or the Multi30k have perplexities below 500 on the Multi30k test set, which is expected. However, when applied to eBay's test data, perplexities computed can be over 60k. Conversely, an LM trained on eBay in-domain data, when applied to the Multi30k test set, also computes very high perplexity scores. These perplexity scores indicate that fluency might not be a good metric to use in our study, i.e. we should not expect a fluent machine-translated output of a model trained on poorly fluent training data.
Clearly, translating user-generated product listings is very challenging; for that reason, we decided to check with humans how they perceive that data with and without having the associated images available. We hypothesise that images bring additional understanding to their corresponding listings.

Source (target) product title-image assessment
A human evaluator is presented with the English (German) product listing. Half of them are also shown the product image, whereas the other half is not. For the first group, we ask two questions: (i) in the context of the product image, how easy it is to understand the English (German) product listing and (ii) how well does the English (German) product listing describe the  Table 3: Difficulty to understand product listings with and without images and adequacy of product listings and images. N is the number of raters.
product image. For the second group, we just ask (i) how easy it is to understand the English (German) product listing. In all cases humans must select from a five-level Likert scale where in (i) answers range from 1-Very easy to 5-Very difficult and in (ii) from 1-Very well to 5-Very poorly. Table 3 suggests that the intelligibility of both the English and German product listings are perceived to be somewhere between "easy" and "neutral" when images are also available. It is notable that, for German, there is a statistically significant difference between the group who had access to the image and the product listings (M=2.00, SD=.50) and the group who only viewed the listings (M=2.83, ST=.30), where F(1,13) = 6.72, p < 0.05. Furthermore, humans find that product listings describe the associated image somewhere between "well" and "neutral" with no statistically significant differences between the adequacy of product listings and images in different languages.
Altogether, we have a strong indication that images can indeed help an MT model translate product listings, especially for translations into German.

Experimental setup
The PBSMT model we use as a baseline is trained on 120k in-domain parallel sentences ( §2.1).
To measure how well multi-modal and text-only NMT models perform when trained on exactly the same data with and without images, respectively, we trained them only on the original and the Multi30k  data sets. We also did not use any additional parallel, but out-of-domain data that had been used to train eBay's PBSMT production system (see Section 5). Training our text-only NMT t baseline on this large corpus would not help shed more light on how multi-modality helps MT, since it has no images available and thus cannot be used to train the multimodal model NMT m . Rather, we report results of reranking experiments using n-best lists generated by eBay's best-performing PBSMT production system.
In order to measure the impact of the training data size on MT quality, we follow Sennrich et al. (2016) and back-translate the mono German product listings using our baseline NMT t model trained on the original 23, 697 German→English corpus (-images). These additional synthetic data (including images) are added to the original's 23, 697 triples and used in our translation experiments. We do not include the back-translated data set when training NMT models for re-ranking n-  best lists to be able to evaluate these two scenarios independently. We evaluate our models quantitatively using BLEU4 (Papineni et al., 2002) and TER (Snover et al., 2006) and report statistical significance computed using approximate randomisation with the Multeval toolkit (Clark et al., 2011).

Results
In Table 4 we present quantitative results obtained with the two text-only baselines SMT and NMT t and one multi-modal model NMT m .
It is clear that the gains from adding more data are much more apparent to the multi-modal NMT m model than to the two text-only ones. This can be attributed to the fact that this model has access to more data, i.e. image features, and consequently can learn better representations derived from them. The PBSMT model's improvements are inconsistent; its TER score even deteriorates by 0.5 with the additional data. The same does not happen with the NMT models, which both (text-only and multi-modal) benefit from the additional data. Model NMT m 's gains are more than 3× larger than that of models NMT t and SMT, indicating that they can properly exploit the additional data. Nevertheless, even with the added back-translated data, model NMT m still falls behind the PBSMT model both in terms of BLEU and TER, although it seems to be catching up as the data size increases.
In Table 5, we show results for re-ranking 10-and 100-best lists generated by eBay's PBSMT production system. This system was trained with additional data sampled from out-of-domain corpora and also includes extra features and optimizations. Its BLEU score on the eBay test set is 29.0. Nevertheless, we still observe improvements in rescoring of n-best lists from this system using our "weaker" NMT models. When n = 10, both models NMT t and NMT m significantly improve the baseline in terms of TER, with model NMT m performing slightly better. With larger lists (n = 100), it seems that both neural models have more difficulty to re-rank. Nonetheless, in this scenario model NMT m still sig-  Table 5: Results for re-ranking n-best lists generated for eBay's test set with text-only and multi-modal NMT models. † Difference is statistically significant (p ≤ 0.05). Best individual results are underscored, best overall results in bold. We also show the translation length for re-ranked n-best lists.
nificantly improves the MT quality in terms of TER, while model NMT t shows differences in BLEU and TER which are not statistically significant (p ≤ 0.05). We note that model NMT m 's improvements in TER are consistent across different n-best list sizes; model NMT t 's improvements are not. The best BLEU (= 29.4) and TER (= 52.1) scores were achieved by model NMT m when applied to rerank 10-best lists, although model NMT m still improves in terms of TER when n = 100. This suggests that model NMT m can efficiently exploit the additional multi-modal signals.
In order to check whether improvements observed in TER could be due to a preference of text-only and multi-modal NMT models for shorter sentences (Table 5), we also computed the average length of translations for n-best lists re-ranked with each of our models, and note that there is no significant difference between the length of translations for the baseline and the reranked models.

Related work
NMT has been successfully tackled by different groups using the sequence-to-sequence framework (Kalchbrenner and Blunsom, 2013;Cho et al., 2014;Sutskever et al., 2014). However, multi-modal MT has just recently been addressed by the MT community in a shared task . In NMT, Bahdanau et al. (2015) first proposed to use an attention mechanism in the decoder. Their decoder learns to attend to the relevant source-language words as it generates each word of the target sentence. Since then, many authors have proposed different ways to incorporate attention into MT (Luong et al., 2015;Firat et al., 2016;Tu et al., 2016).
In the context of image description generation (IDG), Vinyals et al. (2015) proposed an influential neural IDG model based on the sequence-to-sequence framework and trained end-to-end. Elliott et al. (2015) put forward a model to generate multilingual descriptions of images by learning and transferring features between two independent, non-attentive neural image description models. Finally, Xu et al. (2015) proposed an attention-based model where a model learns to attend to specific areas of an image representation as it generates its description in natural language with a softattention mechanism.
Although no purely neural multi-modal model to date has significantly improved on both text-only NMT and SMT models on the Multi30k data set , different research groups have proposed to include images in re-ranking n-best lists generated by an SMT system or directly in a NMT framework with some success (Caglayan et al., 2016;Calixto et al., 2016;Huang et al., 2016;Libovický et al., 2016;Shah et al., 2016).
To the best of our knowledge, we are the first to study multi-modal NMT applied to the translation of product listings, i.e. for the e-commerce domain.

Conclusions and Future work
In this paper, we investigate the potential impact of multi-modal NMT in the context of e-commerce product listings. With only a limited amount of multimodal and multilingual training data available, both text-only and multi-modal NMT models still fail to outperform a productive SMT system, contrary to recent findings (Bentivogli et al., 2016). However, the introduction of back-translated data leads to substantial improvements, especially to a multi-modal NMT model. This seems to be an interesting approach that we will continue to explore in future work.
We also found that NMT models trained on small in-domain data sets can still be successfully used to rescore a standard PBSMT system with significant improvements in TER. Since we know from our experiments with LM perplexities that these are very high for e-commerce data. i.e. fluency is quite low, it seems fitting that BLEU scores do not improve as much. In future work, we will also conduct a human evaluation of the translations generated by the various systems.