Human Evaluation of Multi-modal Neural Machine Translation: A Case-Study on E-Commerce Listing Titles

In this paper, we study how humans perceive the use of images as an additional knowledge source to machine-translate user-generated product listings in an e-commerce company. We conduct a human evaluation where we assess how a multi-modal neural machine translation (NMT) model compares to two text-only approaches: a conventional state-of-the-art attention-based NMT and a phrase-based statistical machine translation (PBSMT) model. We evaluate translations obtained with different systems and also discuss the data set of user-generated product listings, which in our case comprises both product listings and associated images. We found that humans preferred translations obtained with a PBSMT system to both text-only and multi-modal NMT over 56% of the time. Nonetheless, human evaluators ranked translations from a multi-modal NMT model as better than those of a text-only NMT over 88% of the time, which suggests that images do help NMT in this use-case.


Introduction
In e-commerce, leveraging Machine Translation (MT) to make products accessible regardless of the customer's native language or country of origin is a very persuasive use-case. In this work, we study how humans perceive the machine translation of usergenerated auction listings' titles as listed on the eBay main site 1 . Among the challenges for MT are the specialized language and grammar for listing titles, as well as a high percentage of user-generated content for non-business sellers, who are often not native speakers themselves. This is reflected on the data by means of extremely high trigram perplexities of product listings, which is in 4 digit numbers even for language models (LMs) trained on in-domain data, as we discuss in §3. This is not only a challenge for LMs but also for automatic evaluation metrics such as the n-gram precisionbased BLEU metric (Papineni et al., 2002).
The majority of listings are accompanied by a product image, often (but not always) a user-generated shot. Moreover, images are known to bring useful complementary information to MT (Calixto et al., 2012;Hitschler et al., 2016;Huang et al., 2016;Calixto et al., 2017b). Therefore, in order to explore whether product images can benefit the machine translation of auction titles, we evaluate a multi-modal neural MT (NMT) system to eBay's production system, specifically a phrase-based statistical MT (PBSMT) one. We additionally train a text-only attention-based NMT baseline, so as to be able to measure eventual gains from the additional multi-modal data independently of the MT architecture.
According to a quantitative evaluation using a combination of four automatic MT evaluation metrics, a PBSMT system outperforms both text-only and multimodal NMT models in the translation of product listings, contrary to recent findings (Bentivogli et al., 2016). We hypothesise that these automatic metrics were not created for the purpose of measuring the impact an image brings to an MT model, so we conduct a human evaluation of translations generated by three different systems: a PBSMT, a text-only attentionbased NMT and a multi-modal NMT system. With that human evaluation we wish to see whether those findings corroborate the automatic scores or instead support results included in recent papers in the literature.
The remainder of the paper is structured as follows. In §2 we briefly describe the text-only and multi-modal MT models we evaluate in this work and in §3 the data sets we used, together with a discussion of interesting findings. In §4 we discuss how we structure our evaluation and in §5 we analyse and discuss our results. In §6 we discuss important related work and finally in §7 we draw conclusions and suggest avenues for future work.

MT Models evaluated in this work
We first introduce the two text-only baselines used in this work: a PBSMT model ( §2.1) and a text-only attention-based NMT model ( §2.2). We then briefly discuss the doubly-attentive multi-modal NMT model we use in our experiments ( §2.3), which is comparable to the model evaluated by Calixto et al. (2016) and further detailed and analysed in Calixto et al. (2017a).
Figure 1: Decoder RNN with attention over source sentence and image features. This decoder learns to independently attend to image patches and source-language words when generating translations.

Statistical Machine Translation (SMT)
We use a PBSMT model where the language model (LM) is a 5-gram LM with modified Kneser-Ney smoothing (Kneser and Ney, 1995). We use minimum error rate training (Och, 2003) for tuning the model parameters using BLEU as the objective function.

Text-only NMT (NMT t )
We use the attention-based NMT model introduced by Bahdanau et al. (2015) as our text-only NMT baseline. It is based on the encoder-decoder framework and it implements an attention mechanism over the source-sentence words X = (x 1 , x 2 , · · · , x N ), where Y = (y 1 , y 2 , · · · , y M ) is its target-language translation. A model is trained to maximise the log-likelihood of the target given the source.
The encoder is a bidirectional recurrent neural network (RNN) with GRU units (Cho et al., 2014). The annotation vector for a given source word x i is the concatenation of forward and backward vectors h i = − → h i ; ← − h i obtained with forward and backward RNNs, respectively, and C = (h 1 , h 2 , · · · , h N ) is the set of source annotation vectors.
The decoder is also an RNN, more specifically a neural LM (Bengio et al., 2003) conditioned upon its past predictions via its previous hidden state s t−1 and the word emitted in the previous time step y t−1 , as well as the source sentence via an attention mechanism. The attention computes a context vector c t for each time step t of the decoder where this vector is a weighted sum of the source annotation vectors C: where α src t,i is the normalised alignment matrix between each source annotation vector h i and the word to be emitted at time step t, and v src a , U src a and W src a are model parameters.

Multi-modal NMT (NMT m )
We use a multi-modal NMT model similar to the one evaluated by Calixto et al. (2016) and further studied in Calixto et al. (2017a), illustrated in Figure 1. It can be seen as an expansion of the attentive NMT framework described in §2.2 with the addition of a visual component to incorporate local visual features.
We use a publicly available pre-trained Convolutional Neural Network (CNN), namely the 50-layer Residual Network (ResNet-50) of He et al. (2016) to extract convolutional image features (a 1 , · · · , a L ) for all images in our dataset. These features are extracted from the res4f layer and consist of a 196 x 1024 dimensional matrix where each row, i.e. a 1024D vector, represents features from a specific area and so only encodes information about that specific area of the image.
The visual attention mechanism computes a context vector i t for each time step t of the decoder similarly to the textual attention mechanism described in §2.2: where α img t,l is the normalised alignment matrix between each image annotation vector a l and the word to be emitted at time step t, and v img a , U img a and W img a are model parameters.
which consists of 23, 697 tuples of products each containing (i) a product listing in English, (ii) a product listing in German and (iii) a product image. In ∼6k training tuples, the original user-generated product listing was given in English and was manually translated into German by in-house experts. The same holds for validation and test sets, which contain 480 and 444 triples, respectively. In the remaining training tuples (∼18k), the original listing was given in German and manually translated into English. We also use the publicly available Multi30k dataset , a multilingual expansion of the original Flickr30k (Young et al., 2014) with ∼30k pictures from Flickr, each accompanied by one description in English and one human translation of the English description into German.
Although the curation of in-domain parallel product listings with an associated product image is costly and time-consuming, monolingual German listings with an image are far simpler to obtain. In order to increase the small amount of training data, we train the text-only model NMT t on the German-English eBay24k and Multi30k data sets (without images) and back-translate 83, 832 German in-domain product listings into English. We use the synthetic English, original German and original image as additional training tuples, henceforth eBay80k.
The translation of user-generated product titles raises particular challenges; they are often ungrammatical and can be difficult to interpret in isolation even by a native speaker of the language, as illustrated in Table 1. We note that the listings in both languages have many scattered keywords and/or phrases glued together, as well as few typos (e.g., English listing in the first example). Moreover, in the second example the product image has a white frame surrounding it. These are all complications that make the multi-modal MT of product listings a challenging task, where there are different difficulties derived from processing listings and images.
To further demonstrate these issues, we compute perplexity scores with LMs trained on one in-domain and one general-domain German corpus: the Multi30k (∼ 29k sentences) and eBay's in-domain data (∼ 99k sentences), respectively. 2 The LM trained on the Multi30k computes a perplexity of 25k on the eBay test set, and the LM trained on the in-domain eBay data produces a perplexity of 4.2k on the Multi30k test set. We note that the LM trained on eBay's in-domain data still computes a very high perplexity on eBay's test set (ppl = 1.8k). These perplexity scores indicate that fluency might not be a good metric to use in our study, i.e. we should not expect a fluent machine-translated output of a model trained on poorly fluent training data.  Table 2: Difficulty to understand product listings with and without images and adequacy of listings and images. N is the number of raters (Calixto et al., 2017b).

English and German product listings
Clearly, user-generated product listings are not very fluent in terms of grammar or even predictable word order. To better understand whether this has an impact on semantic intelligibility, Calixto et al. (2017b) have recently conducted experiments using eBay data to assess how challenging listings are to understand for a human reader. Specifically, they asked users how they perceive product listings with and without having the associated images available, under the hypothesis that images bring additional understanding to their corresponding listings.
In Table 2, we show results which suggest that the intelligibility of both the English and German product listings are perceived to be somewhere between "easy" and "neutral" when images are also available. It is notable that, in case of German, there is a statistically significant difference between the group who had access to the image and the product listing (M=2.00, SD=.50) and the group who only viewed the listing (M=2.83, ST=.30), where F(1,13) = 6.72, p < 0.05. Furthermore, humans find that product listings describe the associated image somewhere between "well" and "neutral" with no statistically significant differences between the adequacy of product listings and images in different languages (Calixto et al., 2017b).
Altogether, we have a strong indication that images can indeed help an MT model translate product listings, especially for translations into German.  Figure 2: Models PBSMT, NMT t and NMT m ranked by humans from best to worst.

Experimental set-up
We use the eBay24k, the additional back-translated eBay80k and the Multi30k  data sets to train all our models. In our experiments, we wish to contrast the human assessments of the adequacy of translations obtained with two text-only baselines, PBSMT and NMT t , and one multi-modal model NMT m , with scores computed with four automatic MT metrics: BLEU4 (Papineni et al., 2002), ME-TEOR (Denkowski and Lavie, 2014), TER (Snover et al., 2006), and chrF3 (Popović, 2015). 3 We report statistical significance with approximate randomisation for the first three metrics using the MultEval tool (Clark et al., 2011). For our qualitative human evaluation, we ask bilingual native German speakers: 1. to assess the multi-modal adequacy of translations (number of participants N = 18), described in §4.1; 2. to rank translations generated by different models from best to worst (number of participants N = 18), described in §4.2.
On average, our evaluators' consisted of 72% women and 28% men. They were recruited from employees at eBay Inc., Aachen, Germany, as well as the student and staff body of Dublin City University, Dublin, Ireland.

Adequacy
Humans are presented with an English product listing, a product image and a translation generated by one of the models (without knowing which model). They are then asked how much of the meaning of the source is also expressed in the translation, taking the product image into consideration. They must then select from a four-level Likert scale where the answers range from 1 -All of it to 4 -None of it.

Ranking
We present humans with a product image and three translations obtained from different models for a particular English product listing (without identifying the

Results
In Table 3, we contrast the human assessments of the adequacy of translations obtained with two text-only baselines, PBSMT and NMT t , and one multi-modal model NMT m , with scores obtained computing four MT automatic metrics.
Both models NMT m and PBSMT improve on model NMT t 's translations according to the first three automatic metrics (p < 0.01), and we also observe improvements in chrF3. Although a one-way anova did not show any statistically significant differences in adequacy between NMT m and NMT t (F(2, 18) = 1.29, p > 0.05), human evaluators ranked NMT m as better than NMT t over 88% of the time, a strong indication that images do help neural MT and bring important information that the multi-modal model NMT m can efficiently exploit.
If we compare models NMT m and PBSMT, the latter outperforms the former according to BLEU, METEOR and chrF3, but they are practically equal according to TER. Additionally, the adequacy scores for both these models are, on average, the same according to scores computed over N = 18 different human assessments. Nonetheless, even though both models NMT m and PB-SMT are found to produce equally adequate output, translations obtained with PBSMT are ranked best by humans over 56.3% of the time, while translations obtained with the multi-modal model NMT m are ranked best 24.8% of the time, as can be seen in Figure 2.
We stress that the multi-modal model NMT m consistently outperforms the text-only model NMT t , according to all four automatic metrics used in this work. Translations generated by model NMT m contain many neologisms, possibly due to training these models using sub-word tokens rather than just words (Sennrich et al., 2016). Some examples are: "sammlerset", "garagenskateboard", "kampffaltschlocker", "schneidsattel" and "oberreceiver". We argue that this generative quality of the NMT models and the data sets evaluated in this work could have made translations more confusing for native German speakers to understand, therefore the preference for the SMT translations. 4 We note that the pairwise inter-annotator agreement for the ranking task shows a fair agreement among the annotators (κ = 0.30), computed using Cohen's kappa coefficient (Cohen, 1960). For all the other evaluations, according to Landis and Koch (1977) the pairwise inter-annotator agreement can be interpreted as slight (κ = 0.15 for the multi-modal translation adequacy). The lower agreement score seems plausible since our annotators were crowdsourced and so had limited guidelines and less training for the tasks that would have been ideal.  For the first three metrics, results are significantly better than those of NMT t ( † ) or NMT m ( ‡ ) with p < 0.01.

Related work
Multi-modal MT has just recently been addressed by the MT community in a shared task , where many different groups proposed techniques for multi-modal translation using different combinations of NMT and SMT models (Caglayan et al., 2016;Calixto et al., 2016;Huang et al., 2016;Libovický et al., 2016;Shah et al., 2016). In the multimodal translation task, participants are asked to train models to translate image descriptions from one natural language into another, while also taking the image itself into consideration. This effectively bridges the gap between two well-established tasks: image description generation (IDG) and MT.
There is an important body of research conducted in IDG. We highlight the work of Vinyals et al. (2015), who proposed an influential neural IDG model based on the sequence-to-sequence framework. They used global visual features to initialise an RNN LM decoder, used to generate the image descriptions in a target language, word by word. In contrast, Xu et al. (2015) were among the first to propose an attentionbased model where a model learns to attend to specific areas of an image representation as it generates its description in natural language with a soft-attention mechanism. In their model, local visual features were used instead. In both cases, as well as in this work and in most of the state-of-the-art models in the field, models transferred learning from CNNs pre-trained for image classification on ImageNet (Russakovsky et al., 2015).
In NMT, Bahdanau et al. (2015) was the first to propose to use an attention mechanism in the decoder. Their decoder learns to attend to the relevant source-language words as it generates a sentence in the target language, again word by word. Since then, many authors have proposed different ways to incorporate attention into MT. Luong et al. (2015) proposed among other things a local attention mechanism that was less costly than the original global attention; Firat et al. (2016) proposed a model to translate from many source and into many target languages, which involved a shared attention mechanism strategy; Tu et al. (2016) proposed an attention coverage strategy, so that the model has explicit information from which source words are used to generate previous target words, and therefore addressed the problems of over-and undertranslation. Calixto et al. (2017b) has recently reported n-best list re-ranking experiments of e-commerce product listings using multi-modal eBay data. Whereas their focus is on improving translation quality with n-best list re-ranking experiments, in this work our focus is on the human evaluation of translations generated with the different text-only and multi-modal models. To the best of our knowledge, along with Calixto et al. (2017b) we are the first to study multi-modal NMT applied to the translation of product listings, i.e. for the e-commerce domain.

Conclusions and Future Work
In this paper, we investigate the potential impact of multi-modal NMT in the context of e-commerce product listings. Images bring important information to NMT models in this context; in fact, translations obtained with a multi-modal NMT model are preferred to ones obtained with a text-only model over 88% of the time. Nevertheless, humans still prefer phrase-based SMT over NMT output in this use-case. We attribute this to the nature of the task: listing titles have little syntactic structure and yet many rare words, which can produce many confusing neologisms especially if using subword units.
The core neural MT models still have to be improved significantly to address these challenges. However, in contrast to SMT, they already provide an effective way of improving MT quality with information contained in images. As future work, we will study the impact that additional back-translated data have on multi-modal NMT models.