Lessons Learned in Multilingual Grounded Language Learning

Recent work has shown how to learn better visual-semantic embeddings by leveraging image descriptions in more than one language. Here, we investigate in detail which conditions affect the performance of this type of grounded language learning model. We show that multilingual training improves over bilingual training, and that low-resource languages benefit from training with higher-resource languages. We demonstrate that a multilingual model can be trained equally well on either translations or comparable sentence pairs, and that annotating the same set of images in multiple language enables further improvements via an additional caption-caption ranking objective.


Introduction
Multimodal representation learning is largely motivated by evidence of perceptual grounding in human concept acquisition and representation (Barsalou et al., 2003). It has been shown that visually grounded word and sentence-representations Baroni, 2016;Elliott and Kádár, 2017;Kiela et al., 2017;Yoo et al., 2017) improve performance on the downstream tasks of paraphrase identification, semantic entailment, and multimodal machine translation (Dolan et al., 2004;Marelli et al., 2014;Specia et al., 2016). Multilingual sentence representations have also been successfully applied to many-languages-to-one character-level machine translation (Chung et al., 2016) and multilingual dependency parsing (Ammar et al., 2016).
Recently, Gella et al. (2017) proposed to learn both bilingual and multimodal sentence representations using images paired with captions independently collected in English and German. Their results show that bilingual training improves imagesentence ranking performance over a monolingual * Work carried out at the University of Edinburgh. baseline, and it improves performance on semantic textual similarity benchmarks (Agirre et al., 2014(Agirre et al., , 2015. These findings suggest that it may be beneficial to consider another language as another modality in a monolingual grounded language learning model. In the grounded learning scenario, descriptions of an image in multiple languages can be considered as multiple views of the same or closely related data. These additional views can help overcome the problems of data sparsity, and have practical implications for efficiently collecting imagetext datasets in different languages. In real-life applications, many tasks and domains can involve code switching (Barman et al., 2014), which is easier to deal with using a multilingual model. Furthermore, it is more convenient to maintain a single multilingual system than one system for each considered language. However, there is a need for a systematic exploration of the conditions under which it is useful to add additional views of the data. We investigate the impact of the following conditions on the performance of a multilingual grounded language learning model in sentence and image retrieval tasks: Additional languages. Multilingual models have not been explored yet in a multimodal setting. We investigate the contribution of adding more than one language by performing bilingual experiments on English and German (Section 5) as well as adding French and Czech captioned images (Section 6).  captions are collected in different languages. Such disjoint settings have been explored in pivot-based multimodal representation learning (Funaki and Nakayama, 2015;Rajendran et al., 2015) or zero-shot multi-modal machine translation (Nakayama and Nishida, 2017). We compare translated vs. independently collected captions in Sections 5.2 and 6.1, and overlapping vs. disjoint images in Section 5.3.
High-to-low resource transfer: In Section 6.2 we investigate whether low-resource languages benefit from jointly training on larger data sets from higher-resource languages. This type of transfer has previously been shown to be effective in machine translation (e.g., Zoph et al., 2016).
Training objective: In addition to learning to map images to sentences, we study the effect of also learning relationships between captions of the same image in different languages Gella et al. (2017). We assess the contribution of such a caption-caption ranking objective throughout our experiments.
Our results show that multilingual joint train-ing improves upon bilingual joint training, and that grounded sentence representations for a lowresource language can be substantially improved with data from different high-resource languages.
Our results suggest that independently-collected captions are more useful than translated captions, for the task of learning multilingual multimodal sentence embeddings. Finally, we recommend to collect captions for the same set of images in multiple languages, due to the benefits of the additional caption-caption ranking objective function.

Related work
Learning visually grounded word-representations has been an active area of research in the fields of multi-modal semantics and cross-situational wordlearning. Such perceptually-grounded word representations have been shown to lead to higher correlation with human judgements on word-similarity benchmarks such as WordSim353 (Finkelstein et al., 2001) or SimLex999 (Hill et al., 2015) compared to uni-modal representations (Kádár et al., 2015;Bruni et al., 2014;Kiela and Bottou, 2014).
Grounded representations of sentences that are learned from image-caption data sets also improve performance on a number of sentence-level tasks (Kiela et al., 2017;Yoo et al., 2017) when used as additional features to skip-thought vectors . The model architectures used for these studies have the same overall structure as our model and coincide with image-sentence retrieval systems (Kiros et al., 2014;Karpathy and Fei-Fei, 2015): a pre-trained CNN is fixed or fine-tuned as image feature extractor, followed by a learned transformation, while sentence representations are learned by a randomly initialized recurrent neural network. These models are trained to push the true image-caption pairs closer together, and the false image-caption pairs further from each other, in a joint embedding space.
Our work is also closely related to multilingual joint representation learning. In this scenario, a single model is trained to solve a task across multiple languages. Ammar et al. (2016) train a multilingual dependency parser on the Universal Dependencies treebank (Nivre et al., 2015) and show that on average the single multilingual model outperforms the monolingual baselines. Johnson et al. (2016) present a zero-shot neural machine translation model that is jointly trained on language pairs A ↔ B and B ↔ C and show that the model is capable of performing well on the unseen language pair A ↔ C. Lee et al. (2017) find that jointly training a many-languages-to-one translation model on unsegmented character sequences improves BLEU scores compared to monolingual training. They also show evidence that the model can handle intrasentence code-switching. Peters et al. (2017) train a multilingual sequence-to-sequence translation architecture on grapheme-to-phoneme conversion using more than 300 languages. They report better performance when adding multiple languages, even those which are not present in the test data. Finally, Require: p: task switching probability. massively multilingual language representations trained on over 900 languages have been shown to resemble language families (Östling and Tiedemann, 2016) and can successfully predict linguistic typology features (Malaviya et al., 2017).
In the vision and language domain, multilingualmultimodal sentence representation learning has been limited so far to two languages. The joint training of models on English and German data has been shown to outperform monolingual baselines on image-sentence ranking and semantic textual similarity tasks (Gella et al., 2017;Calixto et al., 2017). Recently Harwath et al. (2018) also showed the benefit of joint bilingual training in the domain of speech-to-image and image-to-speech retrieval using English and Hindi data.

Multilingual grounded learning
We train a standard model of grounded language learning which projects images and their textual descriptions into the same space (Kiros et al., 2014;Karpathy and Fei-Fei, 2015). The training procedure is illustrated by the pseudo-code in Figure 2. Images i are encoded by a fixed pre-trained CNN followed by a learned affine transformation ψ(i, θ ψ ), and captions c are encoded by a randomly initialized RNN φ(c, θ φ ). The model learns to minimize the distance between pairs <a, b> using a max-of-hinges ranking loss (Faghri et al., 2017): where < a, b > are the true pairs, and < a,b > and <â, b > are all possible contrastive pairs in the mini-batch. The pairs either consists of imagecaption pairs < i, c >, where the model solves a caption-image c2i ranking task, or pairs of captions in multiple languages belonging to the same image < c a , c b >, where the model solves a captioncaption c2c ranking task (Gella et al., 2017). Our monolingual models are trained to minimize the caption-image ranking objective c2i on the training set. The multilingual models are trained to minimize the ranking loss for the set of all languages L in the collection: at each iteration the model is either updated for the c2i objective or the captioncaption c2c objective given either < c l , i > or a < c k a , c m b > pair in languages l, k, m, . . . ∈ L. All models are trained by first selecting a task, either c2i or c2c. In the c2i case, a language is sampled at random followed by sampling a random batch; in the c2c case, all possible < c a , c b > pairs across all languages are treated as a single data set. All of the model parameters are shared across all tasks and languages.
Implementation. We build our model on the Py-Torch implementation of the VSE++ model (Faghri et al., 2017). Images are represented by the 2048D average-pool features extracted from the ResNet50 architecture (He et al., 2016) trained on ImageNet (Deng et al., 2009); this is followed by a trained linear layer W I ∈ R 2048×1024 . Other implementation details follow (Faghri et al., 2017): sentences are represented as the final hidden state of a GRU (Chung et al., 2014) with 1024 units and 300 dimensional word-embeddings trained from scratch. We use a single word embedding matrix containing the union of all words in all considered languages. The similarity function s in the ranking loss is cosine similarity. We 2 normalize both the caption and image representations. The model is trained with the Adam optimizer (Kingma and Ba, 2014) using default parameters and learning-rate of 2e-4. We train the model with an early stopping crite- rion, which is to maximise the sum of the imagesentence recall scores R@1, R@5, R@10 on the validation set with patience of 10 evaluations. In the monolingual setting the stopping criterion is evaluated at the end of each epoch, whereas in the multilingual setup it is evaluated every 500 iterations. The probability of switching between the c2i and c2c tasks is set to 0.5. Batches from all data sets are sampled by shuffling the full dataset, going through each batch and re-shuffling when exhausted. The sentence-pair dataset used to train the c2c ranking model for languages is generated as follows. For a given image i, a set of languages 1 · · · , and a set of captions C i 1 , . . . , C i associated with an image i, we generate the set of all possible combinations of size 2 from caption sets C i and add the Cartesian product between all resulting pairs C i m × C i n in C i to the training set.

Experimental setup 2
Datasets. We train and evaluate our models on the translation and comparable portions of the Multi30K dataset . The translation portion (a low-resource dataset) contains 29K images, each described in one English caption with German, French, and Czech translations. The comparable portion (a higher-resource dataset) contains the same 29K images paired with five English and five German descriptions collected independently. Figure 1 presents an example of the translation and comparable portions of the data. We used the preprocessed version of the dataset, in which the text is lowercased, punctuation is normalized, and the text is tokenized 3 . To reduce the vocabulary size of the joint models, we replace all words occurring fewer than four times with a spe-cial "UNK" symbol. Table 1 shows the overlap between the vocabularies of the translation portion of the Multi30K dataset. The total number of tokens across all four languages is 17,571, and taking the union of the tokens in these four languages results in vocabulary of 16,553 tokens -a 6% reduction in vocabulary size. On the comparable portion of the dataset, the total vocabulary between English and German contains 18,337 tokens, with a union of 17,667, which is a 4% reduction in vocabulary size.
Evaluation. We evaluate our models on the 1K images of the 2016 test set of Multi30K either using the 5K captions from the comparable data or the 1K translation pairs. We evaluate on image-to-text (I→ T) and text-to-image (T→ I) retrieval tasks.
For most experiments we report Recall at 1 (R@1), 5 (R@5) and 10 (R@10) scores averaged over 10 randomly initialised models. However, in Section 6 we only report R@10 due to space limitations and because it has less variance than R@1 or R@5.

Reproducing Gella et al. (2017)
We start by attempting to reproduce the findings of Gella et al. (2017). In these experiments we train our multi-task learning model on the comparable portion of Multi30K. Our models reimplement their setups used for VSE (Monolingual) and bilingual models Pivot-Sym (Bilingual) and Parallel-Sym (Bilingual + c2c). The OE, Pivot-Asym and Parallel-Asym models are trained using the asymmetric similarity measure introduced for the order-embeddings (Vendrov et al., 2015). The main differences between our models and Gella et al. (2017) is that they use VGG-19 image features, whereas we use ResNet50 features, and we use the max-of-hinges loss instead of the more common sum-of-hinges loss. Table 2 shows the results on the English comparable 2016 test set. Overall our scores are higher than Gella et al. (2017), which is most likely due to the different image features (Faghri et al. (2017) also report a large performance gain when they use the ResNet instead of the VGG image features). Nevertheless, our results show a similar trend to the symmetric cosine similarity models from Gella et al. (2017): our best results are achieved with bilingual joint training with the added c2c objective. Their models trained with an asymmetric similarity measure show a different trend: the monolingual model is stronger than the bilingual model, and the c2c loss provides no clear improvement. Table 3 presents the German results. Once again, our implementation outperforms Gella et al. (2017), and this is likely due to the different visual features and max-of-hinges loss. However, our Bilingual model with the additional c2c objective performs the best for German, whereas Gella et al. (2017) reports the overall best results for the monolingual baseline VSE. Their models that use the asymmetric similarity function are clearly better than the Monolingual OE model. In general, the results from Gella et al. (2017) indicate the benefits of bilingual joint training, however, they do not find a clear pattern between the model configurations across languages. In our implementation, we only focused on the symmetric cosine similarity function and found a systematic pattern across both languages: bilingual training improves results on all performance metrics for both languages, and the additional c2c objective always provides further improvements.

Translations vs. independent captions
We now study whether the model can be trained on either translation pairs or independently collected bilingual captions. Gella et al. (2017) only conducted experiments on independently collected captions. However, it is known that humans have equally strong preference for translated or independently collected captions of images (Frank et al., 2018), which has implications for the difficulty and cost of collecting training data. Our baseline is a Monolingual model trained on 29K singlecaptioned images in the translation portion of Multi30K. The Bi-translation model is trained on both German and English, with shared parameters. Table 4 shows that there is a substantial improvement in performance for both languages in the bilingual setting. However, the additional c2c loss degrades performance here. This could be because we only have one caption per image in each language and it is easier to find a relationship between these views of the translation pairs.
In the Bi-comparable setting, we randomly select an English and a German sentence for each image in the comparable portion of Multi30K. We only find a minor difference in performance between the Bi-translation and Bi-comparable models for English, but the German results are improved. Cru-   German Image-to-text (I→T) and text-to-image (T→I) retrieval results on the comparable part of Multi30K, measured by Recall at 1, 5 at 10. Typewriter font shows performance of two sets of symmetric and asymmetric models from Gella et al. (2017).
cially, it is still better than training on monolingual data. In the Bi-comparable setting, the c2c loss does not have a detrimental effect on model performance, unlike in the Bi-translation experiment.
Overall we find that the comparable data leads to larger improvements in retrieval performance.

Overlapping vs. non-overlapping images
In a bilingual setting, we can improve an imagesentence ranking model by collecting more data in a second language. This can be achieved in two ways: by collecting captions in a new language for the same overlapping set of images, or by using a disjoint set of images and captions in a new language. We compare these two settings here.
In the Bi-overlap condition, we collect captions for the existing images in a new language, i.e. we use all of the English and German captions paired with a random selection of 50% of the images in comparable Multi30K. This results in a training   Table 5 shows the results of this experiment. The upper-bound is to train a Monolingual model on the full comparable corpus. For the lower bound, we train Half Monolingual models by randomly sampling half of the 29K images and their associated captions, giving 72.5K captions over 14.5K images. Unsurprisingly, the Half Monolingual models perform worse than the Full Monolingual models. In the Bi-overlap experiment, the German model is improved by collecting captions for the existing images in English. There is no difference in the performance of the English model, echoing the results from Section 5.1. The Bi-overlap model also benefits from the added c2c objective. Finally, the Bi-disjoint model performs as well as the Bioverlap model without the c2c objective. (It was not possible to train the Bi-disjoint model with the additional c2c objective because there are no caption pairs for the same image.) Overall, these results suggest that it is best to collect additional captions in the original language, but when adding a second language, it is better to collect extra captions for existing images and exploit the additional c2c ranking objective.

Multilingual experiments
We now turn out attention to multilingual learning using the English, German, French and Czech annotations in the translation portion of Multi30K. We only report the text-to-image (T→I) R@10 results due to space limitations. We did not repeat the overlapping vs. nonoverlapping experiments from Section 5.3 in a multilingual setting because this would introduce too much data sparsity. In order to conduct this experiment, we would have to downsample the already low-resource French and Czech captions by 50%, or even further for multi-way experiments. Table 6 shows the results of repeating the translations vs. comparable captions experiment from Section 5.2 with data in four languages. The Multi-translation models are trained on 29K images paired with a single caption in each language. These models perform better than their Monolingual counterparts, and the German, French, and Czech models are further improved with the c2c objective. The Multi-comparable models are trained by randomly sampling one English and one German caption from the comparable dataset, alongside the French and Czech translation pairs. These models perform as well as the Multi-translation models, and the c2c objective brings further improvements for all languages in this setting.  These results clearly demonstrate the advantage of jointly training on more than two languages. Text-to-image retrieval performance increases by more than 11 R@10 points for each of the four languages in our experiment.

High-to-low resource transfer
We now examine whether the lower-resource French and Czech models benefit from training with the full complement of the higher-resource English and German comparable data. Therefore we train a joint model on the translation as well as comparable portions of Multi30K, and examine the performance on French and Czech. Table 7 shows the results of this experiment. We find that the French and Czech models improve by 8.8 and 5.5 R@10 points respectively when they are only trained on the multilingual translation pairs (compared to the monolingual version), and by another 2.2 and 2.8 points if trained on the extra 155K English and German comparable descriptions. We also find that the additional c2c objective improves the Czech model by a further 4.8 R@10 points (this improvement is likely caused by training the model on 46 possible caption pairs). Our results show the impact of jointly training with the larger English and German resources, which demonstrates the benefits of high-to-low resource transfer.

Bilingual vs. multilingual
Finally, we investigate how useful it is to train on four languages instead of two. Figure 3 presents the image-to-text and text-to-image retrieval results of training Monolingual, Bilingual, or Multilingual models. The Monolingual and Bilingual models are trained on a random single-caption-image subsam-  ple of the comparable dataset with the additional c2c objective, as this configuration provided the overall best results in Sections 5.2 and 6.1. The Multilingual models are trained with the additional French and Czech translation data. As can be seen in Figure 3, the performance on both tasks and for both languages improves as we move from using data from one to two to four languages.

Conclusions
We learn multilingual multimodal sentence embeddings and show that multilingual joint training improves over bilingual joint training. We also demonstrate that low-resource languages can benefit from the additional data found in high-resource languages. Our experiments suggest that either translation pairs or independently-collected captions improve the performance of a multilingual model, and that the latter data setting provides further improvements through a caption-caption ranking objective. We also show that when collecting data in an additional language, it is better to collect captions for the existing images because we can exploit the caption-caption objective. Our results lead to several directions for future work. We would like to pin down the mechanism via which multilingual training contributes to improved performance for image-sentence ranking. Additionally, we only consider four languages and show the gain of multilingual over bilingual training only for the English-German language pair. In future work we will incorporate more languages from data sets such as the Chinese Flickr8K (Li et al., 2016) or Japanese COCO (Miyazaki and Shimizu, 2016 Image → Text Text → Image