Contextual Neural Model for Translating Bilingual Multi-Speaker Conversations

Recent works in neural machine translation have begun to explore document translation. However, translating online multi-speaker conversations is still an open problem. In this work, we propose the task of translating Bilingual Multi-Speaker Conversations, and explore neural architectures which exploit both source and target-side conversation histories for this task. To initiate an evaluation for this task, we introduce datasets extracted from Europarl v7 and OpenSubtitles2016. Our experiments on four language-pairs confirm the significance of leveraging conversation history, both in terms of BLEU and manual evaluation.


Introduction
Translating a conversation online is ubiquitous in real life, e.g. in the European Parliament, United Nations, and customer service chats. This scenario involves leveraging the conversation history in multiple languages. The goal of this paper is to propose and explore a simplified version of such a setting, referred to as Bilingual Multi-Speaker Machine Translation (Bi-MSMT), where speakers' turns in the conversation switch the source and target languages. We investigate neural architectures that exploit the bilingual conversation history for this scenario, which is a challenging problem as the history consists of utterances in both languages.
The ultimate aim of all machine translation systems for dialogue is to enable a multi-lingual conversation between multiple speakers. However, translation of such conversations is not wellexplored in the literature. Recently, there has been work focusing on using the discourse or document context to improve NMT, in an online setting, by using the past context (Jean et al., 2017;Wang et al., 2017;Bawden et al., 2017;Voita et al., 2018), and in an offline setting, using the past and future context (Maruf and Haffari, 2018). In this paper, we design and evaluate a conversational Bi-MSMT model, where we incorporate the source and target-side conversation histories into a sentence-based attentional model (Bahdanau et al., 2015). Here, the source history comprises of sentences in the original language for both languages, and the target history consists of their corresponding translations. We experiment with different ways of computing the source context representation for this task. Furthermore, we present an effective approach to leverage the target-side context, and also present an intuitive approach for incorporating both contexts simultaneously. To evaluate this task, we introduce datasets extracted from Europarl v7 and OpenSubtitles2016, containing speaker information. Our experiments on English-French, English-Estonian, English-German and English-Russian language-pairs show improvements of +1.44, +1.16, +1.75 and +0.30 BLEU, respectively, for our best model over the context-free baseline. The results show the impact of conversation history on translation of bilingual multi-speaker conversations and can be used as benchmark for future work on this task.

Related Work
Our research builds upon prior work in the field of context-based language modelling and contextbased machine translation.
Language Modelling There have been few works on leveraging context information for language modelling. Ji et al. (2015) introduced Document Context Language Model (DCLM) which incorporates inter and intra-sentential contexts.  make use of side information, e.g. metadata, and Tran et al. (2016) use inter-document context to boost the performance of RNN language models.
For conversational language modelling, Ji and Bilmes (2004) propose a statistical multi-speaker language model (MSLM) that considers words from other speakers when predicting words from the current one. By taking the inter-speaker dependency into account using a normal trigram context, they report significant reduction in perplexity.

Statistical Machine Translation
The few SMTbased attempts to document MT are either restrictive or do not lead to significant improvements upon automatic evaluation. Few of these deal with specific discourse phenomena, such as resolving anaphoric pronouns (Hardmeier and Federico, 2010) or lexical consistency of translations (Garcia et al., 2017). Others are based on a twopass approach i.e., to improve the translations already obtained by a sentence-level model (Hardmeier et al., 2012;Garcia et al., 2014).
Neural Machine Translation Using contextbased neural models for improving online and offline NMT is a popular trend recently. Jean et al. (2017) extend the vanilla attention-based NMT model (Bahdanau et al., 2015) by conditioning the decoder on the previous source sentence via a separate encoder and attention component. Wang et al. (2017) generate a summary of three previous source sentences via a hierarchical RNN, which is then added as an auxiliary input to the decoder. Bawden et al. (2017) explore various ways to exploit context from the previous sentence on the source and target-side by extending the models proposed by Jean et al. (2017); Wang et al. (2017). Apart from being difficult to scale, they report deteriorated BLEU scores when using the target-side context.  augment the vanilla NMT model with a continuous cache-like memory, along the same lines as the cache-based system for traditional document MT (Gong et al., 2011), which stores hidden representations of recently generated words as translation history. The proposed approach shows significant improvements over all baselines when translating subtitles and comparable performance for news and TED talks. Along similar lines, Kuang et al. (2018) propose dynamic and topic caches to capture contextual information either from recently translated sentences or the entire document to model coherence for NMT. Voita et al. (2018) introduce a context-aware NMT model in which they control and analyse the flow of information from the extended context to the translation model. They show that using the previous sentence as context their model is able to implicitly capture anaphora.
For the offline setting, Maruf and Haffari (2018) incorporate the global source and target document contexts into the base NMT model via memory networks. They report significant improvements using BLEU and METEOR for the contextual model over the baseline. To the best of our knowledge, there has been no work on Multi-Speaker MT or its variation to date.

Problem Formulation
We are given a dataset that comprises parallel conversations, and each conversation consists of turns. Each turn is constituted by sentences spoken by a single speaker, denoted by x or y, if the sentence is in English or Foreign language, respectively. The goal is to learn a model that is able to leverage the mixed-language conversation history in order to produce high quality translations.

Data
Standard machine translation datasets are inappropriate for Bi-MSMT task since they are not composed of conversations or the speaker annotations are missing. In this section, we describe how we extract data from raw Europarl v7 (Koehn, 2005) and OpenSubtitles2016 1 (Lison and Tiedemann, 2016) for this task 2 .
Europarl The raw Europarl v7 corpus (Koehn, 2005) contains SPEAKER and LANGUAGE tags where the latter indicates the language the speaker was actually using. The individual files are first split into conversations. The data is tokenised (using scripts by Koehn (2005)), and cleaned (headings and single token sentences removed). Conversations are divided into smaller ones if the number of speakers is greater than 5. 3 The corpus is then randomly split into train/dev/test sets with respect to conversations in ratio 100:2:3. The English side of the corpus is set as reference, and if the language tag is absent, the source language is English, otherwise Foreign. The sentences in the source-side of the corpus are kept or swapped with those in the target-side based on this tag. We perform the aforementioned steps for English-French, English-Estonian and English-German, and obtain the bilingual multi-speaker corpora for the three language pairs. Before splitting into train/dev/test sets, we remove conversations with sentences having more than 100 tokens for English-French, English-German and more than 80 tokens for English-Estonian 4 respectively, to limit the sentence-length for using subwords with BPE (Sennrich et al., 2016). The data statistics are given in Table 1 and Appendix A 5 .
Subtitles There has been recent work to obtain speaker labels via automatic turn segmentation for the OpenSubtitles2016 corpus (Lison and Meena, 2016;van der Wees et al., 2016;Wang et al., 2016). We obtain the English side of OpenSub-titles2016 corpus annotated with speaker information by Lison and Meena (2016). 6 To obtain the parallel corpus, we use the OpenSubtitles alignment links to align foreign subtitles to the annotated English ones. For each subtitle, we extract individual conversations with more than 5 sentences and at least two turns. Conversations with more than 30 turns are discarded. Finally, since subtitles are in a single language, we assign language tag such that the same language occurs in alternating turns. We thus obtain the Bi-MSMT corpus for English-Russian, which is then divided 4 Sentence-lengths of 100 tokens result in longer sentences than what we get for the other two language-pairs. 5 Although the extracted dataset is small but we believe it to be a realistic setting for a real-world conversation task, where reference translations are usually not readily available and expensive to obtain. 6 The majority of sentences still have missing annotations (Lison and Meena, 2016) due to changes between the original script and the actual movie or alignment problems between scripts and subtitles. As for Wang et al. (2016), their publicly released data is even smaller than our En-De dataset extracted from Europarl. into training, development and test sets.

Sentence-based attentional model
Our base model consists of two sentence-based NMT architectures (Bahdanau et al., 2015), one for each translation direction. Each of them contains an encoder to read the source sentence and an attentional decoder to generate the target translation one token at a time.
Encoder It maps each source word x m to a distributed representation h m which is the concatenation of the corresponding hidden states of two RNNs running in opposite directions over the source sentence. The forward and backward RNNs are taken to be GRUs (gated-recurrent unit; Cho et al. (2014)) in this work.
Decoder The generation of each target word y n is conditioned on all the previously generated words y <n via the state s n of the decoder, and the source sentence via a dynamic context vector c n : where E T [y n−1 ] is the embedding of previous target word y n−1 , and {W (·) ,b y } are the parameters. The fixed-length dynamic context representation of the source sentence c n = m α nm h m is generated by an attention mechanism where α specifies the proportion of relevant information from each word in the source sentence.

Conversational Bi-MSMT Model
Before we delve into the details of how to leverage the conversation history, we identify the three types of context we may encounter in an ongoing bilingual multi-speaker conversation, as shown in Figure 1. It comprises of: (i) the previously completed English turns, (ii) the previously completed Foreign turns, and (iii) the ongoing turn (English or Foreign).
We propose a conversational Bi-MSMT model that is able to incorporate all three types of context using source, target or dual conversation histories into the base model. The base model caters to the speaker's language transition by having one sentence-based NMT model (described previously) for each translation direction, English→Foreign and Foreign→English. We now Figure 1: Overview of an ongoing conversation while translating i th sentence in 2k + 1 th turn. X j |tj | and Y j |tj | denote the sentences in previous English and Foreign turn respectively, and x j i denotes the sentence i in ongoing turn j where i ∈ {1, ..., |t j |}. The shaded turns are observed i.e., source (the speaker utterances), while the rest are unobserved i.e., the target translations or the unuttered source sentences for current turn. describe our approach for extracting relevant information from the source and target bilingual conversation history.

Source-Side History
Suppose we are translating an ongoing conversation having alternating turns of English and Foreign. We are currently in the 2k + 1 th turn (in English) and want to translate its i th sentence using the source-side conversation history represented by context vector o src (dimensions H).
Let's assume that we already have the representations of previous source sentences in the conversation. We pass the source sentence representations through Turn-RNNs, which are composed of language-specific bidirectional RNNs irrespective of the speaker, as shown in Figure 2, and concatenate the last hidden states of the forward and backward Turn-RNNs to get the final turn representation r j , where j denotes the turn index. The individual turn representations are then combined, based on language 7 , to obtain context vectors o en and o f r , computed in several possible ways (described below), which are further amalgamated us- ing a gating mechanism so as to give differing importance to each element of the context vector: where σ is the logistic sigmoid function, U's are matrices and b g is a vector. Finally, we perform a dimensionality reduction to obtain: In the remainder of this section, {W, U, b} are language-specific learned parameters. We propose five ways of computing the language-specific context representations, o en and o f r .

Direct Transformation
The simplest approach is to combine turn representations using a language-specific dimensionality reduction transformation: Here r j 's are concatenated row-wise.
Hierarchical Gating We propose a languagespecific exponential decay gating based on the intuition that the farther the previous turns are from the current one, the lesser their impact may be on the translation of a sentence in an ongoing turn, similar in spirit to the caching mechanism by Tu et al. (2017): Language-Specific Attention The English and Foreign turn representations are combined separately via attention to allow the model to focus on relevant turns in the English and the Foreign context: Here r j 's are concatenated column-wise, h i is the concatenation of last hidden state of forward and backward RNNs in the encoder for current sentence i in turn 2k + 1 (dimensions 2H) and {W en , b en } transform the language space to that of the target language.
Combined Attention This is a languageindependent attention that merges all turn representations into one. The hypothesis here is to verify if the model actually benefits from Language-Specific attention or not.

Language-Specific Sentence-level Attention
All the previous approaches for computing o en and o f r use a single turn-level representation. We propose to use the sentence information explicitly via a sentence-level attention to evaluate the significance of more fine-grained context in contrast to Language-Specific Attention. We first concatenate the hidden states of forward and backward Turn-RNNs for each sentence and get a matrix comprising of representations of all the previous source sentences, i.e., for English turns, we have [r 1 1 ; ...; r 1 |t 1 | ; ...; r 2k+1 1 ; ...; r 2k+1 i−1 ], and similarly we have another matrix for all the previous Foreign sentences. Here, each r j i is the representation of source sentence i in turn j computed by the bidirectional Turn-RNN. The remaining computations are same as in Eq. 3.

Target-Side History
Using target-side conversation history is as important as that of the source-side since it helps in making the translation more faithful to the target language. This becomes crucial for translating conversations where the previous turns are all in the same language. For incorporating the target-side context, we use a sentence-level attention similar to the one described for the source-side context, i.e., for all previous English source sentences, we have a matrix R en comprising of the corresponding target sentence representations in Foreign, and another matrix R f r of target sentence representations (in English) for previous Foreign turns. Here each target sentence representation has dimensions H. Then, pen = softmax(R T en × tanh(Wt,en × hi + bt,en)) p f r = softmax(R T f r × (W td,en × hi + b td,en )) oen = Ren × pen o f r = tanh(Wt,en × (R f r × p f r ) + bt,en) where {W t,en ,b t,en } are for dimensionality reduction and changing the language space of the query vector h i and the context vector, while {W td,en ,b td,en } are only for dimensionality reduction. o en and o f r are further combined using a gating mechanism as in Eq. 1 to obtain the final target context vector o tgt (dimensions H).

Dual Conversation History
Now that we have explained how to leverage the source and target conversation history separately, we explain how they can be utilised simultaneously. The simplest way to do this is to incorporate both context vectors o src and o tgt into the base model (explained in Sec 4.4), referred as Src-Tgt dual context.
Another intuitive approach, as evident from Figure 2, is to separately model English and Foreign sentences using two separate context vectors o en,m and o f r,m , where each is constructed from a mixture of the original source or target translations, is language-specific and possibly contain less noise. We refer to this as the Src-Tgt-Mix dual context. Suppose R en,m contains the mixed source/target representations for English (the dimensions for source representations have been reduced to H) and R f r,m contains the same for Foreign. Then, pen,m = softmax(R T en,m × (W td,en × hi + b td,en )) p f r,m = softmax(R T f r,m × tanh(Wtt,en × hi + btt,en)) oen,m = tanh(Wtr,en × (Ren,m × pen,m) + btr,en) o f r,m = R f r,m × p f r,m where W td,en , W tr,en and W tt,en are for dimensionality reduction, changing the language space and both, respectively.

Incorporating Context into Base Model
• InitDec+AddDec Combination of previous two approaches.

Training and Decoding
The model parameters are trained end-to-end by maximising the sum of log-likelihood of the bilingual conversations in training set D. For example, for a conversation having alternating turns of English and Foreign language, the log-likelihood is: where i, j denote sentences belonging to 2k + 1 th or 2k + 2 th turn; o (.) is a representation of the conversation history, and |T | is the total number of turns (assumed to be even here).
The best output sequence for a given input sequence for the i th sentence at test time, a.k.a. decoding, is produced by:

Experiments
Implementation and Hyperparameters We implement our conversational Bi-MSMT model in C++ using the DyNet library (Neubig et al., 2017). The base model is built using mantis  which is an implementation of the generic sentence-level NMT model using DyNet.
The base model has single layer bidirectional GRUs in the encoder and 2-layer GRU in the decoder 8 . The hidden dimensions and word embedding sizes are set to 256, and the alignment dimension (for the attention mechanism in the decoder) is set to 128.

Models and Training
We do a stage-wise training for the base model, i.e., we first train the English→Foreign architecture and the Foreign→English architecture, using the sentence-level parallel corpus. Both architectures have the same vocabulary 9 but separate parameters to avoid biasing the embeddings towards the architecture trained last. The contextual model is pre-trained similar to training the base model. The best model is chosen based on minimum overall perplexity on the bilingual dev set.
For the source context representations, we use the sentence representations generated by two sentence-level bidirectional RNNLMs (one each for English and Foreign) trained offline. For the target sentence representations, we use the last hidden states of the decoder generated from the pre-trained base model 10 . At decoding time, however, we use the last hidden state of the decoder computed by our model (not the base) as the target sentence representations. Further training details are provided in Appendix B.   (Clark et al., 2011) with p < 0.05.

Results
Firstly, we evaluate the three strategies for incorporating context: InitDec, AddDec, Init-Dec+AddDec, and report the results for source context using Language-Specific Attention in Table 2. For the Europarl data, we see decent improvements with InitDec for En-Et (+1.11 BLEU) and En-De (+1.60 BLEU), and with Init-Dec+AddDec for En-Fr (+1.19 BLEU). We also observe that, for all language-pairs, both translation directions benefit from context, showing that our training methodology was indeed effective. On the other hand, for the Subtitles data, we see a maximum improvement of +0.30 BLEU for Init-Dec+AddDec . We narrow down to three major reasons: (i) the data is noisier when compared to Europarl, (ii) the sentences are short and generic with only 1% having more than 27 tokens, and finally (iii) the turns in OpenSubtitles2016 are short compared to those in Europarl (see Table 1), and we show later (Section 5.2) that the context from current turn is the most important.
The next set of experiments evaluates the five different approaches for computing the sourceside context. It is evident from Table 2 that for English-Estonian and English-German, our model indeed benefits from using the finegrained sentence-level information (Language-Specific Sentence-level Attention) as opposed to just the turn-level one.
Finally, our results with source, target and dual contexts are reported. Interestingly, just using the source context is sufficient for English-Estonian and English-German. For English-French, on the other hand, we see significant improvements for the models using the target-side conversation history over using only the source-side. We attribute this to the base model being more efficient and able to generate better translations for En-Fr as it had been trained on a larger corpus as opposed to the other two language-pairs. Unlike Europarl, for Subtitles, we see improvements for our Src-Tgt-Mix dual context variant over the Src-Tgt one for En→Ru, showing this to be an effective approach when the target representations are noisier.
To summarise, for majority of the cases our Language-Specific Sentence-level Attention is a winner or a close second. Using the Target Context is useful when the base model generates reasonable-quality translations; otherwise, using the Source Context should suffice.
Local Source Context Model Most of the previous works for online context-based NMT consider only a single previous sentence as context (Jean et al., 2017;Bawden et al., 2017;Voita et al., 2018). Drawing inspiration from these works, we evaluate our model (trained with Language-Specific Sentence-Level Attention) on the same    test set but using only the previous source sentence as context. This evaluation allows us to hypothesise how much of the gain can be attributed to the previous sentence. From Table 3, it can be seen that our model surpasses the local-context baseline for Europarl showing that the wider context is indeed beneficial if the turn lengths are longer. For En-Ru, it can be seen that using previous sentence is sufficient due to short turns (see Table 1).

Analysis
Ablation Study We conduct an ablation study to validate our hypothesis of using the complete context versus using only one of the three types of contexts in a bilingual multi-speaker conversation: (i) current turn, (ii) previous turns in current language, and (iii) previous turns in the other language. The results for En-De are reported in Table 4. We see decrease in BLEU for all types of contexts with significant decrease when considering only current language from previous turns.The results show that the current turn has the most influence on translating a sentence, and we conclude En→Fr les; par; est; a; dans; le; en; j'; un; afin; question; entre; qu';être; ces;également; y; depuis; c'; ou Fr→En this; of; we; issue; europe; by; up; make; united; does; what; regard; s; must; however; such; whose; share; like; been En→Et eest; vahel;üle; nimel; ja; aastal; aasta; neid; ainult seepärast; nagu; kes; komisjoni; tehtud; küsimuses; sisserände; liikmesriigi; mulla; liibanoni; dawit Et→En for; this; of; is; political; important; culture; also; as; order; are; each; their; only; gender; were; its; economy; one; market En→De daß; auf; und; werden; nicht; müssen; aus; mehr; können; einem; rates; eines; insbesondere; wurden; habe; mitgliedstaaten; ist; sondern; europa; gemeinsamen De→En that; its; say; must; some; therefore; more; countries; an; favour; public; will; without; particularly; hankiss; much; increase; eu; them; parliamentary Training base model with more data To analyse if the context is beneficial even when using more data, we perform an experiment for English-German where we train the base model with additional sentence-pairs from the full WMT'14 corpus 11 (excluding our dev/test sets and filtering sentences with more than 100 tokens). For training the contextual model, we still use the bilingual multi-speaker corpus. We observe a significant improvement of +1.12 for the contextbased model (Figure 3 II), showing the significance of conversation history in this experiment condition. 12 We perform another experiment where we use a larger base model, having almost double the number of parameters than our previous base model (hidden units and word embedding sizes set to 512, and alignment dimension set to 256), to test if the model parameters are being overestimated due to the additional context. We use the same WMT'14 corpus to train the base model and achieve significant improvement of +1.48 BLEU for our context-based model over the larger baseline (Figure 3 III).
Context nous sommeségalement favorables au principe d'un système de collecte des miles commun pour le parlement européen, pour que celui-ci puisse bénéficier de billets d'avion moins chers, même si nous voyons difficilement comment ce système pourraitêtre déployé en pratique. enfin, nous ne sommes pas opposésà l'attribution de prix culturels par le parlement européen. Source néanmoins, nous sommes particulièrement critiquesà l'égard du prix pour le journalisme du parlement européen et nous ne pensons pas que celui-ci puisse décerner des prix aux journalistes ayant pour mission de soumettre le parlement européenà un regard critique. Target however, we are highly critical of parliament's prize for journalism, and do not believe that it is appropriate for parliament to award prizes to journalists whose task it is to critically examine the european parliament. Base Modelnevertheless, we are particularly critical of the price for the european union's european alism and we do not believe that it would be able to make a price to the journalists who have been made available to the european parliament to a critical view. Our Model however, we are particularly critical of the price for the european union's democratic alism and we do not believe that it can give rise to the prices for journalists who have been tabled to submit the european parliament to a critical view. in this respect, it is necessary to highlight the central role of increased transparency in their activities. Base Modelin this regard it must be emphasised in the major role of transparency in which these activities are to be strengthened. Our Model in this regard, it must be stressed in the key role of greater transparency in their activities.  How is the context helping? The underlying hypothesis for this work is that discourse phenomenon in a conversation may depend on longrange dependency and these may be ignored by the sentence-based NMT models. To analyse if our contextual model is able to accurately translate such linguistic phenomenon, we come up with our own evaluation procedure. We aggregate the to-kens correctly generated by our model and those correctly generated by the baseline over the entire test set. We then take the difference of these counts and sort them 13 . Table 5 reports the top 20 tokens where our model is better than the baseline for the Europarl dataset. Figure 4 gives the density of counts obtained using our evaluation for En→Fr 14 . Positive counts correspond to correct translations by our model while the negative counts correspond to where the base model was better. It can be seen that for majority of cases our model supersedes the base model. We observed a similar trend for other translation directions. In general, the correctly generated tokens by our model include pronouns (that, this, its, their, them), discourse connectives (e.g., 'however', 'therefore', 'also') and prepositions (of, for, by). Table 6 reports an example where our model is able to generate the correct discourse connective 'however' using the context. If we look at the con- text of the source sentence in French, we come to the conclusion that 'however' is indeed a perfect fit in this case, whereas the base model is at a disadvantage and completely changes the underlying meaning of the sentence by generating the inappropriate connective 'nevertheless'. Table 7 gives an instance where our model is able to generate the correct pronoun 'their'. It should be noted that in this case, the current source sentence does not contain the antecedent and thus the context-free baseline is unable to generate the appropriate pronoun. On the other hand, our contextual model is able to do so by giving the highest attention weights to sentences containing the antecedent (observed from the attention map in Figure 5) 15 . Figure 5 also shows that for translating majority of the sentences, the model attends to wide-range context rather than just the previous sentence, hence strengthening the premise of using the complete context.

Conclusion
This work investigates the challenges associated with translating multilingual multi-speaker conversations by exploring a simpler task referred to as Bilingual Multi-Speaker Conversation MT. We process Europarl v7 and OpenSubtitles2016 to obtain an introductory dataset for this task. Compared to models developed for similar tasks, our work is different in two aspects: (i) the history captured by our model contains multiple languages, and (ii) our model captures 'global' history as opposed to 'local' history captured in most previous works. Our experiments demonstrate the

B Experiments
Training For the base model, we make use of stochastic gradient descent (SGD) 16 with initial learning rate of 0.1 and a decay factor of 0.5 after the fifth epoch for a total of 15 epochs. For the contextual model, we use SGD with an initial learning rate of 0.08 and a decay factor of 0.9 after the first epoch for a total of 30 epochs. To avoid overfitting, we employ dropout and set its rate to 0.2. To reduce the training time of our contextual model, we perform computation of one turn at a time, for instance, when using the source context, we run the Turn-RNNs for previous turns once and re-run the Turn-RNN only for sentences in the current turn.