Integration of Dubbing Constraints into Machine Translation

Translation systems aim to perform a meaning-preserving conversion of linguistic material (typically text but also speech) from a source to a target language (and, to a lesser degree, the corresponding socio-cultural contexts). Dubbing, i.e., the lip-synchronous translation and revoicing of speech adds to this constraints about the close matching of phonetic and resulting visemic synchrony characteristics of source and target material. There is an inherent conflict between a translation’s meaning preservation and ‘dubbability’ and the resulting trade-off can be controlled by weighing the synchrony constraints. We introduce our work, which to the best of our knowledge is the first of its kind, on integrating synchrony constraints into the machine translation paradigm. We present first results for the integration of synchrony constraints into encoder decoder-based neural machine translation and show that considerably more ‘dubbable’ translations can be achieved with only a small impact on BLEU score, and dubbability improves more steeply than BLEU degrades.


Introduction
Dubbing, the lip-synchronous translation and revoicing of audio-visual media, is essential for the full-fledged reception of foreign movies, TV shows, instructional videos, advertisements, or short social media clips. Dubbing does not contend for the viewers' visual attention like subtitles (Díaz-Cintas and Remael, 2014) do, and unlike voice-over or asynchronous speech there is no (or only little) mismatch of visual and auditory impression where the resulting cognitive dissonance would otherwise increase the viewers' cognitive load, or even lead to understanding errors (McGurk and Macdonald, * *This work was performed during an internship at Universität Hamburg, Germany. 1976). Dubbing is still primarily studied in audiovisual translation (Orero, 2004;Chaume, 2012) and performed manually, unlike textual translation, which is largely being automated or supported by computer-aided translation (Koehn, 2009).
Recent break-throughs in speech-to-speech translation (Jia et al., 2019), do not yield translations that systematically observe dubbing constraints, i. e. do not match phonetically (or rather: visemically) the original source (we call this 'dubbability'). It is our goal to create MT systems where the dubbability of the translation can be controlled so as to optimize the trade-off between translation quality and lip-synchrony of the dubbed speech. We hope that more widely available dubbing across languages will help to stimulate access to foreign media and foster inter-cultural exchange.
We argue that dubbable MT will not simply emerge from training on dubbed audio-visual corpora, i. e. implicitly. By comparison, audio-visual corpora will always remain smaller than pure textto-text translation corpora. As a result, merely relying on training a conventional MT system on large amounts of dubbing texts is bound to severely limit performance. What's more, the task of dubbing combines the constraints of several areas (meaningpreserving as well as prosodically similar translation) which have different properties. For example, for speech from the off or without the speaker's face visible, there are no limitations on prosodic similarity while it may be critical in close-up scenes; the translation system would thus need to consider video as well (but only very selectively so). Thus, we are looking for a flexible weighing of these two aspects which we achieve by introducing phonetic synchrony constraints that describe the 'dubbability' of a proposed translation, i. e., how well it is expected to allow for lip-synchronous revoicing in source (en): No, no. Each individual's blood chemistry is unique, like fingerprints. dubbed (es): No, no. La sangre de cada individuo esúnica, como una huella.
faithful: No, no. La química de la sangre de cada individuo esúnica, como las huellas dactilares.  Öktem et al., 2018) in its English original and Spanish dubbed revoicing, as well as a meaning-preserving translation. The latter results in about 70 % too many syllables (32 vs. 19 in the source), and would be next to impossible to revoice in a lip-synchronous manner. The human translator (and dubbing expert) resolved the issue by sacrificing some detail in the translation: two terms, "blood chemistry" and "fingerprints" can easily be translated slightly differently (leaving out the "chemistry" and "finger" aspects, as well as singularizing "prints") which reduces the syllable difference down to 20 % without sacrificing the overall meaning conveyed by the utterance.
We describe how synchrony constraints can be included in the MT process, in particular in the search/decoding process of neural MT, in the following section and then describe our implemented system in Section 3 and present results of our experimentation in Section 5. We conclude in Section 6 where we also present our plans for future work.

Integration of Dubbing Constraints
Given a source language sentence S, both statistical MT and neural MT perform a search among many different possible candidate utterances C in the target language, wrt. constraints that represent the faithfulness of the translation, score t (C, S), with the best scoring candidate picked as the result.
Given the source sentence and a candidate translation, we can compute a phonetic (or visemic) synchrony score p (C, S). Then, for dubbing-optimized machine translation, we simply compute a dubbingoptimal score d that combines both sub-scores using a weight α that indicates the relative importance of phonetic synchrony vs. translation faithfulness: score α d (C, S) = (1 − α) * score t (C, S) + α * score p (C, S). In application, α can be varied, e. g. according to whether the speaker's face is visible on screen.
MT systems gradually construct and prune the search space as their scoring functions work well locally, i. e., already do well for partial translations. 3 In contrast, synchrony scoring requires a global perspective, in particular for a constraint such as the relative deviation in syllable number between a candidate and the source, i. e. for score p (C, S) = abs(syll(C) − syll(S))/syll(S). It is not easy to compute this for only a prefix of C as it is typically unclear which words in the source have already been accounted for and as syllables can be shifted between words (only the total matters).
To integrate phonetic constraints into the search  process, we propose a heuristic dubbing estimator that breaks down the task of phonetic similarity scoring into (a) the known phonetic score for the prefix that has already been generated, and (b) a heuristic score p based on the internal state of the decoder for how well the yet untranslated part of the utterance will score. Different prefixes correspond to different decoder states and states are known to capture the remaining length of the translation (Shi et al., 2016). Our method extends over that of Chatterjee et al. (2017), which scores constraints only once all necessary information is available in the decoded prefix. The resulting beam search then performs similarly to A* (Hart et al., 1968). Figure 2 depicts our method, without loss of generality, for NMT. In the example, the decoding of an utterance at decoding stage i is shown. At i, the decoder may consider to add a word to faithfully translate the phrase "blood chemistry", and as an alternate hypothesis consider translating just "blood" as a shorter form of conveying the same message. All alternatives are placed in the MT system's beam which is then re-scored by the dubbing estimator which takes each word sequence in the beam to compute the phonetic score of the prefix, as well as the decoder's hidden state h i to estimate the score for what will still have to be translated. In this case, we can imagine that "sangre" will re-score to a higher position as its brevity is preferred (whereas the alternatives would still need to add "sangre" in a later decoding stage, thus their states will be estimated as containing more material to come yielding an overall higher estimate and a lower score).
The integration of synchrony constraints into the decoder enables a dubbing-optimal search with very little decoding overhead, however with some implementation effort. In addition, the heuristics score p could turn out to be be problematic given little training material or domain mismatches (see below). A similar result at low code complexity but potentially longer run time can be achieved by post-hoc rescoring based on a relatively large beam size from a standard NMT decoder. This approach is implemented in our first prototype which will be described in the next section.

Implemented System
We first describe our NMT model and training setup in detail, which yields an MT system that is competitive with the state of the art. Overall, our goal is not to create a heavily optimized system that gives us the highest possible performance in our domain but merely to yield a plausible baseline. We then describe our amendments for dubbing-optimal decoding.
We implement a convolutional encoder-decoder NMT model (Gehring et al., 2017). Given the relatively lesser training data (see below), we use a smaller model than Gehring et al. (2017), inspired by Edunov et al. (2018) and hence adapt certain hyperparameter values as described in Table 1.
We pre-process textual data as follows: we perform tokenization using the scripts from the open-source package Moses 4 (Koehn et al., 2007) followed by a byte-pair encoding compression algorithm to reduce the vocabulary size (Sennrich et al., 2016) using the open-source package subword-nmt 5 . We denote words not included in the vocabulary as <UNK>. We do not apply any lowercasing or stemming. We train our model with fairseq 6 (Ott et al., 2019) for the default 34 epochs with training objectives and search settings as found to be optimal by Edunov et al. (2018) for a similar MT task.
Our standard decoder uses a beam-size of 50 (which is larger than typically used, but see next section for results).
For dubbing-optimal decoding, we rescore the N-best list from standard decoding B t by the method outlined in Section 2: We estimate the number of syllables in each candidate and the source sentence and take the difference (sylldiff(C, S) = abs(syll(C) − syll(S))) and convert this to a score p (C, S) = 1/(1+sylldiff(C, S)) that is highest for identical syllable counts. We then reweigh the sub-scores for translation and synchrony with a weight α, yielding a rescored beam B d of which we take the best-ranked translation as being the dubbing-optimal translation. The full algorithm for rescoring is described in Algorithm 1. We use Pyphen 7 for estimating the syllable count for both English (source language) and Spanish (target language). 5 https://github.com/rsennrich/ subword-nmt 6 https://github.com/pytorch/fairseq 7 https://pyphen.org/ score t (C) ← C.score 5: score p (C) ← 1/(1+ sylldiff(C, S)) 6: score d (C) ← (1 − α) * score t (C) + α * score p (C) 7: Output: Rescored Beam Output B d 8: Select: Best-ranked candidate from B d

Setup and Evaluation Method
Ideally, a dubbing-optimal translation system should be evaluated on dubbed material. We use the HEROes dubbing corpus (Öktem et al., 2018) a corpus of the TV show with the same name with the source (English) and dubbing into Spanish. The corpus contains a total of 7000 manually aligned utterance pairs in 9.5 hours of speech and based on forced alignment of video subtitles to the audio tracks. The audio material (in both English and Spanish) is not yet used in the experiments reported below.
We find that the HEROes corpus contains 85,767 (resp. 83,561) syllables for English (resp. dubbed Spanish), as computed with Pyphen. The average number of syllables per utterance is 12.25 for English and 11.94 for Spanish. We conclude that, on average, both languages use almost the same number of syllables and hence our phonetic similarity measure based on syllables should be useful. (It would be possible, for other language pairs where the notion of syllable differs, e. g. when considering the mora-driven Japanese, to compute some sort of correction factor between the languages. In our case, we simply ignore the relative difference in syllables of < 3 % between the languages.) Although large for a dubbing corpus, the 7,000 utterances are far too little to train an NMT model on. We hence use the English → Spanish parallel data in the Europarl corpus (Koehn, 2005) for training and will evaluate on both the dubbing corpus and a test set based on the Europarl corpus. The genre of science fiction TV shows may differ radically from parliament proceedings. However, this merely results in lower BLEU performance on the out-of-domain data. We believe that model adaptation (e.g. Chu and Wang, 2018) or relatively more in-domain training material (e.g. Lison and Tiedemann, 2016) would work orthogonal to the dubbing-specific improvements in our paper. Text pre-processing is identical for both corpora.
We measure the translation performance in terms of BLEU (Papineni et al., 2002) as computed with the SacreBLEU software 8 (Post, 2018). Dubbingoptimality of translations in the test-set T is determined by micro-averaging the dubbing-scores as follows: by synchrony-score for test-set T defined as: synchrony-score(T ) = e∈T abs(syll(NMT(e)) − syll(e)) e∈T syll(e)) where NMT(e) is the target translation given by the NMT model P (y|x) (with or without dubbing constraints applied) for English source text e.
As is evident, the lower the synchrony score the better is the dubbing optimality. We run our experiment to analyze the variation of BLEU vs. synchrony score for different rescoring factors α.
We use the trained NMT model as described in the above section. Our decoding algorithm is as described in Algorithm 1, which we use to compute the relation between translation performance and dubbing-optimality of translations.

Experiment and Results
It has previously been pointed out that NMT performance suffers from a beam search size beyond 5 or 10 ( Koehn and Knowles, 2017;Tu et al., 2017) and numerous methods have been proposed to circumvent this (Huang et al., 2017;Yang et al., 2018). However, for our present way of dubbing-optimization based on N-best rescoring, high beam sizes are essential for the dubbingrescoring described in Algorithm 1 to have some material to work with. With only few candidates to be rescored, it might not necessarily give us the most 'dubbable' result.
We experimented with various beam sizes and found no BLEU degradation for a beam size of 50. Larger beams may eventually lead to a degradation and run time would become overly long as it linearly increases with the beam size. Owing to the best of both worlds, we resort to a beam size of 50 for the experiments reported below. 8 https://github.com/mjpost/sacreBLEU  Figure 3 shows BLEU scores (left scale, higher is better) and synchrony score (right scale, lower is better) of our proposed system for a range of α between 0 and 1. Notice that α = 0 corresponds to no rescoring, i. e. the baseline system.

Evaluation on Dubbing Material
The relatively low BLEU score of 13.67 for the baseline system reflects the domain-mismatch between HEROes and Europarl. 9 We find that BLEU score is impacted only moderately for relatively low values of α, with a relative decrease of 2 % for α = .3. At the same time, we find the synchrony score to improve drastically already with small values of α: while the difference in syllables between source and target is almost one quarter in the baseline system, this is almost halved, down to 14 % for α = .3. Figure 3 also contains the synchrony score of the proposed translations vs. the actual gold-standard dubbed texts (dotted line in the figure). As can be seen, the similarity increases up to about α = .3 and then flattens out. This is in line with our observation that, while source and target number of syllables correlate highly, there is no perfect match, indicating that our synchrony constraint has only limited value. However, it also points to the fact that a human dubbing expert needs to find the middle ground between faithful translation and perfect synchrony. Given that two differing linguistic systems are involved, a perfect synchrony is simply impossible if the meaning is to remain approximately correct.

In-Domain Evaluation
We also evaluate our method in-domain, on test data sampled from Europarl (excluded from training). In particular, we use those source sentences for which multiple reference translations are contained in the corpus (about 18k instances). Europarl translations, of course, are not transcripts of lip-synchronously dubbed speech. Thus, our expectations for synchrony constraints are somewhat lower. However, testing in-domain still helps greatly to validate our out-of-domain results above.
As can be seen in Figure 4, we see a similar decrease in BLEU scores (and only very gradually for small α values) and more strongly improving synchrony scores. This again points towards a useful trade-off when combining synchrony constraints with the requirement of meaning-preserving translations. There is a range of possible reasons why our method does not work as well for Europarl as for the HEROes corpus. In particular, Europarl is not transcribed speech and hence may be less 'dubbable' by nature; many phrases in Europarl may translate to phrases with a different number of syllables in the target language, yet the model is reluctant to give up this translation in the in-domain condition; the proxy-target of syllables may work less well for longer, more specific words as found in legal texts, where a focus on only accentuated syllables may be more useful.

Conclusion and Future Work
We have explored the task of dubbing-optimal machine translation, i. e. machine translation that unifies the constraints of faithfulness in translation with the constraint of lip-synchrony for revoicing of audio-visual media. We have, so far, limited our synchrony constraint to counting syllables (which acts as a proxy to jaw openings that would be a major factor in visemic characteristics of speech).
We have outlined how one can integrate synchrony constraints into to the search during decoding by estimating the amount of syllables that are still remaining in the hidden state of the encoderdecoder model. We have implemented a simpler prototype system that instead rescores a conventional system's final N-best list.
Using the (as far as we know) largest corpus of dubbed speech available, the HEROes corpus (Öktem et al., 2018), we have shown our method to yield much more 'dubbable' translations than those that result from a standard MT system. In fact, while the manual dubbing for the sentence in Figure 1 abbreviates the phrase "blood chemistry" to plain "sangre", our model instead chooses "la química de cada persona esúnica" which is still a reasonable translation of "blood chemistry" and comes very close in terms of syllable count.
In the future, we intend to implement the fully integrated search as described in Section 2, as well as implement more powerful synchrony metrics that could also ground in the source audio (e. g. to find out what syllables were stressed) or the source video (e. g. to find out how well the face is visible), and could also consider detailed aspects of the target speech (e. g. via speech synthesis cost estimates for forcing the target text on the observed visemes).
One interesting and relevant aspect of teaching humans interpreting is the task of rewording material in the target language (Gile, 2005). A model that can be trained towards an ability of coming up with alternate wordings for the same concept (but with different synchrony-related properties) would potentially yield much better candidates for 'dubbability' assessment.