Controlling Japanese Honorifics in English-to-Japanese Neural Machine Translation

In the Japanese language different levels of honorific speech are used to convey respect, deference, humility, formality and social distance. In this paper, we present a method for controlling the level of formality of Japanese output in English-to-Japanese neural machine translation (NMT). By using heuristics to identify honorific verb forms, we classify Japanese sentences as being one of three levels of informal, polite, or formal speech in parallel text. The English source side is marked with a feature that identifies the level of honorific speech present in the Japanese target side. We use this parallel text to train an English-Japanese NMT model capable of producing Japanese translations in different honorific speech styles for the same English input sentence.


Introduction
Languages differ in the way they express the same ideas depending on social context. In English different words or phrases are used in a more casual or familiar context compared to a more formal context. In languages such as Japanese or Korean formality distinctions are grammatically encoded using a system of honorifics. These honorifics are part of Japanese verbal morphology, which allows the same concept to be expressed in multiple levels of formality by altering the inflection of the main verb of the sentence. The examples in Table 1 show one sentence in three different levels of formality. In all three examples the meaning is the same, but the inflection of the main verb is different.
It is important to note that these formality distinctions in Japanese are not optional. All sentences must use one verb inflection or another, so speakers are always making a choice of what level of formality to use depending on social context.  For example, when speaking with family, close friends, or others of equal social status, the informal ある (aru "there are") is used. When speaking to superiors, strangers, or older individuals the polite expression あります (arimasu "there are") is used. When expressing deference or humility, the formal expression ございます (gozaimasu "there are") is used. In this paper we use the terms informal, polite, and formal to refer to these three levels of formality as shown in Table 1. Traditional Japanese grammars may make finer-grained, more nuanced distinctions than this. While there is this nuance to Japanese grammar, in English there is no such distinction, so when translating from English into Japanese, a translator must choose one level of formality or another. This poses a challenge for English-Japanese NMT, since for a translation to be adequate it needs to both capture the meaning of the source sentence and use the appropriate level of formality.
We propose a method to allow English-Japanese NMT to produce translations in a particular level of formality, using an additional feature on the source side marking the desired level of formality to be used in the translation. With this feature provided at both training and test time, a single NMT system can learn to distinguish these levels of formality and produce multiple translations for the same input sentence. We evaluate our ap-

English Source
The number at the bottom of the list drops off. Modified English Source <polite> The number at the bottom of the list drops off.

Japanese Target
リストの一番下にある番号がリストから削除されます。 risuto-no ichiban shita-ni aru bangō-ga risuto kara sakujo saremasu Table 2: Attaching a single token to the beginning of an English training data source sentence, based on the predicted formality of the Japanese target side proach on multiple data sets and show that it successfully produces sentences in the requested level of formality. Apart from yielding more consistent outputs, it improves general translation quality as measured by BLEU on all data sets. We see particularly strong gains on the polite and formal portions of the test sets. We also release the following resources that were developed as part of our work towards formality-aware NMT • A set of manual formality labels for a portion of the Tanaka corpus • Code for a rule-based formality converter which can be applied as a translation postprocessing step We hope that these resources will spur further research on translation into Japanese.

Formality-Aware NMT
This section describes our approach for creating a formality-aware English-Japanese NMT system.

Choosing Formality in Translation
Our proposed method starts with identifying the formality of every Japanese target sentence in our parallel training corpus. We can determine that the Japanese sentence is informal, polite, or formal based on the verb inflection of the main verb of the sentence, which is often the last word in the sentence. For example, in Table 2 the suffix ます (masu) at the end of the Japanese target sentence is a common politeness marker that identifies this as a polite sentence. This is particular to Japanese grammar, and from the English source sentence alone you cannot determine what level of formality the Japanese translation should have. So to inform our English-Japanese NMT system what formality level we are translating into, we attach the token <polite> to the beginning of the English source sentence. For every sentence pair in our training corpus, we must attach such a token to the beginning of the English source side, depending on the formality of the Japanese target side.
At test time the resulting English-Japanese NMT model will need to be provided the same kind of informal, polite, or formal tokens at the beginning of every English input sentence to be translated. This allows the user of the NMT system to choose which level of formality they would like their Japanese translation to use. There are applications where these labels could be determined automatically from the context; we leave this for future work as our current data sets do not have context beyond the sentence level.

Automatic Identification of Honorifics
In order to label our training and test data with these formality tokens, we need to be able to identify the formality of a Japanese sentence automatically. To do this we look for the presence or absence of certain Japanese honorific verb forms as a heuristic. We created a set of common verbs and verbal inflections that correspond to each formality level, such as the informal expression じゃ なかった (janakatta "was not"), the suffixes で す (desu) and ます (masu), which attach to verb stems to express politeness, as well as several honorific and humble verbs such as な さ い ま す (nasaimasu "to do" honorific) and 致します (itashimasu "to do" humble), which are used in formal social contexts to either show respect to the listener or show humility from the speaker, respectively. The full set of verb forms can be found in Table 3.
We apply our heuristics to a 21 million sentence Japanese monolingual corpus, composed of webcrawled text from multiple domains. We categorize sentences into three classes which we label informal, polite, or formal by looking for the verb forms in Table 3. We start with the formal verb forms. If any of these verbs are present we consider the sentence to be formal, if not then we proceed to looking for the polite verb forms, then the informal verb forms. If none of the verb forms in Table 3 are present in the sentence it is ignored. From the original 21 million sentences, 1 million were unable to be categorized by our heuristics.
We hypothesize that a text classifier trained on the resulting 20 million sentences selected by our heuristics will learn more nuanced distinctions in word choice and style than using the heuristics alone, which only identify a small set of verb forms. We tokenize this data set with the KyTea morphological analyzer (Neubig, 2011b) and train a model on the tokenized monolingual data and labels with the text classification tools provided by the FastText (Joulin et al., 2017) toolkit, using word trigram features.
To evaluate our classifier's performance, we enlisted the help of a Japanese linguist to make formality judgments on a small test set of 150 Japanese sentences drawn from the publiclyavailable Tanaka corpus (Tanaka, 2001). Out of these 150 total sentences, 68 were labeled informal, 45 were labeled polite and 37 were labeled formal by the annotator. These sentences and annotations will be made publicly available alongside the publication of this paper.  Table 4: Evaluation scores of labels produced by the formality classifier compared to gold test set labels for each formality category (n=150).
heuristic rules on this test set, but we hypothesize that it generalizes better to unseen text and therefore use it in our translation experiments. The results show that our classifier has higher precision on the informal category, but lower recall, and higher recall on the polite and formal categories, but lower precision.

Rule-Based Formality Conversion
We also compare our method of formality-aware NMT with a simple rule-based tool which converts a Japanese sentence from one level of formality to another. This is done by identifying the main verb in a Japanese sentence and either replacing the verb itself or just the verbal inflection with the inflection for the desired level of formality. The code will be made available open-source alongside this publication. Rule-based formality conversion is non-trivial since there are many conjugations to consider for a single verb, which differ based on the class the verb belongs to. For example, to convert the verb 歩きました (arukimashita "walked" polite)" to an informal inflection, the polite suffix ま し た (mashita) is removed from the stem of the verb and a new suffix is appended to create 歩いた (aruita). The き (ki) at the end of the verb stem marks this as a verb in a particular verb class. All verbs with ki at the end of their stem belong to the same class and have the same conjugation pattern.
In order to use this rule-based method to compare to our English-Japanese formality-aware NMT, we can simply take our baseline NMT system, trained without the formality tokens described above in section 2.1, and apply the rules to convert the NMT output into the desired level of formality. However, this rule-based method is imperfect and relies on tokenization and part-ofspeech information from the KyTea morphological analyzer. Incorrect part of speech tags or tokenization that doesn't match our rule-based tool's dictionary will lead to errors in changing verbal inflection. In our evaluation, we show how using this rule-based method compares to our formalityaware NMT.

Evaluation
In this section, we evaluate the translation quality of our formality-aware NMT models as well as their ability to produce the desired formality level in the output.

Datasets
We use three publicly-available parallel data sets for our NMT experiments. The Asian Scientific Paper Excerpt Corpus (ASPEC) (Nakazawa et al., 2016), a corpus of scientific paper abstracts, the Japanese-English Subtitle Corpus (JESC) (Pryzant et al., 2018), a corpus of sentence-aligned movie and television subtitles, and the Kyoto Free Translation Task (KFTT) (Neubig et al., 2011a), a corpus of Wikipedia data about the city of Kyoto. In our experiments we use the standard training and test sets for each parallel corpus. We also use a proprietary parallel training data set which contains web-crawled data from a mix of domains and a corresponding test set. Training and test set sizes are reported in Table 5.  For each parallel corpus, we train a formalityaware NMT model by classifying the formality of the Japanese target side and attaching a corresponding feature to the beginning of each English source segment, identifying the target as being informal, polite, or formal. For comparison, we also train a baseline NMT model without these formality annotations.

Experimental Results
To evaluate our formality-aware NMT models, we first need to choose the right level of formality for each sentence in the test sets. We do this by applying the formality classifier to the test reference and prepending the predicted labels to the source side of each test sentence. We then provide this input to our formality-aware NMT models and compare the output to test set translations from our baseline NMT using BLEU (Papineni et al., 2002). We tokenize the Japanese MT output and reference using KyTea before computing BLEU. We evaluate on the overall test set, as well as each separate portion of the test set where the test reference was classified as informal, polite or formal. Table 6 shows our results on the test set using BLEU.

Performance of Rule-Based Conversion
We first evaluate the performance of the rule-based conversion method described in Section 2.3. The rule-based tool currently has the capability to convert to informal or polite verbal inflections, lacking rules for formal verb inflections. Thus, we only report results on the informal and polite sections of our test sets.
As shown in Table 6, the rule-based method yields improvements on the informal test portion for all models except ASPEC where performance remains the same. On the polite portion, we only see gains for the JESC model but a notable decrease in performance on ASPEC and no changes for the proprietary and KFTT test sets. This shows that while it is possible to adjust the formality level through post-processing, it is a non-trivial task and will require more work to improve the coverage of the tool. However, the rule-based tool could also be used for other tasks such as creating additional synthetic training data.

Performance of formality-aware NMT
The BLEU scores in Table 6 show that on the overall test set, our formality-aware NMT models show an improvement over the baseline NMT models. This holds true for both the model trained on the proprietary training data set, and the models trained on the publicly-available training data sets. Out of the models trained on publicly-available data, the ASPEC model shows the smallest improvement (+0.3 BLEU), the KFTT model improves more (+0.9 BLEU), and the JESC model shows the highest improvement (+1.5 BLEU).
When looking at the individual portions of the test set, as identified by our classifier, we see a larger quality improvement for the model trained on proprietary data on the informal and formal sections of its test set, and a smaller improvement on the polite section. The ASPEC formalityaware NMT is not better on the informal section of its test set, but there are larger gains in quality on the polite and formal test sections. The JESC and KFTT models improve on all three sections, with the largest gains seen in the formal section. Finally, formality-aware NMT improves over the rule-based method for all models and test sections, indicating that the NMT model is more effective at producing the desired formality level in context.

Evaluating formality levels
Since choosing the appropriate formality level in Japanese is very important to conform with social norms, we want to show that our formality-aware NMT models can provide translations in the desired level of formality. As our test sets do not have gold labels from a human annotator for each reference, we use our formality classifier to predict the level of formality for both the MT output and the test reference and compute F1 scores using the predicted reference labels.
Our F1 comparison in Table 7 shows to what extent the formality-aware NMT output matches the predicted formality level of the reference translation when the system is provided with the correct input label. We can see that the F1 scores for the formality-aware NMT are high for all three levels of formality, above 0.9 in all cases. We also see a big improvement over the baseline NMT models for each test set, especially in the polite and formal categories. From this we conclude that our formality-aware NMT models can produce a translation in the desired level of formality.
An imbalance of the training data may partly explain the difference in quality improvement across the three formality sections of the test sets. Table 8 shows how much of the training data for each data set was classified as being informal, polite, or formal. The proprietary data set contains mostly polite and informal data. In contrast, the majority of the three publicly available data sets is informal data, with a much smaller portion of polite data. For all data sets there is very little formal data, leading to the weak baseline performance on that category. By modelling formality levels more explicitly, our models are better able to compensate the inherent bias towards informal style.

Analysis and Examples
To show some concrete examples of our formalityaware translations, Table 9 contains an example of the MT output from the JESC formalityaware NMT model and the corresponding JESC NMT baseline trained without formality annotations. For this single English source sentence, there are multiple different MT outputs depending on which formality label is attached to the source before passing it to the NMT model for translation. The informal expression ない (nai "there is not") is used in the MT output by both the baseline model and the formality-aware NMT model when       Sennrich et al. (2016a) showed that side constraints can be added to the source side of a parallel text to provide control over the politeness of translation output in an English-German translation task. Following this paper's suggestion, we take a similar approach towards Japanese honorifics. Niu et al. (2017) also use a similar approach, termed "Formality-Sensitive Machine Translation", in a French-English translation task. In (Niu et al., 2018) French-English parallel text with formality features is combined with English-English parallel text, where the source and target are of similar meaning but different formality, to create a multi-task model that performs both formalitysensitive MT and monolingual formality transfer.

Related Work
In related work on Japanese-English NMT, Yamagishi et al. (2016) use a side-constraint approach to control the voice (active or passive) of an English translation. Takeno (2017) apply side constraints more broadly to control translation length, bidirectional decoding, domain adaptation, and unaligned target word generation.
Our paper follows the modeling approach introduced by Johnson et al. (2017), who showed that by adding a token to the source side of parallel text allows for training a single NMT model on data for multiple language pairs. Their token specifies the desired target language, allowing the user control over the language of machine translation output, even for source-target language pairs that were not seen during training, which they call "zero-shot" translation. The same approach has been successfully used in other applications, such as in distinguishing standard versus back-translated translation parallel corpora (Caswell et al., 2019).

Conclusion
We have shown how the distinctions between levels of formality in the Japanese language can be learned by an NMT model, by identifying Japanese honorifics in parallel training data and labeling the source side with an additional feature. We find that this technique provides control over the honorifics present in the MT output and provides an improvement in translation quality, particularly in polite and formal sentences in each test set. This improvement holds for models trained on proprietary data as well as models trained on three widely-used publicly available Japanese data sets. In future work, we would like to explore augmenting the training data for each of the comparisons we showed. We would like to explore creating artificial English-Japanese data by doing a rulebased transformation of the Japanese side of the bitext into different formality levels. We would also like to do further human evaluation of our Japanese formality classifier and the NMT models we trained, and we may explore applying this technique to English-Korean NMT because Korean also has a similar system of honorifics.

Source
King Arthur's knights do battle with a killer rabbit.