Personalized Machine Translation: Predicting Translational Preferences

Machine Translation (MT) has advanced in recent years to produce better translations for clients’ speciﬁc domains, and sophisticated tools allow professional translators to obtain translations according to their prior edits. We suggest that MT should be further personalized to the end-user level – the receiver or the author of the text – as done in other applications. As a step in that direction, we propose a method based on a recommender systems approach where the user’s preferred translation is predicted based on preferences of similar users. In our experiments, this method outperforms a set of non-personalized methods, suggesting that user preference information can be employed to provide better-suited translations for each user.


Introduction
Technologies are increasingly personalized, accommodating their behavior for each user. Such personalization is done through user modeling where the goal is to "get to know" the user. To that end, personalization is based on users' attributes, such as demographics (gender, age etc.), personalities, and preferences. For example, in Information Retrieval, results are customized according to the user's information and search history (Speretta and Gauch, 2005), performance of Automatic Speech Recognition substantially improves when adapted to a specific speaker (Neumeyer et al., 1995), and Targeted Advertising makes use of the user's location and prior purchases (Kölmel and Alexakis, 2002).
Personalization in machine translation has a somewhat different nature. Providers of MT tools and services offer means to "customize" or "personalize" the translation engine for each client, mostly through domain adaptation techniques, and a great deal of effort is made to make the human-involved translation process more efficient (see Section 2.2). Most of the focus, though, goes to customization for companies or professional translators. We argue that Personalized Machine Translation (PMT below) should and can take the next step and directly address individual end-users.
The difficulty to objectively determine whether one (automatic) translation is better than another has been repeatedly revealed in the MT literature. Our conjecture is that one reason is individual preferences, to which we refer as Translational Preferences (TP). TP come into play both when the alternative translations are all correct, and when each of them is wrong in a different way. In the former case, a preference may be a stylistic choice, and in the latter, a matter of comprehension or a selection of the least intolerable error in one's opinion. For instance, one user may prefer shorter sentences than others; she may favor a more formal style, while another would rather have it casual. A user could be fine with some reordering errors but be more picky concerning punctuations. One user will not be bothered if some words are left untranslated (perhaps because the source language belongs to the same language family as the target language that he speaks), while another will find it utterly displeasing. Such differences may be the result of the type of translation system being employed (e.g. syntax-vs. phrased-based), the specific training data or many other factors. On the user's side, a preference may be attributed, for example, to her mother tongue, her age or her personality.
Two aspects of end-user PMT may be considered: (i) Personalized translation of texts written by a specific user, and (ii) PMT to provide better translations for a specific reader. In this work we address the second task, aiming to identify translations each user is more likely to prefer. 1 Specifically, we consider a setting where at least two MT systems are available, and the goal is to predict which of the translation systems the user would choose, assuming we have no knowledge about her preference between them. Benchmarking the systems in advance with respect to a reference set, or estimating the quality of the translations (Specia et al., 2009) are viable alternatives for translation selection; these, however, are not personalized to the target user. Instead, we employ a user-user Collaborative Filtering approach, common in Recommender Systems, which we map to the TP prediction task.
We assess this approach using a collection of user rankings of MT systems from a shared translation task (see Section 3). Our results show that the personalized method modestly, but consistently, outperforms several other approaches that rank the systems in general, disregarding the specific user. We consider this as an indication that user feedback can be employed towards a more personalized approach to machine translation.

Collaborative filtering
Collaborative filtering (CF) is a common approach employed by recommender systems for suggesting to users items, such as books or movies. A recommender system may simply suggest to all users the most popular items; often, however, the recommendations are personalized for each individual user to fit her taste or preferences. User-user CF relies on community preferences. The idea is to recommend to the user items that are liked by users similar to her, as manifested, for example, by high rating. Similar users are those that agree with the current user on previously-rated items. In k-nearest-neighbors CF, a user is typically represented by a vector of her preferences, where each entry of the vector is, e.g., a rating of a movie. k similar users are then identified by measuring the similarity between the users' vectors. Cosine similarity is a popular function for that purpose, and we also use it in our work (Resnick et al., 1994;Sarwar et al., 2001;Ricci et al., 2011). An alternative to cosine, Pearson's correlation coefficient (Pearson, 1895), allows addressing different rating patterns across users. In comparison to cosine, here vector entries are normalized with respect to the user's average rating. In our case, such normalization is not very meaningful since the entries of the users vectors represent comparisons rather than absolute ratings, as will be made clear in Section 4. Nevertheless, we have experimented with Pearson correlation as well, and found no advantage in using it instead of cosine.

Customization, personalization and adaptation in MT
Various means of customization and personalization are available, in both academic and commercial MT. Many of them target the company, rather than the individual user, and much of the effort is invested in designing tools for professional translators, aiming to improve their productivity, through intelligent Computer Aided Translation (CAT). Domain adaptation methods are commonly used to adapt to the topic, the genre and even the style of the translated material. Using the company's own corpora is one of the simplest techniques to do so, but many more approaches have been proposed, including dataselection (Axelrod et al., 2011;Gascó et al., 2012;Mirkin and Besacier, 2014), mixture models (Foster and Kuhn, 2007) and table fill-up (Bisazza et al., 2011). Clients can utilize their own glossaries (Federico et al., 2014), corpora (parallel or monolingual) and translation memories (TM), either shared or private ones (Caskey and Maskey, 2014;Federico et al., 2014). Through Adaptive and Interactive MT (Nepveu et al., 2004), the system learns from the translator's edits, in order to avoid repeating errors that have already been corrected. Post-editions can continuously be added to the translator's TM or be used as additional training material, for tighter adaptation to the domain of interest, through batch or incremental training.

User preferences in MT
Many tasks that require annotation by humans are affected by the annotator and not only by the item being judged. Metrics for inter-rater reliability or interannotator agreement, such as Cohen's Kappa (Cohen, 1960), help measuring the extent to which annotators disagree. Disagreement may be due to untrained or inattentive annotators, a result of a task that is not well defined, or when there is no obvious "truth". Such is the case with the evaluation of translation quality -it is not always straightforward to tell whether one translation is better than another. A single sentence can be translated in multiple correct ways. The decision becomes even harder when the translations are automatically produced and are imperfect: Is one error worse than another? The answer is in the eye of the beholder. MT papers regularly report rather low Kappa levels, even when measured on simpler tasks, such as short segments (Macháček and Bojar, 2015). Turchi et al. (2013) refer to the issue of "subjectivity" of human annotators. They address the task of binary classification of "good" vs. "bad" translations, and show that relying on human annotation for training a binary quality estimator is less effective than using automatically-generated labels. This subjectivity is exactly what we are after. We treat it as a preference, trying to identify the systems or specific translations which the user subjectively prefers. Kichhoff et al. (2012) analyze user preferences with respect to MT errors. They show that some types, e.g. word order errors, are the most dis-preferred by users, and that this is a more important factor than the number of errors. While very relevant for our research, their analysis is aggregated over all users participating in the study, and is not focusing on individuals' preferences.

Data
In this work we used the data provided for the MT Shared Task in the 2013 Workshop on Statistical Machine Translation (WMT) (Bojar et al., 2013). 2 This data was of a particularly large scale, with crowdsourced human judges, either volunteer researchers or paid Amazon Turkers. For each source sentence, a judge was presented with the source sentence itself, a reference translation, and the outputs of five machine translation systems. The five systems were randomly selected from the pool of participating systems, and were anonymized and randomly-ordered when presented to the judge. The judge had to rank the translations, with ties allowed (i.e. two system can receive the same ranking). Hence, each annotation point provided with 10 pairwise rankings between systems. Translations of 10 language pairs were assessed, with 11 to 19 systems for each pair. In total, over 900K non-tied pairwise rankings were collected. The Turkers' annotation included a control task for quality assurance, rejecting Turkers failing more than 50% of the control points. The inter-annotator score showed on average a fair to moderate level of agreement.

Translational preferences with collaborative filtering
Our method, denoted CTP (Collaborative Translational Preferences), is based on a k-nearest-neighbors approach for user-user CF. That is, we predict the translational preferences of a user based on those of similar users. In our setting, a user preference is the choice between two translation systems -which system's translations does the user prefer. Given two systems (or models of the same system) we wish to predict which one the user would prefer, without assuming the user has ever expressed her preference between these two specific systems. It is important to emphasize that the method presented here considers the users' overall preferences of systems, and does not regard the specific sentence that is being translated. In future work we intend to make use of this information as well.

Representation
As mentioned in Section 3, each annotation consists of a ranking of five systems. From that, we extract pairwise rankings for every pair of systems that were ranked for a given language pair. For each user u ∈ U (where U are all users who annotated the language pair), we create a user-preference vector, p u , that contains an entry for each pair of translation systems. Denoting the set of systems with S, we have |S|·(|S|−1) 2 system pairs. E.g., for Czech-English, with 11 participating systems, the user vector size is 55. Each entry (i, j) of the vector is assigned the following value: where w (i,j) u and l (i,j) u are the number of wins and loses of system s i vs. system s j as judged by user u. 3 With this representation, a user vector contains values between −1 (if s i always lost to s j ) and 1 (if s i always won). If the user always ranked the two systems identically, the value is 0, and if she has never evaluated the pair, the entry is regarded as a missing value (NA). Altogether, we have a matrix of users by system pairs, as depicted in Figure 1. 3 We have also considered including ties in the denominator of the equation; discarding them was found superior.

Finding similar users
Given a user preference to predict for a pair of systems (s i , s j ), we compute the similarity between p u and each one of p u for all other u ∈ U. In our experiments we used cosine as the similarity measure. The k most-similar-users (MSU ) are then selected. To be included in MSU (u), we require that u and u have judged at least 2 common system pairs.

Preference prediction
Given the similarity scores, to predict the user's preference for the target system pair, we compute a weighted average of the predictions of the users in MSU (u).
We include in the average only users with similarity scores above a certain positive threshold (0.05). We then require that a minimum number of users meet the above criteria of common annotations and minimum similarity (we used 5). If not enough such similar users are found, we turn to a fallback, where we use the non-weighted average preference across all users (AVPF presented in Section 5). 4 The prediction is then the sign of the weighted average. A positive value means s i is the preferred system; a negative one means it is s j , and a zero is a draw. In our evaluation we compare this prediction to the sign of the actual preference of the user, p u (i,j) . Formally, CTP computes the following prediction function f for a given user u and a system pair (s i , s j ): where u ∈ MSU (u) are the most similar users (the nearest neighbors) of u; p u (i,j) are the preferences of user u for (s i , s j ) and sim(u, u ) is the similarity score between the two users. 5

Evaluation methodology
In our experiments we try to predict which one of two translation systems would be preferred by a given user. 4 The fallback was used 0.1% of the times. 5 The denominator is not required as long as we predict only the sign since all used similarity scores are positive. We keep it in order to obtain a normalized score that can be used for other decisions, e.g. ranking multiple systems.
We evaluate our method, as well as several other prediction functions, when compared with the user's pairwise system preference according to the annotationp u (i,j) , shown in Equation 1. For each user this is an aggregated figure over all her pairwise rankings for the pair, determining the preferred system as the one chosen by the user (i.e. ranked higher) more times.
We conduct a leave-one-out experiment. For each language pair, we iterate over all non-NA entries in the user-preferences matrix, remove the entry and try to predict it. User similarity scores are re-computed for each evaluation point, to ensure they do not consider the target pair. The "gold" preference is positive when the user prefers s i , negative when she prefers s j and 0 when she has no preference between them. Hence, each of the assessed methods is measured by the accuracy of predicting the sign of the preference.

Non-personalized methods
We compare CTP to the following prediction methods: Always i (ALI) This is a naïve baseline showing the score when always predicting that system i wins. Note that the baseline is not simply 50% due to ties.
Average rank (RANK) Here, two systems are compared by the average of their rankings across all annotations (r ∈ {1, 2, 3, 4, 5}): r j and r i are the average ranks of s j and s i respectively. Since a smaller value of r corresponds to a higher rank, we subtract the rank of s i from s j and not the other way around. This way, if for instance, s i is ranked on averaged higher than s j , the prediction would be positive, as desired. Koehn (2012) and used by Bojar et al. (2013) in order to rank the participating systems in the WMT benchmark, compares the expected wins of the two systems. Its intuition is explained as follows: "If the system is compared against a randomly picked opposing system, on a randomly picked sentence, by a randomly picked judge, what is the probability that its translation is ranked higher?" The expected wins of s i , e(s i ), is the probability of s i to win when compared to another system, estimated as the total number of wins of s i relative to the total number of comparisons involving it, excluding ties, and normalized by the total number of systems excluding s i , |{s k }|:

Expected (EXPT) This metric, proposed by
where w (i,k) and l (i,k) are summed over all users. The preference prediction is therefore: RANK and EXPT predict preferences based on a system's performance in general, when compared to all other systems. We propose an additional prediction function for comparison which uses only the information concerning the system pair under consideration.
Average user preference (AVPF) This method takes into account only the specific system pair and averages the user preferences for the pair. Formally: where u = u, and {u } are all users except u. This method can be viewed as a non-personalized version of CTP, with two differences: (1) It considers all users, and not only similar ones.
(2) It does not weight the preferences of the other users by their similarity to the target user. Table 1 shows the results of an experiment comparing the performance of the various methods in terms of prediction accuracy. Figure 2 shows the micro-average scores, when giving each of the 97,412 test points an equal weight in the average. CTP outperforms all others for 9 out of 10 language pairs, and in the overall microaveraged results. The difference between CTP and each of the other metrics was found statistically significance with p < 5 · 10 −6 at worse, as measured with a paired Wilcoxon signed rank test (Wilcoxon, 1945) on the predictions of the two methods. The significance test captures in this case the fact that the methods disagreed in many more cases than is visible by the score difference. Our method was found superior to all others also when computing macro-average, taking the average of the scores of each language pair, as well as when the ties are included in the computation of p u .

Results
The parameters with which the above results were obtained are found within the method's description in Section 4. Yet, in our experiments, CTP turned out to be rather insensitive to their values. In this experiment we used a global set of parameters and did not tune them for each language pair separately. It is reasonable to assume that such tuning would improve results. For instance, choosing k, the number of users to include in the average, depends on the total number of users. E.g., for en-es, where there are only 57 users in total, reducing k's value from 50 to 25, improves results of CTP from 62.6% to 63.2%, higher than all other methods (whose scores are not affected).
Specifically in comparison to AVPF, weighting by the similarity scores was found to be a more significant factor than selecting a small subset of the users. This may not come as a surprise, since less similar users that are added to MSU (u) have a smaller impact on the final decision since their weight in the average is smaller.  Table 1: Results in accuracy percentage for the 10 language pairs, including the languages: Czech (cs), English (en), German (de), Spanish (es), French (fr) and Russian (ru). The best results is in bold. The difference between CTP and each of the other methods is highly statistically significant. Figure 2 shows a micro-average of these results.
One weakness of CTP, as well as of other methods, is that it poorly predict ties. In the above experiment, approximately 13.5% of the preferences were 0, none of them was correctly identified. Our analysis showed that numerical accuracy is not the main cause; setting any prediction that is smaller than some values of |ε| to 0 was not found helpful. Arguably, ties need not be predicted, since if the user has no preference between two systems, any choice is just as good. Still, we believe that better ties prediction could lead to general improvement of our method and we wish to address it in future work.

Discussion
We addressed the task of predicting user preference with respect to MT output via a collaborative filtering approach whose prediction is based on preferences of similar users. This method predicts TP better than a set of non-personalized methods. The gain is modest in absolute numbers, but the results are highly statistically significant and stable over parameter values. We consider this work as a step towards more personalized MT. This line of research can be extended in multiple ways. First and foremost, as mentioned, we did not consider the actual content of the sentences, but rather identified a general preference for one system over another. It is plausible, however, that one system is better -from the user's perspective -at translating one type of text, while another is preferred for other texts. Taking the actual texts into account seems therefore es-sential. Content-based methods for recommender systems may be useful for this purpose. Another factor that may be affecting preferences is translation quality: when compared translations are all poor, preferences play a less significant role. Hence, it may be informative to assess TP prediction separately across different levels of translation quality.
Large parallel corpora are typically required for training reasonable statistical translation models. Yet, parallel corpora, and even more so in-domain ones, are hard to gather. It is virtually impossible to find a user-specific parallel corpus, and methods for monolingual domain adaptation are easier to envisage if one wishes to address author-aware PMT (the first PMT task mentioned in Section 1). Collecting user feedback is another challenge, especially since most endusers do not speak the source language. For that and other reasons, it currently seems more feasible to collect preference information from professional translators, explicitly or implicitly.Yet, in this research we aim at end-users rather than translators whose preferences are often driven by the ease of correction more than anything else. We believe that one way to tackle this issue is to exploit other kinds of feedback, from which we can infer user preferences and similarity. Online MT providers are recently collecting end-user feedback for their proposed translations which may be useful for TP prediction. For instance, in early 2015 Facebook introduced a feature letting users rate (Bing) translations, and Google Translate asks for suggested improvements. We are hopeful that such data becomes publicly available. Nevertheless, it remains unlikely to obtain feedback from each and every user. A potential direction for both corpora and feedback collection is personalizing models and identifying preferences for groups of users based on socio-demographic traits, such as gender, age or mother tongue, or based on (e.g. Big 5) personality traits. These can even be inferred by automatically analyzing user texts.