How Much Information Does a Human Translator Add to the Original?

We ask how much information a human translator adds to an original text, and we provide a bound. We address this question in the context of bilingual text compression: given a source text, how many bits of additional information are required to specify the target text produced by a human translator? We develop new compression algorithms and establish a benchmark task.


Introduction
Text compression exploits redundancy in human language to store documents compactly, and transmit them quickly. It is natural to think about compressing bilingual texts, which have even more redundancy: "From an information theoretic point of view, accurately translated copies of the original text would be expected to contain almost no extra information if the original text is available, so in principle it should be possible to store and transmit these texts with very little extra cost." (Nevill and Bell, 1992) Of course, if we look at actual translation data (Figure 1), we see that there is quite a bit of unpredictability. But the intuition is sound. If there were a million equally-likely translations of a short sentence, it would only take us log 2 (1m) = 20 bits to specify which one.
By finding and exploiting patterns in bilingual data, we want to provide an upper bound for this question: How much information does a human translator add to the original? We do this in the context of building a practical compressor for bilingual text.  We adopt the same scheme used in monolingual text compression benchmark evaluations, such as the Hutter Prize (Hutter, 2006), a competition to compress a 100m-word extract of English Wikipedia. A valid entry is an executable, or self-extracting archive, that prints out Wikipedia, byte-for-byte. Decompression code, dictionaries, and/or other resources must be embedded in the executable-we cannot assume that the recipient of the compressed file has access to those resources. This view of compression goes by the name of algorithmic information theory (or Kolmogorov complexity).
Any executable is permitted. For example, if our job were to compress the first million digits of π, then we might submit a very short piece of code that prints those digits. The brevity of that compression would demonstrate our understanding of the sequence. Of course, in our application, we will find it useful to develop generic algorithms that can compress any text.
Our approach will be as follows. Given a bilingual text (file1 and file2), we develop this compression interface: % compress file1 > file1.exe % bicompress file2 file1 > file2.exe The second command compresses file2 while looking at file1. We take the size of file1.exe Our decompression interface is: The second command decompresses file2 while looking at (uncompressed) file1. The contributions of this paper are: 1. We provide a new quantitative bound for how much information a translator adds to an original text. 2. We present practical software to compress bilingual text with compression rates that exceed the previous state-of-the-art. 3. We set up a public benchmark bilingual text compression challenge to stimulate new researchers to find and exploit patterns in bilingual text. Ultimately, we want to feed those ideas into practical machine translation systems.

Data
We propose the widely accessible Spanish/English Europarl corpus v7 (Koehn, 2005) as a benchmark for bilingual text compression ( Figure 2). Portions of this large corpus have been used in previous compression work (Sánchez-Martínez et al., 2012). The Spanish side is in UTF-8. For English, we have removed accent marks and further eliminated all but the 95 printable ASCII characters (Brown et al., 1992), plus newline. Our task is to compress the data "as is": un- tokenized, but already segment aligned. We also include a tokenized version with 334 manually word-aligned segment pairs (Lambert et al., 2005) distributed throughout the corpus.
For rapid development and testing, we have arranged a smaller corpus that is 10% the size of the full corpus ( Figure 3).

Monolingual compression
Compression captures patterns in data. Language modeling also captures patterns, but at first blush, these two areas seem distinct. In compression, we seek a small executable that prints out a text, while in language modeling, we seek an executable that assigns low perplexity to held-out test data. 1 Actually, the two areas have much more in common, as a review of compression algorithms reveals.
Huffman coding. A well-known compression technique is to create a binary Huffman tree whose leaves are characters in the text, 2 and whose edges are labeled 0 or 1 (Huffman and others, 1952). The tree is arranged so that frequent characters have short binary codes (edge sequences). It is very important that the Huffman tree for a particular text be included at the beginning of the compressed file, so that decompression knows how to process the compressed bit string.
Adaptive Huffman. Actually, we can avoid shipping the Huffman tree inside the compressed file, by building the tree adaptively, as the compressor processes the input text. If we start with a uniform distribution, the first few characters may not compress very well, but soon we will converge onto a good tree and good compression. It is very important that the decompressor exactly recapitulate the same sequence of Huffman trees that the compressor made. It can do this by counting characters as it outputs them, just as the compressor counted characters as it consumed them.
Adaptive compression can also nicely accommodate shifting topics in text, if we give higher counts to recent events. By its single-pass nature, it is also good for streaming data.
Arithmetic coding. Huffman coding exploits a predictive unigram distribution over the next character. If we use more context, we can make sharper distributions. An n-gram table is one way to map contexts onto predictions.
How do we convert good predictions into good compression? The solution is called arithmetic coding (Rissanen and Langdon Jr., 1981;Witten et al., 1987). Figure 4 sketches the technique. We produce context-dependent probability intervals, and each time we observe a character, we move to its interval. Our working interval becomes smaller and smaller, but the better our predictions, the wider it stays. A document's compression is the shortest bit string that fits inside the final interval. In practice, we do the bit-coding as we navigate probability intervals.
Arithmetic coding separates modeling and compression, making our job similar to language modeling, where we use try to use context to predict the next symbol.

PPM
PPM is the most well-known adaptive, predictive compression technique (Cleary and Witten, 1984). PPM updates character n-gram tables (usually n=1..5) as it compresses. In a given context, an n-gram table may predict only a subset of characters, so PPM reserves some probability mass for an escape (ESC), after which it executes a hard backoff to the (n-1)-gram table. In PPMA, P(ESC) is 1/(1+D), where D is the number of times the context has been seen. PPMB uses q/D, where q is the number of distinct character types seen in the context. PPMC uses q/(q+D), aka Witten-Bell. PPMD uses q/2D. PPM* uses the shortest previously-seen deterministic context, which may be quite long. If there is no deterministic context, PPM* goes to the longest matching context and starts PPMD. Instead of the longest context, PPMZ rates all contexts between lengths 0 and 12 according to each context's most probable character. PPMZ also implements an adaptive P(ESC) that combines context length, number of previous ESC in the context, etc.
We use our own C++ implementation of PPMC for monolingual compression experiments in this paper. When we pass over a set of characters in favor of ESC, we remove those characters from the hard backoff.

PAQ
PAQ (Mahoney, 2005) is a family of state-of-theart compression algorithms and a perennial Hutter Prize winner. PAQ combines hundreds of models with a logistic unit when making a prediction. This is most efficient when predictions are at the bit-level instead of the character-level. The unit's model weights are adaptively updated by: (1)) η = fixed learning rate P i (1) = ith model's prediction PAQ models include a character n-gram model that adapts to recent text, a unigram word model (where word is defined as a subsequence of characters with ASCII > 32), a bigram model, and a skip-bigram model. Nevill and Bell (1992) introduce the concept but actually carry out experiments on paraphrase corpora, such as different English versions of the Bible. Conley and Klein (2008) and Conley and Klein (2013) compress a target text that has been wordaligned to a source text, to which they add a lemmatizer and bilingual glossary. They obtain a 1%-6% improvement over monolingual compression, without counting the cost of auxiliary files needed for decompression.

Bilingual Compression: Prior Work
Martínez-Prieto et al. (2009), , Adiego et al. (2010) rewrite bilingual text by first interleaving source words with their translations, then compressing this sequence of biwords. Sánchez-Martínez et al. (2012) improve the interleaving scheme and include offsets to enable decompression to reconstruct the original word order. They also compare several characterbased and word-based compression schemes for biword sequences. On Spanish-English Europarl data, they reach an 18.7% compression rate on word-interleaved text, compared to 20.1% for concatenated texts, a 7.2% improvement. Al-Onaizan et al. (1999) study the perplexity of learned translation models, i.e., the probability assigned to the target corpus given the source corpus. They observed iterative training to improve training-set perplexity (as guaranteed) but degrade test-set perplexity. They hypothesized that an increasingly tight, unsmoothed translation dictionary might exclude word translations needed to explain test-set data. Subsequently, research moved to extrinsic evaluation of translation models, in the context of end-to-end machine translation. Foster et al. (2002) and others have used prediction to propose auto-completions to speed up human translation. As we have seen, prediction and compression are highly related.

Predictive Bilingual Compression
Our algorithm compresses target-language file2 while looking at source-language file1: % bicompress file2 file1 > file2.exe To make use of arithmetic coding, we consider the task of predicting the next target character, given the source sentence and target string so far: 3 P(e j |f 1 . . . f l , e 1 . . . e j−1 ) If we are able to accurately predict what a human translator will type next, then we should be able to build a good machine translator. Here is an example of the task: Spanish: Pido que hagamos un minuto de silencio. English so far: I should like to ob  Figure 5: Compressing a file of (unidirectional) automatic Viterbi word alignments computed from our large Spanish/English corpus (sentences less than 50 words).

Word alignment
Let us first work at the word level instead of the character level. If we are predicting the jth English word, and we know that it translates f i ("aligns to f i "), and if f i has only a handful of translations, then we may be able to specify e j with just a few bits. We may therefore suppose that a set of Viterbi word alignments may be useful for compression (Conley and Klein, 2008;Sánchez-Martínez et al., 2012). We consider unidirectional alignments that link each target position j to a single source position i (including the null word at i = 0). Such alignments can be computed automatically using EM (Brown et al., 1993), and stored in one of two formats: Absolute: 1 2 5 5 7 0 3 6 . . . Relative: +1 +1 +3 0 +2 null -4 +3 . . . In order to interpret the bits produced by the compressor, our decompressor must also have access to the same Viterbi alignments. Therefore, we must include those alignments at the beginning of the compressed file. So let's compress them too.
How compressible are alignment sequences? Figure 5 gives results for Viterbi alignments derived from our large parallel Spanish/English corpus. First, some interesting facts: • Huffman works better on relative offsets, because the common "+1" gets a short bit code. • PPMC's use of context makes it impressively insensitive to alignment format. • PPMC beats Huffman on relative offsets. This would not happen if relative offset integers were independent of one another, as assumed by (Brown et al., 1993) and (Vogel et al., 1996). Bigram statistics bear this out: P(+1 | -2) = 0.20 P(+1 | +1) = 0.59 P(+1 | -1) = 0.20 P(+1 | +2) = 0.49 P(+1 | 0) = 0.52 So this small compression experiment already suggests that translation aligners might want to model more context than just P(offset).
However, the main point of Figure 5 is that the compressed alignment file requires 12.4 Mb! This is too large for us to prepend to our compressed file, for the sake of enabling decompression.

Translation dictionary
Another approach is to forget Viterbi alignments and instead exploit a probabilistic translation dictionary table t(e|f ). To predict the next target word e j , we admit the possibility that e j might be translating any of the source tokens. i=0 a(i|j, l)t(e j |f i ) In compression, we must predict English words incrementally, before seeing the whole string. Furthermore, we must predict P(ST OP ) to end the English sentence. We can adapt IBM Model 2 to make incremental predictions: P(ST OP |f 1 . . . f l , e 1 . . . e j−1 ) ∼ P(ST OP |j, l) = (j − 1|l)/ max k=j−1 (k|l) P(e j |f 1 . . . f l , e 1 . . . e j−1 ) ∼ P(e j |f 1 . . . f l ) = [1 − P(ST OP |j, l)] l i=0 a(i|j, l)t(e j |f i ) We can train t, a, and on our bilingual text using EM (Brown et al., 1993). However, the t-table is still too large to prepend to the compressed English file.

Adaptive translation modeling
Instead, inspired by PPM, we build up translation tables in RAM, during a single pass of our compressor. Our decompressor then rebuilds these same tables, in the same way, in order to interpret the compressed bit string.
Neal and Hinton (1998) describe online EM, which updates probability tables after each training example. Liang and Klein (2009) and Levenberg et al. (2010) apply online EM to a number of language tasks, including word alignment. Here we concentrate on the single-pass case.
We initialize a uniform translation model, use it to collect fractional counts from the first segment pair, normalize those counts to probabilities, use those new probabilities to collect fractional counts from the second segment pair, and so on. Because we pass through the data only once, we hope to converge quickly to high-quality tables for compressing the bulk of the text.
Unlike in batch EM, we need not keep separate count and probability tables. We only need count tables, including summary counts for normalization groups, so memory savings are significant. Whenever we need a probability, we compute it on the fly. To avoid zeroes being immediately locked in, we invoke add-λ smoothing every time we compute a probability from counts: 4 t(e|f ) = count(e,f )+λt where |V E | is the size of the English vocabulary. We determine |V E | via a quick initial pass through the data, then include it at the top of our compressed file.
In batch EM, we usually run IBM Model 1 for a few iterations before Model 2, gripped by an atavistic fear that the a probabilities will enforce rigid alignments before word co-occurrences have a chance to settle in. It turns out this fear is justified in online EM! Because the a table initially learns to align most words to null, we smooth it more heavily (λ a = 10 2 , λ t = 10 −4 ).
We also implement a single-pass HMM alignment model (Vogel et al., 1996). In the IBM models, we can either collect fractional counts after we have compressed a whole sentence, or we can do it word-by-word. In the HMM model, alignment choices are no longer independent of one another: Given f 1 . . . f l :  Figure 6: Word alignment f-scores. Batch EM for IBM 1 is run for 5 iterations; Batch IBM2 adds 5 further iterations of IBM2; Batch HMM adds a further 5 iterations of HMM. Online EM is singlepass. Against the silver standard, alignments are unidirectional; against gold, they are bidirectional and symmetrized with grow-dial-final (Koehn et al., 2003). First and last 50% report on different portions of the corpus. Reordered is on segment pairs ordered short to long. All runs exclude segment pairs with segments longer than 50 words.
this with a standard schedule of 5 IBM1 iterations, 5 IBM2 iterations, then 5 HMM iterations. However, HMM still learns a very high value for p 1 , aligning most tokens to null, so we fix p 1 = 0.1 for the duration of training. Single-pass, online HMM suffers the same two problems, both solved when we smooth differentially (λ o = 10 2 , λ t = 10 −4 ) and fix p 1 = 0.1.
Two quick asides before we examine the effectiveness of our online methods: • Translation researchers often drop long segment pairs that slow down HMM model processing. In compression, we cannot drop any of the text. Therefore, if the source segment contains more than 50 words, we use only monolingual PPMC to compress the target. This affects 26.5% of our word tokens. • We might assist an online aligner by permuting our n segment pairs to place shorter, less ambiguous ones at the top. However, we would have to communicate the permutation to the decompressor, at a prohibitive cost of log 2 (n!)/(8 · 10 6 ) = 4.8 Mb. We next look at alignment accuracy (f-score) on our large Spanish/English corpus ( Figure 6). We evaluate against both a silver standard (Batch EM Viterbi alignments 5 ) and a gold standard of 334 human-aligned segment pairs distributed throughout the corpus. We see that online methods generate competitive translation dictionaries. Because single-pass alignment is significantly faster than traditional multi-pass, we also investigate its impact on an overall Moses pipeline for phrase-based 5 We confirm that our Batch HMM implementation gives f-scores (f=70.2, p=80.4, r=62.3) similar to GIZA++ (f=71.2, p=85.5, r=61.0), and its differently parameterized HMM.

Alignment
Test  Figure 7: Fast, single-pass HMM alignment yields competitive Spanish-English Moses phrasebased translation accuracy, as measured by Bleu (Papineni et al., 2002). In-domain (Europarl) and out-of-domain (SMAT-07 News Commentary) tune/test sets each consist of approximately 1000 sentences, all longer than 50 words to avoid overlap with training.
machine translation (Koehn et al., 2007). Figure 7 shows that we can achieve competitive translation accuracy using fast, single-pass alignment, speeding up the system development cycle. For this use case, we can get an additional +0.3 alignment fscore (just as fast) if we print Viterbi alignments in a second pass instead of during training.

Predicting target words
Under this tokenization scheme, we now ask our TM to give us a probability distribution over possible next words. The TM knows the entire source word sequence f 1 ...f l and the target words e 1 ...e j−1 seen so far. As candidates, we consider target words that can be produced, via the current t-table, from any (non-NULL) source words with probability greater than 10 −4 .
For HMM, we compute a prediction lattice that gives a distribution over possible source alignment positions for the current word we are predicting. Intuitively, the prediction lattice tells us "where we currently are" in translating the source string, and it prefers translations of source words in that vicinity. We efficiently reuse the lattice as we make predictions for each subsequent target word.
To make the TM's prediction more accurate, we weight its prediction for each word with a smoothed, adapted English bigram word language model (LM). This discourages the TM from trying to predict the first character of a word by simply using the most frequent source words. We found that exponentiating the LM's score by 0.2 before weighting keeps it from overpowering the HMM predictions.

Predicting target characters
To convert word predictions into character predictions, we combine scores for words that share the next character. For example, if the TM predicts "monkey 0.4, car 0.3, cat 0.2, dog 0.1", then we have "P(c) 0.5, P(m) 0.4, P(d) 0.1". Additionally, we restrict ourselves to words prefixed by the portion of e j already observed. The TM predicts the space character when a predicted word fully matches the observed prefix.
We also adjust PPM to produce a full distribution over the 96 possible next characters. PPM normally computes a distribution over only characters previously seen in the current context (plus ESC). We now back off to the lowest context for every prediction. We interpolate PPM and TM probabilities: P(e k |f 1 . . . f l , e 1 . . . e k−1 ) = µ P P P M (e k |e 1 . . . e k−1 )+ (1 − µ) P T M (e k |f 1 . . . f l , e 1 . . . e k−1 ) We adjust µ dynamically based on the relative confidence of the models: µ = max(P P M ) 2.5 max(P P M ) 2.5 +max(HM M ) 2.5 Here, max(model) refers to the highest probability assigned to any character in the current context by the model. This yields better compression rates than simply setting µ to a constant. When the TM is unable to extend a word, we set µ = 1. Figure 8 shows that monolingual PPM compresses the Spanish side of our corpus to 15.8% of the original. Figure 9 (Main results) shows results for the English side of the corpus. Monolingual PPM compresses to 16.5%, while our HMM-based bilingual compression compresses to 11.9%. 6 We can say that a human translation is characterized by an additional 0.95 bits per byte on top of the original, rather than the 1.32 bits per byte we  Figure 10: Compression of Spanish plus English. All methods are run on a single file of Spanish concatenated with English, except for "Bilingual (this paper)," which records the sum of (1) Spanish compression and (2)  would need if the English were independent text. Assuming our Spanish compression is good, we can also say that the human translator produces at most 68.1% (35.0/51.4) of the information that the original Spanish author produced. Intuitively, we feel this bound is high and should be reduced with better translation modeling. Figure 9 also reports our Shannon game experiments in which bilingual humans guessed subsequent characters of the English text. As suggested by Shannon, we upper-bound bpb as the crossentropy of a unigram model over a human guess sequence (e.g., 1 1 2 5 17 1 1 . . . ), which records how many guesses it took to identify each subsequent English character, given context. For a 502character English sequence, a team of four bilinguals working together gave us an upper-bound bpb of 0.51. This team had access to the original Spanish, plus a Google translation. Monolinguals guessing on the same data (minus the Spanish and Google translation) yielded an upper-bound bpb of 1.61. These human-level models indicate that human translators are actually only adding ∼ 32% more information on top of the original, and that our current translation models are only capturing some fraction of this redundancy. 7 Figure 10 shows compression of the entire bilingual corpus, allowing us to compare with the previous state-of-the-art (Sánchez-Martínez et al., 2012), which compresses a single, wordinterleaved bilingual corpus. It shows how PPMC does on a concatenated Spanish/English file.

Conclusion
We have created a bilingual text compression challenge web site. 8 This web site contains standard bilingual data, specifies what a valid compression is, and maintains benchmark results.
There are many future directions to pursue. First, we would like to develop and exploit better predictive translation modeling. We have so far adapted machine translation technology circa only 1996. For example, the HMM alignment model cannot "cross off" a source word and stop trying to translate it. Also possible are phrase-based translation, neural nets, or as-yet-unanticipated patternfinding algorithms. We only require an executable that prints the bilingual text.
Our current method requires segment-aligned input. To work with real-life bilingual corpora, the compressor should take care of segment alignment, in a way that allows decompression back to the original text. Similarly, we are currently restricted to texts written in the Latin alphabet, per our definition of "word." More broadly, we would also like to import more compression ideas into NLP. Compression has so far appeared sporadically in NLP tasks like native language ID (Bobicev, 2013), text input methods (Powers and Huang, 2004), word segmentation (Teahan et al., 2000;Sornil and Chaiwanarom, 2004;Hutchens and Alder, 1998), alignment (Liu et al., 2014), and text categorization (Caruana & Lang, unpub. 1995).
Translation researchers may also view bilingual compression as an alternate, reference-free evaluation metric for translation models. We anticipate that future ideas from bilingual compression can be brought back into translation. Like Brown et al. (1992), with their gauntlet thrown down and fury of competitive energy, we hope that crossfertilizing compression and translation will bring fresh ideas to both areas.