CLIReval: Evaluating Machine Translation as a Cross-Lingual Information Retrieval Task

We present CLIReval, an easy-to-use toolkit for evaluating machine translation (MT) with the proxy task of cross-lingual information retrieval (CLIR). Contrary to what the project name might suggest, CLIReval does not actually require any annotated CLIR dataset. Instead, it automatically transforms translations and references used in MT evaluations into a synthetic CLIR dataset; it then sets up a standard search engine (Elasticsearch) and computes various information retrieval metrics (e.g., mean average precision) by treating the translations as documents to be retrieved. The idea is to gauge the quality of MT by its impact on the document translation approach to CLIR. As a case study, we run CLIReval on the “metrics shared task” of WMT2019; while this extrinsic metric is not intended to replace popular intrinsic metrics such as BLEU, results suggest CLIReval is competitive in many language pairs in terms of correlation to human judgments of quality. CLIReval is publicly available at https://github.com/ssun32/CLIReval.


Introduction
Machine translation (MT) is the task of automatically translating sentences from a source language to a target language. A natural question that arises is how do we determine whether an MT system is translating sentences well? One answer is that we can engage human translators to evaluate the translated sentences manually. Unfortunately, evaluating translations can be relatively time-consuming and worse, the fact that the quality of translation is inherently subjective can lead to variations among different human translators. The desire for fast and consistent evaluation has led to the emergence of a plethora of automatic evaluation metrics such as BLEU (Papineni et al., 2002), TER (Snover et al., 2006), METOR (Banerjee and Lavie, 2005) and BEER (Stanojević and Sima'an, 2014). Out of the aforementioned metrics, BLEU has become the de facto evaluation metric for machine translation. It calculates the weighted average of n-gram precision between a translated sentence and a reference sentence. Nevertheless, BLEU, too, has its problems. For example, Callison-Burch et al. (2006) showed that an improved BLEU score does not represent an actual improvement in translation quality.
There are also some proposals to evaluate the quality of translations with the help of extrinsic proxy tasks. Berka et al. (2011) collected short English documents from various domains and created yes and no questions in Czech. They then translated the English documents into Czech and evaluated the quality of the MT systems based on human performances on the documents and questions in Czech. Scarton and Specia (2016) translated a dataset of German reading comprehension tests into English with various MT systems such as Google Translate and Bing Translate and judged the quality of translations based on human performances on the translated reading comprehension datasets. Unfortunately, these external tasks suffer from the same scalability and consistency issues as manual evaluation.
One downstream task that relies heavily on MT but has not been used as a method to evaluate MT systems is the task of Cross-Lingual Information Retrieval (CLIR). CLIR is a task in which search queries are issued in one language, and the retrieved relevant documents are written in a different language. Two commonly used methods in CLIR are query translation, where queries are translated into the same language as the documents and document translation where documents are translated into the same language as the queries (Zhou et al., 2012;Oard, 1998;McCarley, 1999). A monolingual IR system is then used to obtain search results.
CLIR is an active field of research, and previ-ous works suggest that the performance of CLIR correlates highly with the quality of the MT (Zhu and Wang, 2006;Nie, 2010;Yarmohammadi et al., 2019). Therefore, we expect IR metrics to be good indicators of the quality of translations. Unfortunately, there is currently no publicly available tool to facilitate research in this area, and this motivates us to design and implement CLIReval. CLIReval is a lightweight python-based MT evaluation toolkit that consumes the same inputs as other automatic MT evaluation tools such as multibleu.perl and SacreBLEU (Post, 2018) and does not require any additional annotated CLIR data. Instead, it automatically transforms inputs into a synthetic CLIR dataset on the fly with the help of an Information Retrieval (IR) system. It implements the document translation approach to CLIR, where MT translations are viewed as documents and indexed using a commonly-used search engine (Elasticsearch).
As a case study, we test CLIReval on the metrics shared task of WMT2019 , which measures the Pearson correlations (r) between automatically generated MT metrics and human judgments. Results show that CLIReval consistently performs at the level of r ≥ 0.9 and is on par or even outperforms popular metrics such as BLEU on multiple language directions. Further, this is achieved without using external data or doing domain-based parameter tuning. These promising results highlight the potential of CLIR as a proxy task for MT evaluation, and we hope CLIReval can facilitate future research in this area.
Our key contributions in this work can be summarized as follows: 1. We release CLIReval, 1 an open-source toolkit that evaluates the quality of MT outputs in the context of a CLIR system, without the need for any actual CLIR dataset. The only inputs required to the tool are the translations and the references. It is easy to use in that with a single script, the tool will create a synthetic CLIR dataset, index the translations as documents, and report metrics such as mean average precision.
2. We demonstrate that CLIReval can perform as well as popular intrinsic MT metrics on recent WMT metrics shared task, without supervision from external datasets and domain-based 1 https://github.com/ssun32/CLIReval parameter tuning. Results suggest that CLIR is a feasible proxy task for MT evaluation and is worth further exploration in future research.

Approach
Given a set of source documents S, an MT system φ converts S into a set of translated documents, T = φ(S) . Intrinsic MT metrics directly calculate an aggregated score between the sentences in T and sentences in R, where R is a set of reference documents. 2 We propose an alternative way to evaluate φ by first converting it into a proxy CLIR task and then evaluate the MT system with extrinsic IR metrics. First, CLIReval extracts a set of synthetic search queries Q from R. Second, given a monolingual information retrieval (IR) engine ρ, we can run these queries Q over the document collection R to obtain a set of "relevant" documents for Q. We use the notation ρ(Q, R) to refer to this set of desired returned search results. Now, our goal is to evaluate the quality of the translation T = φ(S) under the same IR engine ρ. We index the documents T into the IR engine and submit the same queries to obtain the search results ρ(Q, T ). Finally, we can measure the performance of the CLIR system by comparing ρ(Q, T ) to ρ(Q, R), and calculating IR metrics such as mean average precision.
This approach makes several assumptions. First, CLIReval implements the document translation approach to CLIR and evaluates MT quality in that context; additionally, we assume that ρ is a robust and reasonable IR engine that can be used across a wide range of situations. Second, we assume R contains the "correct" translations of S, and that ρ(Q, R) is a good approximation of the optimal search results. Third, we assume that automatically-generated Q can mimic that actual information needs of manually-crafted queries. If these caveats are acknowledged, then CLIReval is a reasonable tool for MT evaluation.  5. Finally, CLIReval evaluates the search results from MT-IR and relevance judgment labels from REF-IR with trec eval, 3 a standard evaluation toolkit used by the information retrieval community.

Design and Implementation Details
We emphasize that the above steps are achieved with a single easy-to-use script: CLIReval is as simple as executing the following command: where the inputs are standard text files that 3 https://github.com/usnistgov/trec_ eval one might pass to multi-bleu.perl, or standard SGML files that one might pass to mteval-v13a.pl, both of which are common BLEU scripts for MT. 4

Input files
CLIReval ingests a system output translation (MT) file which contains documents translated by an MT system and a reference (REF) file, which contains reference translations of the same source documents. Our system supports two input file formats: 1. The SGML format commonly used by the news translation shared task from the annual conference on machine translation (Barrault et al., 2019). This is also the input format required by the NIST BLEU scoring tool. 5 In a SGML file, every translated sentence segment is placed in a <seg> tag, and sentence segments belonging to the same document are placed in the same <doc> tag. Every <doc> tag must also contain a unique document id attribute used to identify the document.

A text file where each line contains a sentence.
A user can supply an optional mapping file that maps a line number to a (document id and, segment id) tuple. If a mapping file is not specified, CLIReval will create an artificial document boundary every N sentences. 6 For either format, the number of documents in the MT file must be equivalent to the number of documents in the REF file. Further, the number of sentence segments in a machine translated document must also match the number of sentence segments in the corresponding reference document.

Query Generator
The query generator module ingests data in the REF file and automatically generates search queries. CLIReval has two modes for query generation, which can be specified with the query mode argument: 1. In sentences mode, the query generator extracts all reference sentences from the input  REF file and treats every sentence as a search query string. This is inspired by Sasaki et al. (2018), who use the first sentences of documents as queries.
2. In unique terms mode, the query generator treats all unique terms as queries. For Elasticsearch, these terms can be obtained from the term vectors of all indexed documents.
We recognize that using sentences or unique terms as queries might be less ideal than using real search queries, but getting relevant humangenerated queries can be time-consuming and expensive. Our query generation methods are cheap and fast, which enables quick experimentation. Examples of R and T are shown in Figure 2, and the resulting queries generated from R are shown in Figure 3. In the example, we have two documents D 1 and D 2 each with two sentences S 1 and S 2 . In the sentence mode for query generation, each of the four sentences in R are used as queries; in the unique terms mode, the 6 vocabulary words are extracted as query.

Information Retrieval (IR) System
To ensure consistent and reproducible results, we choose Elasticsearch 7 as the default backend IR system for CLIReval and adopt well-tested search configurations. 8 Elasticsearch is an open-source, lightweight, and fast search engine written in Java. We pick Elasticsearch for three reasons: First, Elasticsearch has built-in analyzers for a wide variety of languages, which allows CLIReval to support many translation tasks beyond English as the target language. Analyzers are Elasticsearch modules that preprocess and tokenize queries and documents according to language-specific rules. It also implements stopwords removal and stemming. These are important operations that affect the quality of search results.
Second, Elasticsearch implements many competitive retrieval models used by IR researchers and practitioners. By default, CLIReval uses the Okapi BM25 (Robertson et al., 2009) score to measure the degree of similarity of documents to a given search query. Note that BM25 shows strong performances on many datasets (Chapelle and Chang, 2011;Mc-Donald et al., 2018) and frequently outperforms newer "state of the art" methods (Guo et al., 2016). It is also fast to compute, allowing CLIReval to run in a highly efficient manner.
Third, Elasticsearch is a widely used search engine solution that is supported on various platforms. This increases the ease of installation for users of CLIReval.
CLIReval separately indexes the documents from MT and REF files into two instances of Elasticsearch. It then queries the Elasticsearch instances with the generated query strings. For every query, Elasticsearch returns the top 100 documents ranked by BM25 scores. Since trec eval only accepts discrete relevance judgment labels, the relevance label converter module is used to convert search scores from REF-IR into discrete labels.

Relevance Label Converter
We implement three ways of converting raw BM25 scores of REF-IR into discrete relevance judgment labels: The query in document method (Schamoni et al., 2014;Sasaki et al., 2018) assigns 1 to a document if and only if the given search query Figure 4: Given queries from the query generator and documents from R, we can obtain relevance scores from an IR system. The relevance label converter then converts those relevance scores into discrete relevance labels via different conversion modes. is extracted from that document. Consequently, there will only be one relevant document per search query.
The percentile method assigns 1 to documents with BM25 scores in the top 25 percentile of all document scores returned by the IR system and 0 otherwise. The cutoff percentile value can be adjusted with the n percentile argument.
Th Jenks methods uses Jenks natural breaks optimization 9 to automatically break a list of BM25 scores into different classes. This is achieved by minimizing the variance of BM25 scores within a class and at the same time maximize the variance of average BM25 scores between classes (McMaster and McMaster, 2002). Following the conventions of publicly available IR datasets (Chapelle and Chang, 2011;Qin and Liu, 2013), we break the BM25 scores into 5 relevance judgment classes, where 4 indicates that a document is highly relevant to a given query and 0 indicates that a document is not relevant to a given query. For each query, CLIReval normalizes the BM25 scores of 9 https://en.wikipedia.org/wiki/Jenks_ natural_breaks_optimization retrieved documents to the range [0, 1] and use Jenks natural breaks optimization to convert the BM25 scores into discrete relevance judgment labels. Users can specify the number of classes with the jenks nb class argument. Figure 4 illustrates an example of how relevance labels are generated for each query-document pair using the generated query Q (see Section 3.2 and the reference documents R provided by the user. First, raw BM25 scores are obtained by indexing R in an IR system and searching with Q. These scores are then converted to discrete labels in one of three ways.

IR Metrics
To summarize: after the queries and relevance labels are prepared (as in Section 3.2 and 3.4), the MT output T (e.g. Figure 2, left) is indexed into another IR system. Finally, we run the queries Q through this MT-IR system to obtain document scores ρ(Q, T ) (e.g. Figure 1, left branch), which can be evaluated with respect to the relevance labels. We do this final evaluation with the standard trec eval toolkit.
The trec eval toolkit returns a large number of IR metrics but CLIReval is configured to return only two of the most popular IR metrics by default: • Mean average precision (MAP) is the mean of the average precision scores for each query (Buckley and Voorhees, 2005).
• Normalized discounted cumulative gain (NDCG) is a metric that measures the usefulness of documents based on their ranks in the search results (Järvelin and Kekäläinen, 2002) and is normalized to [0, 1].
We choose MAP because it is a widely understood metric, and NDCG because it allows for multiple levels of relevance labels. We follow standard practice in IR benchmark datasets such as Chapelle and Chang (2011) and calculate both metrics at the cutoff threshold of 10 documents. We name these metrics as MAP@10 and NDCG@10.

Installation
CLIReval is written in Python 3 and works on Python 3.5 and later. Elasticsearch requires at least Java 8. We provide a shell script that automatically downloads and installs Elasticsearch 6.5.3 and the latest version of trec eval. It also installs additional Elasticsearch plugins that support additional languages. In total, CLIReval has built-in support for 36 languages and for unsupported languages, it will fall back to the default standard analyzer, which is based on the Unicode text segmentation algorithm. 10 We tested CLIReval extensively in the Unix/Linux environment, but it should work in other environments with minimal modification.

WMT metrics shared task
To demonstrate the utility of CLIReval, we test it on the metrics shared task of WMT2019. 11 The metrics task  is designed to evaluate outputs from automatic MT metrics against actual human ratings on machine translation systems. The goal is to find evaluation methods that have high Pearson correlations with human judgments. For every system in every language direction, we compute multiple system-level scores (different IR metrics) with CLIReval.
In total, there are 18 language directions, and for every language direction, a reference file and 11 to 22 system generated translation files are provided. In every reference file, there are around 1000 to 2000 sentences in 70 to 140 documents. The only exceptions are French-German and German-French, where all sentences are placed in the same document. Since document boundaries are not clearly defined in these language directions, we are excluding them from this case study.

Run Time
We used an Intel Xeon E5 Linux server with 64GB RAM. For every language direction, CLIReval runs consistently at the rate of around 0.2 to 0.3 seconds per document and it takes less than a minute to get results.

Results
We use the official evaluation scripts 12 to compute linear correlations between IR metrics and human judgments. Table 1 presents the results for 16 language directions and IR metrics perform well. On Jenks mode, NDCG@10 outperforms BLEU and NIST on 10 out of the 16 language directions. Further, the 4 IR metrics collectively hold the top scores for 6 language directions. BEER seems to be a little bit better than the IR metrics, claiming the top spot for 7 language directions. Note that the participating BEER system is trained on provided in-domain data, while we are getting comparable results without any tuning. It is also worth pointing out that the intrinsic MT metrics work at sentence level while in comparison, CLIReval works at the documentlevel. Nonetheless, the results are encouraging and show the potential of CLIR as a proxy task for MT evaluation. To get a deeper comparison between CLIReval and the most popular MT metric, BLEU, we randomly select two systems (Baidu-system for zh-en and UEDIN for en-gu) and calculate sentence-level BLEU and sentence-level NDCG@10 scores 13 on both systems. As we can see in Figure 5, there is no clear correlation between sentence-level NDCG@10 and sentence-level BLEU scores. To be more exact, the Pearson correlations between the two metrics is almost non-existent, at -0.021 and -0.032 for zh-en and en-gu respectively. This shows that the two metrics are qualitatively different and contribute different perspectives to MT evaluation.

Conclusions
We present CLIReval, an open-source pythonbased evaluation toolkit for machine translation. Rather than directly evaluating translated sentences against reference sentences, CLIReval transforms the inputs into the closely related task of CLIR, without the need for annotated CLIR dataset. The aim of this project is not to replace current automatic evaluation metrics or fix the limitations in those metrics, but to bridge the gap between machine translation and cross-lingual information retrieval and to show that CLIR is a feasible proxy task for MT evaluation.
Our case study on the WMT2019 metrics shared task further highlights the potential of CLIR as a proxy task for MT evaluation, and we hope that CLIReval can facilitate future research in this area.