SIMULEVAL: An Evaluation Toolkit for Simultaneous Translation

Simultaneous translation on both text and speech focuses on a real-time and low-latency scenario where the model starts translating before reading the complete source input. Evaluating simultaneous translation models is more complex than offline models because the latency is another factor to consider in addition to translation quality. The research community, despite its growing focus on novel modeling approaches to simultaneous translation, currently lacks a universal evaluation procedure. Therefore, we present SimulEval, an easy-to-use and general evaluation toolkit for both simultaneous text and speech translation. A server-client scheme is introduced to create a simultaneous translation scenario, where the server sends source input and receives predictions for evaluation and the client executes customized policies. Given a policy, it automatically performs simultaneous decoding and collectively reports several popular latency metrics. We also adapt latency metrics from text simultaneous translation to the speech task. Additionally, SimulEval is equipped with a visualization interface to provide better understanding of the simultaneous decoding process of a system. SimulEval has already been extensively used for the IWSLT 2020 shared task on simultaneous speech translation. Code will be released upon publication.


Introduction
Simultaneous translation, the task of generating translations before reading the entire text or speech source input, has become an increasingly popular topic for both text and speech translation (Grissom II et al., 2014;Cho and Esipova, 2016;Gu et al., 2017;Alinejad et al., 2018;Arivazhagan et al., 2019;Ren et al., 2020). Simultaneous models are typically evaluated from quality and latency perspective. Note that the term latency is overloaded and sometimes refers to the actual system speed. In this paper, latency refers to the simultaneous ability, which is how much partial source information is needed to start the translation process.
While the translation quality is usually measured by BLEU (Papineni et al., 2002;Post, 2018), a wide variety of latency measurements have been introduced, such as Average Proportion (AP) (Cho and Esipova, 2016), Continues Wait Length (CW) (Gu et al., 2017), Average Lagging (AL) , Differentiable Average Lagging (DAL) (Cherry and Foster, 2019), and so on. Unfortunately, the latency evaluation processes across different works are not consistent: 1) the latency metric definitions are not precise enough with respect to text segmentation; 2) the definitions are also not precise enough with respect to the speech segmentation, for example some models are evaluated on speech segments (Ren et al., 2020) while others are evaluated on time duration (Ansari et al., 2020); 3) little prior work has released implementations of the decoding process and latency measurement. The lack of clarity and consistency of the latency evaluation process makes it challenging to compare different works and prevents tracking the scientific progress of this field.
In order to provide researchers in the community with a standard, open and easy-to-use method to evaluate simultaneous speech and text translation systems, we introduce SIMULEVAL, an open source evaluation toolkit which automatically simulates a real-time scenario and evaluates both latency and translation quality. The design of this toolkit follows a server-client scheme, which has the advantage of creating a fully simultaneous translation environment and is suitable for shared tasks such as the IWSLT 2020 shared task on simultaneous speech translation 2 or the 1st Workshop on Automatic Simultaneous Translation at ACL 2020 3 . The server provides source input (text or audio) upon request from the client, receives predictions from the client and returns different evaluation metrics when the translation process is complete. The client contains two components, an agent and a state, where the former executes the system's policy and the latter keeps track of information necessary to execute the policy as well as generating a translation. SIMULEVAL has built-in support for quality metrics such as BLEU (Papineni et al., 2002;Post, 2018), TER (Snover et al., 2006) and METEOR (Banerjee and Lavie, 2005), and latency metrics such as AP, AL and DAL. It also support customized evaluation functions. While all latency metrics have been defined for text translation, we discuss issues and solutions when adapting them to the task of simultaneous speech translation. Additionally, SIMULEVAL users can define their own customized metrics. SIMULEVAL also provides an interface to visualize the policy of the agent. An interactive visualization interface is implemented to illustrate the simultaneous decoding process. The initial version of SIMULEVAL was used to evaluate submissions from the first shared task on simultaneous speech translation at IWSLT 2020 (Ansari et al., 2020).
In the remainder of the paper, we first formally define the task of simultaneous translation. Next, latency metrics and their adaptation to the speech task are introduced. After that, we provide a high-level overview of the client-server design of SIMULE-VAL. Finally, usage instructions and a case study are provided before concluding.

Task Formalization
An evaluation corpus for a translation task contains one or several instances, each of which consists of a source sequence X = [x 1 , ..., x |X| ] and a reference sequence Y * = [y * 1 , ..., y * |Y | ]. The system to be evaluated takes X as input, and generates Y = [y 1 , ..., y |Y | ]. We denote the elements of the X, Y and Y * segments. For text translation, each x j is an individual word while for speech translation, x j is a raw audio segment of duration T j . In the simultaneous translation task, a system starts generating a hypothesis with partial input only. Then it either reads a new source segment, or writes a new target segment. Assuming X 1:j = [x 1 , ..., x j ], j < |X| has been read when generating y i , we define the delay of y i as Similar to an offline model, the quality is measured by comparing the hypothesis Y to the reference Y * after the translation process is complete. On the other hand, the latency measurement involves considering partial hypotheses. The latency metrics are calculated from a function which takes a sequence of delays D = [d 1 , ..., d |Y | ] as input.

Existing Text Latency Metrics
First, we review three latency metrics previously introduced for the text translation task.
(2), measures the average of proportion of source input read when generating a target prediction.
Despite AP's simplicity, several concerns have been raised. Specifically, AP is not length invariant, i.e. the value of the metric depends on the input and output lengths. For instance, AP for a wait-3 model  is 0.72 when |X| = |Y | = 10 but 0.52 when |X| = |Y | = 100. Moreover, AP is not evenly distributed on the [0, 1] interval, i.e., values below 0.5 represent models that have lower latency than an ideal policy, and an improvement of 0.1 from 0.7 to 0.6 is much more difficult to obtain than the same absolute improvement from 0.9 to 0.8 .
Average Lagging (AL) first defines an ideal policy, which is equivalent to a wait-0 policy that has the same prediction as the system to be evaluated.  define AL as where τ (|X|) = min{i|d i = |X|} is the index of the target token when the policy first reaches the end of the source sentence and γ = |Y |/|X|. (i − 1) /γ term is the ideal policy for the system to compare with. AL has good properties such as being length-invariant and intuitive. Its value directly describes the lagging behind the ideal policy.
Differentiable Average Lagging (DAL) introduces a minimum delay of 1/γ after each operation. Unlike AL, it considers the tokens when i > τ (|X|) (Cherry and Foster, 2019). It is defined in Eq. (4): where A minimum delay prevent DAL recovering from lagging once it has been incurred.

Adapting Metrics to the Speech Task
In this section, we adapt the three latency metrics introduced in Section 3.1 to the simultaneous speech translation task.
Average Proportion is straightforward to adapt to the speech task and as follows: Average Lagging is adapted as follows: where τ (|X|) = min{i|d i = |X| j=1 T j } and d * i are the delays of an ideal policy, of which the straightforward adaption is d However such adaptation is not robust for models that tend to stop hypothesis generation too early and generate translations that are too short. This is more likely to happen in simultaneous speech translation where a model can generate the end of sentence token too early, for example when there is a long pause even though the entire source input has not been consumed. Fig. 1 Actual Source Length illustrate this phenomenon. The red line in Fig. 1 corresponds to the ideal policy defined in . We can see that when the model stops generating the translation, the lagging behind the ideal policy is negative. This is because the model stops reading any input after completing hypothesis generation. This kind of model can obtain relatively good latency-quality trade-offs as measured by AL (and BLEU), which does not reflect the reality. We thus define to prevent this issue, i.e., it is assumed that the ideal policy generates the reference rather than the system hypothesis. The newly defined ideal policy is represented by the green line in Fig. 1.
Differentiable Average Lagging for the speech task still uses Eq. (4) and Eq. (5) with a new γ defined as 4 Architecture SIMULEVAL simulates a real-time scenario by setting up a server and a client. The server and client can be run separately or jointly, and are connected through RESTful APIs. An overview is shown in Fig. 2.

Server
The server has primarily four functions. First, read source and reference files. Second, send source segments to the client upon a READ action. Third, receive predicted segments from the client upon a WRITE action, and record the corresponding delays. Fourth, run the evaluation on instances.
The evaluation process by the server on one instance is shown in Algorithm 1. Note that in line 18 in Algorithm 1, the server only runs sentencelevel metrics. The server will collect Y , D and T for every instance in the evaluation corpus, and calculate corpus-level metrics after all hypotheses are complete. if r.action == READ then 4:

Client
The client contains two components -an agent and a state. The agent is a user-defined class that operates the policy and generates hypotheses for simultaneous translation, the latter provides functions such as pre-processing, post-processing and memorizing context. The purpose of this design is to make the user free from complicated setups, and focus on the policy. The client side algorithm is shown in Algorithm 2.

User-Defined Agent
A user-defined agent class is required for evaluation, along with the user's model specific arguments. The user is able to add customized arguments and initialize the model. Two functions must be defined in order to successfully run online decoding. The first one is "policy", which takes the state as input and returns a decision on whether to perform a read or write action. The other function is "predict" which will be called when the "policy" returns a write action and return a new target prediction given the state. An example of a text wait-k model is shown below. Additionally, the user can define pre-processing or post-processing methods to handle different types of input. For example, for a speech translation model, the pre-processing method can be a feature extraction function that converts speech samples to filterbank features while for text translation, the pre-processing can be tokenization or subword splitting. Post-processing can implement functions such as merging subwords and detokenization.

User-Defined Client
A typical user will only need to implement an agent and will rely on the out-of-the-box client implementation of Algorithm 2. However, sometimes, a user may want to customize the client, for example if they want to use a different programming language than Python or make the implementation of Algorithm 2 more efficient. In that case, they can take advantage of the RESTful APIs between the client and the server described in Table 1. Users can easily plug in these APIs into their own client implementations.

Evaluation
With a well-defined agent class, SIMULE-VAL is able to start the evaluation automatically. Assuming the agent class is stored in text waitk agent.py, the evaluation can be run in one single command or separate commands: Listing 2: Evaluation command (joint) After all hypotheses are generated, the intermediate results and corpus level evaluation metrics will be saved in the output directory. SIMULEVAL also supports resuming an evaluation if the process has been interrupted.

Visualization
SIMULEVAL provides a web user interface (UI) for visualizing the online decoding process. Fig. 3 shows an interactive example on simultaneous speech translation. A user can move the cursor to find the corresponding translation at a certain point. The visualization server can be simply started by simuleval server --visual --log-dir $OUT_DIR The default port is 7777 and the web UI can be accessed at http://ip-of-server:7777.

Case Study: IWSLT 2020
In order to avoid inconsistencies in how latency metrics are computed and to ensure fair comparisons between results presented in research papers, we encourage the research community to use SIMULEVAL when reporting latency in the future.   In addition, an earlier version of SIMULEVAL was used in the context of the first simultaneous speech translation shared task at IWSLT (Ansari et al., 2020), where it is of paramount importance to have the same evaluation conditions for all submissions. In order to preserve the integrity of the evaluation process, the test set, including the source side, could not be released to participants. This motivated the client-server design, where participants defined their own agent file and submitted their system in a Docker (Merkel, 2014) environment. The organizers of the task were then able to run SIMULEVAL and score each submission in a consistent way, even for systems implemented in different frameworks.

Conclusion
In this paper, we introduced SIMULEVAL, a general and easy-to-use evaluation toolkit for simultaneous speech and text translation. It simulates a real-time scenario with a server-client scheme and automatically evaluates simultaneous translation given a user-defined agent, both for text and speech. Furthermore, it provides a visualization interface for the user to track the online decoding process. We introduced example use cases of the toolkit and showed that its general design allows evaluation on different frameworks. We encourage future research on simultaneous speech and text translation to make use of this toolkit in order to obtain an accurate and standard comparison of the latency between different systems.