Building a Non-Trivial Paraphrase Corpus Using Multiple Machine Translation Systems

We propose a novel sentential paraphrase acquisition method. To build a well-balanced corpus for Paraphrase Identiﬁ-cation, we especially focus on acquiring both non-trivial positive and negative instances. We use multiple machine translation systems to generate positive candidates and a monolingual corpus to extract negative candidates. To collect non-trivial instances, the candidates are uniformly sampled by word overlap rate. Finally, annotators judge whether the candidates are either positive or negative. Using this method, we built and released the ﬁrst evaluation corpus for Japanese paraphrase identiﬁcation, which comprises 655 sentence pairs.


Introduction
When two sentences share the same meaning but are written using different expressions, they are deemed to be a sentential paraphrase pair. Paraphrase Identification (PI) is a task that recognizes whether a pair of sentences is a paraphrase. PI is useful in many applications such as information retrieval (Wang et al., 2013) or question answering (Fader et al., 2013).
Despite this usefulness, there are only a few corpora that can be used to develop and evaluate PI systems. Moreover, such corpora are unavailable in many languages other than English. This is because manual paraphrase generation tends to cost a lot. Furthermore, unlike a bilingual parallel corpus for machine translation, a monolingual parallel corpus for PI cannot be spontaneously built.
Even though some paraphrase corpora are available, there are some limitations on them. For example, the Microsoft Research Paraphrase Corpus (MSRP) (Dolan and Brockett, 2005) is a standardized corpus in English for the PI task. However, as Rus et al. (2014) pointed out, MSRP collects candidate pairs using short edit distance, but this approach is limited to collecting positive instances with a low word overlap rate (WOR) (non-trivial positive instances, hereafter) 1 . In contrast, the Twitter Paraphrase Corpus (TPC) (Xu et al., 2014) comprises short noisy user-generated texts; hence, it is difficult to acquire negative instances with a high WOR (non-trivial negative instances, hereafter) 2 .
To develop a more robust PI model, it is important to collect both "non-trivial" positive and negative instances for the evaluation corpus. To create a useful evaluation corpus, we propose a novel paraphrase acquisition method that has two viewpoints of balancing the corpus: positive/negative and trivial/non-trivial. To balance between positive and negative, our method has a machine translation part collecting mainly positive instances and a random extraction part collecting negative instances. In the machine translation part, we generate candidate sentence pairs using multiple machine translation systems. In the random extraction part, we extract candidate sentence pairs from a monolingual corpus. To collect both trivial and non-trivial instances, we sample candidate pairs using WOR. Finally, annotators judge whether the candidate pairs are paraphrases.
In this paper, we focus on the Japanese PI task and build a monolingual parallel corpus for its evaluation as there is no Japanese sentential paraphrase corpus available. As Figure 1 shows, we use phrase-based machine translation (PBMT) and neural machine translation (NMT) to generate two different Japanese sentences from one English sentence. We expect the two systems provide widely different translations with regard to surface form such as lexical variation and word order difference because they are known to have different characteristics (Bentivogli et al., 2016); for instance, PBMT produces more literal translations, whereas NMT produces more fluent translations.
We believe that when the translation succeeds, the two Japanese sentences have the same meaning but different expressions, which is a positive instance. On the other hand, translated candidates can be negative instances when they include fluent mistranslations. This occurs since adequacy is not checked during an annotation phase. Thus, we can also acquire some negative instances in this manner.
To actively acquire negative instances, we use Wikipedia to randomly extract sentences. In general, it is rare for sentences to become paraphrase when sentence pairs are collected randomly, so it is effective to acquire negative instances in this regard.
Our contributions are summarized as follows: • Generated paraphrases using multiple machine translation systems for the first time • Adjusted for a balance from two viewpoints: positive/negative and trivial/non-trivial • Released 3 the first evaluation corpus for the Japanese PI task

Related Work
Paraphrase acquisition has been actively studied. For instance, paraphrases have been acquired from monolingual comparable corpora such as news articles regarding the same event (Shinyama et al., 2002) and multiple definitions of the same concept (Hashimoto et al., 2011). Although these methods effectively acquire paraphrases, there are not many domains that have comparable corpora. In contrast, our method can generate paraphrase candidates from any sentences, and this allows us to choose any domain required by an application.
Methods using a bilingual parallel corpus are similar to our method. In fact, our method is an extension of previous studies that acquire paraphrases using manual translations of the same documents (Barzilay and McKeown, 2001;Pang et al., 2003). However, it is expensive to manually translate sentences to create large numbers of translation pairs. Thus, we propose a method that inexpensively generates translations using machine translation and Quality Estimation. Ganitkevitch et al. (2013) and Pavlick et al. (2015) also use a bilingual parallel corpora to build a paraphrase database using bilingual pivoting (Bannard and Callison-Burch, 2005). Their methods differ from ours in that they aim to acquire phrase level paraphrase rules and carry out word alignment instead of machine translation.
There are also many studies on building a large scale corpora utilizing crowdsourcing in related tasks such as Recognizing Textual Entailment (RTE) (Marelli et al., 2014;Bowman et al., 2015) and Lexical Simplification (De Belder and Moens, 2012;Xu et al., 2016). Moreover, there are studies collecting paraphrases from captions to videos (Chen and Dolan, 2011) and images (Chen et al., 2015). One advantage of leveraging crowdsourcing is that annotation is done inexpensively, but it requires careful task design to gather valid data from non-expert annotators. In our study, we collect sentential paraphrase pairs, but we presume that it is difficult for nonexpert annotators to provide well-balanced sentential paraphrase pairs, unlike lexical simplification, which only replaces content words. For this reason, annotators classify paraphrase candidate pairs in our study similar to the method used in the TPC and previous studies on RTE.
As for Japanese, there exists a paraphrase database (Mizukami et al., 2014) and an evaluation dataset that includes some paraphrases for lexical simplification (Kajiwara and Yamamoto, 2015;Kodaira et al., 2016). They provide either lexical or phrase-level paraphrases, but we focus on collecting sentence-level paraphrases for PI evaluation. There is also an evaluation dataset for RTE (Watanabe et al., 2013) containing 70 sentential paraphrase pairs; however, as there is a limitation in terms of size, we aim to build a larger corpus.  3 Candidate Generation

Paraphrase Generation using Multiple Machine Translation Systems
We use different types of machine translation systems (PBMT and NMT) to translate source sentences extracted from a monolingual corpus into a target language. This means that each source sentence has two versions in the target language, and we use the sentences as a pair.
To avoid collecting ungrammatical sentences as much as possible, we use Quality Estimation and eliminate inappropriate sentences for paraphrase candidate pairs. At WMT2016 (Bojar et al., 2016) in the Shared Task on Quality Estimation, the winning system YSDA (Kozlova et al., 2016) shows that it is effective for Quality Estimation to employ language model probabilities of source and target sentences, and BLEU scores between the source sentence and back-translation. Therefore, we calculate the language model probabilities of source sentences and translate them in the order of their probabilities. To further obtain better translations, we select sentence pairs in the descending order of machine translation output quality, which is defined as follows: Here, e i denotes the i-th source sentence, BT PBMT denotes the back-translation using PBMT, BT NMT denotes the back-translation using NMT, and SBLEU denotes the sentence-level BLEU score (Nakov et al., 2012). When this score is high, it indicates that the difference in sentence meaning before and after translation is small for each machine translation system.

Non-Paraphrase Extraction from a Monolingual Corpus
This extraction part of our method is for acquiring non-trivial negative instances. Although the machine translation part of our method is expected to collect non-trivial negative instances too, there will be a certain gap between positive and negative instances. To fill the gap, we randomly collect sentence pairs from a monolingual corpus written in the target language.
To check whether the negative instances acquired by machine translation and those extracted directly from a monolingual corpus are discernible, we asked three people to annotate randomly extracted 100 instances whether a pair is machine-translated or not. The average F-score on the annotation was 0.34. This means the negative instances are not distinguishable, so this does not affect the balance of the corpus.

Balanced Sampling using Word Overlap Rate
To collect both trivial and non-trivial instances, we carefully sample candidate pairs. We classify the pairs into eleven ranges depending on the WOR and sample pairs uniformly for each range, except for the exact match pairs. The WOR is calculated as follows: Label Example Positive Input: My father was a very strong man.

PBMT:
My father was a very strong man.

NMT:
My father was a very strong man.

Negative
Input: It is available as a generic medication.

PBMT:
It is available as a generic medicine.

NMT:
It is available as a generic medication. Unnatural Input: I want to wake up in the morning PBMT: * I wake up want to in the morning* NMT: I want to wake up in the morning Other Input: Academy of Country Music Awards : PBMT: Academy of Country Music Awards : NMT: Academy of Country Music Awards : Here, T PBMT and T NMT denote the sentence in the target language translated by PBMT and NMT respectively.

Acquiring Candidate Pairs in Japanese
We built the first evaluation corpus for Japanese PI using our method. We used Google Translate PBMT 4 and NMT 5 (Wu et al., 2016) to translate English sentences extracted from English Wikipedia 6 into Japanese sentences 7 . We calculated the language model probabilities using KenLM (Heafield, 2011), and built a 5-gram language model from the English Gigaword Fifth Edition (LDC2011T07). Then we translated the top 500,000 sentences and sampled 200 pairs in the descending order of machine translation output quality for each range, except for the exact match pairs (Table 1).

Annotation
We used four types of labels; Positive, Negative, Unnatural, and Other (Table 2). When both sentences of a candidate pair were fluent and semantically equivalent, we labeled it as Positive. In contrast, when the sentences were fluent but semantically inequivalent, the pair was labeled as Negative. Positive and Negative pairs were included in our corpus. The label Unnatural was assigned to pairs when at least one of the sentences was ungrammatical or not fluent. In addition, the label Other was assigned to sentences and phrases that comprise named entities or that have minor differences such as the presence of punctuation, even though they are paraphrases. Unnatural or Other pairs were discarded from our corpus.
One of the authors annotated 2,000 machinetranslated pairs, then another author annotated the pairs labeled either Positive or Negative by the first annotator. The inter-annotator agreement (Cohen's Kappa) was κ=0.60. Taking into consideration the fact that PI deals with a deep understanding of sentences and that there are some ambiguous instances without context (e.g., good child and good kid), the score is considered to be sufficiently high. There were 89 disagreements, and the final label was decided by discussion. As a result, we acquired 363 positive and 102 negative machinetranslated pairs.
Although the machine translation part of our method successfully collected non-trivial positive instances, it acquired only a few non-trivial negative instances as we expected. To fill the gap between positive and negative in higher WOR, we randomly collected sentence pairs from Japanese Wikipedia 8 and added 190 non-trivial negative instances. At the end of both parts of our method, we acquired 655 sentence pairs in total, comprising 363 positive and 292 negative instances.     Table 3 shows the result of corpus analysis on machine-translated instances. We randomly sampled ten pairs from each range of WOR for both positive and negative pairs, i.e., 168 pairs in total, and investigated what type of pairs are included. We found that most of the data comprises content word replacement (63.1%). Further investigation of this category shows that 30.2% are related to a change in the origin of words and transliterations. In Example # 1 in Table 4, PBMT outputs a transliteration of a member, and NMT outputs a Japanese translation. Next, the second most common type of pair is phrasal/sentential replacement (25.0%). When a pair has a bigger chunk of sentence or the sentence as a whole is replaced, it is assigned to this category. This implies that our method, which focuses on sampling by WOR, works to collect non-trivial instances like Examples # 2 and # 3. On the contrary, Example # 4 is an example of instances where machine translations demonstrate each characteristic like that mentioned in Section 1 (PBMT is more literal and NMT is more fluent), so negative instances are produced as we expected. The outputs are semantically close, but the surface is very different. In this example, the PBMT output entails the NMT output.

Paraphrase Identification
We conducted a simple PI experiment an unsupervised binary classification. Here, we classified each sentence pair as either paraphrase or non-paraphrase using WOR thresholds and evaluated its accuracy. Figure 4 shows the results from each corpus. Achieving around accuracy of 80% does not mean that the corpus is well built in any language. In that respect, this result proves that our corpus includes more instances that are difficult to be solved with only superficial clues, which helps develop a more robust PI model.  ial and non-trivial instances. With this method, we built the first evaluation corpus for Japanese PI. According to our PI experiment, our method made the corpus difficult to be solved.
Our method can be used in other languages, as long as machine translation systems and monolingual corpora exist. In addition, more candidates could be added by including additional machine translation systems. A future study will be undertaken to explore these possibilities.