Extract Domain-specific Paraphrase from Monolingual Corpus for Automatic Evaluation of Machine Translation

Paraphrase can help match synonyms or match phrases with the same or similar meaning, thus it plays an important role in automatic evaluation of machine translation. The traditional approaches extract paraphrase in general domain from bilingual corpus. Because the WMT16 metrics task consists of three sub-tasks, namely news domain, medical domain, and IT domain, we propose to extract domain-specific paraphrase tables from monolingual corpus to replace the general paraphrase table. We utilize the M-L approach to filter the large scale general monolingual corpus into a domain-specific sub-corpus, and exploit Markov Network model to extract paraphrase tables from the sub-corpus. The experimental results on WMT15 Metrics task show that METEOR metric using the domain-specific paraphrase tables outperforms that using the paraphrase table in general domain extracted from the bi-lingual corpus.


Introduction
Machine translation (MT) automatic evaluation metrics, such as BLEU (Papineni et al., 2002), NIST (Doddington, 2002), METEOR (Banerjee et al., 2005), TER (Snover et al., 2006), MAXSIM (Chan et al., 2008) etc., evaluate the quality of the MT system output by calculating the similarity between the translation output and the human reference. Accurately matching words or phrases with the same or similar meaning is critical to the performance of the automatic evaluation metrics (Li et al., 2013;Li et al., 2016).
Recently, many works enhanced traditional metrics by adding paraphrase match. For instance, in the latest version of METEOR package (Denkowski and Lavie, 2014), the paraphrase match was added after the standard exact word match, stem match and synonym match. And the latest version of TER package (Bannard et al., 2005) relaxes the condition of word match or chunk shift by adding paraphrase match. Note that the paraphrase tables used in latest METE-OR and TER metrics belong to the general domain and they are extracted from bilingual parallel corpus by the Pivot approach (Bannard et al., 2005). However, the WMT16 metrics task consists of sub-tasks on specific domains involving several different languages. Confronted with the changes, we propose a Monolingual Paraphrase Extraction method based on Domain Adaptation (MPEDA), and use the new domain-specific paraphrase table to replace the traditional paraphrase tables in the latest METEOR package.

Related Work
In statistical natural language processing, both the scale and the quality of the training data have a direct impact on the performance of statistical learning. Take statistical MT for an example, if the size of training data is larger and the more it covers n-gram appeared in the test set, the quality of the MT outputs will be better.
To expand the scale of the existing domainspecific corpus, Moore and Lewis (2010) trained models with general corpus and domain-specific corpus, and computed cross entropy of each sentence in the general corpus to extract a subcorpus much larger than the existing domainspecific corpus. In this way, a large scale domain-specific training corpus for statistical MT was established. Along this approach, Amittai et al. (2011) proposed a bilingual parallel data selection approach based on cross entropy to improve the MT performance for spoken language translation. And Juri et al. (2015) filtered training data for automatic extraction of paraphrase by using Moore and Lewis' approach to extract paraphrases from the filtered training data via the Pivot approach.
Automatically extracting paraphrases from the large scale corpus is low cost. Barzilay and McKeown (2001) presented an unsupervised learning approach to extract paraphrases of words and phrases from different English translations of the identical source language sentences. Bannard and Callison-Burch (2005) employed the word alignment technique of statistical MT to extract paraphrases from bilingual parallel corpus. Shinyama et al. (2002) used the named entity recognition features to extract paraphrases from monolingual comparable corpus. Barzilay and Lee (2003) used text strings alignment algorithm to learn paraphrases at sentence level from the unannotated comparable corpus. Yet, there are still great restrictions of the latter two monolingual paraphrase extraction methods. Therefore, we adopt the Markov-based method proposed by Weng et al. (2015) to extract paraphrases in specific domain from monolingual corpus because that it has no restrictions on monolingual corpus in the target language as it can extract paraphrase by constructing the Markov networks of words. Prior to the paraphrase extraction, we first filter large scale monolingual corpus into sub-corpus close to the domain of the human reference. Compared with general training corpus, the filtered sub-corpus is smaller and more related to the target domain, which results in the improvement on the quality of paraphrase table as well as the performance when the paraphrase table is applied in automatic evaluation metric.

MPEDA: Monolingual Paraphrase Extraction Based on Domain Adaptation
We extract domain-specific paraphrases from the monolingual corpus which are the most related to the test data. Our approach aims at accurately matching synonyms and phrases with the same or similar meaning in MT outputs and in human references with the help of the domain-specific paraphrase. We first filter a sub-corpus from a large general corpus by the extended M-L method, and then extract paraphrases based on Markov Network model and finally apply the paraphrase table to METEOR metric.

Extracting paraphrases based on word chunks
According to the Markov Network model, we first use the term co-occurrence in the text set to calculate the correlation among terms and construct a term Markov network where the correlation between two words in the network (edge weight) is computed by the joint conditional probability of two terms in the text set according to Formula (1) -(3), in which conditional probability P(t i |t j ) and P(t j |t i ) are not equal.
In Formula (1) -(3), t i and t j stand for two terms, C(t i , t j ) is the number of documents that in the whole training data term t i and term t j cooccur in the same window, C(t i ) and C(t j ) denote the numbers of documents that term t i and term t j occur in the whole training data respectively, R(t i , t j ) denotes the correlation between term t i and term t j . The greater the R value, the higher the correlation between the two terms.
Extracting paraphrases from the constructed term Markov network is built on the following hypothesis: the more word chunks co-occurring between two terms, the more similar their semantic meanings are, and thus the two terms are a paraphrase pair. Therefore, we need to build an n-gram word chunk set for each term and then calculate the ratio between the number of cooccurring word chunks of two terms and the total number of word chunks with one term occurring. The ratio is considered as the possibility of the two terms constructing a paraphrase pair, which can be obtained by formula (4) -(6). Formula (6) is used to calculate the weight of n-gram word chunk.
In the above formulas, pos(t i ，t j ) is the paraphrase probability of term t i and term t j , W 3 (t i ，t j ) is the sum of weights of all the 3-gram word chunks containing term t i and term t j , W 3 (t i ) is the sum of weights of all the 3-gram word chunks containing term t i , W 3 (t j ) denotes the sum of weights of all the 3-gram word chunks containing term t j , n denotes the number of nodes in word chunk, R(t i ，t j ) denotes the correlation between term t i and term t j .
We use the terms co-occurrence to construct a term Markov network and extract phrases in the corpus as a node of Markov network. Figure 1 shows an example of 3-gram word chunk, where t 1 stands for the term "computer", t 2 stands for the term "Internet", t 3 stands for the term "calculating machine", t 4 stands for the term "electronic". In this example, the 3-gram word chunk set for each term is S( there is a high correlation between the two terms of t 1 and t 3 . Based on the hypothesis of this paper, we think term t 1 , "computer", and term t 3 , "calculating machine", in this example is a paraphrase pair.

M-L corpus filtering
The corpus filtering method is built similar to the M-L method proposed by Moore and Lewis (2010). To extract a sub-corpus of target domain from the large general corpus, we first select a domain-specific corpus and a general large scale corpus. To improve the automatic MT metric, we use the human references of each sub-task in the metric tasks as the domain-specific corpus, and train the language model of the two corpora respectively, furthermore, we calculate the cross entropy of the two models. Finally, the similarity between the sentences and the human references is measured by calculating the difference of two cross entropy of the same sentence according to Formula (7). Generally, smaller value means the sentence is closer to the target domain.
In formula (7), S i denotes the i-th sentence, H ref denotes the cross entropy of the language model trained from the human references, while H train denotes the cross entropy of the language model trained from the training data.

Document sets filtering
The Markov network-based automatic paraphrase extraction approach requires divide a general monolingual corpus into different document sets. Weng et al. (2015) divided the text of a fixed length into a document without considering the correlation among documents. Hence, we form the sentences in the corpus into cluster via K-means clustering algorithm, and then use the bag of word model to create a vector for each sentence in the corpus. Thus the distance between two sentences can be obtained by calculating the cosine value of the two vectors. Each cluster is viewed as a document. In the process of clustering, dividing documents via K-means algorithm can guarantee that the sentences in a document approximately belong to the same domain.
Then, the M-L method is used to extract the sub-sets of documents which are close to the target domain from the clustered general document sets. This signifies that it is the document not the sentence that is regarded as the smallest filtering unit in the process of corpus filtering. And we want to identify documents which are similar to our target domain by summing up the difference of cross entropy of each sentence in the document. However, when dividing the large-scale corpus into documents via K-means algorithm, the number of sentences in the documents varies, thus we calculate the mean after summing up the difference of cross entropy of each sentence to obtain the score of each document

Experiments
To test the quality of the domain-specific paraphrase extracted from monolingual corpus by the proposed approach, we conducted experiments on WMT15 Metrics task.
The METEOR-Universal metric (Denkowski and Lavie, 2014) using the paraphrase tables which were extracted from the bilingual parallel corpus was set as the baseline metric. We used the paraphrase tables in general domain extracted by the Markov Network model, and the domainspecific paraphrase tables extracted by our ap-proach substituted for the original paraphrased tables, respectively. The updated metrics are called as METEOR-Markov and METEOR-MPEDA. We compared the METEOR-MPEDA metric with the METEOR-Markov metric and METEOR-Universal metric to demonstrate the quality of the domain-specific paraphrase table extracted by our approach. Besides, we compared the METEOR-MPEDA with METEOR metric (Banerjee et al., 2005) which only uses the exact word match, stem match and synonym match.

Corpus
The training data and the human references we used in the experiment are all provided in WMT15 Translation task and Metrics task (Bojar et al., 2015), every training data has its corresponding references. Table 1 shows the number of sentences in the corpora. The row "T-corpus" denotes the training data, while the row "ref" denotes the references.
The training data was processed by text clustering. We used K-means clustering algorithm to gather the corpus sentences in different clusters, and then adopted the bag of word to create a vector for each sentence. By computing the cosine value of the two vectors, we obtained the distance between two sentences. Each cluster was viewed as a document. The i-th document in training data was named D i , and the number of sentences in each document was different. Table  2 is the number of documents after training data clustering. The row "D-corpus" is the number of document used in the training data.

Experiments Settings
After dividing the training data into documents, we processed the corpus by the following procedure: tokenize the training data and the references; delete the punctuations; transform the capitalized letters of words into lower case. Then, we employed 4-gram language model with Kneser-Ney discounting to train corresponding language models for training data and the references. The difference of cross entropy of each sentence in the training data language model was calculated. Then we summed up and normalized the difference of the cross entropy of the documents' sentences. Thus every document in the training data received a score. The smaller the value is, the closer the document is to the reference. Later, we arranged the values in an ascending order, meanwhile, a threshold value was set, and the corpus beyond the threshold was abandoned. In this way, we obtained a smaller subcorpus with the approximately same domain with the training data. Finally, we gave different threshold value to the different sub-tasks, in other words, we selected the top n documents after ordering.
We used the Markov network to build a term Markov network model in the sub-corpus, then we calculated the relation among words according to words co-occurrence, next, we extracted the word chunks in the Markov network, and computed the likelihood that two words are a paraphrase pair by comparing the two chunks' similarity. In this work, we extracted ten paraphrase tables for ten sub-tasks in six languages on WMT15.

Results
The Pearson Coefficient is used to compute the system-level correlation between automatic evaluation and human judgments as follows: where H i and M i are the i-th system scores of human judgment and that of the automatic evaluation metrics, respectively.
The system-level correlation for the three metrics is given in Table 3 and Table 4, from the tables, we found that the system-level correlation of METEOR-MPEDA metric is better than ME-TEOR, METEOR-Universal and METEOR-Markov on average.
Furthermore, Kendall's τ coefficient was used to compute the correlation between automatic evaluation metrics and human judgments at segment -level as follows: where Concordant denotes the set where the human judgment and the automatic evaluation metrics' score are concordant, while Discordant denotes the set where they are discordant. The segment-level correlation is given in Table 5 and 6. It can be observed that the segmentlevel correlation of METEOR-MPEDA metric on evaluation translation into English tasks is better than METEOR, METEOR-Universal metric and METEOR-Markov metric on average. However, when evaluating translation out of English tasks, the performance of the METEOR-MPEDA metric is slightly lower than METEOR-Universal metric. It can be explained that when we have a large amount of bilingual parallel training data, the paraphrase table extracted from the bilingual corpus is better than that from monolingual corpus for automatic evaluation of MT.

Conclusion
In this paper, we describe the submissions of our metric for WMT16 Metrics task in detail. We propose an approach to extract domain-specific paraphrase table from monolingual corpus for automatic evaluation of MT, and use it to replace the original paraphrase table in METEOR metric to improve the correlation between human judgment and automatic evaluation metrics. The proposed approach is tested on the newswire domain.
In future work, we will systematically apply it to different specific domains such as the medical domain, IT domain, etc.