Identification of Parallel Sentences in Comparable Monolingual Corpora from Different Registers

Parallel aligned sentences provide useful information for different NLP applications. Yet, this kind of data is seldom available, especially for languages other than English. We propose to exploit comparable corpora in French which are distinguished by their registers (specialized and simplified versions) to detect and align parallel sentences. These corpora are related to the biomedical area. Our purpose is to state whether a given pair of specialized and simplified sentences is to be aligned or not. Manually created reference data show 0.76 inter-annotator agreement. We exploit a set of features and several automatic classifiers. The automatic alignment reaches up to 0.93 Precision, Recall and F-measure. In order to better evaluate the method, it is applied to data in English from the SemEval STS competitions. The same features and models are applied in monolingual and cross-lingual contexts, in which they show up to 0.90 and 0.73 F-measure, respectively.


Introduction
The purpose of text simplification is to provide simplified versions of texts, in order to remove or replace difficult words or information. Simplification can be concerned with different linguistic aspects, such as lexicon, syntax, semantics, pragmatics and even document structure. Simplification can address needs of people or NLP applications (Brunato et al., 2014). In the first case, simplified documents are typically created for children (Son et al., 1008;De Belder and Moens, 2010;Vu et al., 2014), people with low literacy or foreigners (Paetzold and Specia, 2016), people with mental or neurodegenerative disorders (Chen et al., 2016), or laypeople who face specialized documents (Arya et al., 2011;Leroy et al., 2013). In the second case, the purpose of simplification is to transform documents in order to make them easier to process within other NLP tasks, such as syntactic analysis (Chandrasekar and Srinivas, 1997;Jonnalagadda et al., 2009), semantic annotation (Vickrey and Koller, 2008), summarization (Blake et al., 2007), machine translation (Stymne et al., 2013;Štajner and Popović, 2016), indexing (Wei et al., 2014), or information retrieval and extraction (Beigman Klebanov et al., 2004). Hence, parallel sentences, which align difficult and simple information, provide crucial indicators for the text simplification. Indeed such pairs of sentences contain cues on transformations which are suitable for the simplification, such as lexical substitutes and syntactic modifications. Yet, this kind of resources is seldom available, especially in languages other than English. The purpose of our work is to detect and align parallel sentences from comparable monolingual corpora, that are differentiated by their registers. Besides, comparable corpora are easier to obtain. More precisely, we work with texts written for specialists and their simplified versions. We work with corpora in French.

Existing Work
In parallel corpora, sentence alignment can rely on empirical information, such as relative length of the sentences in each language (Gale and Church, 1993), or lexical information (Chen, 1993). In comparable corpora, both monolingual and bilingual, sentences present relatively loose common semantics and do not necessarily occur in the same order. It should also be noted that (1) the degree of parallelism can vary from nearly parallel corpora, with a lot of parallel sentences, to verynon-parallel corpora (Fung and Cheung, 2004); and that (2) such corpora can contain parallel information at various degrees of granularity, such as documents, sentences or sub-phrastic segments (Hewavitharana and Vogel, 2011). Detection of parallel sentences in comparable corpora is thus a substantial challenge and requires specific methods.
In relation with monolingual comparable corpora, the main difficulty is that sentences may show low lexical overlap but be nevertheless parallel. Recently, this task gained in popularity thanks to the semantic text similarity (STS) initiative. Dedicated SemEval competitions have been proposed for several years (Agirre et al., 2013(Agirre et al., , 2015(Agirre et al., , 2016. The objective, for a given pair of sentences, is to predict if they are semantically similar and to assign similarity score going from 0 (independent semantics) to 5 (semantic equivalence). This task is usually explored in general-language corpora. Among the exploited methods, we can notice: • lexicon-based methods which rely on similarity of subwords or words from the processed texts or on machine translation (Madnani et al., 2012). The features exploited can be: lexical overlap, sentence length, string edition distance, numbers, named entities, the longest common substring (Clough et al., 2002;Zhang and Patrick, 2005;Qiu et al., 2006;Zhao et al., 2014;Nelken and Shieber, 2006;Zhu et al., 2010); • knowledge-based methods which exploit external resources, such as WordNet (Miller et al., 1993) or PPDB (Ganitkevitch et al., 2013). The features exploited can be: overlap with external resources, distance between the synsets, intersection of synsets, semantic similarity of resource graphs, presence of synonyms, hyperonyms or antonyms (Mihalcea et al., 2006;Fernando and Stevenson, 2008;Lai and Hockenmaier, 2014); • syntax-based methods which exploit the syntactic modelling of sentences. The features often exploited are: syntactic categories, syntactic overlap, syntactic dependencies and constituents, predicat-argument relations, edition distance between syntactic trees (Wan et al., 2006;Severyn et al., 2013;Tai et al., 2015;Tsubaki et al., 2016); • corpus-based methods which exploit distributional methods, latent semantic analysis (LSA), topics modelling, word embeddings, etc. (Barzilay and Elhadad, 2003;Guo and Diab, 2012;Zhao et al., 2014;Kiros et al., 2015;He et al., 2015;Mueller and Thyagarajan, 2016).
These methods and types of features can of course be combined for optimizing the results (Bjerva et al., 2014;Lai and Hockenmaier, 2014;Zhao et al., 2014;Rychalska et al., 2016;Severyn et al., 2013;Kiros et al., 2015;He et al., 2015;Tsubaki et al., 2016;Mueller and Thyagarajan, 2016). Our objective is close to the second type of works: we want to detect and align parallel sentences from monologual comparable corpora. Yet, there are some differences: (1) we work with corpora related to the biomedical area and not to the general language, (2) we have to state if two sentences have to be aligned (binary statement) and not to compute their similarity score, and (3) we work with data in French which were not exploited for this kind of task yet. To our knowledge, the only work which exploited articles from French encyclopedia performed manual alignment of sentences (Brouwers et al., 2014).
In what follows, we first present the linguistic material used, and the methods proposed. We then present and discuss the results obtained, and conclude with directions of future work.

Linguistic Material
We use three comparable corpora in French. They are related to the biomedical domain and are contrasted by the technicity of information they contain with typically specialized and simplified versions of a given text. These corpora cover three genres: drug information, summaries of scientific articles, and encyclopedia articles (Sec. 3.1). We also exploit a set of stopwords (Sec. 3.2), and the reference data with sentences manually aligned by two annotators (Sec. 3.3). Table 1 indicates the size of the source corpora (number of documents, number of words in specialized and simplified versions). The three corpora are built with French data.

Comparable Corpora
The Drug corpus contains drug information such as provided to health professionals and patients. Indeed, two distinct sets of documents exist, each of which contains common and specific information. This corpus is built from the public drug database 1 of the French Health ministry. These data have been downloaded in June 2017. We can see that the specialized versions of documents provide more word occurrences.
The Scientific corpus contains summaries of meta-reviews of high evidence health-related articles, such as proposed by the Cochrane collaboration (Sackett et al., 1996). These reviews have been first intended for health professionals but recently the collaborators started to create simplified versions of the reviews (Plain language summary) so that they can be read and understood by the whole population. This corpus has been built from the online library of the Cochrane collaboration 2 . The data have been downloaded in November 2017. We can see that specialized version of summaries is also larger than the simplified version, although the difference is not very important.
The Encyclopedia corpus contains encyclopedia articles from Wikipedia 3 and Vikidia 4 .
Wikipedia articles are considered as technical texts while Vikidia articles are considered as their simplified versions (they are created for children 8 to 13 year old). Similarly to the works done in English, we associate Vikidia with Simple Wikipedia 5 . Only articles related to the medical portal are exploited in this work. These encyclopedia articles have been downloaded in August and September 2017. From Table 1, we can see that specialized versions (from Wikipedia) are also longer than simplified versions.
These three corpora are more or less parallel: Wikipedia and Vikidia articles are written independently from each other, drug information documents are related to the same drugs but the types of information presented for experts and laypeople vary a lot, while simplified summaries from the scientific corpus are created starting from the expert summaries.

Reference Data
In this section we describe the data that are used for training and evaluation of the automatic sentence alignments.
The reference data are created manually. We have randomly selected 2*14 encyclopedia articles, 2*12 drug documents, and 2*13 scientific summaries. The sentence alignment is done by two annotators following these guidelines:   4. include sentence pairs in which one sentence is included in the other, which enables manyto-one matching (e.g. C'est un organe fait de tissus membraneux et musculaires, d'environ 10à 15 mm de long, qui pendà la partie moyenne du voile du palais. and Elle est constituée d' un tissu membraneux et musculaire. -It is an organ made of membranous and muscular tissues, approximately 10 to 15 mm long, that hangs from the medium part of the soft palate. and It is made of a membranous and muscular tissue.); 5. include sentence pairs with equivalent semantics -other than semantic intersection and inclusion (e.g. Les médicaments inhibant le péristaltisme sont contre-indiqués dans cette situation. and Dans ce cas, ne prenez pas de médicaments destinésà bloquer ou ralentir le transit intestinal. -Drugs that inhibit the peristalsis are contraindicated in that situation. and In that case, do not take drugs intended for blocking or slowing down the intestinal transit.) The judgement on semantic closeness may vary according to the annotators. For this reason, the alignments provided by each annotator undergo consensus discussions. This alignment process provides a set of 663 aligned sentence pairs. The inter-annotator agreement is 0.76 (Cohen, 1960). It is computed within the two sets of sentences proposed for alignment by the two annotators. Table 2 indicates the size of the reference data before (source columns) and after (aligned columns) the alignment. In the two last columns (Alignment rate), we indicate the percentage of sentences aligned in each register and corpus. We can observe that scientific corpus is the most parallel with the highest alignment rate of sentences from specialized and simplified documents, while the two other corpora (drugs and encylopedia) contain proportionnally less parallel sentences. Another interesting observation is that sentences from simplified documents in the scientific and drugs corpora are longer than sentences from specialized documents because they often add explanations for technical notions, like in this example: We considered studies involving bulking agents (a fibre supplement), antispasmodics (smooth muscle relaxants) or antidepressants (drugs used to treat depression that can also change pain perceptions) that used outcome measures including improvement of abdominal pain, global assessment (overall relief of IBS symptoms) or symptom score. In the encylopedia corpus such notions are replaced by simpler words, or removed. Finally, in all corpora, we observe frequent substitutions by synonyms, like in these pairs: {nutrition; food}, {enteral; directly in the stomach}, {hypersensitivity; allergy}, {incidence; possible complications}. Notice that with such substitutions, lexical similarity between sentences is reduced.

Automatic Alignment of Parallel Sentences
As already indicated, our objective is to detect and align parallel sentences within monologual comparable corpora in French. We already have the information on which documents are comparable. So, the task is really dedicated to the alignment of sentences from specialized and simplified versions of documents. The method is composed of several steps: pre-processing of data (Sec. 4.1), generation of features (Sec. 4.2), automatic alignment of sentences (Sec. 4.3), and evaluation (Sec. 4.4).

Pre-processing of Data
The documents are first pre-processed: they are POS-tagged with TreeTagger (Schmid, 1994), which permits to obtain their lemmatized versions. Then, the documents are segmented into sentences using strong punctuation (i.e. .?!;:). The same pre-processing and segmentation have been applied when creating the reference data.

Feature Generation
Our goal is to propose features that can work on textual data in different languages. We use several features which are mainly lexicon-based and corpus-based, so that they can be easily applied to textual data in other languages or transposed to data in other languages. The features are computed on word forms and on lemmas: 1. Number of common non-stopwords. This feature permits to compute the basic lexical overlap between specialized and simplified versions of sentences (Barzilay and Elhadad, 2003). This feature exploits external knowledge (set of stopwords), which are nevertheless very common linguistic data; 2. Number of common stopwords. This feature also exploits external knowledge (set of stopwords). It concentrates on non-lexical content of sentences; 3. Percentage of words from one sentence included in the other sentence, computed in both directions. This features represents possible lexical and semantic inclusion relations between the sentences; 4. Sentence length difference between specialized and simplified sentences. This feature assumes that simplification may imply stable association with the sentence length; 5. Average length difference in words between specialized and simplified sentences. This feature is similar to the previous one but takes into account average difference in sentence length; 6. Total number of common bigrams and trigrams. This feature is computed on character ngrams. The assumption is that, at the sub-word level, some sequences of characters may be meaningful for the alignment of sentences if they are shared by them; 7. Word-based similarity measure exploits three scores (cosine, Dice and Jaccard). This feature provides a more sophisticated indication on word overlap between the two compared sentences. Weight assigned to each word is set to 1; 8. Word-based similarity measure with the tf*idf weighting of words (Nelken and Shieber, 2006). This feature is similar to the previous one but it also exploits information on context by incorporating the tf*idf weighting (Salton and Buckley, 1988) of words. For this, sentences are considered as documents and documents as corpora. This feature permits to weigh words in a sentence with respect to their occurrences in other sentences of the document; 9. Character-based minimal edit distance (Levenshtein, 1966). This is a classical acception of edit distance. It takes into account basic edit operations (insertion, deletion and substitution) at the level of characters. The cost of each operation is set to 1; 10. Word-based minimal edit distance (Levenshtein, 1966). This feature is computed with words as units within sentence. It takes into account the same three edit operations with the same cost set to 1. This feature permits to compute the cost of lexical transformation of one sentence into another.

Automatic Alignment of Sentences
The task is to find parallel sentences within the whole set of sentences we described in section 3.3. Hence, we have to categorize the pairs of sentences in one of the two categories: • alignment: the sentences are parallel and can be aligned; • non-alignment: the sentences are nonparallel and cannot be aligned.
The reference data provide positive examples (663 parallel sentences), while negative examples are obtained by randomly pairing some of the remaining sentences (800 non-parallel sentences) from the same documents.
We use several linear classifiers with their default parameters if not indicated otherwise: Perceptron (Rosenblatt, 1958), Multilayer Perceptron (MLP) (Rosenblatt, 1961), Linear discriminant analysis (LDA) (Fisher, 1936) with the LSQR solver, Quadratic discriminant analysis (QDA) (Cover, 1965), Logistic regression (Berkson, 1944), Stochastic gradient descent (SGD) (Ferguson, 1982) with the log loss, Linear SVM (Vapnik and Lerner, 1963). We also tested hinge and modified huber as loss functions with the SGD, and Eigen and SVD solvers with the LDA, but the results were either lower or very close to the best parameters and we abandoned the idea to use them.

Evaluation
The training of the system is performed on two thirds of the sentence pairs, and the test is performed on the remaining third. Several classifiers and several combinations of features are tested. Classical evaluation measures are computed: Precision, Recall, F-measure, Mean Square Errors, and True Positives. Our baseline is the combination of length measures with the common words (features 1, 2, 4 and 5). These features are indeed traditionnally exploited in the existing work.
We also evaluate the system on data in English that were released for STS competitions 6 : we use 750 sentence pairs from SemEval 2012, 1,500 sentence pairs from SemEval 2013, 3,750 sentence pairs from SemEval 2014. Each pair of sentences is associated with the similarity score [0;5]. We apply our system to these data in two ways: (1) the system is trained and tested on the STS dataset, and (2) the system is trained on our dataset in French and tested on the STS dataset in English.
We assume indeed that the features used and even the models generated can be transposed to data in other languages. For the experiments with the English data, we use the same evaluation measures (Precision, Recall, F-measure, Mean Square Errors, and True Positives). The set of stopwords in English contains 150 entities.  In Table 3, we present the results obtained on French data using the whole set of features (but without the tf*idf similarity scores) on test set, and non-lemmatized texts. The results are indicated in terms of Recall R, Precision P , F-measure F , Mean Square Errors M SE and True positives T P (out of the 221 positive sentence pairs in the test set). We can see that all the classifiers are competitive with F-measure above 0.80. Overall, several classifiers (LDA, QDA, LogReg, LinSVM) provide stable results, for which we indicate the evaluation scores obtained in one iteration. Other classifiers (Perceptron, MLP, SGD) provide fluctuating results, and we indicate then the average scores obtained after 20 iterations. Another positive observation is that Precision and Recall values are well balanced. Logistic regression seems to be the best classifier for this task, with Precision, Recall and F-measure at 0.93. This classifier is used for the experiments described in the next sections.

Results and Discussion
We first present and discuss the exploitation of various featuresets on French data (Sec. 5.1), and then the exploitation of the features and models on the STS data in English in monolingual (Sec. 5.2) and cross-lingual (Sec. 5.3) contexts. As our final objective (text simplification in French) and the data we work on (French texts from the biomedical domain) are different from the STS context, we believe it should be noted that there are intrin-sic limitations as to the comparison we can make.  The purpose of these experiments is to detect the most suitable combinations of features. We present the results obtained on our data. We distinguish four sets of features, which are used in isolation and in various combinations. We indicate the corresponding numbers from section 4.2 between brackets :
Contrary to the previous work (Nelken and Shieber, 2006;Zhu et al., 2010), the tf*idf weighting of words is not efficient on our data. For this reason, this set of features was not used in the experiments.
The results are presented in Table 4. The lowest results are obtained with the Levenshteinbased features (F-measure 0.78), they are followed by the similarity-based features (F-measure 0.84). We obtain 0.86 F-measure with the baseline. Other combinations indicate that each set of features exploited is useful to gain efficiency for this task. Hence, the best results are obtained with the combination BL+L+N and with the whole set of features (BL+L+S+N), which shows 0.93 Fmeasure. We use the whole set of features for the experiments with the STS dataset.  In this set of experiments, the classification model is trained and tested on the STS reference data in English. Our assumption is that the features exploited are transferable from one language to another. The reference data and categories in English and in French differ. One difference is that the STS pairs of sentences are scored from 0 to 5 according to their similarity, while in French we do binary classification (a given pair of sentences should be aligned or not). To make the two datasets comparable, we propose to transform the STS scoring in binary categories. We test similarity thresholds within the interval [2.5;4.5] by step of 0.5, which permits not to consider identical sentences (scores close to 5) and very distant sentences (scores lower than 2.5). As indicated in Table 5, we obtain up to 0.90 F-measure with the similarity threshold 4.5 on data from 2013 and 2014, while in 2012 the best F-measure (0.82) is obtained with the similarity score 2.5. It is difficult to compare our results with those of the participating teams and already published results because our categories and evaluation differ from the STS protocols -we rate sentence pairs as either aligned or not aligned, while STS offers a scale from 0 to 5. Yet, the MSE rate (0.308) published by one of the top participants in 2014 (Bjerva et al., 2014) indicates that our MSE rate is improved, as it is at 0.29 on the 2014 data.  In this set of experiments, the classification model is trained on French data and tested on the STS data in English. Here, our assumption is that the models generated on one language can be transferable to another language in order to detect parallel sentences. Here as well, we test several similarity thresholds. As we can see in Table 6, in this cross-lingual experiment, the best F-measures are obtained with the score 2.5 in 2012 (0.82) and in 2014 (0.73), and with scores 2.5 and 3.0 in 2013 (0.74). These thresholds indicate that the models generated on our French data can be exploited on the STS data in English quite efficiently and that the features that are used show cross-lingual relevance for the French-English language pair. These results also indicate that, for the targeted task of text simplification, we need quite a strong similarity between sentences.

Conclusion and Future Work
In this work, we proposed to address the task of detection and alignment of parallel sentences from monolingual comparable corpora in French. The comparable dimension is due to the technicality of documents, which contrast specialized and simplified versions of documents and sentences. We use three corpora which are related to the biomedical area. Several features and classifiers and exploited. Our results reach up to 0.93 F-measure on the French data, with a very good balance between Precision and Recall. Linear regression appears to be the best classifier for this task. Our approach is then tested on the STS data in English, such as proposed by several SemEval com-petitions between 2012 and 2014. We first test the features, with training and testing done on the STS data. This gives up to 0.90 F-measure with the 4.5 similarity threshold. Then, we test the models: they are generated on the French data and tested on the STS data. This gives 0.82 F-measure. We assume that the proposed approach (features and classifiers) show a good transferability to another language. This is a good point because it validates our approach on data from another language.
In future, we plan to exploit the best models generated in French for enriching the set of parallel sentences. This will permit to prepare data necessary for the developement of simplification methods for French. Parallel sentences may also be helpful for othe NLP applications. Other directions for future work are concerned with the exploitation of other features for the alignment of sentences, such as use of word embeddings to smooth lexical variation or exploitation of external knowledge. Besides, our appoach will be further evaluated on data from other languages.