French Biomedical Text Simplification: When Small and Precise Helps

We present experiments on biomedical text simplification in French. We use two kinds of corpora – parallel sentences extracted from existing health comparable corpora in French and WikiLarge corpus translated from English to French – and a lexicon that associates medical terms with paraphrases. Then, we train neural models on these parallel corpora using different ratios of general and specialized sentences. We evaluate the results with BLEU, SARI and Kandel scores. The results point out that little specialized data helps significantly the simplification.


Introduction
The goal of text simplification is to make complex texts accessible to a given target audience (children, foreigners, diseased people...) or to make complex texts to be more easily processed by NLP applications. There is little existing work on biomedical text simplification in English (Peng et al., 2012;Shardlow and Nawaz, 2019), and on text simplification for the general language in French (Abdul Rauf et al., 2020;Gala et al., 2020;Sauvan et al., 2020;Brouwers et al., 2014).
Research on text simplification is usually performed on open domain English. Currently used methods rely on deep learning approaches and require large parallel monolingual corpora in which one complex sentence is paired with one or more simplified versions (Nisioi et al., 2017;Cooper and Shardlow, 2020). Two English datasets are commonly used: (1) Newsela (Xu et al., 2015), a corpus with news articles that are manually re-written according to four levels of simplification, and (2) WikiLarge (Zhang and Lapata, 2017), issued from a compilation of three previously released simplification corpora all extracted from Wikipedia (Zhu et al., 2010;Woodsend and Lapata, 2011;Kauchak, 2013). The availability of corpora and resources plays indeed a very important role and condition the feasibility of the NLP research.
We build and test models for biomedical text simplification in French. We first describe the data used (Section 2) and various configurations of the experiments (Section 3). We then indicate the evaluation principles (Section 4), and present and discuss the results (Section 5). We also provide the WikiLarge FR corpus and a set of native parallel sentences in French, from general and biomedical languages. We use two sets of parallel corpora dedicated to the simplification. One set is obtained from the freely available simplification corpus CLEAR in French (Grabar and Cardon, 2018). This is a comparable corpus, which contains three types of texts (medical literature reviews, drug information, and medical articles from Wikipedia and Vikidia), and from which parallel sentences were extracted. This parallel corpus contains 4,596 sentence pairs, which is not sufficient when using the current state-of-the-art methods, that rely on deep learning. Another corpus is obtained thanks to the automatic translation of WikiLarge in French. The translation has been done using OpenNMT-py (Klein et al., 2017) with the default parameters, and the En-Fr model provided. This WikiLarge FR parallel corpus contains almost 300,000 sentence pairs. Table 1 indicates the volume of data in both corpora. We segmented the CLEAR corpus into train, validation and test sets: 100 examples for testing, three times as many for validation, and the rest for training. WikiLarge FR is already segmented in these three sets, yet we decided to reduce the WikiLarge FR test set from 359 pairs to 100 to make it comparable with the CLEAR test set. As these two corpora contain data from Wikipedia, we checked for duplicates to avoid having identical pairs in two different sets, and we found none.

Linguistic Data
We also use a lexicon that proposes layman paraphrases for technical medical terms, like {hypotension; baisse de la tension artérielle} ({hypotension; decrease in arterial pressure} ). This lexicon has been built using medical terminologies (Lindberg et al., 1993) and corpora (CLEAR corpus and various discussion fora) in French. The lexicon currently contains 7,580 paraphrases for 4,516 medical terms.

Experimental Protocol
The purpose of the experimental protocol is to focus on two aspects: 1. Impact of the general and medical language corpora, for which we use different ratios of WikiLarge FR and CLEAR sets. Since we aim at the simplification of biomedical texts and since the CLEAR corpus is not large, we always use all of the CLEAR training and validation sets. 2. Impact of lexicon. We perform the same experiments to which we feed the lexicon. The lexicon is exploited in two ways: (1) during the simplification phase with the OpenNMT-py --phrase_table flag that it usually used for dealing with unknown words. That flag was used in the same way in a work for simplification of clinical letters in English (Shardlow and Nawaz, 2019) and (2) during the training phase for which we add the entire lexicon to the training set.
We refer to the first set of experiments as NPT (no phrase table), the second set of experiments as PTS (phrase table used in the simplification phase) and the third set of experiments as PTT (phrase table added to the training set). Across each series of experiments, the validation and test sets are the same, while the training sets differ in the ratios and whether or how the lexicon is used. We use OpenNMT-py for the sentence simplification with the following configuration: two bidirectional LSTM layers of 500 units for the encoder and the decoder, ADAM optimizer, learning rate 0.001, dropout probability 0.3, attention dropout probability 0.2. Each individual training took about five hours on a Geforce RTX 2070 GPU. During the simplification phase, we use the --replace_unk flag which tells the program to copy unknown words from the input to the output, except for the PTS set of experiments where it is replaced by the --phrase_table flag that uses the lexicon.
We report three baselines: (1) one model trained on CLEAR only, (2) another model trained on Wiki-Large FR only, and (3) the identity baseline which corresponds to the case where the output is the copy of the input.

Evaluation
We evaluate the models using several metrics: (1) BLEU (Papineni et al., 2002), initially designed for the evaluation in machine translation, is also used in text simplification which can be seen as a monolingual translation task. It compares the system output with the reference data. This metric gives a rough indication of the performance of a system, especially regarding grammaticality and meaning preservation, but it is not a strong indicator for simplification (Sulem et al., 2018); (2) SARI (Xu et al., 2016) is currently considered as the most common metric for text simplification. SARI is computed by comparing the system output against the reference, and against the input as well. It should be noted that SARI is more reliable when several references are available (Alva-Manchego et al., 2020;Zhang and Lapata, 2017), which is not the case in our experiments; (3) Kandel (Kandel and Moles, 1958) is a readability metric. It does not compare the output with the reference or the input, and only provides formal indicators such as sentence length and number of syllables per word. It is an adaptation of the Flesch (Flesch, 1948) readability measure -that was designed for English -to the French language. The absolute indexes are not informative by themselves: the measure is described to be more relevant for comparisons. Higher scores mean that the text should be easier to read.
We computed the first two metrics with the EASSE evaluation suite for automatic text simplification (Alva-Manchego et al., 2019).   Table 2 indicates the SARI, BLEU and Kandel scores when testing the models on WikiLarge FR and CLEAR test sets. The first three rows show the baseline results. According to the Kandel scores, medical sentences from CLEAR are indeed less readable than sentences from Wikipedia. We can also see that training on CLEAR performs very poorly on the general language (<1 BLEU and 20.52 SARI). The performances are also poor on medical language but BLEU is way higher than on the general language (21.59 vs 0.15). This is due to the fact that the model trained on WikiLarge FR only performs quite well on WikiLarge FR (39.08 BLEU score) but poorly on CLEAR (9.72 BLEU score), which indicates  PTS 1:75 une tension inférieure à la normale artérielle peut être observée en cas d' administration intraveineuse trop rapide , inférieure à 60 minutes (lower than normal blood pressure may be observed if intravenous administration is too rapid, less than 60 minutes.)

Quantitative and Qualitative Results
PTT 1:50 une hypotension artérielle peut être observée en cas d'administration intraveineuse trop rapide, inférieure à 60 minutes ( voir rubrique « 3 ) (arterial hypotension may be observed if intravenous administration is too rapid, less than 60 minutes (see section " 3)) PTT 1:75 une diminution de la tension artérielle peut être observée en cas d' administration intraveineuse trop rapide , inférieure à 60 minutes (a decrease in blood pressure may be observed if intravenous administration is too rapid, less than 60 minutes) examples including the lexicon (PTS and PTT) this is a substantial improvement. Finally, as the Kandel index advantages models that output short sentences with no consideration to their contents, we will simply observe that no model worsens the Kandel readability score of the original texts -no model produces an output with a Kandel index that is lower than the identity baseline. Table 3 shows some simplification examples provided for WikiLarge FR, and Table 4 shows some simplification examples for CLEAR.
In Table 3, the sentence Le 14 octobre 1960, le candidat à la présidence John F. Kennedy a proposé le concept de ce qui est devenu le Peace Corps sur les marches de l'Union du Michigan. (On October 14, 1960, presidential candidate John F. Kennedy proposed the concept of what became the Peace Corps on the Union Steps of Michigan.) is processed. the baseline models provide either no changes (WikiLarge FR) or changes that are not meaningful (CLEAR). Indeed, the CLEAR baseline model applied to the WikiLarge FR test example proposes a grammatical sentence, but which has no semantic relation to the input (cancer is medicine). This draws attention to the fact that simplification is very sensitive to the training data and needs caution. We can see that the CLEAR baseline model only outputs words related to the medical domain, regardless of the input. Hence, real improvement in quality is obtained with other models. Indeed, except the CLEAR baseline, other examples for WikiLarge correspond to the state-of-the-art transformations.
As for CLEAR (Table 4), we illustrate the modifications on a sentence that comes from drug information released by the French Ministry of Health: une hypotension artérielle peut être observée en cas d'administration intraveineuse trop rapide, inférieure à 60 minutes (voir rubrique 4.2). (arterial hypotension may be observed if intravenous administration is too rapid, less than 60 minutes (see section 4.2).). The source document is aimed at physicians, whereas the reference is aimed at patients and can be found in drug boxes. The only changes in the reference are the truncation of "minutes" and the deletion of the mention of another section. NPT examples are close to the reference except for the truncation, while PTS and PTT are more creative: PTS 1:75 transforms hypotension in lower than normal blood pressure, while PTT 1:75 transforms hypotension in decrease in blood pressure. Lexically, these are correct transformations and an improvement over the reference. Yet, by mechanically replacing hypotension, PTS creates an ungrammatical sentence in French.

Conclusion
We addressed the biomedical text simplification in French, which is, to our knowledge, the first attempt to perform this task with a machine translation technique. In order to cope with the lack of French data for simplification, we translated to French a freely available English corpus. We made the translation available, as well as the parallel data from CLEAR 1 . Using those data we achieved improvements over the baselines for French biomedical text simplification. Indeed, baselines produce incorrect and imperfect simplifications, but the results are significantly improved with large datasets.
We propose several experiments that indicate that (1) an automatically translated resource can help in a low resource setting, (2) a small amount of good quality specialized data can significantly improve overall performances, (3) multiword units processing can be dealt with by adding a lexicon to the training set. That is to say that mixing data of different linguistic nature helps simplification. Overall, we can obtain interesting simplification results which often prove to be more creative than the reference and yet correct. Besides, the impact of large native data for simplification should also be studied.