Translating Electronic Health Record Notes from English to Spanish: A Preliminary Study

The Centers for Medicare & Medicaid Services Incentive Programs promote meaningful use of electronic health records (EHRs), which, among many benefits, allow patients to receive electronic copies of their EHRs and thereby empower them to take a more active role in their health. In the United States, however, 17% population is Hispanic, of which 50% has limited English language skills. To help this population take advantage of their EHRs, we are developing English-Spanish machine translation (MT) systems for EHRs. In this study, we first built an English-Spanish parallel corpus and trained NoteAid Spanish , a statistical MT (SMT) system. Google Translator and Microsoft Bing Translator are two baseline MT systems. In addition, we evaluated hybrid MT systems that first replace medical jargon in EHR notes with lay terms and then translate the notes with SMT systems. Evaluation on a small set of EHR notes, our results show that Google Translator outperformed NoteAid Spanish . The hybrid SMT systems first map medical jargon to lay language. This step improved the translation. A fully implemented hybrid MT system is


Introduction
The Centers for Medicare & Medicaid Services Incentive Programs promote meaningful use of electronic health records (EHRs), which, among many benefits, allow patients to receive electronic copies of their health records and thereby empower them to take a more active role in their health. EHRs present a new and personalized communication channel that has the potential to increase patient involvement in care and improve communication between physicians and patients and their caregivers. In particular, allowing patients access to their physicians' notes has the potential to enhance patients' understanding of their conditions and disease and improve medication adherence and self-managed care.
However, most EHRs are written in English. In the United States, 17% population is Hispanic, of which 50% has limited English language skills. Many general-purpose MT systems are available. For example, Google Translate is a free service that has been used by health professions. Like most general-purpose MT systems, it is based on SMT, looking for patterns in hundreds of millions of WWW documents. In contrast, EHRs contain medical terms, shortened forms, complex disease and medication names, and other domain-specific jargon that do not typically appear in WWW documents, and therefore Google Translate may not perform well for EHRs, as was found in a prior study that evaluated general-purpose MT systems (Zeng-Treitler et al., 2010). Furthermore, the Health Insurance Portability and Accountability Act of 1996 protects the privacy and security of individually identifiable health information, so a secure MT system may be needed for US hospitals.
Therefore we are developing an EHR domainspecific English-Spanish MT system called NoteAid Spanish , which may help over 37 million Spanish speaking US residents to meaningfully use their EHRs. 134

Background
MT has been an active research field for the past 60-70 years. Early systems mainly applied bilingual dictionaries and manually crafted rules. However, since the 1990s, research has turned to SMT (Brown et al., 1990). The best SMT systems are built from translation patterns that are learned automatically from parallel, humantranslated text corpora (Koehn, 2010). Translation patterns include phrase translations that translate input text by translating sequences of words at a time (Koehn et al., 2003;Och, 2002), re-ordering tendencies allowing swapping of words or phrases (Tillmann, 2004), hierarchical phrase translations with variables (Chiang, 2007), and syntax-based transformations (Galley et al., 2004). Automatic learning enables systems to imitate human translation behavior and adapt to particular domains. The bulk of current MT research is tested on domains such as news and politics. The BLEU (Papineni et al., 2002) score is a standard evaluation metric for MT. It measures n-gram overlap with human translations and has shown correlation with human judgment.
Comparatively few MT systems have been developed in the medical domain. Early work focused on knowledge-based approaches for phrase translation (Eck et al., 2004;Humphrey et al., 1998;Liu et al., 2006;Merabti et al., 2011). Several research groups built parallel corpora, then trained SMT systems (Wu et al., 2011;Yepes et al., 2013). The Shared Task of Medical Translation provided both parallel aligned and monolingual corpora (Bojar et al., 2014). Eight teams participated the shared task and most of the systems were based on the Mose phrasebased toolkit with in-domain and out-of-domain language models.
Zeng-Treitler et al (Zeng-Treitler et al., 2010) evaluated a general-purpose MT tool called Babel Fish to translate 213 EHR note sentences from English into Spanish, Chinese, Russian, and Korean and then evaluated the comprehensibility and accuracy of the translation. They found, however, the majority of the translations were incomprehensible and/or incorrect.

Methods
We first built a domain-specific English-Spanish parallel aligned corpus and then developed and evaluated SMT and hybrid machine translation (HMT) systems for translating EHR notes from English to Spanish. This study was approved by the Institutional Review Board of University of Massachusetts Medical School. All EHR notes have been deidentified.

The EHR Corpus (ESPAC EHR )
The UMass Amherst Translation Center translated three de-identified EHR notes (108 sentences, 13.4 word tokens per sentence, and a total of 1,445 words) from English to Spanish.

Phrase-Based SMT
Using ESPAC MedlinePlus , we trained an initial phrase-based Moses (Koehn et al., 2007) system. The training aligns the words in sentence pairs and extracts phrase pairs consistent with those alignments. We set the maximum phrase pair length to 7 words. We trained a 3-gram language model on the Spanish side using SRILM (Stolcke, 2002;Stolcke et al., 2011). We first used the default feature weights in Moses, then adjusted these feature weights using MERT (Och, 2003).

HMT Systems
EHR notes contain medical jargon that differs significantly from the consumer-oriented medical corpora most MT systems are trained on. We therefore speculate that if we replace medical jargon with lay terms and then feed the transformed EHR note to a SMT system, we may improve the MT performance. In our HMT system, we first applied the Metemap (Aronson, 2001) to map free text to UMLS concepts. For those mapped concepts, we replace the medical jargon with lay terms. A concept is clinically relevant if it belongs to one of the 18 UMLS semantic types, as described in the NoteAid system (Ramesh et al., 2013). A term is a lay term if it appears in the Consumer Health Vocabulary of the UMLS. A term is also a lay term if it appears in MedlinePlus. We also identify abbreviations and replace them with their expanded full terms. The second component of the HMT systems is an SMT system. We explored two state-of-the-art SMT systems, Google Translate and Microsoft Bing Translator, resulting in two HMT systems, NoteAid-Google Spanish and NoteAid-Bing Spanish .

Baseline MT Systems
The baseline systems are the state-of-the-art general purpose Google and Bing MT systems in which EHR notes are directly fed into the systems without any medical jargon replacement.

Evaluation Metrics and Procedure
All the MT systems were evaluated by singlereference, case-insensitive BLEU score using the Moses package. We also asked a bilingual domain expert to manually evaluate the five MT system outputs of the three EHR notes.

Automatic Evaluation
The BLEU score of NoteAid-Moses Spanish on the tuning and testing medical parallel data are 41.8 (1.097) and 41.2 (1.104) before MERT and 50.4 (0.99) and 49.8 (0.99) after MERT. The BLEU score of Google Translate, which was 49.9 (0.99).

Evaluation by a Domain Expert
A bilingual human expert performed a blind review of the outputs of all five MT systems on the three EHR notes (a total of 15 Spanish outputs). He ranked all five MT systems. In addition, he marked up the errors by each MT system.
The expert judged that each MT system had a few translation omissions. For example, "symptomatically," was omitted by all the MT systems. Of the three EHR notes, Google Translate performed the best for two. NoteAid-Google Spanish and NoteAid-Bing Spanish were second on three. Bing Translator was the best for one. NoteAid-Moses Spanish was the last.
The expert also performed a blind comparison of Google Translate versus NoteAid-Google Spanish . He found that the hybrid system simplified the medical jargon and translated well. However, it introduced inconsistencies a few times. Therefore, the rating for Google translation is slightly better on two out of the three EHR notes.

Discussion
There are a number of challenges for translating EHR notes from English to Spanish. Spanish translation frequently increases token length. In addition, rhetoric styles differ, which can considerably affect text length in cases where the medical note is more of a narrative than a sequence of facts and isolated sentences (Valero-Garces, 1996). Finally, it is expensive to create English-Spanish parallel aligned EHR corpora.
Both NoteAid-Moses Spanish and Google Translate achieved a competitive performance for ESPAC MedlinePlus . Several factors could have contributed to the excellent MT performance. Since 25% of our data is redundant, during the training process the decoder memorized those sentences. This combined with the fact that the total percentage of unknown words and sentences were small (~16%) may have contributed to the good results. In addition, we found that 37% of the sentences in the tuning and testing sets had less than seven words, and about half of those sentences overlapped with the training set. These sentences were memorized as phrases during training, although their contribution to the overall performance was less significant than longer sentences. Finally, translating sentences with one word is easier than translating sentences with multiple words because one-word sentences do not have a reordering problem, which is one of the challenges in MT.
The evaluation of MT systems on EHR notes ( Table 2) showed much reduced performance. The results are not surprising since 17.9% terms in EHR notes do not appear in the MedlinePlus.
In addition, all HMT systems performed worse than their SMT counterparts. The lower performance of HMT systems can be attributed to the lack of gold standards that exactly match the source text of hybrid systems. The gold standard consists of original English notes translated to Spanish by human translators. But, the HMT systems modify the original notes by replacing the medical jargon with lay terms and then translate the notes to Spanish. Since, the BLEU score calculates what percentage of the ngrams or phrases from the translations also appear in the gold standard and the HMT systems modify the original text before translation, it is expected to yield a lower performance.
We also found that sentences in EHR notes were not always grammatically well formed. Whereas, when humans translated the text, they inferred the context from the note and formed coherent and logical sentences by inserting the missing verb or conjunction. The translation systems translated the original ill-formed sentences into Spanish word for word. This resulted in a lower BLEU score performance for MT systems.
Our manual analyses show that the baseline and the HMT systems perform well and make very few mistakes on EHR notes. The mistakes include: • Translation omission when they encounter typos in the source language. For example, the MT systems failed to translate typos like "possily" and "phychological." • Failure to take context into consideration when translating the text. For example, in "we are redrawing blood cultures," the MT systems failed to recognize that "redraw" refers to removing blood cultures, and translated it as "redibujando" or "rediseñando," meaning redrawing or redesigning something.
• Incorrect grammatical gender assignment although the translation is correct. For example, "Skin: Warm and dry" is translated as "Piel: Cálido y seco" ignoring the fact that the grammatical gender context of "Piel"/Skin is feminine.
• Errors in verb conjugation. For example, "to drain" is translated as "para drenar" instead of "á drenar." We select and describe three examples of errors by MT systems, as shown below.
In the example below, all the five MT systems fail to accurately translate the sentence and change the meaning when translated back to English. We also observed that human translators often translate the text using different words while maintaining the semantic sense of the sentence. In this example, NoteAid-Moses Spanish conserves only some of the source text's context and format but omits translation of several words, including medical jargon. The NoteAid-Bing Spanish omits only one word but the remaining MT systems do not omit any word. The Google translate and both the hybrid systems make a grammatical mistake by assigning incorrect gender to the patient in Spanish.
Source: ASSESSMENT AND PLAN: The patient was scheduled for a kidney biopsy today, but she was informed by the Renal Transplant Service that they were going to delay this since there was some improvement in her creatinine (today's creatinine is not yet available). NoteAid-GoogleSpanish: Pulmones: bilateral: reducción de sonidos pulmonares espiratorio sibilancias presentes ( en el lóbulo superior , en el lóbulo inferior) , en o cerca de la línea de base la visión ? , No hay rhonchis presentes , . Piel: lesión . cambios b / l venostasis en distal rastro tibias anterior tibial Edema NoteAid-BingSpanish: Pulmones: bilateral: reducido sibilancias espiratorio de sonidos pulmonares presentes (en el lóbulo superior, en el lóbulo inferior),, en o cerca de base de la visión? , no rhonchis presente,. Piel: lesión. cambios b/l lindo tibias anteriores distales rastrear el Edema tibial NoteAid-Moses Spanish performed poorly on the EHR notes, suggesting that the system needs to be trained on bigger data sets, or be trained directly on the EHR notes. We found that some errors by NoteAid-Google Spanish were due to engineering errors, which can be fixed.

Limitations, Conclusion and Future Work
This pilot study has limitations. The SMT system was built on the limited MedlinePlus data. We plan to incorporate other biomedical corpora (e.g., Medline and ClinicalTrial.gov). The corpus size of EHR notes for evaluation is small and we plan to build such a corpus.
The BLEU score does not provide a measurement in terms of whether the semantic content is correctly translated. In the future work we may explore other domain-specific evaluation metrics (Castilla et al., 2005).
In this application, we have experimented with simple MT approaches. In the future we may explore other MT approaches, including incorporating biomedical knowledge resources (e.g., the UMLS), domain adaptation, semantic role labelling and abstract meaning represenation.