Adapting Phrase-based Machine Translation to Normalise Medical Terms in Social Media Messages

Previous studies have shown that health reports in social media, such as DailyStrength and Twitter, have potential for monitoring health conditions (e.g. adverse drug reactions, infectious diseases) in particular communities. However, in order for a machine to understand and make inferences on these health conditions, the ability to recognise when laymen's terms refer to a particular medical concept (i.e.\ text normalisation) is required. To achieve this, we propose to adapt an existing phrase-based machine translation (MT) technique and a vector representation of words to map between a social media phrase and a medical concept. We evaluate our proposed approach using a collection of phrases from tweets related to adverse drug reactions. Our experimental results show that the combination of a phrase-based MT technique and the similarity between word vector representations outperforms the baselines that apply only either of them by up to 55%.


Introduction
Social media, such as DailyStrength1 and Twitter2 , is a fast growing and potentially rich source of voice of the patient data about experience in terms of benefits and side-effects of drugs and treatments (OConnor et al., 2014).However, natural language understanding from social media messages is a difficult task because of the lexical and grammatical variability of the language (Baldwin et al., 2013;OConnor et al., 2014).
Indeed, language understanding by machines requires the ability to recognise when a phrase refers to a particular concept.Given a variable length phrase, an effective system should return a concept with the most similar meaning.
For example, a Twitter phrase 'No way I'm gettin any sleep 2nite' might be mapped to the medical concept 'Insomnia' (SNOMED:193462001), when using the SNOMED-CT dictionary (Spackman et al., 1997).The success of the mapping between social media phrases and formal medical concepts would enable an automatic integration between patient experiences and biomedical databases.
Existing works, e.g.(Elkin et al., 2012;Gobbel et al., 2014;Wang et al., 2009), mostly focused on extracting medical concepts from medical documents.For example, Gobbel et al. (2014) proposed a naive Bayesian-based technique to map phrases from clinical notes to medical concepts in the SNOMED-CT dictionary.Wang et al. (2009) identified medical concepts regarding adverse drug events in electronic medical records.On the other hand, OConnor et al. (2014) investigated the normalisation of medical terms in Twitter messages.In particular, they proposed to use the Lucene retrieval engine3 to retrieve medical concepts that could be potentially mapped to a given Twitter phrase, when mapping between Twitter phrases and medical concepts.
In contrast, we argue that the medical text normalisation task (Limsopatham and Collier, 2015) can be achieved by using well-established phrasebased MT techniques, where we translate a text written in a social media language (e.g.'No way I'm gettin any sleep 2nite') to a text written in a formal medical language (e.g.'Insomnia').Indeed, in this work we investigate an effective adaptation of phrase-based MT to map a Twitter phrase to a medical concept.Moreover, we propose to combine the adapted phrase-based MT technique and the similarity between word vector representations to effectively map a Twitter phrase to a medical concept.
The main contributions of this paper are threefold: 1. We investigate the adaptation of phrase-based MT to map a Twitter phrase to a SNOMED-CT concept.2. We propose to combine our adaptation of phrase-based MT and the similarity between word vector representations to map Twitter phrases to formal medical concepts.3. We thoroughly evaluate the proposed approach using phrases from our collection of tweets related to the topic of adverse drug reactions (ADRs).

Related Work
Phrase-based MT models (e.g.(Koehn et al., 2003;Och and Ney, 2004)) have been shown to be effective in translation between languages, as they learn local term dependencies, such as collocations, re-orderings, insertions and deletions.Koehn et al. (2003) showed that a phrase-based MT technique markedly outperformed traditional word-based MT techniques on several benchmarks.In this work, we adapt the phrase-based MT technique of Koehn et al. (2003) for the medical text normalisation task.In particular, we use the phrase-based MT technique to translate phrases from Twitter language to formal medical language, before mapping the translated phrases to medical concepts based on the ranked similarity of their word vector representations.
Traditional approaches for creating word vector representations treated words as atomic units (Mikolov et al., 2013b;Turian et al., 2010).For instance, the one-hot representation used a vector with a length of the size of the vocabulary, where one dimension is on, to represent a particular word (Turian et al., 2010).Recently, techniques for learning high-quality word vector representations (i.e.distributed word representations) that could capture the semantic similarity between words, such as continuous bags of words (CBOW) (Mikolov et al., 2013b) and global vectors (GloVe) (Pennington et al., 2014), have been proposed.Indeed, these distributed word representations have been effectively applied in different systems that achieve state-ofthe-art performances for several NLP tasks, such as MT (Mikolov et al., 2013a) and named entity recognition (Passos et al., 2014).In this work, beside using word vector representations to measure the similarity between translated Twitter phrases and medical concepts, we use the similarity between word vector representations of the original Twitter phrase and a medical concept to augment the adapted phrase-based MT technique.

Medical Term Normalisation
We discuss our adaptation of phrase-based MT for medical text normalisation in Section 3.1.Section 3.2 introduces our proposed approach for combining similarity score of word vector representations with the adapted phrase-based MT technique.

Adapting Phrase-based MT
We aim to learn a translation between a Twitter phrase (i.e. a phrase from a Twitter message) and a formal medical phrase (i.e. the description of a medical concept).For a given Twitter phrase phr t , we find a suitable medical phrase phr m using a translation score, based on a phrase-based model, as follows: where p(phr m |phr t ) can be calculated using any phrase-based MT technique, e.g.(Koehn et al., 2003;Och and Ney, 2004).
We then rank translated phrases phr m based on this translation score.The top-k translated phrases are used for identifying the corresponding medical concept.However, the translated phrase phr m may not be exactly matched with the description of any target medical concepts.We propose two techniques to deal with this problem.For the first technique, we rank the target concepts based on the cosine similarity between the vector representation of phr m and the vector representation of the description of each concept desc c : where V phrm and V descc are the vector representations of phr m and desc c , respectively.Any technique for creating word vector representations (e.g.one-hot, CBOW and GloVe) can be used.Note that if a phrase (e.g.phrase m ) contains several terms, we create a vector representation by summing the value of the same dimension of the vector representation of each term (i.e.elementwise addition).
On the other hand, the second technique also incorporates the ranked position r of the translated phrase phr m when translated from the original phrase phr t using Equation (1).Indeed, the second technique calculates the similarity score as follows:

Combining Similarity Score with Phrase-based MT
As discussed in Section 2, word vector representations (e.g.created by CBOW or GloVe) can capture semantic similarity between words by itself.Hence, we propose to map a Twitter phrase phr t to a medical concept c, which is represented with a description desc c , by linearly combining the cosine similarity, between vector representations of the Twitter phrase phr t and the description desc c , with the similarity score computed using one of the adapted phrased-based MT techniques (introduced in Section 3.1), as follows: where M T a (phr t , desc c ) is calculated using one of the adapted phrase-based MT techniques described in Section 3.1.

Test Collection
To evaluate our approach, we use a collection of 25 million tweets related to adverse drug reactions (ADRs).In particular, these tweets are related to cognitive enhancers (Hanson et al., 2013) and anti-depressants (Schneeweiss et al., 2010) that can have adverse side effects.We use 201 ADR phrases and their corresponding SNOMED-CT concepts annotated by a PhD-level computational linguist.These phrases were anonymised by replacing numbers, user IDs, URIs, locations, email addresses, dates and drug names with appropriate tokens e.g.NUMBER .

Evaluation Approach
We conduct experiments using 10-fold cross validation, where the Twitter phrases are randomly divided into 10 separated folds.We address this task as a ranking task, where we aim to rank the medical concept with the highest similarity score, e.g.calculated using Equation (2), at the top rank.Hence, we evaluate our approach using Mean Reciprocal Rank (MRR) measure (Craswell, 2009), which is based on the the reciprocal of the rank at which the first relevant concept is viewed in the ranking.In addition, we compare the significant difference between the performance achieved by our proposed approach and the baselines using the paired t-test (p < 0.05).

Word Vector Representation
We use three different techniques, including onehot, CBOW and GloVe, to create word vector representations used in our approach (see Section 3).In particular, the vocabulary for creating the one-hot representation includes all terms in the Twitter phrases and the descriptions of the target SNOMED-CT concepts.Meanwhile, we create word vector representations based on CBOW and GloVe by using the word2vec4 and GloVe5 implementations.We learn the vector representations from the collections of tweets and medical articles, respectively, using window size of 10 words.The tweet collection (denoted Twitter) contains 419,702,147 English tweets, which are related to 11 drug names and 6 cities, while the medical article collection (denoted BMC) includes all medical articles from the BioMed Central6 .For both CBOW and GloVe, we create vector representations with vector sizes 50 and 200, respectively.

Learning Phrase-based Model
We use the phrase-based MT technique of Koehn et al. (2003), as implemented in the Moses toolkit (Koehn et al., 2007) with default settings, to learn to translate from the Twitter language to the medical language.In particular, when training the translator, we show the learner pairs of the Twitter phrases and descriptions of the corresponding SNOMED-CT concepts.

Experimental Results
We evaluate 6 different instantiations of the proposed approach discussed in Section 3, including: 1. bestMT: set k = 1, when finding the translated phrase phr m for a Twitter phrase phr t (Equation ( 1)), before ranking target medical concepts for the translated phrase phr m using Equation (2). 2. top5MT: similar to bestMT, but set k = 5. 3. top5MTr: similar to top5MT, but also consider the rank position of the translate phrases when ranking the target medical concepts by using Equation (3). 4. bestMT+vSim: incorporate with the ranking generated from bestMT, the cosine similarity between the vector representations of the Twitter phrase phr t and the description desc c of target medical concepts by using Equation (4). 5. top5MT+vSim: similar to bestMT+vSim, but use the ranking from top5MT.6. top5MTr+vSim: similar to bestMT+vSim, but use the ranking from top5MTr.Another baseline is vSim, where we consider only the cosine similarity between the vector representations of the Twitter phrase phr t and the description desc c of target medical concepts.
Table 1 compares the performance of these 6 instantiations and the vSim baseline in terms of MRR-5.We firstly observe that for the vSim baseline, excepting for word vector representation with vector size 50 learned using GloVe from the Twitter collection, word vector representations learned using either CBOW or GloVe are more effective than the one-hot representation.However, the difference between the MRR-5 performance is not statistically significant (p > 0.05, paired t-test).In addition, word vector representations learned ei-ther using CBOW or GloVe with vector size 200 is more effective than those with vector size 50.
Next, we find that our adaptation of phrasebased MT (i.e.bestMT, top5MT and top5MTr) significantly (p < 0.05) outperforms the vSim baseline.For example, with the one-hot representation, top5MT (MRR-5 0.2491) and top5MTr (MRR-5 0.2458) perform significantly (p < 0.05) better than vSim (MRR-5 0.1675) by up to 49%.Meanwhile, when using word vector representations with the vector size 200 learned using GloVe from the BMC collection, top5MT (MRR-5 0.2638) significantly (p < 0.05) outperforms vSim with both the GloVe vector representation (MRR-5 0.1869) and the one-hot representation (MRR-5 0.1675).We observe the similar trends in performance when using vector representations learned from the Twitter collection.These results show that our adapted phase-based MT techniques are effective for the medical term normalisation task.
In addition, we observe the effectiveness of our combined approach (i.e.bestMT+vSim, top5MT+vSim and top5MTr+vSim), as it further improves the performance of the adapted phrasebased MT (i.e.bestMT, top5MT and top5MTr, respectively), when using the one-hot representation.For example, top5MTr+vSim achieves the MRR-5 of 0.2594, while the MRR-5 of top5MTr is 0.2458.However, the performance difference is not statistically significant.Meanwhile, when using the CBOW and GloVe vectors, the achieved performance is varied based on the collection (i.e.BMC or Twitter) used for learning the vectors and the size of the vectors.

Conclusions
We have introduced our approach that adapts a phrase-based MT technique to normalise medical terms in Twitter messages.We evaluate our proposed approach using a collection of phrases from tweets related to ADRs.Our experimental results show that the proposed approach significantly outperforms an effective baseline by up to 55%.For future work, we aim to investigate the modelling of learned vector representation, such as CBOW and GloVe, within a phrase-based MT model when normalising medical terms.
Table 2: MRR-5 performance of the proposed approach when the word vector representation created by CBOW and GloVe is learned from the BMC collection with window sizes 50, 100 and 200.Significant differences (p < 0.05) with the cosine similarity with the one-hot representation, and the cosine similarity with the corresponding distributed word representation vector are denoted △ and , respectively.

Table 1 :
MRR-5 performance of the proposed approach and the baselines.Significant differences (p < 0.05) with the cosine similarity (vSim) baselines with the one-hot representation, and with the corresponding distributed word representation (e.g.CBOW or GloVe) are denoted △ and , respectively.