Deep Learning for Punctuation Restoration in Medical Reports

In clinical dictation, speakers try to be as concise as possible to save time, often resulting in utterances without explicit punctuation commands. Since the end product of a dictated report, e.g. an out-patient letter, does require correct orthography, including exact punctuation, the latter need to be restored, preferably by automated means. This paper describes a method for punctuation restoration based on a state-of-the-art stack of NLP and machine learning techniques including B-RNNs with an attention mechanism and late fusion, as well as a feature extraction technique tailored to the processing of medical terminology using a novel vocabulary reduction model. To the best of our knowledge, the resulting performance is superior to that reported in prior art on similar tasks.


Introduction
Medical dictation has been a major instrument in clinical settings to minimize the administrative burden on physicians (Johnson et al., 2014;Hammana et al., 2015;Hodgson and Coiera, 2016). Rather than having to fill forms in electronic medical record systems (EMRs) or typing out-patient letters, such labor is often outsourced to medical transcription providers, many of which make use of automated speech recognition (ASR), coupled with a manual correction step, to increase effectiveness and speed of transcription . Despite the fact that medical dictation reduces time physicians spend on clinical documentation substantially, an average dictation still takes about three minutes . In an attempt to dictate as efficiently as possible, often physicians (a) speak extremely fast, (b) use pre-dictated paragraphs (so-called physician normals), (c) make massive use of abbreviations, and (d) include very limited (if any) instructions regarding formatting and punctuation. While the ASR system is in charge of turning spoken words into their textual representation, a sophisticated NLP unit, the post-processor, takes care of formatting and structuring the output to produce a draft resembling the out-patient letter as well as possible. Among other responsibilities (such as formatting numerical expressions, dates, section headers, etc.), the post-processor is also charged with restoring punctuation in the letter's narrative. This paper focuses on the automated punctuation restoration in clinical reports, drawing on the latest advances in the NLP sector.
To achieve best possible results in this study, we paid particular attention to the specific challenges faced in medical texts. Foremost among these is a large domain-specific vocabulary, which makes it difficult if not impossible to apply tools developed for general-domain text. When building a system from scratch, however, several factors conspire to make it hard to obtain enough training data: the large medical vocabulary increases problems related to data sparsity and the handling of out-ofvocabulary (OOV) terms; the data often contain sensitive information and have restricted access or availability; and modern methods, such as neural networks as used here, typically require large amounts of data.
We overcame these issues by developing a text pre-processing strategy to reduce vocabulary size, collapsing particular roots and exploiting the fact that many medical terms are built from relatively few morphemes. Our method, which we call the vocabulary reduction model, effectively allows the punctuation restoration neural network to focus on morphosyntactic features of words rather than their full semantic representation, as usually cap- After reviewing the prior art in the field of punctuation restoration in Section 2, we describe the corpus used in this study in Section 3. The system's general architecture based on bidirectional recurrent neural networks with attention mechanism and late fusion is discussed in Section 4, followed by Section 5 providing details on the vocabulary reduction model. Evaluation results are covered in Section 6, and conclusion and future outlook in Section 7.

Related Work
Early efforts in this field used hidden-event ngram language modeling to predict where punctuation should be inserted (Stolcke et al., 1998;Beeferman et al., 1998). Numerous other strategies have also been devised: combining n-grams with constituency parse information (Shieber and Tao, 2003); maximum entropy using n-gram and part-of-speech features (Huang and Zweig, 2002); conditional random fields (CRFs) (Ueffing et al., 2013); feed-forward neural networks and CRFs on n-gram and lexical features ; even reframing the problem as monolingual machine translation (Peitz et al., 2011).
Most recently, it has been demonstrated that recurrent neural networks can restore punctuation very effectively Alumäe, 2015, 2016). Such methods are promising because they should be able to handle long-distance dependencies that are missed by other methods.
There has been little work on punctuation restoration in the medical domain specifically. While using pauses showed to help in punctuation restoration for rehearsed speech such as TED Talks (Tilk and Alumäe, 2016), Deoras and Fritsch (2008) note that medical dictations pose a particular challenge because the speech is often delivered rapidly and without typical prosodic cues, such as pauses where one would write commas or other punctuation. Thus, although acoustic information has been successfully incorporated for other domains (Huang and Zweig, 2002;Christensen et al., 2001), the same may not be feasible for medical text, so it is especially desirable to have a reliable text-only method.

Corpus
The corpus we are using in this study is composed of 32,275 medical reports (i.e., out-patient letters), which we converted into a sequence of tokens with punctuation as tags (since they are the most relevant to medical dictations, we focused on three punctuation marks: colon, comma, and period, represented in the tag set {COLON, COMMA, PE-RIOD}). We randomly split our corpus into training set, development set, and blind test set. Detailed corpus statistics are given in Table 1.
To reduce the size of the vocabulary, we performed two layers of text preprocessing. First, we performed several text normalization steps such as converting all digits to "D", normalizing numbers, dates, and times into familiar formats (e.g., "D.D", "DD/DDDD", "DD/DD", "DD/D-D/DDDD", "DD:DD"), as well as other tokens of the medical domain into normalized formats (e.g., "DDD/DD" for blood pressure, "lD-lD" for lumbar spinal discs, and "q.D+h" meaning "every D+ hours"). Normalization also included lowercasing, unifying abbreviations (e.g., "p.r.n" and "p.r.n."), and performing simple segmentation (e.g., splitting "'s" from a word). Second, we ran a vocabulary reduction algorithm, as detailed in Section 5, that maps infrequent and OOV words to word classes. The combination of these two layers dramatically reduced the vocabulary size, as shown in Table 1.

The Neural Network Model
We define punctuation restoration as a tagging problem. We try to tag every word in the input sequence with one of four tags: {NONE, COLON, COMMA, PERIOD}. Tagging a word by a punctuation means that the punctuation should be inserted after this word, while tagging with NONE means that the word does not have a punctuation after it. Our neural network approach is based on the work of Tilk and Alumäe (2016). Inspired by Bahdanau et al. (2016), our deep neural network model uses a bidirectional recurrent neural network (B-RNN) (Schuster and Paliwal, 1997) with gated recurrent units (Cho et al., 2014). B-RNNs help in learning long range dependencies on the left and right of the current input word. The B-RNN is composed of a forward RNN and a backward RNN that are preceded by the same word embedding layer. A sliding window of 256 words are passed to the shared embedding layer as one-hot vectors.
On top of the B-RNN, we stack a unidirec-tional RNN with an attention mechanism (Bahdanau et al., 2016) to assist in capturing relevant contexts that support punctuation restoration decisions. Finally, we use late fusion (Wang and Cho, 2015) to combine the output of the attention mechanism with the current position in the B-RNN without interfering with its memory.

The Vocabulary Reduction Model
To improve the modeling of rare words and to deal with OOV words in the test and development sets, we implemented a step that maps rare words to common word classes, reducing the overall size of the vocabulary. This vocabulary reduction allows us to reduce the number of model parameters, which is crucial for fast decoding in a live recognizer. Table 2 shows examples of prefixes and suffixes that capture the semantic and morpho-syntactic information of infrequent words in our training data such as medical terminology and proper names. For every input word consisting of alphabetical characters only, our vocabulary reduction algorithm goes through the prefix and suffix lists starting from the longer affixes to the shorter ones and tries to match them to the beginning or end of the word, while ensuring that the stem is at least four letters long. If the word starts with a prefix p+ of the prefix list we replace it with "pAAAA" (where "AAAA" represents an alphabetical stem). If it starts with a suffix +q, we replace it with "AAAAq". Finally, if the word matches a prefix p+ and a suffix +q, we split it into two tokens "pAA+" and "+AAq", respectively, to ensure that the information in them gets modeled separately. Every rare word consisting of alphabetical characters only that does not match any suffix or prefix is replaced with a token that represents its length range. The length range is computed with a step of five characters resulting in tokens like AAAA 5 for words shorter than five characters, AAAA 10 for words shorter than ten characters, etc. For example, "angiotensinconvertingenzyme" is replaced with AAAA 30. All other rare words (e.g., "t1cn0m0") are replaced with the token "RARE". These handcrafted rare classes allow us to increase the threshold for considering a word rare. This technique not only significantly reduces the size of the vocabulary, but also allows us to better model rare classes with a higher number of tokens. 4 inte+, anti+, post+, tran+, over+, intr+, peri+, hype+, para+, neur+, hypo+, micr+, rein+, mult+, card+, comp+, retr+, reco+, self+, gran+, extr+, medi+, hemi+, well+, semi+, endo+, radi+, hemo+, fibr+, oste+, elec+ +tion, +ions, +type, +ness, +ized, +date, +able, +gery, +tive, +sult, +tomy, +ated, +tory, +sion, +ates, +ular, +ical, +osis, +ment, +nary, +rate, +ings, +arge, +onal, +itis, +ents, +like, +lity, +ance, +berg 3 non+, pre+, per+, pro+, mar+, sub+, sch+, str+, tri+, ben+ +ing, +ion, +ted, +ate, +lly, +ive, +tic, +ers, +ble, +ies, +ity, +cal, +man, +sis, +son, +ial, +ous, +ell, +ary, +lar, +tes, +ton, +dez 2 re+, de+, mc+, un+, le+, la+, vi+ +ed, +er, +es, +al, +ry, +te, +ic, +ly, +le  We replace a word with its rare class whenever we find it 20 or fewer times in the training data, and we perform the affix-based replacement described above whenever the word occurred less than 100 times. These thresholds were tuned on a held-out development set. Running this algorithm on top of the normalized text results in lowering the vocabulary size in our training data to 11,766 types, meaning that four out of five types are replaced with a class.

Evaluation
For the present study, we used Keras with Tensor-Flow backend (Chollet, 2015;Abadi et al., 2016;Chollet, 2017). We evaluated on the blind test set by passing the whole set to our system as a sequence of about three million tokens without any indication of beginning or end of sentence, paragraph, or report. All words were lowercased, as described earlier, to avoid giving out any hint of sentence or section header start or end. We report the results in Table 3.
We achieve 96.3% F-Score on periods, which we consider the most important as they define sentence boundaries. The latter are crucial for virtually any subsequent NLP process, such as automatic coding of medical reports (Suendermann-Oeft et al., 2016).
The second most important punctuation type in medical reports is colons, as they define section headers and, thus, help format the report structure. We achieve 98.6% F-Score on colons.
Finally, we get 83.1% F-Score on commas, the hardest tag to predict due to human inconsistency in using them. This inconsistency affects the accuracy of the training data as well as the fairness in the evaluation against the test set. The overall performance of the system on all tags is 94.1% in terms of F-Score. Refer to Table 4 for examples of our system's output.

Conclusion and Future Work
Although prior work on punctuation restoration has used different corpora from the work presented in this paper, our result (F-Score 94.1%) compares very favorably with previous publications. For example,  achieve an F-Score of 61.8% on a meeting and lecture corpus, Tilk and Alumäe (2016) produce 64.4% on TED talk transcripts, and Ueffing et al. (2013) report an F-Score of 66.8% on one of Nuance's in-house dictation corpora.
While we have tested the performance of the presented punctuation restoration algorithm on naturalistic medical dictations, we have not yet measured the impact the speech recognizer's word error rate has on the F-Score, a task we plan to address in the near future. We are also interested to learn whether analyzing the speech waveform and characteristic pauses and prosodic patterns in medical dictations can be exploited in a hybrid speech/text punctuation restoration system to enhance accuracy even further. We also plan to replace the vocabulary reduction model by fusing a morphology-aware neural network such as a ... is reasonable we will optimize his medications by adding low dose angiotensinconvertingenzyme inhibitors which he currently is not on if the ...

Gold
... is reasonable. we will optimize his medications by adding low dose angiotensinconvertingenzyme inhibitors, which he currently is not on. if the ... Punctuated ... is reasonable PERIOD we will optimize his medications by adding low dose AAAA 30 inhibitors COMMA which he currently is not on PERIOD if the ... Table 4: Examples of the output of our system on word sequences of the input. The first example shows the correct handling of consecutive colons indicating a section header and a subsection header. The second example shows the preprocessing of infrequent medical terminology like "neurodermatosis", "neurotic", and "excoriations" by capturing their semantic and part-of-speech information. The third and fourth examples emphasize the case of parallelism captured by mapping "tremulousness and unsteadiness" to "AAAAness and unAA+ +AAness" and "hopelessness helplessness worthlessness" to "AAAAness AAAAness AAAAness", thus predicting commas when needed since the meaning is irrelevant to the punctuation task. The fourth example also shows the correct prediction of coordinated lists, separating them with commas. The final example presents the mapping of a very long word, "angiotensinconvertingenzyme", into "AAAA 30", which reduces the confusion of the network and results in the correct prediction.
character-based convolutional network.