Numerically Grounded Language Models for Semantic Error Correction

Semantic error detection and correction is an important task for applications such as fact checking, speech-to-text or grammatical error correction. Current approaches generally focus on relatively shallow semantics and do not account for numeric quantities. Our approach uses language models grounded in numbers within the text. Such groundings are easily achieved for recurrent neural language model architectures, which can be further conditioned on incomplete background knowledge bases. Our evaluation on clinical reports shows that numerical grounding improves perplexity by 33% and F1 for semantic error correction by 5 points when compared to ungrounded approaches. Conditioning on a knowledge base yields further improvements.


Introduction
In many real world scenarios it is important to detect and potentially correct semantic errors and inconsistencies in text. For example, when clinicians compose reports, some statements in the text may be inconsistent with measurements taken from the patient (Bowman, 2013). Error rates in clinical data range from 2.3% to 26.9% (Goldberg et al., 2008) and many of them are number-based errors (Arts et al., 2002). Likewise, a blog writer may make statistical claims that contradict facts recorded in databases (Munger, 2008). Numerical concepts constitute 29% of contradictions in Wikipedia and GoogleNews (De Marneffe et al., 2008) and 8.8% of contradictory pairs in entailment datasets (Dagan et al., 2006). These inconsistencies may stem from oversight, lack of reporting guidelines or negligence. In fact they may not even be errors at all, but point to interesting outliers, or to errors in a reference database. In all cases, it is important to spot and possibly correct such inconsistencies. This task is known as semantic error correction (SEC) (Dahlmeier and Ng, 2011).
In this paper, we propose a SEC approach to support clinicians with writing patient reports. A SEC system reads a patient's structured background information from a knowledge base (KB) and their clinical report. Then it recommends improvements to the text of the report for semantic consistency. An example of an inconsistency is shown in Figure 1.
The SEC system has been trained on a dataset of records and learnt that the phrases "non dilated" and "severely dilated" correspond to high and low values for "EF" (abbreviation for "ejection fraction", a clinical measurement), respectively. If then the system is presented with the phrase "non dilated" in the context of a low value, it will detect a semantic inconsistency and correct the text to "severely dilated".
Our contributions are: 1) a straightforward extension to recurrent neural network (RNN) language models for grounding them in numbers available in the text; 2) a simple method for modelling text conditioned on an incomplete KB by lexicalising it; 3) our evaluation on a semantic error correction task for clinical records shows that our method achieves F1 improvements of 5 and 6 percentage points with grounding and KB conditioning, respectively, over an ungrounded approach (F1 of 49%).

Methodology
Our approach to semantic error correction ( Figure 1) starts with training a language model (LM), which can be grounded in numeric quantities mentioned inline with text (Subsection 2.1) and/or conditioned on a potentially incomplete KB (Subsection 2.2). Given a document for semantic checking, a hypothesis generator proposes corrections, which are then scored using the trained language model (Subsection 2.3). A final decision step involves accepting the best scoring hypothesis.

Numerically grounded language modelling
Let {w 1 , ..., w T } denote a document, where w t is the one-hot representation of the t-th token and V is the vocabulary size. A neural LM uses a matrix, E in ∈ R D×V , to derive word embeddings, e w t = E in w t . A hidden state from the previous time step, h t−1 , and the current word embedding, e w t , are sequentially fed to an RNN's recurrence function to produce the current hidden state, h t ∈ R D . The conditional probability of the next word is estimated as softmax(E out h t ), where E out ∈ R V ×D is an output embeddings matrix.
We propose concatenating a representation, e n t , of the numeric value of w t to the inputs of the RNN's recurrence function at each time step. Through this numeric representation, the model can generalise to out-of-vocabulary numbers. A straightforward representation is defining e n t = float(w t ), where float(.) is a numeric conversion function that returns a floating point number constructed from the string of its input. If conversion fails, it returns zero. The proposed mechanism for numerical grounding is shown in Figure 2. Now the probability of each next word depends on numbers that have appeared earlier in the text. We treat numbers as a separate modality that happens to share the same medium as natural language (text), but can convey exact measurements of properties of the real world. At training time, the numeric representations mediate to ground the language model in the real world.

Conditioning on incomplete KBs
The proposed extension can also be used in conditional language modelling of documents given a knowledge base. Consider a set of KB tuples accompanying each document and describing its attributes in the form < attribute, value >, where attributes are defined by a KB schema. We can lexicalise the KB by converting its tuples into textual statements of the form "attribute : value". An example of how we lexicalise the KB is shown in Figure 2 when values of some attributes are missing.

Semantic error correction
A statistical model chooses the most likely correction from a set of possible correction choices. If the model scores a corrected hypothesis higher than the original document, the correction is accepted. A hypothesis generator function, G, takes the original document, H 0 , as input and generates a set of candidate corrected documents G(H 0 ) = {H 1 , ..., H M }. A simple hypothesis generator uses confusion sets of semantically related words to produce all possible substitutions.
A scorer model, s, assigns a score s(H i ) ∈ R to a hypothesis H i . The scorer is based on a likelihood ratio test between the original document (null hypothesis, H 0 ) and each candidate correction (alternative hypotheses, H i ), i.e. s(H i ) = p(H i ) p(H 0 ) . The assigned score represents how much more probable a correction is than the original document.
The probability of observing a document, p(H i ), can be estimated using language models, or grounded and conditional variants thereof.

Data
Our dataset comprises 16,003 clinical records from the London Chest Hospital (Table 1). Each patient record consists of a text report and accompanying structured KB tuples. The latter describe 20 possible numeric attributes (age, gender, etc.), which are also partly contained in the report. On average, 7.7 tuples description confusion set intensifiers (adv): non, mildly, severely intensifiers (adj): mild, moderate, severe units: cm, mm, ml, kg, bpm viability: viable, non-viable quartiles: 25, 50, 75, 100 inequalities: <, > are completed per record. Numeric tokens constitute only a small proportion of each sentence (4.3%), but account for a large part of the unique tokens vocabulary (>40%) and suffer high OOV rates.
To evaluate SEC, we generate a "corrupted" dataset of semantic errors from the test part of the "trusted" dataset (Table 1, last column). We manually build confusion sets (Table 2) by searching the development set for words related to numeric quantities and grouping them if they appear in similar contexts. Then, for each document in the trusted test set we generate an erroneous document by sampling a substitution from the confusion sets. Documents with no possible substitution are excluded. The resulting "corrupted" dataset is balanced, containing 2,926 correct and 2,926 incorrect documents.

Results and discussion
Our base LM is a single-layer long short-term memory network (LSTM, Hochreiter and Schmidhuber (1997) with all latent dimensions (internal matrices, input and output embeddings) set to D = 50. We extend this baseline to a conditional variant by conditioning on the lexicalised KB (see Section 2.2). We also derive a numerically grounded model by concatenating the numerical representation of each token to the inputs of the base LM model (see Section 2.1). Finally, we consider a model that is both grounded and conditional (g-conditional).
The vocabulary contains the V = 1000 most frequent tokens in the training set. Out-of-vocabulary tokens are substituted with <num unk>, if numeric, and <unk>, otherwise. We extract the numerical representations before masking, so that the grounded models can generalise to out-ofvocabulary numbers. Models are trained to minimise token cross-entropy, with 20 epochs of backpropagation and adaptive mini-batch gradient de-  We report perplexity (PP) and adjusted perplexity (APP). Best results in bold.

scent (AdaDelta) (Zeiler, 2012).
For SEC, we use an oracle hypothesis generator that has access to the groundtruth confusion sets (Table 2). We estimate the scorer (Section 2.3) using the trained base, conditional, grounded or g-conditional LMs. As additional baselines we consider a scorer that assigns random scores from a uniform distribution and always (never) scorers that assign the lowest (highest) score to the original document and uniformly random scores to the corrections.

Experiment 1: Numerically grounded LM
We report perplexity and adjusted perplexity (Ueberla, 1994) of our LMs on the test set for all tokens and token classes (Table 3). Adjusted perplexity is not sensitive to OOV-rates and thus allows for meaningful comparisons across token classes. Perplexities are high for numeric tokens because they form a large proportion of the vocabulary. The grounded and g-conditional models achieved a 33.3% and 36.9% improvement in perplexity, respectively, over the base LM model. Conditioning without grounding yields only slight improvements, because most of the numerical values from the lexicalised KB are out-of-vocabulary.
The qualitative example in Figure 3 demonstrates how numeric values influence the probability of tokens given their history. We select a document from the development set and substitute its numeric values as we vary EF (the rest are set by solving a known system of equations). The selected exact values were unseen in the training data. We calculate the probabilities for observing the document with different word choices {"non", "mildly", "severely"} under the grounded LM and find that "non dilated" is associated with higher EF values. This shows that it has captured semantic dependencies on numbers.

Experiment 2: Semantic error correction
We evaluate SEC systems on the corrupted dataset (Section 3) for detection and correction.
For detection, we report precision, recall and F1 scores in Table 4. Our g-conditional model achieves the best results, a total F1 improvement of 2 points over the base LM model and 7 points over the best baseline. The conditional model without grounding performs slightly worse in the F1 metric than the base LM. Note that with more hypotheses the random baseline behaves more similarly to always. Our hypothesis generator generated on average 12 hypotheses per document. The results of never are zero as it fails to detect any error.
For correction, we report mean average precision (MAP) in addition to the same metrics as for detection (Table 5). The former measures the position of the ranking of the correct hypothesis. The always (never) baseline ranks the correct hypothesis at the top (bottom). Again, the g-conditional model yields the best results, achieving an improvement of  Table 4: Error detection results on the test set. We report precision (P), recall (R) and F1. Best results in bold.
6 points in F1 and 5 points in MAP over the base LM model and an improvement of 47 points in F1 and 9 points in MAP over the best baseline. The conditional model without grounding has the worst performance among the LM-based models.

Related Work
Grounded language models represent the relationship between words and the non-linguistic context they refer to. Previous work grounds language on vision (Bruni et al., 2014;Socher et al., 2014;Silberer and Lapata, 2014), audio , video (Fleischman andRoy, 2008), colour (McMahan andStone, 2015), and olfactory perception . However, no previous approach has explored in-line numbers as a source of grounding.
Our language modelling approach to SEC is inspired by LM approaches to grammatical error detection (GEC) (Ng et al., 2013;Felice et al., 2014). They similarly derive confusion sets of semantically related words, substitute the target words with alternatives and score them with an LM. Existing semantic error correction approaches aim at correcting word error choices (Dahlmeier and Ng, 2011), collocation errors (Kochmar, 2016), and semantic anomalies in adjective-noun combinations (Vecchi et al., 2011). So far, SEC approaches focus on short distance semantic agreement, whereas our approach can detect errors which require to resolve long-range dependencies. Work on GEC and SEC shows that language models are useful for error correction, however they neither ground in numeric quantities nor incorporate background KBs.  Table 5: Error correction results on the test set. We report mean average precision (MAP), precision (P), recall (R) and F1. Best results in bold.

Conclusion
In this paper, we proposed a simple technique to model language in relation to numbers it refers to, as well as conditionally on incomplete knowledge bases. We found that the proposed techniques lead to performance improvements in the tasks of language modelling, and semantic error detection and correction. Numerically grounded models make it possible to capture semantic dependencies of content words on numbers.
In future work, we will plan to apply numerically grounded models to other tasks, such as numeric error correction. We will explore alternative ways for deriving the numeric representations, such as accounting for verbal descriptions of numbers. For SEC, a trainable hypothesis generator can potentially improve the coverage of the system.