KW-ATTN: Knowledge Infused Attention for Accurate and Interpretable Text Classification

Text classification has wide-ranging applications in various domains. While neural network approaches have drastically advanced performance in text classification, they tend to be powered by a large amount of training data, and interpretability is often an issue. As a step towards better accuracy and interpretability especially on small data, in this paper we present a new knowledge-infused attention mechanism, called KW-ATTN (KnoWledge-infused ATTentioN) to incorporate high-level concepts from external knowledge bases into Neural Network models. We show that KW-ATTN outperforms baseline models using only words as well as other approaches using concepts by classification accuracy, which indicates that high-level concepts help model prediction. Furthermore, crowdsourced human evaluation suggests that additional concept information helps interpretability of the model.


Introduction
Text classification is a fundamental Natural Language Processing (NLP) task which has wideranging applications such as topic classification (Lee et al., 2011), fake news detection (Shu et al., 2017), and medical text classification (Botsis et al., 2011). The current state-of-the-art approaches for text classification use Neural Network (NN) models. When these techniques are applied to real data in various domains, there are two problems. First, neural approaches tend to require large training data, but it is often the case that large training data or pretrained embeddings are not available in domain-specific applications. Second, when text classification is applied in real life, not only the accuracy, but also the interpretability or explainability of the model is important.
As a way to improve interpretability as well as accuracy, incorporating high-level concept information can be useful. High-level concepts could help interpretation of model results because concepts summarize individual words. The concept "medication" would be not only easier to interpret than the words "ibuprofen" or "topiramate" but also contributes to understanding the words better. In addition, higher-level concepts can make raw words with low frequency more predictive. For instance, the words "hockey" and "archery" might not occur in a corpus frequently enough to be considered important by a model, but knowing that they belong to the concept "athletics" could give more predictive power to the less frequent individual words depending on the task, because the frequency of the concept "athletics" would be higher than individual words.
In this paper we present a new approach that incorporates high-level concept information from external knowledge sources into NN models. We devise a novel attention mechanism, KW-ATTN, that allows the network to separately and flexibly attend to the words and/or concepts occurring in a text, so that attended concepts can offer information for predictions in addition to the information a model learns from texts or a pretrained model. We test KW-ATTN on two different tasks: patient need detection in the healthcare domain and topic classification in general domains. Data is annotated with high level concepts from external knowledge bases: BabelNet (Navigli and Ponzetto, 2012) and UMLS (Unified Medical Language System) (Lindberg, 1990). We also conduct experiments and analyses to evaluate how high-level concept information helps with interpretability of resultant classifications as well as accuracy. Our results indicate that KW-ATTN improves both classification accuracy and interpretability.
Our contribution is threefold: (1) We propose a novel attention mechanism that exploits highlevel concept information from external knowledge bases, designed for providing an additional layer of interpretation using attention. This attention mechanism can be plugged in different architectures and applied in any domain for which we have a knowledge resource and a corresponding tagger.
(2) Experiments show KW-ATTN makes statistically significant gains over a widely used attention mechanism plugged in RNN models and other approaches using concepts. We also show that the attention mechanism can help prediction accuracy when added on top of the pretrained BERT model. Additionally, our attention analysis on patient need data annotated with BabelNet and UMLS indicates that choice of external knowledge impacts the model's performance. (3) Lastly, our human evaluation using crowdsourcing suggests our model improves interpretability.
Section 2 relates prior work to ours. Section 3 explains our method. Section 4 evaluates our model on two different tasks in terms of classification accuracy. Section 5 describes our human evaluation on interpretability. Section 6 concludes.

Knowledge-infused Neural Networks
There has been a growing interest in incorporation of external semantic knowledge into neural models for text classification.  proposed a framework based on convolutional neural networks that combines explicit and implicit representations of short text for classification by conceptualizing a short text as a set of relevant concepts using a large taxonomy knowledge base. Yang and Mitchell (2017) proposed KBLSTM, a RNN model that uses continuous representations of knowledge bases for machine reading. Xu et al. (2017) incorporated background knowledge with the format of entity-attribute for conversation modeling. Stanovsky et al. (2017) overrided word embeddings with DBpedia concept embeddings, and used RNNs for recognizing mentions of adverse drug reaction in social media.
More advanced neural architectures such as transformers has been also benefited by external knowledge. (Zhong et al., 2019) proposed a Knowledge Enriched Transformer (KET), where contextual utterances are interpreted using hierarchical self-attention and external commonsense knowledge is dynamically leveraged using a contextaware affective graph attention mechanism. ERNIE (Zhang et al., 2019) integrated entity embeddings pretrained on a knowledge graph with corresponding entity mentions in the text to augment the text representation. KnowBERT (Peters et al., 2019) trained BERT for entity linkers and language modeling in a multitask setting to incorporate entity representation. K-BERT (Liu et al., 2020) injected triples from knowledge graphs into a sentence to obtain an extended tree-form input for BERT.
Although all these prior models incorporated external knowledge into advanced neural architectures to improve model performance, they didn't pay much attention to interpretability benefits. There have been a few knowledge-infused models that considered interpretability. Kumar et al. (2018) proposed a two-level attention network for sentiment analysis using knowledge graph embedding generated using WordNet (Fellbaum, 2012) and top-k similar words. Although this work mentions interpretability, it did not show whether/how the model can help interpretability. Margatina et al. (2019) incorporated existing psycho-linguistic and affective knowledge from human experts for sentiment related tasks. This work only showed attention heatmap for an example.
Our work is distinguished from others in that KW-ATTN is designed in consideration of not only accuracy but also interpretability of the model. For this reason, KW-ATTN allows separately and flexibly attending to the words and/or concepts so that important concepts for prediction can be included in prediction explanations, adding an extra layer of interpretation. We also perform human evaluation to see the effect of incorporating high-level concepts on interpretation rather than just showing a few visualization examples.

Interpretability
Interpretability is the ability to explain or present a model in an understandable way to humans (Doshi-Velez and Kim, 2017). This interpretability is beneficial for developers to understand the model, help identify and possibly fix issues with the model, or to enhance the model. It is crucial for application end users because knowing explanations or justifications behind a model's prediction can further assist in decision making or the task at hand.
To provide interpretability, researchers have used inherently interpretable models such as sparse linear regression models, decision trees, or rule sets. These models are generally useful for simple prediction tasks, yet it is difficult to apply them to complicated tasks. To interpret complex models used for complex tasks, one can examine how prediction changes between two different inputs (Shrikumar et al., 2017;Lundberg and Lee, 2017) or by locally perturbing an input (Ribeiro et al., 2016). However, a recent and popular method in NLP has been the use of an attention mechanism, which was found to be effective in helping interpret complex models by highlighting which inputs are informative to prediction (Wang et al., 2016;Lin et al., 2017;Ghaeini et al., 2018;Seo et al., 2016).
Along the lines of work using attention for interpretation, our model improves attention-based interpretability by using high-level concept information. To our knowledge, no prior work used external high-level concept information for better interpretability.

External Knowledge Bases
We automatically annotate data with high-level concepts from two knowledge bases: BabelNet and UMLS.

BabelNet
BabelNet (Navigli and Ponzetto, 2012) is a constantly growing semantic network which connects concepts and named entities in a large network of semantic relations, currently made up of about 16 million entries, called Babel synsets. In our study, we use the hypernyms of Babel synsets as additional higher-level concept information for the raw words or phrases in text. We first map texts with concepts in Babel synsets using an entity linking toolkit, Babelfy (Moro et al., 2014), and then retrieve hypernyms, high-level concepts, of the concepts using BabelNet APIs. Table 1 shows example annotations for the sentence "My mom was diagnosed with stage 3 ovarian cancer."

Expression
BabelNet Concepts "Mom" mother "diagnosed" analyze "state" state "ovarian cancer" disease We also exploit an external medical ontology, the UMLS (Lindberg, 1990), for a comparison with BabelNet for the patient need task. The UMLS is a high-level ontology for organizing a great number of concepts in the biomedical domain, which provides unified access to many different biomedical resources. On top of the UMLS, the UMLS semantic network (McCray, 2003) implements an upperlevel conceptual layer for all UMLS concepts. This semantic network categorizes all concepts in the UMLS into 134 semantic types and provides 54 links between the semantic types to represent relationships in the biomedical domain. We use the semantic types of the UMLS semantic network as additional higher-level concepts because it can abstract more fine clinical concepts that exist across much larger medical ontologies such as UMLS, SNOMED (Benson, 2010), and ICT-10(Organization et al., 2017). To obtain the semantic types, we annotate raw text by using MetaMap. Table 2 shows an example from MetaMap. Note that the automatic annotation can be noisy (e.g., incorrect semantic types for "mom" in the example).

Incorporating High-Level Concepts
To incorporate high-level concept information into a NN model, we design a new attention mechanism, KW-ATTN, which allows giving separate but complementary attentions to a word and its corresponding concept. To test KW-ATTN, we choose a one-level RNN architecture with an attention mechanism (1L), a hierarchical RNN architecture with an attention mechanism (2L) as in Hierarchical Attention Network (HAN) (Yang et al., 2016), and a pretrained BERT (Devlin et al., 2018). Our 2L model architecture is shown in Figure 1. The whole architecture begins with words in each sentence as input. They are embedded and encoded using a word encoder, and then the resulting hidden representations move forward to a wordconcept attention layer after being concatenated with the corresponding concept embeddings. This part is different from common RNN architectures for text classification, where only the hidden representations from the word encoder are used for a word-level attention layer. Then, the output of this attention layer is used in the next phase, a sentence encoder in case of 2L, and a final layer in case of  1L. When KW-ATTN is applied to BERT (KW-BERT), the word encoder using RNN is replaced with BERT and then the output of KW-ATTN is feed to the final layer as in 1L. Word and Concept Embeddings: Each word w it (a one-hot vector, where t ∈ {1, · · · , T } and T i is the number of words in the i-th sentence) is mapped to a real-valued vector x it through an embedding matrix W e by x it = W e w it . To use high-level concepts, each concept c it (a one-hot vector) corresponding to word w it is also mapped to x c it through an embedding matrix W ec by x c it = W ec c it . When a word is not mapped into a concept, we map the concept vector to a no-concept vector.
Word and Concept Encoders: We encode T words in each sentence i using a word encoder. The corresponding T concepts are also encoded using a concept encoder. We use a bi-directional GRU (Cho et al., 2014) to build a representation for the t-th word and concept in the sentence i, denoted as h it and h c it as follows: where t ∈ {1, · · · , T }, and T i is the number of words in the i-th sentence. Note that we obtain a representation that summarizes the information of the whole sentence around the t-th word w it by concatenating the forward hidden state − → h it and the backward hidden state ← − h it .
Word-Concept Attention: In this stage, the output from the word encoder h it and the corresponding concept output h c it are combined by going through a word-concept level attention layer. This layer consists of two attention levels. One is an attention vector α it that tracks the importance of a combined word-concept, which we call "combined" attention. The other attention vector we call "balancing" attention p it is for flexibly incorporating concept information into the model. The balancing attention is introduced to give attention complementarily to both word and concept because the importance of a word or concept can differ at times. For example, when "football" is attended, we don't know if "football" itself is important for the prediction, or "football", "tennis", and all others together are important. Additionally, this balancing attention helps the model to be more robust to noisy concepts that may be caused by automatic annotation.
In detail, each position in a sentence includes a word and its corresponding concept. For each position, combined attention α is assigned, which represents attention to the position (both word and concept). Within each position, balancing attention p is assigned to a concept and its complement 1 − p is assigned to the corresponding word. As seen in Figure 1, α it represents the contribution of the position t (both the t-th word and its concept) to the meaning of the sentence i in the sentence, while 1 − p it represents a weight on the word and p it represents a weight on the word's concept. Hence, α it (1 − p it ) and α it p it represent the contribution of the t-th word and concept to the sentence i, respectively. This attention mechanism using combined and balancing attentions enables us to give separate but complementary attentions to the word and concept. In addition, we set p it as 0 when a word does not have a corresponding concept because in this case the model should attend only the word. The new attention mechanism is as follows: where W α , b α , w p , b p and u α are the model parameters. s i is a representation for the i-th sentence.
s i is used as an input to the next layer, the sentence encoder in case of 2L (HAN). Then, the sentence representations h i go through the sentence level attention layer, and build a document vector v, as shown in Figure 1. In case of a 1L model or a BERT model, all the words in the document are treated as one single sentence. Then, there is a single representation s 1 , which is equivalent to the document vector v in the 2L case.
Finally, based on this vector v, classification probability for each class is computed in the final layer.

Experiments
KW-ATTN is evaluated on two different datasets for patient need detection (need dataset) (Jang et al., 2019) and topic classification (Yahoo answers) (Zhang et al., 2015). We use different tasks to more broadly demonstrate the benefits of our approach.

Data
Patient need detection: This dataset is for detecting patient need in posts from an online cancer discussion forum. We use the health information need data for binary classification (450 positive samples out of 853). Although this dataset is quite small, we choose to use it because RNN approaches showed effectiveness (Jang et al., 2019) and it is a dataset we can compare the effect of general knowledge graph and domain-specific medical ontology. We build two different concept annotations with BabelNet and UMLS.
Yahoo answers: This dataset is for topic classification. It incluldes 10 different topics such as Society & Culture and Sports. To generate a dataset that is still small but one order of magnitude bigger than the need dataset, we randomly select 10,000 instances of the dataset enforcing a balanced dataset (1,000 instances per topic), and annotate them with BabelNet concepts.
The data statistics of our concept annotated datasets are summarized in Table 3. The ratios of words that match concepts are 6.6%(the need dataset with BabelNet), 36.3%(the need dataset with UMLS), and 8.9%(Yahoo answers). In all our experiments, we perform 10-fold cross-validation ten times. For each run, we use 80% of data for training, 10% for development, and 10% for test.

Experiment Settings
We compare our KW-ATTN 1L and 2L with a widely used attention mechanism leveraging only words (Yang et al., 2016;Ying et al., 2018). We call it ATTN. In addition, we use other proven approaches that leverage concept information: Concept-replace uses input documents where raw words are replaced with the corresponding Ba-belNet/UMLS high-level concepts when the mappings are available, as in (Stanovsky et al., 2017;Magumba et al., 2018). Concept-concat uses concatenation to combine word and concept embeddings, as in Zhou et al., 2018). Attn-concat uses concatenation to combine a concept embedding and a hidden representation of word and use ATTN. Attn-gating uses a gate mechanism to select salient features of a hidden word representation, conditioned on the concept information. Both Attn-concat and Attn-gating are stateof-the-art presented by Margatina et al. (2019). All these methods are tested in 1L and 2L settings.
The parameters for RNN models are tuned on    (Mancini et al., 2016). 1 We randomly initialize word embeddings rather than using pretrained embeddings because our model often uses phrases recognized by knowledge resources, and they are usually not part of pretrained embeddings. We optimize parameters using Adam (Kingma and Ba, 2014) with epsilon 1e-08, decay 0.0, a mini-batch size of 32, and the loss function of negative log-likelihood loss. We use early-stopping.
In addition, we also conduct experiments with pre-trained BERT Word Encoder (KW-BERT) to see if injecting concept also helps the model trained on large-scale corpora. We use the 'bert-baseuncased' model, and the dimension of Concept bi-GRU is 384, making the concept representation the same dimension of BERT word representations. We show both the results from frozen models and fine-tuned models.The frozen models do not update parameters of pretrained models, i.e., they use pre-trained contextualized embeddings without fine-tuning. In contrast, fine-tuned BERT or KW-BERT are adapted to the target task. The learning rates for learning frozen models and fined-tuned models are 2e-3 and 1e-6, respectively.

Experiment Results
The results are shown in Table 4. First, we observe that 2L models do not perform better than 1L models. This could be because 2L models are too large for the data sizes, especially for the need data. It could indicate that the document itself is not too long to put in one RNN, and the sentence boundary might not be necessary for the classification. Second, using concept information alone does not perform well in general, which indicates that concept information alone is not sufficient. Using word and concept information together (concept-concat) also do not always result in a gain of performance. Third, Attn-models generally perform better than simpler Concept-models. However, KW-ATTN significantly improves over all other models for both tasks, indicating the effeteness of our mechanism.
In addition, Table 4 shows that for the need task, while both types of concepts help the prediction, UMLS concepts help slightly more. This suggests that choosing the right knowledge resource, especially for domain specific tasks, is critical for prediction performance.
To see the effect of data size on the model, we compare KW-ATTN and ATTN across different data sizes of Yahoo reviews (Table 5). KW-ATTN models significantly outperform ATTN models consistently. However, as the data size becomes larger, performance gains, while still significant, diminish, showing that, in this domain, our method is more effective when the data is smaller.   Table 6 shows the comparison between BERT and KW-BERT. We can see that additional concept information substantially improves the performances on both datasets in case of frozen models whereas it only improves the performance on the need dataset when fine-tuned. The results from the frozen models indicate that the encoded concepts provide complementary information to BERT. However, when fine-tuned, KW-BERT outperforms BERT only on the Need dataset. This could be because a BERT model itself is learnt on Wikipedia, which may lack knowledge on medicine. Although BERT learns task-specific knowledge during finetuning, but the data is small and additional highlevel concept information still helps. This may suggest that KW-BERT could be more beneficial for small data problems in domains that require more expert knowledge than Wikipedia can provide.
We can also notice that the frozen models poorly perform on the Need dataset compared with RNN models (Table 4) whereas they drastically outperform on the Yahoo dataset. This could be because the documents in the Need dataset are conversational coming from an online forum, which are markedly different from the Wikipedia dataset on which BERT is trained. We can see that when finetuned, both BERT and KW-BERT beat RNN models, which suggests that finetuning allows learning task/domain specific information.
Attention Analysis: To better understand why UMLS concepts help more on the need dataset, we draw the distributions of concept attentions in models with both annotations in Figure 2. Interestingly, for the average attention of each concept,  the attention for the model using BabelNet annotations is greater than the model using UMLS annotations. However, the max attention of each concept is greater for UMLS annotations than for BabelNet annotations, which indicates that UMLS concepts are more actively used. Additionally, attentions from the model using UMLS concepts show lower variance. This result indicates that the model using UMLS concepts assigns a similar attention to each concept whereas the model using BabelNet concepts sometimes assigns small or large attentions to concept. In other words, the model using UMLS concepts consistently select a concept to attend whereas the model using BabelNet concepts is less consistent. Intuitively, this makes sense as the UMLS concepts are domain specific to the task of health information need detection.

Human Evaluation on Interpretability
We use human evaluation to see whether additional high-level concept information given by KW-ATTN can be beneficial for interpretation. We compare top-ranked attended words/concepts by KW-ATTN with top-ranked attended words by ATTN. We use Amazon Mechanical Turk (MTurk). Since we use crowdsourcing, we conduct evaluation only on the Yahoo reviews dataset for topic classification, which covers general domains.

Experiment Design
For each Human Intelligence Task (HIT) in MTurk, we provide a prediction and its explanation for a text, generated from either KW-ATTN 1L or ATTN 1L. 2 We use 1L because one attention layer is simpler to interpret. Then, we ask whether MTurkers would assign the given topic to the text based on the given explanation. Only one explanation is randomly given, and which model the explanations is from is not shown to MTurkers. Additionally, we ask them to rate their confidence in their answer.

Explanation Type Example
No concept "java, yields, best, language, results, built" KW same number "java as a(n) object-oriented_programming_language, ide as a(n) application, php as a(n) free_software, swing, best, looking" KW same length "java as a(n) object-oriented_programming_language, ide as a(n) application, php as a(n) free_software" KW replacement "object-oriented_programming_language, application, free_software, swing, best, looking" We assume that attention can be used for prediction explanations based on (Wiegreffe and Pinter, 2019;Serrano and Smith, 2019). We choose to ask about the validity of a given prediction unlike prior work that asked to guess a model's prediction based on an explanation (Nguyen, 2018;Chen et al., 2020). Although we acknowledge that the model's prediction may bias the annotators, we choose this approach since humans have high-level concepts as background knowledge. Humans do not require external additional concept information for guessing a correct topic label among multiple topic options especially when the given topic options are distinct from each other. For example, although the high-level concept "athletics" is not given for the word "baseball" in an explanation, humans would not have a problem with classifying it into the sports category when given topic options are sports and music. However, high-level concepts may help users to have more confidence when interpreting the explanation for a given topic. Therefore, we evaluate users' trusts about the system indirectly by requesting them to assess a given topic based on an explanation and rate their confidence.
The top 6 ranked features (words and concepts) with the highest attention weights are selected as an explanation. The high-level concept of a word is included in the explanation as the format of "[word] as a(n) [concept]" only when the balancing weight, p, for the concept is non-zero (See Section 3.2).
We remove stopwords and punctuations from explanations.
Four different types of explanations are given to MTurkers and compared in our analysis as shown in Table 7. A no-concept explanation consists of 6 words. A KW-same-number explanation also contains 6 words and their corresponding concepts if they exist. A KW-same-length is composed of 3 words and their corresponding concepts if they exist. A KW-replacement consists of 6 words or concept. When a word has a lower attention value than its corresponding concept according to the p attention value, it is replaced by its concept in the explanation. Note that KWexplanations are all from the same model using KW-ATTN, and no-concept explanations are from a model using ATTN.
We randomly pick 200 samples that have correct predicted labels made by both systems. To make the 200 samples, we draw 100 samples with the prediction probability higher than .90 for their predicted labels, and 100 samples with the prediction probability between .80 and .90. To balance topics, we pick equal number of samples for each topic. We do not perform the same MTurk task for incorrectly predicted samples because when a system makes an incorrect prediction, assessing interpretability is not straightfoward. There can be multiple different reasons about the wrong prediction.
For MTurk, each HIT asks questions about an explanation generated by a system for one sample, as shown in Figure 3. For each HIT, 5 MTurkers participate. We hire North American Master MTurkers with HIT acceptance rates above 98% in order to ensure high quality of the evaluation. We pay $0.03-$0.05 for each HIT.

Human Evaluation Results
As shown in Table 8, KW-same-number and KWsame-length explanations resulted in a significantly higher confidence in assigning given topics to explanations compared to no-concept explanations. This indicates that the additional high-level concept information from KW-ATTN is beneficial for improving interpretability. We can also observe that KW-replacement explanations improve confidence although the gain is not significant.
It is important to note that KW-same-length and KW-replacement explanations both improve interpretability over no-concept explanations as well as KW-same-number. While KW-same-number explanations provide more information (12 at maximum in total including both words and concepts), KWsame-length and KW-replacement give the same or less amount of information compare to no-concept (6 at maximum in total). This indicates that the high-level concept information really helps.

Conclusion
We presented a new attention mechanism, KW-ATTN, which extends a NN model by incorporating high-level concepts. Our experiments showed that using high-level concept information improves predictive power by helping the data sparseness problem in small data. Furthermore, in our crowdsourcing experiments, we found significant improvement on the confidence of human evaluators on predictions, suggesting that our new attention mechanism provides benefits in explaining the predictions. High-level concepts provide an additional layer of information above raw words that can assist in understanding predictions. Additionally, our attention mechanism can distinguish between the importance of words vs. concepts, providing further information. We are optimistic that KW-ATTN can be applied widely. Figure 3 shows a screenshot of the Amazon Mechanical Turk user interface in our human evaluation.