Attention-based Conditioning Methods for External Knowledge Integration

In this paper, we present a novel approach for incorporating external knowledge in Recurrent Neural Networks (RNNs). We propose the integration of lexicon features into the self-attention mechanism of RNN-based architectures. This form of conditioning on the attention distribution, enforces the contribution of the most salient words for the task at hand. We introduce three methods, namely attentional concatenation, feature-based gating and affine transformation. Experiments on six benchmark datasets show the effectiveness of our methods. Attentional feature-based gating yields consistent performance improvement across tasks. Our approach is implemented as a simple add-on module for RNN-based models with minimal computational overhead and can be adapted to any deep neural architecture.


Introduction
Modern deep learning algorithms often do away with feature engineering and learn latent representations directly from raw data that are given as input to Deep Neural Networks (DNNs) McCann et al., 2017;Peters et al., 2018). However, it has been shown that linguistic knowledge (manually or semi-automatically encoded into lexicons and knowledge bases) can significantly improve DNN performance for Natural Language Processing (NLP) tasks, such as natural language inference (Mrkšić et al., 2017), language modelling (Ahn et al., 2016), named entity recognition (Ghaddar and Langlais, 2018) and relation extraction (Vashishth et al., 2018).
For NLP tasks, external sources of information are typically incorporated into deep neural architectures by processing the raw input in the context of such external linguistic knowledge. In machine learning, this contextual processing is known as conditioning; the computation carried out by a model is conditioned or modulated by information extracted from an auxiliary input. The most commonly-used method of conditioning is concatenating a representation of the external information to the input or hidden network layers.
In this work, we propose a novel way of utilizing word-level prior information encoded in linguistic, sentiment, and emotion lexicons, to improve classification performance. Usually, lexicon features are concatenated to word-level representations Trotzek et al., 2018), as additional features to the embedding of each word or the hidden states of the model. By contrast, we propose to incorporate them into the self-attention mechanism of RNNs. Our goal is to enable the self-attention mechanism to identify the most informative words, by directly conditioning on their additional lexicon features.
Our contributions are the following: (1) we propose an alternative way for incorporating external knowledge to RNN-based architectures, (2) we present empirical results that our proposed approach consistently outperforms strong baselines, and (3) we report state-of-the-art performance in two datasets. We make our source code publicly available 1 .

Related Work
In the traditional machine learning literature where statistical models are based on sparse features, affective lexicons have been shown to be highly effective for tasks such as sentiment analysis, as they provide additional information not captured in the raw training data (Hu and Liu, 2004;Kim and Hovy, 2004;Ding et al., 2008;Yu and Dredze, 2014;Taboada et al., 2011). After the emergence of pretrained word representations Pennington et al., 2014), the use of lexicons is no longer common practice, since word embeddings can also capture some of the affective meaning of these words.
Recently, there have been notable contributions towards integrating linguistic knowledge into DNNs for various NLP tasks. For sentiment analysis, Teng et al. (2016) integrate lexicon features to an RNN-based model with a custom weightedsum calculation of word features. Shin et al. (2017) propose three convolutional neural network specific methods of lexicon integration achieving state-of-the-art performance on two datasets. Kumar et al. (2018) concatenate features from a knowledge base to word representations in an attentive bidirectional LSTM architecture, also reporting state-of-the-art results. For sarcasm detection,  incorporate psycholinguistic, stylistic, structural, and readability features by concatenating them to paragraph and documentlevel representations.
Furthermore, there is limited literature regarding the development and evaluation of methods for combining representations in deep neural networks. Peters et al. (2017) claim that concatenation, non-linear mapping and attention-like mechanisms are unexplored methods for including language model representations in their sequence model. They employ simple concatenation, leaving the exploration of other methods to future work. Dumoulin et al. (2018) provide an overview of feature-wise transformations such as concatenation-based conditioning, conditional biasing and gating mechanisms. They review the effectiveness of conditioning methods in tasks such as visual question answering (Strub et al., 2018), style transfer (Dumoulin et al., 2017) and language modeling (Dauphin et al., 2017). They also extend the work by Perez et al. (2017), which proposes the Feature-wise Linear Modulation (FiLM) framework, and investigate its applications in vi-sual reasoning tasks. Balazs and Matsuo (2019) provide an empirical study showing the effects of different ways of combining character and word representations in word-level and sentence-level evaluation tasks. Some of the reported findings are that gating conditioning performs consistently better across a variety of word similarity and relatedness tasks.
3 Proposed Model 3.1 Network Architecture Word Embedding Layer. The input sequence of words w 1 , w 2 , ..., w T is projected to a lowdimensional vector space R W , where W is the size of the embedding layer and T the number of words in a sentence. We initialize the weights of the embedding layer with pretrained word embeddings.
LSTM Layer. A Long Short-Term Memory unit (LSTM) (Hochreiter and Schmidhuber, 1997) takes as input the words of a sentence and produces the word annotations h 1 , h 2 , ..., h T , where h i is the hidden state of the LSTM at time-step i, summarizing all sentence information up to w i .
Self-Attention Layer. We use a self-attention mechanism (Cheng et al., 2016) to find the relative importance of each word for the task at hand. The attention mechanism assigns a score a i to each word annotation h i . We compute the fixed representation r of the input sequence, as the weighted sum of all the word annotations. Formally:

External Knowledge
In this work, we augment our models with existing linguistic and affective knowledge from human experts. Specifically, we leverage lexica containing psycho-linguistic, sentiment and emotion annotations. We construct a feature vector c(w i ) for every word in the vocabulary by concatenating the word's annotations from the lexicons shown in Table 1. For missing words we append zero in the corresponding dimension(s) of c(w i ).

Conditional Attention Mechanism
We extend the standard self-attention mechanism (Eq. 1, 2), in order to condition the attention distribution of a given sentence, on each word's prior lexical information. To this end, we use as input to the self-attention layer both the word annotation h i , as well as the lexicon feature c(w i ) of each word. Therefore, we replace . Specifically, we explore three conditioning methods, which are illustrated in Figure 1. We refer to the conditioning function as f i (.), the weight matrix as W i and the biases as b i , where i is an indicative letter for each method. We present our results in Section 5 (Table 3) and we denote the three conditioning methods as "conc.", "gate" and "affine" respectively. Attentional Concatenation. In this approach, as illustrated in Fig. 1(a), we learn a function of the concatenation of each word annotation h i with its lexicon features c(w i ). The intuition is that by adding extra dimensions to h i , learned representations are more discriminative. Concretely: where denotes the concatenation operation and W c , b c are learnable parameters.
Attentional Feature-based Gating. The second approach, illustrated in Fig. 1(b), learns a feature mask, which is applied on each word annotation h i . Specifically, a gate mechanism with a sigmoid activation function, generates a mask-vector from each c(w i ) with values between 0 and 1 (black and white dots in Fig. 1(b)). Intuitively, this gating mechanism selects salient dimensions (i.e. features) of h i , conditioned on the lexical information. Formally: where denotes element-wise multiplication and W g , b g are learnable parameters. Attentional Affine Transformation. The third approach, shown in Fig. 1(c), is adopted from the work of Perez et al. (2017) and applies a featurewise affine transformation to the latent space of the hidden states. Specifically, we use the lexicon features c(w i ), in order to conditionally generate the corresponding scaling γ(·) and shifting β(·) vectors. Concretely: where W γ , W β , b γ , b β are learnable parameters.

Baselines
We employ two baselines: The first baseline is an LSTM-based architecture augmented with a selfattention mechanism (Sec. 3.1) with no external knowledge. The second baseline incorporates lexicon information by concatenating the c(w i ) vec-    tors to the word representations in the embedding layer. In Table 3 we use the abbreviations "baseline" and "emb. conc." for the two baseline models respectively.

Experiments
Lexicon Features. As prior knowledge, we leverage the lexicons presented in Table 1. We selected widely-used lexicons that represent different facets of affective and psycho-linguistic features, namely; LIWC (Tausczik and Pennebaker Datasets. The proposed framework can be applied to different domains and tasks. In this paper, we experiment with sentiment analysis, emotion recognition, irony, and sarcasm detection. Details of the benchmark datasets are shown in Table 2. Preprocessing. To preprocess the words, we use the tool Ekphrasis Experimental Setup. For all methods, we employ a single-layer LSTM model with 300 neurons augmented with a self-attention mechanism, as described in Section 3. As regularization techniques, we apply early stopping, Gaussian noise N (0, 0.1) to the word embedding layer, and dropout to the LSTM layer with p = 0.2. We use Adam to optimize our networks (Kingma and Ba, 2014) with mini-batches of size 64 and clip the norm of the gradients (Pascanu et al., 2013) at 0.5, as an extra safety measure against exploding gradients. We also use PyTorch (Paszke et al., 2017) and scikitlearn (Pedregosa et al., 2011).

Results & Analysis
We compare the performance of the three proposed conditioning methods with the two baselines and the state-of-the-art in Table 3. We also provide results for the combination of our best method, attentional feature-based gating, and the second baseline model (emb. conc.).
The results show that incorporating external knowledge in RNN-based architectures consistently improves performance over the baseline for all datasets. Furthermore, feature-based gating im-  Figure 2: Attention heatmap of a PsychExp random test sample. The first attention distribution is created with the baseline model without lexicon feature integration, while the second with the combination of our attentional feature-based gating method and the concatenation to word embeddings baseline (gate+emb.conc.).
proves upon baseline concatenation in the embedding layer across benchmarks, with the exception of PsychExp dataset.
For the Sent17 dataset we achieve state-ofthe-art F 1 score using the feature-based gating method; we further improve performance when combining gating with the emb. conc. method. For SST-5, we observe a significant performance boost with combined attentional gating and embedding conditioning (gate + emb. conc.). For PsychExp, we marginally outperform the state-ofthe-art also with the combined method, while for Irony18, feature-based gating yields the best results. Finally, concatenation based conditioning is the top method for SCv1, and the combination method for SCv2.
Overall, attentional feature-based gating is the best performing conditioning method followed by concatenation. Attentional affine transformation underperforms, especially, for smaller datasets; this is probably due to the higher capacity of this model. This is particularly interesting since gating (Eq. 4) is a special case of the affine transformation method (Eq. 5), where the shifting vector β is zero and the scaling vector γ is bounded to the range [0, 1] (Eq. 6). Interestingly, the combination of gating with traditional embedding-layer concatenation gives additional performance gains for most tasks, indicating that there are synergies to exploit in various conditioning methods.
In addition to the performance improvements, we can visually evaluate the effect of conditioning the attention distribution on prior knowledge and improve the interpretability of our approach. As we can see in Figure 2, lexicon features guide the model to attend to more salient words and thus predict the correct class.

Conclusions & Future work
We introduce three novel attention-based conditioning methods and compare their effectiveness with traditional concatenation-based conditioning. Our methods are simple, yet effective, achieving consistent performance improvement for all datasets. Our approach can be applied to any RNN-based architecture as a extra module to further improve performance with minimal computational overhead.
In the future, we aim to incorporate more elaborate linguistic resources (e.g. knowledge bases) and to investigate the performance of our methods on more complex NLP tasks, such as named entity recognition and sequence labelling, where prior knowledge integration is an active area of research.