Multi-input Recurrent Independent Mechanisms for leveraging knowledge sources: Case studies on sentiment analysis and health text mining

This paper presents a way to inject and leverage existing knowledge from external sources in a Deep Learning environment, extending the recently proposed Recurrent Independent Mechnisms (RIMs) architecture, which comprises a set of interacting yet independent modules. We show that this extension of the RIMs architecture is an effective framework with lower parameter implications compared to purely fine-tuned systems.


Introduction
Deep neural networks have been successfully applied to a variety of natural language processing tasks such as text classification, sequence labeling, sequence generation, etc. Deep architectures are often non-modular, homogeneous systems and trained end-to-end. End-to-end training is performed with the hope that the structure of a networks is sufficient to direct gradient descent from a random initial state to a highly non-trivial solution (Glasmachers, 2017).
An important issue with the end-to-end training is that throughout the training of a system composed of several layers, valuable information contained in a problem decomposition that resulted in a specific network design is ignored (Glasmachers, 2017). In non-modular systems, explicit decomposition of high level tasks into distinct subprocesses is not possible and necessary complexity has to be induced through the complexity of the input stimulus. This results in large systems whith the required number of training samples becoming intractable. Interpretation of these black box systems is difficult (Miikkulainen and Dyer, 1991).
In compositional systems, in contrast, smaller modules encode specialized expertise which is known to impact one aspect of the task at hand. The aggregation of the modules acts synergistically to address the overall task. In a modular system, the components act largely independently but communicate occasionally. Module autonomy is crucial because in the case of distributional shifts (significant changes in some modules), other modules should remain robust (Schölkopf et al., 2012), (Goyal et al., 2019). Modules also need to interact occasionally to achieve compositional behavior (Bengio, 2017).
Many current neural modular systems, such as EntNet (Henaff et al., 2017) and IndRNN , offer only module independence, but no module communication. The recently proposed Recurrent Independent Mechanisms (RIMs) (Goyal et al., 2019), however, suggest to model a complex system by dividing the overall model into M communicative recurrent modules.
Deep architectures often rely solely on raw data in large quantities with a requirement of representativeness regarding task requirements. This becomes problematic for tasks with a specialized, low-frequency terminology, where high quality knowledge sources for NLP and AI are often available and have proven their effectiveness. Embedding expert knowledge in extended pre-trained word embeddings is costly. We present untied inedpendent modules to embed knowledge from different sources onto systems input. Knowledge sources, as independent experts, provide different annotations (abstractions) for the input, combining various classifications for solving the task.
For instance, providing sentiment lexica for sentiment analysis reduces the demand for training data by expanding the limited training vocabulary with an extended set of annotated terms. Precompiled word embeddings are to be considered knowledge sources in the same spirit and we demonstrate that they inter-operate with a variety of other knowledge sources such as gazetteers and POS encoding.
(1) This is an absurd comedy about alienation, separation and loss. Figure 1 shows annotations from different knowledge sources for Example 1, such as tokenization (from the ANNIE tokenizer), POS tags (from the Stanford POS tagger), and sentiment annotations from three sentiment lexica (AFINN (Nielsen, 2011), MPQA (Wilson et al., 2005), and NRC (Mohammad et al., 2013)).  The annotations of the different sentiment lexica in Figure 1 vary substantially: comedy is classified as positive (+1) in AFINN, as negative in MPQA, and almost neutral in NRC. (Özdemir and Bergler, 2015a) showed that this variance in judgements is not prohibitive, in fact (Özdemir and Bergler, 2015b) showed that combining 5 sentiment lexica outperformed all other combinations. These differences are in fact advantageous in an ensemble setting and reflect diversity among experts. The differences cannot be exploited, when a single embedding is used for tokens, but may be retained, when different lexica are embedded independently in different modules.
We add input independence to the RIMs architecture, providing different language annotations as inputs to a set of independent, but interacting modules. The resulting system is a flexible modular architecture for leveraging token-level knowledge in form of different annotation embeddings, which will be given different weights for the task at hand dependeing on their usefulness during training (see Figure 11). The system is evaluated on tasks such as sentiment analysis and analysis of health-related tweets for different health concerns.
Our experiments demonstrate that leveraging knowledge sources under a modular framework consistently improves performance with little increase in parameter space. Additionally, when frozen language models are supplemented with knowledge sources, the drop in performance is minimal, making this technique particularly beneficial for users that do not have access to powerful computational resources. Lastly, the modular nature of the system allows to visualize the models functionality.

RIMs
Recurrent independent mechanisms (RIMs) is a modular architecture that models a dynamic system by dividing it into M recurrent modules (Goyal et al., 2019). At time-step t, each module R m (m = 1, . . . , M ) has a hidden state h m t ∈ R d h .
Input selection Each module R m gets the augmented input X t = x t ⊕ 0, where 0 is an all-zero vector and ⊕ is the row-level concatenation. Then, using an attention mechanism, module R m selects input: where h m t−1 W query m is the query, X t W key is the key, and X t W val is the value in the attention mechanism (Vaswani et al., 2017). The matrices W query m ∈ R d h ×d query in , W key ∈ R d in ×d key in , and W val ∈ R d in ×d val in are linear transformations for constructing query, key, and value for the input selection attention. 1 If the input x t is considered relevant to module R m , the attention mechanism in Equation 1 assigns more weight to it (selects it), otherwise more weight will be assigned to the null input (Goyal et al., 2019).
The sof tmax values of Equation 1 determine a set S t of top m Active modules. 2 Among M modules, those with the least attention on the null input are the active modules. The selected input A m t determines a temporary hidden stateh m t for the active modules:h m of the inactive modules R m (m / ∈ S t ) remain unchanged: Module communication To obtain the actual hidden states h m t , the active modules communicate using an attention mechanism: The matricesW query com are used for constructing query, key, and value for the communication attention. 3 Note that both the key K t,: and the value V t,: depend on the temporary hidden states of all modules, therefore h m t in Equation 4 is determined by attending to all modules. The overall hidden state of the RIMs model at time-step t can be defined as which is the concatenation of the hidden states of all modules.
Classification We choose a simple attention layer together with a classifier to obtain the appropriate vector representation of a given sample. Attention (Bahdanau et al., 2015) determines importance scores e t = w T att h t using a latent context vector w att . The score is then normalized using α t = exp(et) j e j for a weighted sum H = t α t * h t , which is the input for a classifier.

Multi-input RIMs
We extend this architecture to so-called multi-input RIMs, which consist of a set of M modules, similar to the standard RIMs. The standard RIMs model assumes the same input sequence for all modules (X t in Equation 1), which share the same linear transformation matrices W key and W val for constructing the keys and values for the attention mechanism. In contrast, we untie the input attention mechanism and consider dedicated linear transformations W key m and W val m for module R m . Untying the attention mechanism allows modules to have different inputs X m t (m = 1, . . . , M ) each potentially with a different dimensionality. This supports our use of each module to encode a different knowledge source, one being word embeddings, one being a gazetteer list, etc. The input selection mechanism of Equation 1 then expands to Equation 5: where X m t = x m t ⊕ 0. In Equation 5, the sof tmax produces two attention scores, i.e. how much the module R m attends to the input x m t and the null input 0. The top m active modules with least attention scores to the null input form a set S t . The temporary hidden state for active modules is determined by Equation 2 and modules communicate according to Equation 4, identical to standard RIMs. An illustration of the multi-input RIMs model is provided in Figure 2.

Tasks
We explore the potential of multi-input RIMs by ablation on different tasks that are each very specific in their description and do not have large training datasets, namely three sentiment analysis tasks and two health-related tweet classification tasks.

Sentiment analysis
Here we consider three sentiment benchmark datasets with their respective tasks: SST-2 Stanford sentiment tree-bank for the task of binary sentiment classification of movie reviews (Socher et al., 2013). The models are trained on the data provided by the GLUE benchmark 4 (Wang et al., 2018).
SE17-4A SemEval 2017 task 4 subtask A is a 3class problem for sentiment classification of tweets (Rosenthal et al., 2017). The tweets are classified as Negative, Neutral, and Positive. The performance for this task is measured by the macro-average of recall scores for positive, negative, and neutral classes and evaluated by the TweetEval benchmark (Barbieri et al., 2020) 5 .
SE15-11 SemEval 2015 task 11 is a pilot task of sentiment analysis for figurative language tweets. The training set comprises a collection of sarcastic, ironic, and metaphoric tweets (4490 tweets) annotated on an 11 point scale (−5, . . . , +5) (Ghosh et al., 2015). The performance is measured by Cosine similarity between the gold standard labels and predictions.
We use the following sentiment lexica as knowledge sources: 1. AFINN: A manually compiled lexicon of 2500 words, rated for valence scores with an integer between -5 and 5 together with their prior polarities (Nielsen, 2011).
3. NRC HashTag sentiment: An automatically compiled resource, that uses seed hashtags (Mohammad et al., 2013). The polarity of the seed hashtag is used to calculate PMI-based 6 scores (Church and Hanks, 1990).
The training set SE15-11 has been released as tweet IDs and part of the training set is not available anymore 7 , therefore we randomly select 20% of the available tweets as test set and use the remaining for training.

Health experience classification of tweets
Personal experiences gleaned from social media can enhance awareness of the state of public health. Here we focus on two tasks: SM18-2 The task of medication intake report detection was introduced as SMM4H 2018 Task 2 (Weissenbacher et al., 2018) as a 3way classification task. Tweets in which the user clearly expresses a personal medication intake/consumption are considered Class 1. Tweets where the user may have taken some medication are labeled as Class 2. Class 3 tweets mention medication names but do not indicate personal intake. The total number of samples in the training set is 17700.
SM20-5 Birth defect mention detection concerning a child is a 3-class problem, where Class 1 tweets indicate that the user's child has a birth defect. Class 2 tweets are unclear as to whether the poster speaks of birth defects of their child. Class 3 tweets merely mention birth defects but not with respect to the poster's child (Klein et al., 2020). The training set includes 18382 samples.
Both, SM18-2 and SM20-5 benefit from specialized gazetters of relevant medical terms, in particular: For SM18-2, the gold labels of the competition set have not been disclosed, therefore we randomly hold out a test set (20% of the original training data). For SM18-2 and SM20-5 the performances are measured in terms of micro-F1 scores for 0 and 1 class.

Implementation
Preprocessing We preprocess the data using a GATE pipeline (Cunningham et al., 2002) with the ANNIE English Tokenizer (for SST-2 task) and ANNIE tweet tokenizer as well as the hashtag tokenizer (for the tweet tasks).
Embeddings Each annotation type provides a sequence (see Figure 1) which is used as input for a dedicated module in multi-input RIMs. Therefore, each sequence has to be properly embedded. The annotation types can be embedded either using pretrained embeddings or using randomly initialized embeddings that are learned during the training.
Tokens are embedded using ELMo (Peters et al., 2018) or RoBERTa (Liu et al., 2019) pretrained models. For ELMo, we use the pretrained model provided by AllenNLP 9 and for RoBERTa, the model provided by Hugging Face 10 .
POS tags following (Bagherzadeh and Bergler, 2021), we apply Word2Vec on POS tag sequences instead of token sequences. The POS embeddings are trained using the Gensim package (Rehurek and Sojka, 2010) with a window size of 5 and dimensionality 20. The pretraining is performed on combined training data of all tasks introduced in Section 3.
AFINN and NRC matches do not require an embedding, since the lexica quantify the sentiment scores numerically.
Medical Gazetteer matches are embedded using a learnable embedding matrix B ∈ R 5×20 . 9 https://allennlp.org/ 10 https://huggingface.co/ The 5 rows in B correspond to 4 medical resources 11 plus one row to indicate no annotation.
The multi-input RIMs model is a flexible architecture and the modules can be of any recurrent type. Here, we use LSTMs for complex inputs, such as Token or POS, and RNNs for annotations with simpler encodings, such as gazetteers.   Figure 3 summarizes the hyper-parameters used for multi-input RIMs. We use the learning rates of lr = 0.5e − 2 and lr = 0.5e − 4 for ELMo-and RoBERTa-based models respectively. The hyperparameters are tuned based on a grid-search approach. The multi-input RIMs model itself (excluding the language models) has 4M learnable parameters.
To calculate classification loss we use crossentropy loss and we optimize the models using the Adam optimizer (Kingma and Ba, 2015). The models are implemented using PyTorch (Paszke et al., 2017).

Numerical results
We present a set of ablation studies to evaluate the effectiveness and contribution of different knowledge sources. Figures 4-6 report results for the multi-input RIMs model when the modules are provided with different annotation types and all modules are kept active (M = m Active ). For the runs where the Token annotation is the only input (M = 1), the model is reduced to a simple LSTM with ELMo or RoBERTa embeddings, which we consider to form baselines. Figure 4 shows that all sentiment tasks benefit from the sentiment lexica. For SST-2, AFINN and MPQA add more to the task than NRC. On the other hand, NRC yields considerable performance improvements for the tweet sentiment data sets of 113  SE17-4a and SE15-11. We surmise the greater effectiveness of the NRC lexicon for the tweet sentiment tasks is due to the fact that it is constructed from tweet corpora.

All modules active
POS constitutes general linguistic knowledge and demonstrates consistent yet small improvements for the sentiment tasks. However, POS improves performance for the health concerns data of SM18-2 ( Figure 5) and SM20-5 ( Figure 6). Note that both tasks concern detection of personal experience mentions, for which categories such as pronouns (both personal and possessive) and verbs in past tense are important, which carry distinctive POS tags.  Figure 5: Multi-input RIMs for SM18-2, personal drug intake. All modules are active POS constitutes general linguistic knowledge and demonstrates consistent yet small improvements for the sentiment tasks. However, POS improves performance for the health concerns data of SM18-2 ( Figure 5) and SM20-5 ( Figure 6). Note that both tasks concern detection of personal experience mentions, for which categories such as pronouns (both personal and possessive) and verbs in past tense are important, which carry distinctive POS tags.
Improvements from medical knowledge gazetteers are also compelling. Figure 5 shows that the Disease gazetteer enhances the performance for the medication intake task, corroborating the hypothesis that disease mentions are strong evi-  Figure 6: Multi-input RIMs for SM20-5, birth defect in a child. All modules are active dence for medication intake. Similarly, Figure 6 shows that the Pregnancy gazetteer, as a complementary knowledge source, provides effective support for birth defect mention detection.
Some modules active We next evaluate performance when limiting the number of active modules (m Active < M ). Figures 7-9 show experiments for multi-input RIMs with each annotation as input to different modules. Interestingly, for most tasks, limiting the number of modules yields better performance, corroborating observations made by (Goyal et al., 2019). This confirms the importance of forcing the annotations into competition mode for the moderate to small datasets: if m Active < M , the modules compete for activation. As argued by (Goyal et al., 2019) and (Parascandolo et al., 2018) the competition between modules for representational resources (here the annotations) potentially leads to independence among learned mechanisms, making each module specialize on a simpler sub-problem, which prevents individual RIMs from dominating (Bengio et al., 2020).
Freezing language model vs fine-tuning We are interested in the behaviour of multi-RIMs when the language models are frozen. Freezing models such as BERT has recently demonstrated improvements (including speed-up) in the Adapters framework (Houlsby et al., 2019) and (Pfeiffer et al., 2020). The Adapters rely on injecting new trainable layers (modules) as intermediate layers within a frozen language model. The trainable layers are then expected to learn task specific representations.
Here, we investigate task adaptation using multi-input RIMs, combining trainable modules with complementary task specific resources/representations to compensate for possible losses in learning capacity of the model.
The last two rows in Figures 4-6 report performance when the language model is frozen (no finetuning). The fully-featured versions of all frozen systems still outperform the token-only baseline for all tasks for ELMo and almost all tasks for RoBERTa.
All of runs were executed on an Intel® Core i7 2.20GHz CPU. When we fine tune our RoBERTabased models, the average time for a forward pass and back-propagation for one sample is 1.71sec compared to 0.63sec when the language model is frozen. This significant reduction in training overhead when freezing language models is helpful for users whose access to computational resources is limited. The reported experiments suggest that appropriate knowledge sources can compensate for losses when freezing heavy language models such as ELMo or RoBERTa.
For other tasks however, we replicated the reported SOTA system for each task. For SM18-2 the SOTA performance is reported for (Xherija, 2018), which is a two-layer stacked bi-LSTM with attention. The SOTA results for SE15-11 are reported by CRNN-RoBERTa (Potamias et al., 2020) for a RoBERTa-based model in which a bi-LSTM layer is stacked on top of the RoBERTa model, together with a pooling operation for its last layer. The model is replicated here based on hyper-paramters provided in (Potamias et al., 2020). Figure 10 shows that multi-input RIMs perform at or above SOTA for all benchmarks with greater performance gains for tasks with comparatively smaller datasets and more complex linguistic requirements (SM18-2, SM20-5, SE15-11).

Module activation patterns
An advantage of a modular system is the possibility of module inspection. The functionality of each module during the course of processing has to be transparent for assessment. Figure 11 provides the activation patterns of two multi-input RIMs when applied to two inputs from SST-2 ( Figure 11a) and SM20-5 (Figure 11b) to assess whether they give insight into the functionality of the modules.
In Figure 11a, the modules that operate on sentiment knowledge sources (AFINN, MPQA, and NRC) are active only when an annotation is available and are idle (inactive) otherwise. The sentiment modules also compete with one another. Consider Beautifully at t = 1. For this token, both AFINN and MPQA provide annotations (AFINN:  +3, MPQA: Pos.), but the AFINN module wins the competition and is active while the MPQA module is inactive. The larger NRC lexicon provides more annotations for the input leading to more activity for the NRC module compared to the other sentiment modules for this sentence. Inactivity of token modules at certain time steps is particularly interesting, indicating that the model has chosen to attend to a external knowledge source. We find that 63% of the time, when the sentiment lexia provide consistent sentiment polarities, the token module is inactive.
The activation patterns in Figure 11b show the Birth Defect and Pregnancy gazetteer modules are   (Bai and Zhou, 2020) active only, when an annotation is available. The tokens CHD (t = 9) and T18 (t = 15) are matched by the Birth Defect gazetteer and the token stillbirth (t = 20) is matched by the Pregnancy gazetteer. The activity patterns are the result of the input selection mechanism (attention). Multi-input RIMs modules are free to select an input signal or ignore it, which allows each module to potentially focus on a specific part of the input. The input selection mechanism prevents the modules from getting updated with spurious inputs (here the input at steps, where no annotation is available). Additionally, this allows the system to develop different modules to select complementary input signals, biasing the behavior away from combining redundant encodings.
We believe that the activation patterns can be useful for model explanation. Nevertheless, the activation patterns have to be studied under a variety of NLP tasks and different, richer annotations, which demands a dedicated study and is beyond the scope of this paper.

Conclusion
This paper presents proof of concept for a modular system for leveraging different knowledge sources. Under the proposed model, various annotations with different encodings are used as inputs for a set of independent, decoupled, but interacting modules, a novel extension of the RIMs architecture.
Deploying several readily available knowledge sources (gazetteer lists and part-of-speech information), our experiments report on different sentiment tasks and data sets, as well as two health-related tasks and datasets. The results suggest that the modules successfully interoperate for addressing different target tasks and multiple datasets with drastically reduced parameter space (and processing resources).
In addition to the transfer potential of RIMs, we probed their transparency. The activation patterns of the modules in multi-input RIMs showed inter- b -Input: "Our baby had a very serious form of CHD. It was caused by T18 and we had a stillbirth." estingly differentiated motifs. In particular, the activation patterns show that modules are active only when their input annotation is relevant for the target task. To interpret the functionality of different modules in multi-input RIMs architectures, we plan a detailed analysis of the module activation patterns under different NLP tasks in the future.