Exploiting Domain Knowledge via Grouped Weight Sharing with Application to Text Categorization

A fundamental advantage of neural models for NLP is their ability to learn representations from scratch. However, in practice this often means ignoring existing external linguistic resources, e.g., WordNet or domain specific ontologies such as the Unified Medical Language System (UMLS). We propose a general, novel method for exploiting such resources via weight sharing. Prior work on weight sharing in neural networks has considered it largely as a means of model compression. In contrast, we treat weight sharing as a flexible mechanism for incorporating prior knowledge into neural models. We show that this approach consistently yields improved performance on classification tasks compared to baseline strategies that do not exploit weight sharing.


Introduction
Neural models are powerful in part due to their ability to learn good representations of raw textual inputs, mitigating the need for extensive task-specific feature engineering (Collobert et al., 2011). However, a downside of learning from scratch is failing to capitalize on prior linguistic or semantic knowledge, often encoded in existing resources such as ontologies. Such prior knowledge can be particularly valuable when estimating highly flexible models. In this work, we address how to exploit known relationships between words when training neural models for NLP tasks.
We propose exploiting the feature-hashing trick, originally proposed as a means of neural network compression (Chen et al., 2015). Here we instead view the partial parameter sharing induced by feature hashing as a flexible mechanism for ty-  Figure 1: An example of grouped partial weight sharing. Here there are two groups. We stochastically select embedding weights to be shared between words belonging to the same group(s).
ing together network node weights that we believe to be similar a priori. In effect, this acts as a regularizer that constrains the model to learn weights that agree with the domain knowledge codified in external resources like ontologies.
More specifically, as external resources we use Brown clusters (Brown et al., 1992), WordNet (Miller, 1995 and the Unified Medical Language System (UMLS) (Bodenreider, 2004). From these we derive groups of words with similar meaning. We then use feature hashing to share a subset of weights between the embeddings of words that belong to the same semantic group(s). This forces the model to respect prior domain knowledge, insofar as words similar under a given ontology are compelled to have similar embeddings.
Our contribution is a novel, simple and flexible method for injecting domain knowledge into neural models via stochastic weight sharing. Results on seven diverse classification tasks (three sentiment and four biomedical) show that our method consistently improves performance over (1) baselines that fail to capitalize on domain knowledge, and (2) an approach that uses retrofitting (Faruqui et al., 2014)

Grouped Weight Sharing
We incorporate similarity relations codified in existing resources (here derived from Brown clusters, SentiWordNet and the UMLS) as prior knowledge in a Convolutional Neural Network (CNN). 1 To achieve this we construct a shared embedding matrix such that words known a priori to be similar are constrained to share some fraction of embedding weights.
Concretely, suppose we have N groups of words derived from an external resource. Note that one could derive such groups in several ways; e.g., using the synsets in SentiWordNet. We denote groups by {g 1 , g 2 , ..., g N }. Each group is associated with an embedding g g i , which we initialize by averaging the pre-trained embeddings of each word in the group.
To exploit both grouped and independent word weights, we adopt a two-channel CNN model (Zhang et al., 2016). The embedding matrix of the first channel is initialized with pre-trained word vectors. We denote this input by E p ∈ R V ×d (V is the vocabulary size and d the dimension of the word embeddings). The second channel input matrix is initialized with our proposed weight-sharing embedding E s ∈ R V ×d . E s is initialized by drawing from both E p and the external resource following the process we describe below.
Given an input text sequence of length l, we construct sequence embedding representations W p ∈ R l×d and W s ∈ R l×d using the corresponding embedding matrices. We then apply independent sets of linear convolution filters on these two matrices. Each filter will generate a feature map vector v ∈ R l−h+1 (h is the filter height). We perform 1-max pooling over each v, extracting one scalar per feature map. Finally, we concatenate scalars from all of the feature maps (from both channels) into a feature vector which is fed to a softmax function to predict the label ( Figure 2).
We initialize E s as follows. Each row e i ∈ R d of E s is the embedding of word i. Words may belong to one or more groups. A mapping function G(i) retrieves the groups that word i belongs to, i.e., G(i) returns a subset of {g 1 , g 2 , ..., g N }, which we denote by {g  Figure 2: Proposed two-channel model. The first channel input is a standard pre-trained embedding matrix. The second channel receives a partially shared embedding matrix constructed using external linguistic resources.
initialize E s , for each dimension j of each word embedding e i , we use a hash function h i to map (hash) the index j to one of the K group IDs: berger et al., 2009;Shi et al., 2009), we use a second hash function b to remove bias induced by hashing. This is a signing function, i.e., it maps (i, j) tuples to {+1, −1}. We then set e i,j to the product of g h i (j),j and b(i, j). h and b are both approximately uniform hash functions. Algorithm 1 provides the full initialization procedure.
end for 6: end for For illustration, consider Figure 1. Here g 1 contains three words: good, nice and amazing, while g 2 has two words: good and interesting. The group embeddings g g 1 , g g 2 are initialized as averages over the pre-trained embeddings of the words they comprise. Here, embedding parameters e 1,1 and e 2,1 are both mapped to g g 1 ,1 , and thus share this value. Similarly, e 1,3 and e 2,3 will share value at g g 1 ,3 . We have elided the second hash function b from this figure for simplicity.
During training, we update E p as usual using back-propagation (Rumelhart et al., 1986). We update E s and group embeddings g in a manner similar to Chen et al. (2015). In the forward propagation before each training step (mini-batch), we derive the value of e i,j from g: We use this newly updated e i,j to do the forward propagation in CNN.
During backward propagation, we first compute the gradient of E s , and then we use this to derive the gradient w.r.t gs. To do this, for each dimension j in g g k , we aggregate the gradients w.r.t E s whose elements are mapped to this dimension: where δ h i (j)=g k = 1 when h i (j) = g k , and 0 otherwise. Each training step involves executing Equations 1 and 2 . Once the shared gradient is calculated, gradient descent proceeds as usual.
We update all parameters aside from the shared weights in the standard way.
3 Experimental Setup

Implementation Details and Baselines
We use SentiWordNet (Baccianella et al., 2010) 5 for the sentiment tasks. SentiWordNet assigns to each synset of wordnet three sentiment scores: positivity, negativity and objectivity, constrained to sum to 1. We keep only the synsets with positivity or negativity scores greater than 0, i.e., we remove synsets deemed objective. The synsets in SentiWordNet constitute our groups. We also use the Brown clustering algorithm 6 on the three sentiment datasets. We generate 1000 clusters and treat each as a group.
For the biomedical datasets, we use the Medical Subject Headings (MeSH) terms 7 attached to each abstract to classify them. Each MeSH term has a tree number indicating the path from the root in the UMLS. For example, 'Alagille Syndrome' has tree number 'C06.552.150.125'; periods denote tree splits, numbers are nodes. We induce groups comprising MeSH terms that share the same first three parent nodes, e.g., all terms with 'C06.552.150' as their tree number prefix constitute one group.
We compare our approach to several baselines. All use pre-trained embeddings to initialize E p , but we explore several approaches to exploiting E s : (1) randomly initialize E s ; (2) initialize E s to reflect the group embedding g, but do not share weights thereafter; (3) use the linguistic resources to retro-fit (Faruqui et al., 2014) the pre-trained embeddings, and use these to initialize E s .
For the sentiment datasets we use three filter heights (3,4,5) for each of the two CNN channels. For the biomedical datasets, we use only one filter height (1), because the inputs are unstructured MeSH terms. 8 In both cases we use 100 filters of each unique height. For the sentiment datasets, we use Google word2vec (Mikolov et al., 2013) Table 2: Accuracies on sentiment datasets. 'p': channel initialized with the pre-trained embeddings E p . 'r': channel randomly initialized. 'retro': initialized with retofitted embeddings. 'S/B (no sharing)': channel initialized with E s (using SentiWordNet or Brown clusters), but weights are not shared during training. 'S/B (sharing)': proposed weight-sharing method.
initialize E p . For the biomedical datasets, we use word2vec trained on biomedical texts (Moen and Ananiadou, 2013) 10 to initialize E p . For parameter estimation, we use Adadelta (Zeiler, 2012). We developed our approach using the MR sentiment dataset, tuning our approach to constructing groups from the available resources -experiments on other sentiment datasets were run after we finalized the model and hyperparameters. Similarly, we used the anemia (AN) review as a development set for the biomedical tasks, especially w.r.t. constructing groups from MeSH terms using UMLS.

Results
We report results (averages from 10-fold cross validation) on the sentiment and biomedical corpora in Tables 2 11 and 3, respectively. These exploit different external resources to induce the word groupings that in turn inform weight sharing. We report AUC for the biomedical datasets because these are highly imbalanced (see Table 1).
Our method improves performance compared to all relevant baselines (including an approach that also exploits external knowledge via retrofitting) in six of seven cases. Informing weight initialization using external resources improves performance independently, but additional gains are realized by also enforcing sharing during training.
We note that our aim here is not necessarily to achieve state-of-art results on any given dataset, 10 bio.nlplab.org/ 11 Sentiment task results are not directly comparable to prior work due to different preprocessing steps.   Table 2, except here the external resource is the UMLS MeSH ontology ('U').
but rather to evaluate the proposed method for incorporating external linguistic resources into neural models via weight sharing. We have therefore compared to baselines that enable us to assess this.

Related Work
Neural Models for NLP. Recently there has been enormous interest in neural models for NLP generally (Collobert et al., 2011;Goldberg, 2016). Most relevant to this work, simple CNN based models (which we have built on here) have proven extremely effective for text categorization (Kim, 2014;Zhang and Wallace, 2015). Exploiting Linguistic Resources. A potential drawback to learning from scratch in end-to-end neural models is a failure to capitalize on existing knowledge sources. There have been efforts to exploit such resources specifically to induce better word vectors (Yu and Dredze, 2014;Faruqui et al., 2014;Yu et al., 2016;Xu et al., 2014). But these models do not attempt to exploit external resources jointly during training for a particular downstream task (which uses word embeddings as inputs), as we do here. Past work on sparse linear models has shown the potential of exploiting linguistic knowledge in statistical NLP models. For example, Yogatama and Smith (2014) used external resources to inform structured, grouped regularization of loglinear text classification models, yielding improvements over standard regularization approaches. Elsewhere, Doshi-Velez et al. (2015) proposed a variant of LDA that exploits a priori known treestructured relations between tokens (e.g., derived from the UMLS) in topic modeling. Weight-sharing in NNs. Recent work has considered stochastically sharing weights in neural models. Notably, Chen et al. (2015) proposed randomly sharing weights in neural networks. Their primary motivation was compression, whereas here we view the hashing trick as a mechanism to encode domain knowledge.
We have proposed a novel method for incorporating prior semantic knowledge into neural models via stochastic weight sharing. We demonstrated that this generally improves text classification performance, compared to model variants that fail to exploit external resources and to an approach based on retrofitting prior to training.
In future work, we hope to generalize the approach beyond classification tasks, and to inform weight sharing using other varieties and sources of linguistic knowledge.