A Prism Module for Semantic Disentanglement in Name Entity Recognition

Natural Language Processing has been perplexed for many years by the problem that multiple semantics are mixed inside a word, even with the help of context. To solve this problem, we propose a prism module to disentangle the semantic aspects of words and reduce noise at the input layer of a model. In the prism module, some words are selectively replaced with task-related semantic aspects, then these denoised word representations can be fed into downstream tasks to make them easier. Besides, we also introduce a structure to train this module jointly with the downstream model without additional data. This module can be easily integrated into the downstream model and significantly improve the performance of baselines on named entity recognition (NER) task. The ablation analysis demonstrates the rationality of the method. As a side effect, the proposed method also provides a way to visualize the contribution of each word.


Introduction
In Nature Language Processing (NLP), words contribute differently to different tasks. Therefore, attention-based models pay more attention on important words than unimportant words. Since the information that is unrelated to the task can be regarded as noise, unimportant words contain more noise than important words do. From this perspective, attention is a noise reduction mechanism.
Hard attention and soft attention are two main types of attention mechanisms which are proposed in (Xu et al., 2015). Hard attention mechanism selects some important tokens from input sequence * Kun Liu and Shen Li contributed equally to this work. † Work performed when Kun Liu worked as an intern in Deeplycurious.ai. 1 Our code is available at https://github.com/ liukun95/Prism-Module and ignore others. This will lead to the loss of necessary information which exists in the ignored tokens. By contrast, in soft attention mechanism, a probability distribution which reflects the importance of tokens is calculated over each token of the input sequence. However, since there is more useless information than useful information in unimportant words, it should be noted that noise could be kept more, when those words are assigned with non-zero probabilities. Overall, both two attention mechanisms have drawbacks in noise reduction.
Attention mechanism is firstly applied in Computer Vision (CV)  where pixels are the basic units. However, in NLP, the minimum unit is not word but sense. Therefore, NLP tasks need a noise reduction method at a finer granularity than attention mechanism.
Normally, various aspects of semantics are entangled in word embeddings (Bengio et al., 2003;Mikolov et al., 2013). However, only some of the aspects are needed in specific tasks and other redundant aspects can be regarded as noise. To reduce the noise, entangled word embeddings can be replaced with distributed representations of disentangled semantic aspects. Considering that it could be hard to find the corresponding semantics for each aspect, we call them abstract aspects.
In this paper, we propose a prism module to generate parallel denoised sentences from multiple aspects. Different from attention mechanism, the module reduces noise in semantic aspect level rather than word level. Specifically, we selectively replace some words in the sentence with abstract aspects. These denoised sentences are expected to keep sufficient information to make predictions in the downstream tasks, like the low-noise version of original sentence. Compared with attention mechanism, the proposed method not only reduces the noise, but also reduces the loss of necessary information. Furthermore, this method also allows to reduce noise from different aspects. As a side effect, the interpretability of models is improved since different abstract aspects could represent different semantics.
We introduce a method to train this module jointly with downstream model without extra training data. During training, the prism module learns to find the proper words to be replaced for each abstract aspect and also learns the embeddings of abstract aspects which can represent the task-related semantics of words. Furthermore, we introduce a novel trick to reduce the high variance in training brought by REINFORCE method.
The prism module can be easily integrated into downstream model to reduce noise and improve performance. We evaluate our method on NER task. Results show that our model outperforms the baseline by a substantial margin.

Related Work
Attention-based models achieve the state of the art performance in a broad range of NLP tasks. Although soft attention is more popular, hard attention is found to be more effective with good training (Xu et al., 2015). Hard attention has been successfully applied in computer vision (Ba et al., 2014; but the application is limited in NLP. Lei et al. (2016) proposed a novel type of hard attention and apply it to improve the interpretability of models. However, the accuracy is not improved. Inspired by this, our proposed method can also be understood as hard-attention based but improves the accuracy successfully.
In addition to improving accuracy, attentionbased models also improve the interpretability by showing the inner working of neural networks (Rush et al., 2015;Rocktäschel et al., 2015;Lei et al., 2016). Disentangling provides another way to improve the interpretability by extracting information from different aspects of the input. Lin et al. (2017) propose a multi-aspect self-attention to disentangle the latent semantic information of the input sentence. Jain et al. (2018) propose a model to learn disentangled representations of texts for 4 given biomedical aspects. Our proposed method can be regarded as the combination of the above two types of methods to improve the interpretability of the model.

Prism Module
The target of this module is to get the sentences with less noise by replacing some of the words with abstract aspects. In a sentence, since each word has different semantics and contributes differently to the task, the key is to calculate the probability distribution over possible replacements.
Given a sentence X, which have n words where w i is the embedding of the i-th word in the sentence. We also have m different abstract aspects which represent m aspects of semantics where a i is the embedding of the i-th abstract aspect.
We apply bidirectional LSTM to the input sentence, which could capture some dependency between words.
where − → h t and ← − h t denote the hidden states. We use h t , the concatenation of − → h t and ← − h t as the annotation of words. All n hidden states are annotated as the matrix H = (h1, h2, h3, · · · hn) We define binary variable s i,j ∈ 0, 1 which indicates whether j-th word w j is replaced by i-th abstract aspect a i or not. Then, the probabilities P with shape of m-by-n can be computed, where each element p i,j is the probability of s i,j = 1. P is calculated as: pi,j = p(si,j = 1|X) Here, W is the weight with the size of m-by-2h and b is the bias. s i,j is the random variable with multinoulli distribution parametrized by p i,j . To get the replaced sentences, we sample S according to the probability distribution p i,j where i-th row of the matrix indicates which words in a sentence are replaced with i-th abstract aspect. After replacing the words with the guide of S , we obtain m replaced sentences (X 1 , X 2 , X 3 · · · X m ) where each one is denoised from different aspect. Then, these parallel sentences including m denoised sentences and the original sentence are used as the input of the downstream model.

Model Training
The prism module is trained jointly with downstream model. The parameters in the model can be divided into two parts, θ o for downstream model and θ a for prism module.
The objective for optimizing θ o is to improve the prediction accuracy of the model. Since the input of the model includes both the word embeddings and abstract aspect embeddings, the loss function for parameters θ o is L (θo) = L (θo, X, y) + L θo, X, S , y The objective for optimizing θ a is to replace proper words with proper abstract aspects. Because of the discrete variable s i,j , the loss function is non-differentiable for the parameters θ a . We use the policy gradient/REINFORCE (Williams, 1992) to optimize θ a . Since we expect that not only the downstream model is well trained, but also the replaced sentences can achieve favorable performance in downstream task, the loss function L (θ o ) is used as reward R. The objective function for θ a is: Besides, we also introduce a penalization term Ω (A) proposed by Lin et al. (2017) to diversify the abstract aspects which are expected to represent different disentangled aspects.
where F denotes the Frobenius norm of a matrix, I stands for the identity matrix and A is calculated by normalizing each a i of A.
Considering that we sample the S according to the probability distribution to simplify the expectation, for all parameters, the loss function L is:

Normalization of Reward
High variance is one of the disadvantages of RE-INFORCE method, which makes models difficult to converge. No exception, our model also suffers from the same problem. We propose a novel method to reduce the variance and stabilize the training process. We normalize the rewards by making them have the mean of 0 and variance of 1.
where mean µ and variance σ are calculated over each mini-batch. R i denotes the normalized reward. The loss L becomes L = L (θo, X, y) + L θo, X, S , y

Experiments
We evaluate the effectiveness of our noise reduction method on NER task. Dataset: CoNLL 2003 (Sang andDe Meulder, 2003) is used as our dataset.
Baseline: Yang et al. (2018) compare the performance of twelve neural sequence labeling models in NER task and the architecture CNN-BiLSTM (Bi-directional LSTM)-CRF (Ma and Hovy, 2016) achieves the best result (F1). Therefore, we use this model as our baseline. Figure 1 shows our model where the prism module is integrated into CNN-BiLSTM-CRF architecture. The sentence is fed into the prism module and the output of this module is m(e.g., 3) sentences which are denoised from different aspect. These m + 1 parallel sentences including the m denoised sentences and the original sentence are fed into BiSTM+CRF network to predict the labels. Besides, only the original sentence is used in testing.

Model Configuration
In the prism module, the hidden size of BiLSTM is the same as in CNN-BiLSTM-CRF architecture. The number of abstract aspects is set as 8. Except the hyper parameters in prism module, other hyper parameters are all set as (Ma and Hovy, 2016).

Result and Analysis
The experimental results are shown in Table 1.
Our model outperforms the baseline by a clear margin.
To prove the effectiveness of our prism module, we design three ablation experiments: Multi-aspect hard attention: Instead of replacing the words with abstract aspects, we replace the embeddings of selected words with zero vectors. This method can be regarded as a type of multi-aspect hard attention where some of the words are ignored.
Random replacement: Instead of learning to select the words to be replaced guided by the downstream task, we select the words to be replaced randomly for each abstract aspect. It is a kind of data noising technique which is similar to the method proposed in (Xie et al., 2017) with Single aspect: In our model, one word could be replaced with different abstract aspects in different denoised sentences. In this experiment, there is only one denoised sentence where each word could only be replaced with the abstract aspect of the maximum probability.
Our model has better performance than three ablation experiments as shown in Table 1. The results indicate that (1) The trainable embeddings of each abstract aspect can capture the information which is valuable for the task.
(2) Our model can learn to replace words properly guided by the downstream task (e.g., NER). (3) For each word, more than one aspect of semantics are task-related. Additionally, considering that the first two ablation experiments improve F1 by 0.3% but the last one only improves 0.1%, multi-aspect denoising is important for the prism module.

Visualization
We visualize the matrix S by drawing the heat map of each row vector as shown in Figure 2. In this example, japan and china are location entities. Each row corresponds to one abstract aspect and each element indicates whether this word is replaced. The heat map shows that each abstract aspect replaces some of words to keep certain task-related semantics and filter out other information. Since the abstract aspects represent different meanings respectively, the selections of words vary between rows which indicates noise is reduced from different aspects. From the heat map, we can also learn that a word can be replaced with multiple abstract aspects and this process is the disentanglement of semantics.

Conclusion
In this paper, we propose a prism module to reduce the noise of word embeddings by selectively replacing some words with task-related semantic aspects. We also introduce a structure to train this prism module jointly with existing model and no extra data is needed. Considering REINFORCE method is used in training, a novel method is introduced to reduce the variance of rewards. As a result, our model outperforms the baseline by a clear margin and the ablation analysis proves the effectiveness of our method. As a side effect, this module also improves the interpretability of models. Since our prism module can be easily integrated into existing models, it can be applied in a wide range of neural architectures.