Pointing the Unknown Words

The problem of rare and unknown words is an important issue that can potentially effect the performance of many NLP systems, including both traditional count based and deep learning models. We propose a novel way to deal with the rare and unseen words for the neural network models with attention. Our model uses two softmax layers in order to predict the next word in conditional language models: one of the softmax layers predicts the location of a word in the source sentence, and the other softmax layer predicts a word in the shortlist vocabulary. The decision of which softmax layer to use at each timestep is adaptively made by an MLP which is conditioned on the context. We motivate this work from a psychological evidence that humans naturally have a tendency to point towards objects in the context or the environment when the name of an object is not known. Using our proposed model, we observe improvements in two tasks, neural machine translation on the Europarl English to French parallel corpora and text summarization on the Gigaword dataset.


Introduction
Words are the basic input/output units in most of the NLP systems, and thus the ability to cover a large number of words is a key to building a robust NLP system. However, considering that (i) the number of all words in a language including named entities is very large and that (ii) a language itself is an evolving system (people create new words), this is also a challenging problem.
The traditional approach taken by the most current neural network based NLP systems is to use a softmax output layer where each of the output dimension corresponds to a word in a predefined word-shortlist. Because computing high dimensional softmax is computationally expensive, in practice the shortlist is usually limited to have only top-K most frequent words in the training set. All other words are then replaced by a special word, called the unknown word (UNK).
This approach has two fundamental problems. The first problem, which is known as the rare word problem, is that some of the words in the shortlist occur less frequently in the training set and thus are difficult to learn a good representation, resulting in poor performance. Second, it is obvious that we can lose some important information by mapping different words to a single dummy token UNK. Even if we have a very large shortlist including all unique words in the training set, it does not necessarily improve the test performance, because there still exists a chance to see an unknown word at test time. This is known as the unknown word problem. On the other hand, increasing the shortlist size mostly leads to increasing rare words due to Zipf's Law.
These two problems are particularly critical in language understanding tasks such as factoid question answering (Bordes et al., 2015) where the words that we are mainly interested in are usually named entities which are likely to be unknown or rare words.
In a similar situation, where we have a limited information on how to call an object of interest, it seems that humans (and also some primates) have an efficient behavioral mechanism of drawing attention to the object: pointing (Matthews et al., 2012). Pointing makes it possible to deliver information and to associate context to a particular object without knowing how to call it. In partic-ular, human infants use pointing as a fundamental communication tool (Tomasello et al., 2007).
In this paper, inspired by the pointing behavior of humans and recent advances in the attention mechanism (Bahdanau et al., 2014) and the pointer networks , we propose a novel method to deal with the rare or unknown words problem. The basic idea is that we can see many NLP problems as a task of predicting target text given context text, where some of the target words are observed also in the context. We observe that in this case we can make the model learn to point a word in the context and copy it to the target text, as well as when to point. For example, in machine translation, we can see the source sentence as the context, and the target sentence as what we need to predict. Although the source and target languages are different, many of the words such as named entities are usually represented by the same characters in both languages, making it possible to copy. Similarly, in text summarization, it is natural to see some words in the original text also in the summarized text.
Specifically, to predict a target word at each timestep, our model first determines the source of the word generation, that is, whether to take one from the predefined shortlist or to copy one from the context. For the former, we apply the typical softmax operation, and for the latter, we use the attention mechanism to obtain the pointing softmax probability over the context words and pick the one of high probability. The model learns this decision so as to use the pointing only when the context includes a word that can be copied to the target. This way, our model can predict even the words which are not in the shortlist, as long as it appears in the context. Although some of the words still need to be labeled as UNK if it is neither in the shortlist nor in the context, in experiment we show that this learning when and where to point improves the performance in tasks such as machine translation and text summarization.
The rest of the paper is organized as follows. In the next section, we review the related works including pointer networks and previous approaches to the rare/unknown problem. In Section 3, we review the neural machine translation with attention mechanism which is the baseline in our experiments. Then, in Section 4, we propose our method dealing with the rare/unknown word problem, called the Pointer Softmax. The experimental results are provided in the Section 5 and we conclude our work in Section 6.

Related Work
The attention-based pointing mechanism is introduced first in the pointer networks . In the pointer networks, the output space of the target sequence is constrained to be the observations in the input sequence (not the input space). And instead of having a fixed dimension softmax output layer, softmax outputs of varying dimension is dynamically computed for each input sequence in such a way to maximize the attention probability of the target input. However, its applicability is rather limited because, unlike our model, there is no option to choose whether to point or not; it always points. In this sense, we can see the pointer networks as a special case of our model where we always choose to point a context word.
Several approaches have been proposed towards solving the rare words/unknown words problem, which can be broadly divided into three categories. The first category of the approaches focuses on improving the computation speed of the softmax output so that it can maintain a softmax over a very large vocabulary. Because this only increases the shortlist size, it helps to mitigate the unknown word problem, but still suffers from the rare word problem. The hierarchical softmax (Morin and Bengio, 2005), importance sampling (Bengio and Senécal, 2008;Jean et al., 2014), and the noise contrastive estimation (Gutmann and Hyvärinen, 2012;Mnih and Kavukcuoglu, 2013) methods are in the class.
The second category, where our proposed method also belongs to, uses information from the context. Notable works are (Luong et al., 2015) and (Hermann et al., 2015). In particular, applying to machine translation task, (Luong et al., 2015) learns to point some words in source sentence and copy it to the target sentence, similarly to our method. However, it does not use attention mechanism, and by having fixed sized softmax output over the relative pointing range (e.g., -7, . . . , -1, 0, 1, . . . , 7), their model (the Positional All model) has a limitation in applying to more general problems such as summarization and question answering, where, unlike machine translation, the length of the context and the pointing locations in the context can vary quite widely. In the question answering, (Hermann et al., 2015) have used placeholders on named entities in the context. However, the placeholder id is directly predicted in the softmax output rather than predicting its location in the context.
The third category of the approaches changes the unit of input/output itself from words to a smaller resolution such as characters (Graves, 2013) or bytecodes (Sennrich et al., 2015;Gillick et al., 2015). Although this approach has the main advantage that it could suffer less from the rare/unknown word problem, the training usually becomes much harder because the length of sequences significantly increases.

Neural Machine Translation Model with Attention
As the baseline neural machine translation system, we use the model proposed by citenmt that learns to (soft-)align and translate jointly. We refer this model as NMT.
The encoder of the NMT is a bidirectional RNN (Schuster and Paliwal, 1997). The forward RNN reads the input sequence The backward RNN reads x in the reversed direction and outputs ( ← − h 1 , . . . , ← − h T ). We then concatenate the hidden states of forward and backward RNNs at each time step and obtain a sequence of annota- Here, || denotes the concatenation operator. Thus, each annotation vector h j encodes information about the j-th word with respect to all the other surrounding words in both directions. In decoder, we use the gated recurrent unit (GRU) Chung et al., 2014). Specifically, at each time-step t, the softalignment mechanism first computes the relevance weight e tj which determines the contribution of annotation vector h j to the t-th target word. We use a non-linear mapping f (e.g., MLP) which takes h j , the previous decoder's hidden state s t−1 and the previous output y t−1 as input: The outputs e tj are then normalized as follows: . (1) We call l tj as the relevance score, or the alignment weight, of the j-th annotation vector.
The relevance scores are used to get the context vector c t of the t-th target word in the translation: The hidden state of the decoder s t is computed based on the previous hidden state s t−1 , the context vector c t and the output word of the previous time-step y t−1 : where f r is GRU. We use a deep output layer (Pascanu et al., 2013) to compute the conditional distribution over words: where W is a learned weight matrix and b is a bias of the output layer. f o is a single-layer feedforward neural network. ψ (Wo,bo) (·) is a function that performs an affine transformation on its input. And the superscript a in ψ a indicates the a-th column vector of ψ.
The whole model, including both the encoder and decoder, is jointly trained to maximize the (conditional) log-likelihood of the target sequences given the input sequences, where the training corpus is a set of (x n , y n )'s. Figure 1 illustrates the architecture of the NMT. In this section, we introduce our method, called the pointer softmax (PS), to deal with the rare and unknown words. The pointer softmax is a more generally applicable method to many NLP tasks because it resolves the limitations of previous approaches and it can be used in parallel with other existing techniques such as the large vocabulary trick (Jean et al., 2014). To make the pointing mechanism applicable in more general settings, our model learns two key abilities jointly: (i) to predict whether it is required to use the pointing or not at each time step and (ii) to point any location of the context sequence whose length can vary widely over examples. Note that the pointer networks  are in lack of the ability (i), and the ability (ii) is not achieved in the models by (Luong et al., 2015).
To achieve this, our model uses two softmax output layers, the shortlist softmax and the location softmax. The shortlist softmax is the same as the typical softmax output layer where each dimension corresponds a word in the predefined word shortlist. The location softmax is a pointer network where each of the output dimension corresponds to the location of a word in the context sequence. Thus, the output dimension of the location softmax varies according to the length of the given context sequence.
In each time-step, if the model decides to use the shortlist softmax, we generate a word w t from the shortlist. Otherwise, if it is expected that the context sequence contains a word which needs to be generated at the time step, we obtain the location of the context word l t from the location softmax.
The key making the above possible is how to decide when to use the shortlist softmax or the location softmax at each time step. For this, we introduce a switch network to the model. The switch network, which is a multilayer perceptron in our experiments, takes the representation of the context sequence (similar to the input annotation in NMT) and the previous hidden state of the output RNN as its input, and outputs a binary variable z t which indicates whether to use the shortlist softmax (when z t = 1) or the location softmax (when z t = 0). Note that if the word that is expected to be generated at the time step is neither in the shortlist nor in the context sequence, the switch network selects the shortlist softmax, and then the shortlist softmax predicts UNK.
In Figure 2, we provide a simple depiction on the architecture of the pointer softmax model.  More specifically, our goal is to maximize the probability of observing the target word sequence y = (y 1 , y 2 , . . . , y Ty ) and the word generation source z = (z 1 , z 2 , . . . , z Ty ), given the context sequence x = (x 1 , x 2 , . . . , x Tx ): p θ (y, z|x) = Ty t=1 p θ (y t , z t |y <t , z <t , x). (4) Note that the word observation y t can be either a word w t from the shortlist softmax or a location l t of the location softmax, depending on the switching variable z t . Considering this, we can factorize the above equation further Here, T w is a set of time steps where z t = 1, and T l is a set of time-steps where z t = 0. And, T w ∪T l = {1, 2, . . . , T y } and T w ∩ T l = ∅. We denote all previous observations at step t by (y, z) <t . Note also that h t = f ((y, z) <t ).
Then, the joint probabilities inside each product can be further factorized as follows: p(l t , z t |(y, z) <t ) = p(l t |z t = 0, (y, z) <t ) × p(z t = 0|(y, z) <t ) here, we omitted x which is conditioned on all probabilities in the above. The switch probability is modeled as a multilayer perceptron with binary output: And p(w t |z t = 1, (y, z) <t , x) is the shortlist softmax and p(l t |z t = 0, (y, z) <t , x) is the location softmax which can be a pointer network. σ(·) stands for the sigmoid function, σ(x) = 1 exp(-x)+1 .
Given N such context and target sequence pairs, our training objective is to maximize the following log likelihood w.r.t. the model parameter θ arg max θ 1 N N n=1 log p θ (y n , z n |x n ). (10)

Basic Components of the Pointer Softmax
In this section, we will discuss about the three fundamental components of the pointer-softmax and discuss about some practical details of those components as depicted in Fig 2. Location Softmax l t : The location of the word to copy from source text to the target is predicted by the location softmax l t . Location softmax outputs the conditional probability distribution p(l t |z t = 0, (y, z) <t , x). For models using the attention mechanism such as NMT we can use the attention distribution in order to predict the location of the word to predict. Otherwise we can simply use a pointer network of the model to predict the location.
Shortlist Softmax w t : The subset of the words in the vocabulary V is being predicted by the shortlist softmax w t .
Switching network d t : The switching network d t is an MLP with sigmoid output function that outputs a scalar probability of switching between l t and w t , and represents the conditional probability distribution p(z t |(y, z) <t , x). For NMT model, we condition the MLP that outputs the switching probability on the representation of the context of the source text c t and the hidden state of the decoder h t . Note that, during the training, d t is observed, and thus we do not have to sample. The output of the pointer softmax, f t will be the concatenation of the the two vectors, d t × w t and (1 − d t ) × l t .
At test time, we compute Eqn. (6) and (7) for all shortlist word w t and all location l t , and pick the word or location of the highest probability.

Experiments
In this section, we provide our main experimental results with pointer-softmax algorithm on Machine Translation and Summarization tasks. In our experiments, we have used the same baseline model and just replaced the softmax layer with pointer softmax layer at the language model. We used the Adadelta (Zeiler, 2012) learning rule for the training of NMT models.

The Rarest Word Detection
We constructed a synthetic task and run some preliminary experiments on it in order to compare the results with the pointer softmax and the regular softmax's performance for the rare-words. We constructed an artificial dataset with the vocabulary size of |V |= 600 with the sequences of length 7. The words in the sequences are sampled according to their unigram distribution which has the form of a geometric distribution. The task is to predict the least frequent word in the sequence according to unigram distribution of the words. During the training, the sequences are generated randomly online. Validation and test sets are constructed before the training with a fixed seed.
We use a GRU layer over the input sequence and take the last-hidden state, in order to get the summary c t of the input sequence. The w t , l t are only conditioned on c t , and the MLP predicting the d t is conditioned on the latent representations of w t and l t . We used minibatches of size 250 using adam adaptive learning rate algorithm (Kingma and Adam, 2015) with learning rate 8 × 10 −4 and hidden layer of size 1000 units.
We train a model with pointer-softmax where we assign pointers for the rarest 60 words and the rest of the words are predicted from the shortlist softmax of size 540. We observe that increasing the inverse temperature of the sigmoid output of d t to 2, in other words making the decisions of d t to become sharper, works better, i.e. d t = σ(2x).
At the end of training with pointer-softmax we obtain the error rate of 17.4% and by using softmax over all 600 tokens, we obtain the error-rate of 48.2%.

Summarization
In these series of experiments, we have used the annotated Gigaword corpus as described in (Rush et al., 2015). We have used the scripts made available by the authors of this work 1 to preprocess the data, which resulted in about 3.8M training examples. The script also generates about 400K validation and an equal number of test examples, but we use a randomly sampled subset of 2000 examples each for validation and testing. We also have made small modifications to the script to extract not only the tokenized words, but also system-generated named-entity tags. We have created two different versions of training data for pointers, which we call UNK-pointers data and entity-pointers data respectively.
In the UNK-pointers data, we have trimmed the vocabulary of the source and target data in the training set and replaced a word by the UNK token whenever a word occurs less than 5 times in either source or target data separately. Then, we created pointers from each UNK token in the target data to the position in the corresponding source document where the same word occurs in the source, as seen in the data before UNKs were created. It is possible that the source data also has an UNK in the matching position, but we still created a pointer in this scenario as well. The resulting data has 2.7 pointers per 100 examples in the training set and 9.1 pointers rate in the validation set.
In the entity-pointers data, we exploited the named-entity tags in the annotated corpus and first anonymized the entities by replacing them with an integer-id that always starts from 1 for each document and increments from left to right. Entities that occur more than once in a single document share the same id. We created the anonymization at token level, so as to allow partial entity matches between the source and target for multi-token entities. Next, we created a pointer from the target to source on similar lines as before, but only for exact matches of the anonymized entities. The resulting data has 161 pointers per 100 examples in the training set and 139 pointers per 100 examples in the validation set.
When there are multiple matches in the source 1 https://github.com/facebook/NAMAS in either the UNK-pointers data or the entitypointers data, we resolved the conflict in favor of the first occurrence of the matching word in the source document. In the UNK data, we modeled the UNK tokens on the source side using a single placeholder embedding that is shared across all documents, and in the entity-pointers data, we modeled each entity-id in the source by a distinct placeholder, each of which is shared across all documents.
In all our experiments, we have used a bidirectional GRU- RNN (Chung et al., 2014) for the encoder and a uni-directional RNN for the decoder. To speed-up training, we use the large-vocabulary trick (Jean et al., 2014) where we limit the vocabulary of the softmax layer of the decoder to 2000 words dynamically chosen from the words in the source documents of each batch and the most common words in the target vocabulary. In both experiments, we have fixed the embedding size to 100 and the hidden state dimension to 200. We have used pre-trained word2vec vectors trained on the same corpus to initialize the embeddings, but allowed them to be further learned during training. Our vocabulary sizes have been fixed at 125K for source and 75K for target for both experiments.
The reference data for pointers is consumed by the model only at training time. At test time, the switch makes a decision at every timestep on which softmax layer to use.
For evaluation, we use full-length Rouge F1 using the official evaluation tool 2 . In their work, the authors of (Bahdanau et al., 2014) used full-length Rouge Recall on this corpus, since the maximum length of limited-length version of Rouge recall of 75 bytes (intended for DUC data) is already too long for Gigaword summaries. However, since full-length Recall can unfairly reward longer summaries, we decided to use full-length F1 in our experiments for a fair comparison between our models, independent of the summary length.
The experimental results comparing the Pointer Softmax model with NMT model are displayed in Table 1 for the UNK pointers data and in Table 2 for the entity pointers data. As the experiments show, Pointer Softmax improves over the baseline NMT on both UNK data and entities data. Our hope was that the improvement would be much greater for the entities data since the incidence of pointers was much greater. However, it turns out this is not the case, and we suspect the main reason is anonymization of entities which removed datasparsity by converting all entities to integer-ids that are shared across all documents. We are also running additional experiments on de-anonymized data, where we believe our model could help more, since the issue of data-sparsity is more acute in this case.   In Table 3, we provide the results for summarization on Gigaword corpus in terms of recall as also similar comparison done by (Rush et al., 2015). We observe improvements on all the scores with the addition of pointer-softmax. Let us note that, since the test set of (Rush et al., 2015) is not publicly available, we sampled 2000 texts with their summaries without replacement from the validation set and used those examples as our test set.
We present a few system generated summaries from the Pointer Softmax model trained on the UNK pointers data in Table 4. From the examples, it is apparent that the model has learned to accurately point to the source positions whenever it needs to generate rare words in the summary.

Neural Machine Translation
In our neural machine translation (NMT) experiments, we have trained NMT models with attention over the Europarl corpus (Bahdanau et al., 2014) over the sequences of length up to 50 for English to French translation. 3 . All models are trained with early-stopping which is done based on the NLL on the development set. Our evaluations to report the performance of our models are done on newstest2011 by using BLUE score. 4 We use 30,000 tokens for both the source and the target language shortlist vocabularies (1 of the token is still reserved for the unknown words). The whole corpus contains 134, 831 unique English words and 153,083 unique French words. We have created a word-level dictionary from French to English which contains translation of 15,953 words that are neither in shortlist vocabulary nor dictionary of common words for both the source and the target. There are about 49,490 words shared between English and French parallel corpora of Europarl.
During the training, in order to decide whether to pick a word from the source sentence using attention/pointers or to predict the word from the short-list vocabulary, we use a very simple heuristic. If the word is not in the short-list vocabulary, we first check if the word y t itself appears in the source sentence. If it is not, we check if the word itself is in the source sentence by using the common words table for the source and the target language. If the word is in the source sentence, we use the location of the word in the source as the target. Otherwise we check if one of the English senses from the cross-language dictionary of the French word is in the source. If it is in the source sentence, then we use the location of that word as our translation. Otherwise we just use the argmax of l t as the target.
For switching network d t , we observed that using a two-layered MLP with noisy-tanh activation (Gulcehre et al., 2016) function with residual connection from the lower layer (He et al., 2015) activation function to the upper hidden layers improves the BLEU score about 1 points over the d t using ReLU activation function. We initial- Table 4: Generated summaries from NMT with PS. Boldface words are the words copied from the source.
Source #1 china 's tang gonghong set a world record with a clean and jerk lift of ### kilograms to win the women 's over-## kilogram weightlifting title at the asian games on tuesday .

Target #1
china 's tang <unk>,sets world weightlifting record NMT+PS #1 china 's tang gonghong wins women 's weightlifting weightlifting title at asian games Source #2 owing to criticism , nbc said on wednesday that it was ending a three-month-old experiment that would have brought the first liquor advertisements onto national broadcast network television . Target #2 advertising : nbc retreats from liquor commercials NMT+PS #2 nbc says it is ending a three-month-old experiment Source #3 a senior trade union official here wednesday called on ghana 's government to be " mindful of the plight " of the ordinary people in the country in its decisions on tax increases .

Target #3
tuc official,on behalf of ordinary ghanaians NMT+PS #3 ghana 's government urged to be mindful of the plight ized the biases of the last sigmoid layer of d t to −1 such that if d t becomes more biased toward choosing the shortlist vocabulary at the beginning of the training. We renormalized the gradients if the norm of the gradients exceed 1 (Pascanu et al., 2012). In Table 5, we provided the result of NMT with pointer-softmax and we observed about 3.6 BLEU score improvement over our baseline.
In Figure 3, we have shown the validation curves of the NMT model with attention and the NMT model with shortlist-softmax layer. Pointersoftmax converges faster and achieves a lower validation negative-log-likelihood (NLL) (63.91) after 200k minibatch updates over the Europarl dataset than the NMT model with shortlist softmax trained for 400k minibatch updates (65.26). Pointer-softmax converges faster than the model with the shortlist softmax because the targets provided to the pointer-softmax also acts like guiding hints.