Learning to Tag OOV Tokens by Integrating Contextual Representation and Background Knowledge

Neural-based context-aware models for slot tagging have achieved state-of-the-art performance. However, the presence of OOV(out-of-vocab) words significantly degrades the performance of neural-based models, especially in a few-shot scenario. In this paper, we propose a novel knowledge-enhanced slot tagging model to integrate contextual representation of input text and the large-scale lexical background knowledge. Besides, we use multi-level graph attention to explicitly model lexical relations. The experiments show that our proposed knowledge integration mechanism achieves consistent improvements across settings with different sizes of training data on two public benchmark datasets.


Introduction
Slot tagging is a critical component of spoken language understanding(SLU) in dialogue systems. It aims at parsing semantic concepts from user utterances. For instance, given the utterance "I'd also like to have lunch during my flight" from the ATIS dataset, a slot tagging model might identify lunch as a meal description type. Given sufficient training data, recent neural-based models (Mesnil et al., 2014;Lane, 2015, 2016;Goo et al., 2018;Haihong et al., 2019;He et al., 2020) have achieved remarkably good results.
However, these works often suffer from poor slot tagging accuracy when rare words or OOV( out-of-vocab) words exist. (Ray et al., 2018) has verified the presence of OOV words further degrades the performance of neural-based models, especially in a few-shot scenario where training data can not provide adequate contextual semantics. Previous context-aware models merely focus on how to capture deep contextual semantics to aid * Weiran Xu is the corresponding author.

B-music_type I-music_type
(knowledge integration) Figure 1: An example of slot tagging in the few-shot scenario where scat singing is unseen in the training set. The prior context-aware model fails to recognize its correct type because of low-coverage contextual information. After integrating background knowledge from WordNet, it succeeds to reason the correct type via lexical relations.
in recognizing slot entities, while neglecting ontology behind the words or large-scale background knowledge. Explicit lexical relations are vital to recognizing unseen words when there is not adequate training data, that is, few-shot scenarios. Fig 1 gives a motivating example of slot tagging to explain the phenomenon. This example suggests slot tagging requires not only understanding the complex linguistic context constraints but also reasoning explicit lexical relations via large-scale background knowledge graphs. Previous state-of-the-art context-aware models (Goo et al., 2018;Haihong et al., 2019) only learn contextual information based on a multi-layer BiL-STM encoder and self-attention layer. (Dugas and Nichols, 2016;Williams, 2019;Shah et al., 2019) use handcrafted lexicons (also known as gazettes or dictionaries), which are typically collections of phrases semantically related, to improve slot tagging. One major limitation is that lexicons collected by domain experts are relatively small on the scale and fail to model complicated relations between words, such as relation hierarchy.
In this paper, we propose a novel knowledgeenhanced method for slot tagging by integrating contextual representation of input text and the largescale lexical background knowledge, enabling the model to reason explicit lexical relations. We aim to leverage both linguistic regularities covered by deep LMs and high-quality knowledge derived from curated KBs. Consequently, our model could infer rare and unseen words in the test dataset by incorporating contextual semantics learned from the training dataset and lexical relations from ontology. As depicted in Fig 2, given an input sequence, we first retrieve potentially relevant KB entities and encode them into distributed representations that describe global graph-structured information. Then we employ a BERT (Devlin et al., 2019) encoder layer to capture context-aware representations of the sequence and attend to the KB embeddings using multi-level graph attention. Finally, we integrate BERT embeddings and the desired KB embeddings to predict the slot type. Our main contributions are three-fold: (1) We investigate and demonstrate the feasibility of applying lexical ontology to facilitate recognizing OOV words in the few-shot scenario. To the best of our knowledge, this is the first to consider the large-scale background knowledge for enhancing context-aware slot tagging models. (2) We propose a knowledge integration mechanism and use multi-level graph attention to model explicit lexical relations. (3) Plenty of experiments on two benchmark datasets show that our proposed method achieves consistently better performance than various state-of-theart context-aware methods.

Our Approach
In this work, we consider the slot tagging task in the few-shot scenario, especially for OOV tokens. Given a sequence with n tokens X = {x i } n i=1 , our goal is to predict a corresponding tagging se- . This section first explains our BERT-based model and then introduces the proposed knowledge integration mechanism for inducing background commonsense. The overall model architecture is illustrated in Fig 2.

BERT-Based Model for Slot Tagging
The model architecture of BERT is a multi-layer bidirectional Transformer encoder. The input representation is a concatenation of WordPiece em-Knowledge Integration Layer beddings (Wu et al., 2016), positional embeddings, and the segment embeddings. Inspired by previous RNN-based works (Mesnil et al., 2014;Liu and Lane, 2016), we extend BERT to a slot tagging model. We first feed the input sequence X = {x i } n i=1 to a pre-trained BERT encoding layer and then get final hidden states H = (h 1 , ..., h n ). To make this procedure compatible with the original BERT tokenization, we feed each input word into a WordPiece tokenizer and use the hidden state corresponding to the first sub-word as input to the softmax classifier.
where h i ∈ R d 1 is the hidden state corresponding to the first sub-word of the i-th input word x i and y i is the slot label.

Knowledge Integration Mechanism
The knowledge integration mechanism aims at enhancing the deep contextual representation of input text via leveraging the large-scale lexical background knowledge, Wordnet (Miller, 1995), to recognize unseen tokens in the training set. Essentially, it applies multi-level graph attention to KB embeddings with the BERT representations from the previous layer to enhance the contextual BERT embeddings with human-curated background knowledge.
We first introduce the KB embedding and retrieval process. In this paper, we use the lexical KB, WordNet, stored as (subject, relation, object) triples, where each triple indicates a specific relation between word synsets, e.g., (state, hypernym-of, california). Each synset expresses a distinct concept, organized by a human-curated tree hierarchy.
KB Embeddings We represent KB concepts as continuous vectors in this paper. The goal is that the KB tuples (s, r, o) can be measured in the dense vector space based on the embeddings. We adopt the BILINEAR model (Yang et al., 2014) which measures the relevance via a bilinear function: f (s, r, o) = s T M r o, where s, o ∈ R d 2 are the vector embeddings for s, o respectively and and M r is a relation-specific embedding matrix. Then we train the embeddings using the max-margin ranking objective: where T denotes the set of triples in the KB and T denotes the negative triples that are not observed in the KB. Finally we can acquire vector representations for concepts of the KB. Because we mainly focus on the slot tagging task, and the datasets are relatively small for joint learning KB embeddings. Furthermore, the KB contains many triplets not present in the ATIS and Snips dataset. Therefore we pre-train the KB vectors and keep them fixed while training the whole model to reduce the complexity.
KB Concepts Retrieval We need to retrieve all the concepts or synsets relevant to the input word x i from the KB. Different from (Yang and Mitchell, 2017;Yang et al., 2019), for a word x i , we first return its synsets as the first-level candidate set C 1 (x i ) of KB concepts. Then we construct the second-level candidate set C 2 (x i ) by retrieving all the direct hyponyms of each synset in C 1 (x i ), as shown in the right part of Fig 2. Multi-Level Graph Attention After obtaining the two-level concept candidate sets, we apply the BERT embedding h i of input token x i to attending over the multi-level memory. The first-level attention, α, is calculated by a bilinear operation between h i and each synset c j in the first level set C 1 (x i ): Then we add an additional sentinel vector c (Yang and Mitchell, 2017) and accumulate all the embeddings as follows: where γ i is similar to α ij and j α ij + γ i = 1.
Here s 1 i is regarded as a one-hop knowledge state vector for it only represents its directly linked synsets. Therefore, we perform the second-level graph attention to encode the hyponyms of its direct synsets to enrich the information of original synsets. Intuitively the second-level attention over the hyponyms can be viewed as a relational reasoning process. Because once a synset belongs to an entity type, its hyponyms always conform to the same type. Likewise, the second-level attention over C 2 (x i ) is calculated: where c j is the j-th synset linked to token x i and c jk the k-th hyponym of c j . So we can obtain the multi-hop knowledge state vector s 2 i : Then we concat multi-level knowledge-aware vector s 1 i , s 2 i , and original BERT representation h i , and output f i = [s 1 i , s 2 i , h i ]. We also add a BiLSTM matching layer which takes as input the knowledge-enriched representations f i . Then we forward the hidden states to a CRF layer and predict the final results. The training objective is the sum of log-likelihood of all the words.

Setup
Datasets To evaluate our approach, we conduct experiments on two public benchmark datasets, ATIS (Tür et al., 2010) and Snips (Coucke et al., 2018). ATIS contains 4,478 utterances in the training set and 893 utterances in the test set, while Snips contains 13,084 and 700 utterances, respectively. The percentage of OOV words between the training and test datasets is 0.77%(ATIS) and 5.95%(Snips).  To simulate the few-shot scenarios, we downsample the original training sets of ATIS and Snips to different extents while keeping valid and test sets fixed. We aim to evaluate the effectiveness of integrating external KB under the settings of varied sizes of training data available.
Evaluation We evaluate the performance of slot tagging using the F1 score metric. In the experiments, we use the English uncased BERT-base model, which has 12 layers, 768 hidden states, and 12 heads. The hidden size for the BiLSTM layer is set to 128. Adam (Kingma and Ba, 2014) is used for optimization with an initial learning rate of 1e-5. The dropout probability is 0.1, and the batch size is 64. We finetune all hyperparameters on the valid set.

Baselines
Attention-Based (Liu and Lane, 2016) uses an RNN layer and a self-attention layer to encode the input text. Slot-Gated (Goo et al., 2018), which has two variants, Full Atten and Intent Atten, applies the information of intent detection task to enhance slot tagging. SF-ID Network (Haihong et al., 2019) designs a multiple iteration mechanism to construct bi-directional interrelated connections between slot tagging and intent detection. Most of the previous methods consider improving the performance of slot tagging by joint learning with intent detection. However, the effectiveness of background knowledge for slot tagging is still unexplored. Con-sequently, our proposed approach intends to integrate the large-scale lexical background knowledge, WordNet, to enhance the deep contextual representation of input text. We hope to further improve the performance of slot tagging, especially in the fewshot scenario where there is no plenty of training data available. 1

Overall Results
We display the experiment results in Table 2, where we choose two model architectures RNN and BERT as the encoding layer. Table 2 shows that our proposed knowledge integration mechanism significantly outperforms the baselines for both datasets, demonstrating that explicitly integrating the largescale background knowledge and contextual representation can benefit slot tagging effectively. Moreover, the improvement of 0.72% over strong baseline BERT on Snips is considerably higher than 0.27% on ATIS. Considering the distinct complexity of the two datasets, the probable reason is that a simpler slot tagging task, such as ATIS, does not require much background knowledge to achieve good results. Because the vocabulary of ATIS is extremely smaller than that of Snips, therefore the context-aware models are capable of providing enough cues for recognizing rare or OOV words. Hence, our method makes a notable difference in a scenario where samples are linguistically diverse, and large vocab exists. The results also demonstrate that incorporating external knowledge will not bring in much noise since we use a knowledge sentinel for the better tradeoff between the impact of background knowledge and information from the context.
On the other hand, the main results of the

Ablation Study
To study the effect of each component of our method, we conduct ablation analysis under the 10% training data setting (Table 3). We can see that knowledge integration is crucial to the improvements. Besides, the first-level graph attention acquires better performance gain than the secondlevel attention. We assume that directly linked synsets are more significant than the hyponyms. The matching layer and CRF also play a role. The reason why the RNN matching layer matters is partly to build explicit interactions between knowledge vectors and context vectors.

Conclusion
We present a novel knowledge integration mechanism of incorporating background KB and deep contextual representations to facilitate the few-shot slot tagging task. Experiments confirm the effectiveness of modeling explicit lexical relations, which has not yet been explored by previous works. Moreover, we find that our method delivers more benefits to data scarcity scenarios. We hope to provide new guidance for the future slot tagging work.