Adversarial Semantic Decoupling for Recognizing Open-Vocabulary Slots

Open-vocabulary slots, such as ﬁle name, album name, or schedule title, signiﬁcantly degrade the performance of neural-based slot ﬁll-ing models since these slots can take on values from a virtually unlimited set and have no semantic restriction nor a length limit. In this paper, we propose a robust adversarial model-agnostic slot ﬁlling method that explicitly decouples local semantics inherent in open-vocabulary slot words from the global context. We aim to depart entangled contextual semantics and focus more on the holistic context at the level of the whole sentence. Experiments on two public datasets show that our method consistently outperforms other meth-ods with a statistically signiﬁcant margin on all the open-vocabulary slots without deterio-rating the performance of normal slots.


Introduction
Slot filling is a critical component of spoken language understanding (SLU) in task-oriented dialogue systems. It aims at extracting semantic constituents from the user queries. Given an immense amount of labeled training data, recent neural networks (Mesnil et al., 2015;Lane, 2015, 2016;Goo et al., 2018;Haihong et al., 2019;He et al., 2020a,b) have been actively applied to slot filling task and achieved good results.
Although most previous neural-based models achieve state-of-the-art performance across a wide range of slot filling datasets, they often suffer from poor slot filling accuracy while dealing with 'openvocabulary' slots. Open-vocabulary slots signify slot types that can take on values from a virtually unlimited set, such as file name, album name, text body, or schedule title. Typically, these slot values  (Coucke et al., 2018). Here "water" is mistakenly recognized as "entity name" type by the baseline model (Liu and Lane, 2016) due to the local context "don't drink the water". However, it represents a playlist at the level of the whole sentence.  (Liu and Lane, 2016). We display the top10 slot types of the highest error rates.
have no constraints on the length and specific semantic patterns of content. Besides, these words are employed differently from the meaning inherent in themselves, as Fig 1 shows. Intrinsically, the complexity of recognizing open-vocabulary slots comes from the inconsistent context with different granularity. For example, consider the utterance "add the song don't drink the water to my playlist" in Fig 1. While identifying the slot type of the word "water", the slot filling model will mistakenly recognize the word "water" as "entity name" slot type if it only focuses on the local context "don't drink the water". By contrast, it should instead focus on the global context "add the song ... to my playlist" to recognize the "don't drink the water" as the correct "playlist" slot type.  Kim et al. (2018) exploits a long-term aware attention structure and positional encoding with multi-task learning to capture global information. Kim et al. (2019) focuses on data augmentation by adding random noise in the embeddings of all slot words. Ray et al. (2019) proposes an iterative delexicalization algorithm that utilizes model uncertainty to improve delexicalization for openvocabulary slots. One major limitation is that these methods can't explicitly distinguish semantic representation inherent in open-vocabulary slot words from the holistic context.
In this paper, we propose a robust adversarial slot filling approach that explicitly decouples local semantic representation inherent in open-vocabulary slot words from the global context. Our approach aims to focus more on the holistic semantics at the level of the whole sentence, not only the vicinity of the local context within open-vocabulary slots. Specifically, our approach generates modelagnostic adversarial worst-case perturbations to the inputs in the direction that significantly increases the model's loss. Our main contributions are threefold: (1) We dive into the issues of open-vocabulary slots in slot filling task and propose a novel adversarial semantic decoupling method which distinguishes local semantics from the global context.
(2) Our method can be easily applied to all the previous slot filling neural-based models. (3) Experiments show that our proposed method consistently outperforms various SOTA baselines, especially in open-vocabulary slot f1. 1 2 Approach Problem Formulation Given a sentence X = {x 1 , ..., x n } with n tokens, the slot filling task is to predict a corresponding tag sequence Y = {y 1 , ..., y n } in BIO format, where each y i can take three types of values: B-slot type, I-slot type and O.
Fig 3 shows the overall architecture of our method. Here we adopt BiLSTM (Liu and Lane, 2016) as our backbone. 2 Our method includes three core steps: forward, backward, and decoupling forward. We first feed each word to an embedding layer to get word embeddings e i = E(x i ). Then in the forward step, we adopt a BiLSTM layer and softmax output layer to calculate the classification cross-entropy loss L(f (e; θ), Y ) for each word.
Then in the second backward step, we perform adversarial attacks (Goodfellow et al., 2015;Kurakin et al., 2016;Miyato et al., 2016;Jia and Liang, 2017;Zhang et al., 2019;Ren et al., 2019) to explicitly shift the local semantics of open-vocabulary slot words and decouple them from the global context. Theoretically, we need to compute a decoupling vector v dec that effectively degrades the current model's performance (i.e., maximum the loss function): where L indicates the loss function and is the norm bound of the decoupling vector. However, due to model complexity, accurate computation for v dec is costly and inefficient. Similar to Vedula et al. (2020) and Ru et al. (2020), we apply Fast Gradient Value (FGV) (Rozsa et al., 2016) to approximate a worst-case perturbation as our decoupling vector: Here, the gradient g is the first-order differential of the loss function L w.r.t. e, representing the direction that rapidly increases the loss function. We perform normalization to g and then use a small to ensure the approximate is reasonable. Finally, we perform a mask operation to filter out normal words and add the decoupling vector to the original token embeddings e. Hence, the updated word embeddings are e = e+ v dec while other model parameters are fixed. Ablation study proves that only adding the decoupling vector to open-vocabulary slot words achieves better improvement.
In the third decoupling forward step, we feed e to the same BiLSTM model and calculate a new adversarial loss L . The final loss is a weighted sum of L and L controlled by a hyperparameter α 3 :   Baselines For a fair comparison, we use the same slot filling architecture BiLSTM (Liu and Lane, 2016) as (Kim et al., 2019;Ray et al., 2019). Kim et al. (2019) proposes two model variants, where random noise means adding random noise in the embeddings of all slot words and cw represents concatenating the context word window as input. Note that the random noise in (Kim et al., 2019) is independently sampled regardless of the global context, which is significantly different from our method. Our adversarial semantic decoupling method can take into account the impact of different contexts (global semantics) on local semantics, thereby enabling more accurate decoupling. Ray et al. (2019) proposes greedy delex and iterative delex methods for open-vocabulary slots. We also validate our method in the BERT-based models (Devlin et al., 2019) for comprehensive analysis.
Evaluation We evaluate the performance of slot filling using the F1 metric (Sang and Buchholz, 2000). Specially, we report the F1 score over all open-vocabulary slots, noted as F1-ov. We followed the set-ups in (Liu and Lane, 2016;Kim et al., 2019), and re-implement the baseline BiL-STM, +random noise and +random noise,cw based on the same settings. We report the original results of greedy delex and iterative delex from (Ray et al., 2019).

Main Results
We display the experiment results in Table 3. Compared to the previous state-of-the-arts , our method  We also show the results of BERT models. Table 3 displays that our method still achieves an improvement of 8.29% in F1-ov over the original BERT model and 0.74% over the previous SOTA, which substantiates our method is model-agnostic and can be easily integrated into different slot filling architectures. Meanwhile, the F1-ov scores in BERT-based models are consistently higher than BiLSTM-based models, which indicates that BERT can effectively capture the global context semantics and tackle long-term dependency than BiLSTM.

Qualitative Analysis
Results of all open slot categories   vocabulary slot type, which confirms our method is not specific to several slot types. For the restaurant name type, the random noise model suffers from a performance drop of 7.62% compared to BiLSTM. It illustrates simply adding random noise is not constrained and has no guarantee of semantics decoupling. Conversely, our method employs adversarial deliberate disturbance and outperforms BiLSTM by 9.58%. Open-vocabulary slots vs normal slots We also show overall test F1, F1-ov on all the openvocabulary slots, and F1-normal on all the normal slots in Table 4 to compare the comprehensive performance. The results show that our method significantly outperforms BiLSTM by 14.31% on F1-ov and 2.6% on F1-normal, which proves our method gets notable improvement on open-vocabulary slots without harm to the performance of normal slots. We hypothesize the improvement on normal slots Ablation studies To study the effects of different hyperparameters of our method, we conduct ablation analysis under BiLSTM architecture (Table 5). We can see that adding perturbation to the embedding layer of open-vocabulary slots gets significant improvement. Specifically, for the Filter setting, adding perturbation to open-vocabulary slots outperforms all slots by 1.65%. For the Space setting, adding perturbation to the word embedding layer is superior to the RNN layer. For the hyperparameters and α, = 1.5 and α = 0.5 achieves the best performance.
Case study Table 6 gives three examples from the Snips dataset: (1) the baseline model identifies a partial word "one" in "the sound of one hand clipping" as "rating value" due to overfitting. (2) the baseline model fails to identify "look to you" since it is heavily coupled with "put" in local semantics.  (3) the predicate "would die" in open-vocabulary slots are identified as the predicate of the whole sentence and thus are mistakenly labeled as "O" by the baseline model. In all cases, the baseline model focuses too much on local semantics and neglects the hints in global. With our proposed approach, the model is trained to pay more attention to global semantics and succeeds to identify open-vocabulary slots.

Conclusion
In this paper, we dive into the issues of openvocabulary slots in slot filling task and propose a novel model-agnostic adversarial semantic decoupling method which distinguishes local semantics inherent in open-vocabulary slot words from the global context. Experiments confirm the effectiveness of semantic decoupling. We hope to provide new guidance for the future slot filling work.