Attention-based Semantic Priming for Slot-filling

The problem of sequence labelling in language understanding would benefit from approaches inspired by semantic priming phenomena. We propose that an attention-based RNN architecture can be used to simulate semantic priming for sequence labelling. Specifically, we employ pre-trained word embeddings to characterize the semantic relationship between utterances and labels. We validate the approach using varying sizes of the ATIS and MEDIA datasets, and show up to 1.4-1.9% improvement in F1 score. The developed framework can enable more explainable and generalizable spoken language understanding systems.


Introduction
Priming (Waltz and Pollack, 1985) is a cognitive mechanism in which a primary stimulus (i.e. the prime) influences the response to a subsequent stimulus (i.e. the target) in an implicit and intuitive manner.In the case of semantic priming, both the prime and the target typically belong to the same semantic category.Semantic priming can be explained in terms of induced activation in associative neural networks (McClelland and Rogers, 2003).Further, there is empirical evidence to suggest that the processing of words in natural language is influenced by preceding words that are semantically related (Foss, 1982).Therefore, semantic priming approaches would enable improvements in sequence labelling.
Previous studies have leveraged contextual information in utterance sequences (Mesnil et al., 2015) and dependencies between labels (Ma and Hovy, 2016) to improve performance in sequence labelling tasks.However, there is limited work to use contextual information in utterances to inform inference of the subsequent labels through semantic priming.For instance, "I'd like to book ..." not only suggests the next word(s), e.g., flight, but also the label of the next word(s), e.g., services.We posit that systems employing this mode of cross-linked semantic priming could enhance performance in a variety of sequence labelling tasks.
In this work, we hypothesize that semantic priming in human cognition can be simulated by means of an attention mechanism that uses word context to enhance the discriminating power of sequence labelling models.We propose and explore the use of attention (Bahdanau et al., 2014) in a deep learning architecture to simulate the semantic priming mechanism.We apply this concept to slot filling, an example of sequence labelling in spoken language understanding, which aims to label the utterance sequences with a set of begin/in/out (BIO) tags.Specifically, we use pre-trained word embeddings to characterise not only the context of words, but also the semantic relationship between words in utterances and words in labels.
Overall, we develop a semantic priming based approach for the task of slot-filling to associate utterances and label sequences.Our contributions are as follows: (1) We propose an approach that applies semantic priming to sequence labelling.To capture semantic associations between utterance words and label words, we use three different strategies for deriving label embeddings from pre-trained embeddings.(2) We implemented the approach in an LSTM-based architecture and validate the efficacy of the approach.
In Section 2 we review related work.Section 3 elaborates the proposed approach.An empirical evaluation is provided in Section 4. Finally, Section 5 concludes the paper.

Related Work
Our proposed method draws on the attention mechanism, which has shown to be effective for sequence-based NLP tasks, particularly, machine translation (Bahdanau et al., 2014;Luong et al., 2015).Since attention allows the neural networks to dynamically attend to important features in the inputs, it is a suitable mechanism to achieve the objective of semantic priming between utterances and labels.Conditional random field (CRF) has been used together with RNNs, sometimes also including CNNs, to improve accuracy (Mesnil et al., 2013(Mesnil et al., , 2015;;Ma and Hovy, 2016;Reimers and Gurevych, 2017b).Dinarelli et al. (2017) proposes to learn label embedding for improving tagging accuracy, while our label embedding is computed directly from pre-trained word embeddings.Furthermore, our approach does not require shifted label sequences as input.
To use external knowledge, previous studies consider graph or entity embedding (Huang et al., 2017;Chen et al., 2016;Yang and Mitchell, 2017), together with other contextual information, such as dependency graph (Huang et al., 2017) or sentence structures (Chen et al., 2016).Specifically, Yang and Mitchell (2017) extends LSTM with graph embedding to learn concepts from knowledge bases and integrate the concept embedding into the state vectors of words.In contrast, our approach does not learn or parse sentences to get extra contextual information, which is suitable for languages lacking well trained parsers.Moreover, context integration is achieved without fine-tuning the underlying RNN structure yet rather through the attention mechanism.

Semantic Priming
Figure 1 depicts an LSTM-based neural network architecture for semantic priming.Given an utterance, a priming matrix is computed to connect the labels to input features generated by a bi-directional LSTM.The priming effects are then used for prediction.

Computing Priming Matrix
This section considers three different strategies of the proposed attention-based semantic priming mechanism.In all the three cases the input words are compared to proxies of the semantic categories over word vectors.
Let m denote the number of labels.An utterance of length n is represented by the matrix X : n × k, where k is the dimension of pre-trained word vectors.Given a word vector x j , semantic priming is achieved by comparing x j with a label embedding matrix L : m ×k, with m unique con- cepts, each encoded in k dimensions.In addition, let E l i ,1≤i≤m denote the set of embedded words tagged with the label l i in the dataset.Note that the corresponding embedding of l i is L i .Below are the definitions of three different strategies to compute the label embeddings L.
• Priming using Instance Centroid (PIC): L is defined to be m × k and L i = mean(E l i ).
Intuitively, the proxy of the concept, L i , is the centroid (mean vector) of the cluster of all known instance words in the concept.
• Priming using Instance Neighbor (PIN): L is defined to be m × k and In this case, the proxy of the concept is the nearest instance having the same label as x j .
• Priming using Concepts (PC): L is defined to be m × k, m is pre-specified, and L i = c i , where c i is a manually selected concept from l i .The embedding representation, c i , is of dimension k as it is either the word vector per se of a single concept label or the mean vector of a set of such word vectors.
While PIN is a straightforward simulation of the semantic priming mechanism between a prime and its potential targets in different classes, PIC and PC are variants of a categorization mechanism referred to as the Basic Level (Rosch et al., 1976) Once L is computed, the priming matrix is computed by the cosine similarity, or the induced distance, between the word embedding of the utterance and L, i.e., p = cos(X, L).

Attention to Semantic Priming
In Figure 1, the hidden states, h, of the bidirectional LSTM are considered to be the source, while the priming matrix p is analogous to the target.Following (Luong et al., 2015), we define the alignment scoring function to be s(p, h) = pW a h and compute the final output as follows:

Experiments
To validate the efficacy of the architecture in Figure 1, an empirical evaluation was performed and implemented in Keras 1 .This section elaborates the experimental setup and presents our results.

Datasets
Two datasets on spoken dialogues were used in the experiments, namely, the Air Travel Information System (ATIS) task (Dahl et al., 1994) and MEDIA, French dialogues collected by ELDA (Bonneau-Maynard et al., 2005).The statistics of the two datasets is given in Table 1.For ME-DIA, using entities significantly impacts the performance.Thus entities are used together with words in utterances, as implied by the size of vocabulary in Table 1.Since bi-directional LSTM is used in the architecture in Figure 1, no context word windows (Mesnil et al., 2015) were used as additional inputs in the datasets.The pre-trained 1 https://keras.io/word embedding sources for the two datasets are GloVe (English) (Pennington et al., 2014) and fast-Text (French) (Bojanowski et al., 2016), respectively.In particular, we found that there are about 100 words missing in the fastText French word embedding.Some of the words, however, are due to original tokenization in MEDIA.

Setup and Hyperparameters
To facilitate mini-batching for training, the utterances were padded to the maximum utterance length.For all experiments, we use one set of fixed hyperparameters to enable meaingful comparison.The dimension of word embedding is 300 for both GloVe and fastText.Following the recommendations in (Reimers and Gurevych, 2017a), all dropout layers have a rate of 0.5, and LSTM has an additional recurrent dropout of 0.5 between recurrent units.During learning phase, a mini-batch size of 18 and an initial learning rate of 0.004 was used with the Adam optimizer to minimize the cross-entropy loss.The learning rate was reduced by 50% after no improvement in three epochs.
As semantic priming provides connections between words and labels through the use of the same pre-training embedding, it will enable more robust performance even when the datasets are small.To validate this, we investigated the effects of semantic priming in cases where the datasets are reduced.Note that both ATIS and MEDIA have many short utterances; in particular, ME-DIA has over 4000 utterances consisting of a single word.For reduction, we rank vocabulary by word frequency in the training and development sets and choose utterances containing the words until 100% of vocabulary is covered.

Results
In this section the conlleval-F12 scores are reported.The experiments were run on a NVIDIA DGX1 station (Tesla V100 and 16GB memory), and the F1 scores are the average of that in the first 30 epochs in three independent runs.
The results shown are for baseline with trainable embedding (BE), baseline with pre-trained embedding (BP), and the strategies defined in Section 3.1, i.e., PIC, PIN and PC.For PC, the concepts are the keywords that have occurred in the labels.Example concepts include airline in ATIS and chambre in MEDIA.A total of 30 and 53 concepts are extracted for PC in ATIS and MEDIA, respectively.
Although BE yields much higher F1, we compare the proposed approach with the baseline approach, BP, where F1 is computed using pretrained embedding.This is because all strategies, except for BE, are based on pre-trained word embedding.We also compare the results in the ME-DIA dataset with and without CRF.Since CRF in ATIS was shown to lead to no improvement (Dinarelli et al., 2017) Table 2 shows the F1 computed over the full datasets.In ATIS, although no significant conclusions can be drawn, all strategies, in particular, PC, outperform the baseline BP.Note that, when CRF, instead of SOFTMAX, is used in MEDIA, there is an increase of 4% for BE, 7% for BP, and 10% for PIC/PIN.For MEDIA, F1 has a considerable drop when pre-trained word embedding is used instead of trainable embedding.SOFT-MAX is used, none of the strategies outperformed the baselines BP or BE.Table 4 describes the results over further reduced datasets, i.e., these two reduced datasets covers only 70%3 of the whole vocabulary, containing 348 and 1216 utterances (train/dev) for ATIS and MEDIA, respectively.As shown in Table 4, PC was the best strategy for ATIS while PIN consistently outperformed the baseline BP in ME-DIA.
Overall, we have seen performance gains when priming is used over the original and reduced datasets, compared to the pre-trained baseline approach BP.In particular, we recommend PIN over the other strategies as it is less computational expensive compared with PIC while it seems to provide more consistent improvement over BP than other strategies.

Conclusions and Future Work
We have demonstrated an approach to leverage semantic priming for natural language understanding tasks.The approach employs pre-trained embeddings to prime label concepts based on utterance words.Our experimental results suggest improvements over baselines are feasible.However, we note that the coverage of the dataset vocabulary in the pre-trained word embedding may limit performance improvements.For example, the missing words in the pre-trained French word embedding adversely affected the F1 scores for MEDIA.The approach can be easily adapted to a variety of different network architectures (e.g., (Dinarelli et al., 2017)) and word embeddings (e.g., (Reimers and Gurevych, 2017a)).Future studies will focus on how to choose a good set of concepts for the PC priming strategy.It will also be fruitful to understand how to explain the sequence labelling outputs using attention mechanisms.

Figure 1 :
Figure 1: Proposed topology for priming.FC denotes a fully connected layer.

Table 1 :
Statistics of datasets.† The vocabulary is a mix of words and entities.
, in which the targets are intermediate, dominant concepts that represent the category.

Table 2 :
, so, no CRF layer was applied to ATIS in the experiments.F1 of the two datasets.† CRF used.
In contrast, once CRF is used both PIC and PIN gained over 1% increase compared with BP.

Table 3 :
F1 of the reduced datasets.† CRF used.100% of the vocabulary in datasets are retained.

Table 3
describes the results over reduced datasets that cover the full (100%) vocabulary in the datasets.ATIS 100 has a total of 583 utterances for training/development, while MEDIA 100 has 1717 for training/development.Note that reduction was not performed to test datasets, i.e., full test sets were used.For both ATIS and MEDIA, PIN shows consistent performance gain (+1%) over the pre-trained baseline approach (BP).

Table 4 :
F1 of the reduced datasets.† CRF used.70% of the vocabulary in datasets are retained.