Noise Robust Named Entity Understanding for Voice Assistants

Named Entity Recognition (NER) and Entity Linking (EL) play an essential role in voice assistant interaction, but are challenging due to the special difficulties associated with spoken user queries. In this paper, we propose a novel architecture that jointly solves the NER and EL tasks by combining them in a joint reranking module. We show that our proposed framework improves NER accuracy by up to 3.13% and EL accuracy by up to 3.6% in F1 score. The features used also lead to better accuracies in other natural language understanding tasks, such as domain classification and semantic parsing.


Introduction
Understanding named entities correctly when interacting with virtual assistants (e.g. "Call Jon", "Play Adele hello", "Score for Warrior Kings game") is crucial for a satisfying user experience. However, NER and EL methods that work well on written text often perform poorly in such applications: utterances are relatively short (with just 5 tokens, on average), so there is not much context to help disambiguate; speech recognizers make errors ("Play Bohemian raspberry" for "Play Bohemian Rhapsody"); users also make mistakes ("Cristiano Nando" for "Cristiano Ronaldo"); non-canonical forms of names are frequent ("Shaq" for "Shaquille O'Neal"); and users often mention new entities unknown to the system.
In order to address these issues we propose a novel Named Entity Understanding (NEU) system that combines and optimizes NER and EL for noisy spoken natural language utterances. We pass multiple NER hypotheses to EL for reranking, enabling NER to benefit from EL by including information from the knowledge base (KB).
We also design a retrieval engine tuned for spoken utterances for retrieving candidates from the KB. The retrieval engine, along with other techniques devised to address fuzzy entity mentions, lets the EL model be more robust to partial mentions, variation in named entities, use of aliases, as well as human and speech transcription errors.
Finally, we demonstrate that our framework can also empower other natural language understanding tasks, such as domain classification (a sentence classification task) and semantic parsing.

Related Work
There have been a few attempts to explore NER on the output of a speech pipeline (Ghannay et al., 2018;Abujabal and Gaspers, 2018;Coucke et al., 2018). Among these, our NER model is closest to Abujabal and Gaspers (2018) and Coucke et al. (2018); however, unlike the former, we use a richer set of features rather than phonemes as input, and unlike the latter, we are able to use a deep model because of the large volume of data available.
EL has been well explored in the context of clean (Martins et al., 2019;Kolitsas et al., 2018;Luo et al., 2015) and noisy text inputs (Eshel et al., 2017;Guo et al., 2013;Liu et al., 2013), but as with NER, there have been only a few efforts to explore EL in the context of transcribed speech (Benton and Dredze, 2015;Gao et al., 2017), although crucially, both these works assume gold standard NER and focus purely on the EL component.
Traditionally, a pipelined architecture of NER followed by EL has been used to address the entity linking task (Lin et al., 2012;Derczynski et al., 2015;Bontcheva et al., 2017;Bowden et al., 2018). Since these approaches rely only on the best NER hypothesis, errors from NER propagate to the EL step. To alleviate this, joint models have been proposed: Sil and Yates (2013) proposed an NER+EL model which re-ranks candidate mentions and entity links produced by their base model. Our work differs in that we use a high precision NER system, while they use a large number of heuristically obtained Noun Phrase (NP) chunks and word n-grams as input to the EL stage. Luo et al. (2015) jointly train an NER and EL system using a probabilistic graphical model. However, these systems are trained and tested on clean text and do not address the noise problems we are concerned with.

Architecture Design
For a given utterance, we first detect and label entities using the NER model and generate the top-l candidate hypotheses using beam search. The EL model consists of two stages: (i) candidate retrieval and (ii) joint linking and re-ranking. In the retrieval stage, for each NER hypothesis, we construct a structured search query and retrieve the top-c candidates from the retrieval engine. In the ranking stage, we use a neural network to rank these candidate entity links within each NER hypothesis while simultaneously using rich signals (entity popularity, similarity between entity embeddings, the relation across multiple entities in one utterance, etc.) from these entity links as additional features to re-rank the NER hypotheses from the previous step, thus jointly addressing both the NER and EL tasks.

NER
For the NER task, following Lample et al. (2016) we use a combination of character and word level features. They are extracted by a bi-directional LSTM (biLSTM) (Hochreiter and Schmidhuber, 1997), and then concatenated with pre-trained GloVe word embeddings 1 (Pennington et al., 2014) to pass through another biLSTM and fed into a CRF model to produce the final label prediction based on a score s(ỹ i , x; θ) that jointly optimizes the probability of labels for the tokens and the transition score for the entire sequenceỹ i = (y 1 , . . . , y T ) given the input x: where ψ t,θ is the biLSTM prediction score from the label y t of the t th token, and φ(j, k) is the transition score from label j to label k.
During training, we maximize the probability of the correct label sequence p seq , which is defined as , whereỹ i is the label sequence for hypothesis i, and S is the set of all possible label sequences. During inference, we generate up to 5 NER alternatives for each utterance using beam search. We also calculate a mention level confidence p men for each entity mention m k . p men is computed by aggregating the sequence level confidence for all the prediction sequences that share the same mention sub-path m k : where S m i is the set of prediction sequences that all have m k as the prediction for the corresponding tokens. Both p seq and p men are computed by dynamic programming, and serve as informative features in the EL model.

Joint Linking and Re-ranking
The entity linking system follows the NER model and consists of two steps: candidate retrieval, and joint linking and re-ranking.
To build the candidate retrieval engine, we first index the list of entities in our knowledge base, which can be updated daily to capture new entities and change of their popularity. To construct the index, we iterate through the flattened list of entities and construct token-level unigram, bigram and trigram terms from the surface form of each entity. Apart from using the original entity names, we also use common aliases, harvested from usage logs, for popular entities (e.g. LOTR as an alias for "Lord of the Rings") to make the retrieval engine more robust to commonly occurring variations. Next, we create an inverted index which maps the unique list of n-gram terms to the list of entities that these n-grams are part of, also known as posting lists. Further, to capture cross-entity relationships in the knowledge base (such as relationships between an artist and a song or two sports teams belonging to the same league), we assign a pointer 2 for each entity in the knowledge base to its related entities and this relational information is leveraged by the EL model for entity disambiguation (described in 5.2). We then compute the tf-idf score for all the n-gram terms present in the entities and store them in the inverted index.
For each hypothesis predicted by the NER model we query the retrieval engine with the corresponding text. We first send the query through a highprecision seq-to-seq correction model (Schmaltz et al., 2017;Ge et al., 2019) trained using common errors observed in usage. Next, we construct ngram features from the corrected query in a similar way to the indexing phase and retrieve all entities matching these n-gram features in our inverted index. Additionally, we use synonyms derived from usage for each term in the query to expand our search criteria: for example, our synonym list for "Friend" contains "Friends", which matches the TV show name which would have been missed if only the original term was used.
For each entity retrieved, we get the tf-idf score for the terms present in the query chunk from the inverted index. We then aggregate the tf-idf scores of all the terms present in the query for this entity and linearly combine this aggregate score with other attributes such as popularity (i.e. prior usage probability) of the entity to generate a final score for all retrieved entity candidates for this query. Finally, we perform an efficient sort across all the entity candidates based on this score and return a top-c (in our case c = 25) list filtered by the entity type detected by the NER model for that hypothesis. These entity candidates coupled with the original NER hypothesis are sent to the ranker model described below for joint linking and re-ranking.
Following the candidate retrieval step, we introduce a neural model to rerank the candidate entities, aggregating features from both the NER model and the candidate retrieval engine.
The EL model scores each entity linking hypothesis separately. An entity linking hypothesis consists of a prediction from the NER model (which consists of named entity chunks in the input utterance and their types), and the candidate retrieval results for each chunk. Formally, we define an en-tity linking hypothesis y with k entity predictions as: where m j is the j-th mention in the utterance, and e j is the entity name associated with this mention from the knowledge base. f utter , f NER , f CR are features derived from the original utterance text, the NER model and the candidate retrieval system respectively. In our system, f utter is a representation of the utterance from averaging the pre-trained word embeddings for the tokens in the utterance. Intuitively, having a dense representation of the full utterance can help the EL model better leverage signals from the utterance context. f NER includes the type of each mention, as well as the sequence and mention confidence computed by the NER model. f CR includes popularity, and whether a relation exists between the retrieved entities in y.
To be robust to noise, the EL model adopts a pair of CNNs to compare each entity mention m j and its corresponding knowledge base entity name e j . The CNN learns a name embedding with onedimensional convolution on the character sequence, and the kernel parameters are shared between the CNN used for user mention and the one used for the canonical name. A character-based text representation model is better at handling mis-transcriptions or mis-pronounced entity names. While a noisy entity name may be far from the canonical name in the word embedding space when they are semantically different, they are usually close to each other in the character embedding space due to similar spellings. To model the similarity between CNN name embeddings of m j and e j , we use the standard cosine similarity as a baseline, we experiment with an MLP that takes the concatenated name embeddings as input. We are able to model more expressive interactions between the two name embeddings with the MLP, and in turn better handle errors. Finally, we concatenate the similarity features with other features as input to another MLP that computes the final score for y. Formally, the scoring function is defined in Equation 1, where ⊕ means concatenation. In our data, the number of entity mentions in an utterance averages less than 3. We pad the entity feature sequence to length 5, which provides a good coverage. In the scoring model above, we use a simple concatenation to aggregate the embedding similarities of multiple entity mentions which empirically performs as well as sequence models like LSTM, while being much cheaper in computation.
To train the EL model, we use the standard maxmargin loss for ranking tasks. If for the i-th example, we denote the ground truth as y * i and an incorrect prediction asŷ i , and the scoring function s(·) is as defined in Equation 1, the loss function is The max-margin loss encourages the ground truth score to be at least a margin γ higher than the score of an incorrect prediction. The margin is defined as a function of the ground truth and the incorrect prediction, thus adaptive to the quality of prediction. A larger margin is needed when the incorrect prediction is further away from the ground truth. For our reranking task, we set a smaller margin when only the resolved entities are incorrect but the NER result is correct, and a larger margin when the NER result is wrong. This adaptive margin helps rerank NER hypotheses even when the model cannot rank the linking results correctly. During training, we uniformly sample the negative predictions from the candidates retrieved by the retrieval engine.

Improvement on Other Language Understanding Tasks
We also explore the impact of our NEU feature encoding on two tasks: a domain classifier and a domain-specific shallow semantic parser.

Domain Classification
Domain classification identifies which domain a user's request falls into: sports, weather, music, etc., and is usually done by posing the task as sequence classification: our baseline uses word embeddings and gazetteer features as inputs to an RNN, in a manner similar to Chen et al. (2019). Consider a specific token t. Let a be the number of alternatives used from the NER model in the domain classifier (which we treat as a hyperparameter), p i represent the (scalar) sequence level confidence score p seq (ỹ i , x; θ) of the i th NER alternative defined in Section 3.1, c i represent an integer for the entity type that NER hypothesis i assigns to the token t, and o(.) represent a function converting an integer into its corresponding one-hot vector. Then the additional NER feature vector f r concatenated to the input vector fed into token t as part of the domain classifier can be written as: Likewise, for the featurization that uses both NER and EL, let a be the number of alternatives used from the NER+EL system in the domain classifier (also a hyperparameter); these a alternatives are now sorted by the scores from the EL hypotheses, as opposed to the sequence level confidence scores from NER. Let s i be the i th re-ranked alternative's cosine similarity score between the mention and knowledge base entity name as output by the EL model. p i and c i are consistent with our earlier notation, except that they now correspond to the i th NER alternative after re-ranking. Then the additional NER+EL feature vector f u concatenated to the input fed into token t as part of the domain classifier can be written as:

Semantic Parsing
Our virtual assistant also uses domain-specific shallow semantic parsers, running after domain classification, responsible both for identifying the correct intent that the user expects (such as the "play" intent associated with a song) and for assigning semantic labels to each of the tokens in a user's utterance (such as the word "score" and "game" respectively being tagged as tokens related to a sports statistic and sports event respectively in the utterance "What's the score of yesterday's Warriors game?"). Each semantic parser is structured as a multi-task sequence classification (for the intent) and sequence tagging (for the token-level semantic labelling) task, with our production baseline using word embeddings and gazetteer features as inputs into an RNN similar to our domain classifier. Here, f r and f u are featurized as described above. Note that in contrast to the NEU system, the semantic parser uses a domain-specific ontology, to enable each domain to work independently and to not be encumbered by the need to align ontologies.

Datasets and Training Methodology
To create our datasets, we randomly sampled around 600k unique anonymous English transcripts (machine transcribed utterances), and annotated them with NER and EL labels. Utterances are subject to Apple's baseline privacy practices with respect to Siri requests, including that such requests are not associated with a user's Apple ID, email address, or other data Apple may have from a user's use of other Apple services, and have been filtered as described in Section 7. We then split the annotated data into 80/10/10 for train, development and test sets. For both the NER and EL tasks, we report our results on test sets sampled from the "music", "sports" and "movie & TV" domains. These are popular domains in the usage and have a high percentage of named entities: with an average of 0.6, 1.1 and 0.7 entities for each utterance in the 3 domains respectively. To evaluate model performance specifically on noisy user inputs, we select queries from the test sets that are marked as containing speech transcription or user errors by the annotators and report results on this "noisy" subset, which constitutes 13.5%, 12.7% data for movie&TV and music domain respectively when an entity exists. 3 To evaluate the relation feature, we also look at the "related" subset where a valid relation exists in the utterance. This subset consists 13.4% and 5.3% of data for the music and sports domain with at least one entity. 4 We first train the NER model described in Section 3.1. Next, for every example in our training dataset, we run inference on the trained NER model and generate the top-5 NER hypotheses using beam search. Following this, we retrieve the top 25 candidates for each of these hypotheses using our search engine combined with the ground truth NER and EL labels and fed to the EL model for training.
To measure the NER model performance, we use the standard NER F1 metric used for the CoNLL-2003 shared task (Tjong Kim Sang and De Meulder, 2003). To measure the quality of the top-5 NER hypotheses, we compute the oracle top-5 F1 score by comparing and choosing the best alternative hypothesis among the 5 and calculate its F1 score, for each test utterance. In this manner, we also know the upper bound that EL can reach from reranking NER hypotheses. As described in section 3.2, the EL model is optimized to perform two tasks simultaneously: entity linking and reranking of NER hypotheses. Hence to evaluate the performance of the EL model, we use two metrics: reranked NER-F1 score and the EL-F1 score. The reranked NER F1 score is measured on the NER predictions according to the top EL hypothesis, and is defined in the same way as the previous NER task. To evaluate entity linking quality, we adopt a strict F1 metric similar to the one used for NER. Besides entity boundary and entity type, the resolved entity also needs to be correct for the entity prediction to be counted as a true positive.
For NER model training, we use standard minibatch gradient descent using the Adam optimizer with an initial learning rate of 0.001, a scheduled learning rate decay of 0.99, LSTM with a hidden layer of size 350 and a batch size of 256. We apply a dropout of 0.5 to the embedding and biLSTM layers, and include token level gazetteer features (Ratinov and Roth, 2009) to boost performance in recognizing common entities. We linearly project these gazetteer features and concatenate the projection with the 200 dimensional word embeddings and 100 dimensional character embeddings which are then fed into the biLSTM followed by the CRF.
For EL, the character CNN model we use has two layers, each with 100 convolution kernels of size 3, 4, and 5. Character embedding are 25 dimensional and trained end to end with the entity linking task. The MLP for embedding similarity takes the concatenation of two name embeddings, as well as their element-wise sum, difference, minimum, maximum, and multiplication. It has two hidden layers of size 1024 and 256, with output dimension 64. Similarity features of mentions in the prediction are averaged, while the other features like NER confidence and entity popularity are concatenated to the representation. The final MLP for scoring has two hidden layers, with size 256 and 64. We train the model on 4 GPUs with synchronous SGD, and for each gradient step we send a batch of 100 examples to each GPU. In Table 2, we show improvements achieved by several specific model design choices and features on entity linking performance.

Qualitative Analysis
We provide a few examples to showcase the effectiveness of our NEU system. Firstly, the EL model is able to link noisy entity mentions to the corresponding entity canonical name in the knowledge base. For instance, when the transcribed utterance is "play Carla Cabello", the EL model is able to resolve the mention "Carla Carbello" to the correct artist name "Camila Cabello".
Secondly, the EL model is able to recover from errors made by the NER system by leveraging the knowledge base to disambiguate entity mentions. The reranking is especially powerful when the utterance contains little context of the entity for the NER model to leverage. For example, for "Doctor Strange", the top NER hypothesis labels the full utterance as a generic "Person" type, and after reranking, EL model is able to leverage the popularity information ("Doctor Strange" is a movie that was recently released and has a high popularity in our knowledge base) and correctly label the utterance as "movieTitle". Reranking is also effective when the entity mentions are noisy, which will cause mismatches for the gazetteer features that NER uses. For "play Avengers Age of Ultra", the top NER hypothesis only predicts "Avengers" as "movieTitle", while after reranking, the EL model is able to recover the full span "Avengers Age of Ultra" as a "movieTitle", and resolve it to "Avengers: Age of Ultron", the correct canonical title.
The entity relations from the knowledge base are helpful for entity disambiguation. When the user refers to a sports team with the name "Giants", they could be asking for either "New York Giants", a National Football League (NFL) team, or "San Francisco Giants", a Major League Baseball team. When there are multiple sports team mentions in an utterance, the EL model leverages a relation feature from the knowledge base indicating whether the teams are from the same sports league (as the user is more likely to mention two teams from the same league and the same sport). Knowing entity relations, the EL model is able to link the mention "Giants" in "Cowboys versus Giants" to the NFL team, knowing that "Cowboys" is referring to "Dallas Cowboys".
To validate the utility of our proposed NEU framework, we illustrate performance improvements in the Domain Classifier and the Semantic Parsers corresponding to the three domains (music, movies & TV and sports) as described in Section 3.3. Table 3 reports the classification accuracy for the Domain Classifier and the parse accuracies for the Semantic Parsers (the model is said to have predicted the parse correctly if all the tokens are tagged with their correct semantic parse labels). We observe substantial improvements in all 4 cases when NER features are used as additional input, given all the other components of the system being the same. In turn, we observe further improvements when our NER+EL featurization is used.

Conclusion
In this work, we have proposed a Named Entity Understanding framework that jointly identifies and resolves entities present in an utterance when a user interacts with a voice assistant. Our proposed architecture consists of two modules: NER and EL, with the EL serving the additional task of possibly correcting the recognized entities from NER   by leveraging rich signals from entity links in the knowledge base while simultaneously linking these entities to the knowledge base. With several design strategies in our system targeted towards noisy natural language utterances, we have shown that our framework is robust to speech transcription and user errors that occur frequently in spoken dialog systems. We have also shown that featurizing the output of NEU and feeding these features into other language understanding tasks substantially improves the accuracy of these models.

Ethical Considerations
We randomly sampled transcripts from Siri production datasets over a period of months, and we believe it to be a representative sample of usage in the domains described. In accordance with Apple's privacy practices with respect to Siri requests, Siri utterances are not associated with a user's Apple ID, email address, or other data Apple may have from a user's use of other Apple services. In addition to Siri's baseline privacy guarantees, we filtered the sampled utterances to remove transcripts that were too long, contained rare words, or contained references to contacts before providing the dataset to our annotators.