GEMNET: Effective Gated Gazetteer Representations for Recognizing Complex Entities in Low-context Input

Named Entity Recognition (NER) remains difficult in real-world settings; current challenges include short texts (low context), emerging entities, and complex entities (e.g. movie names). Gazetteer features can help, but results have been mixed due to challenges with adding extra features, and a lack of realistic evaluation data. It has been shown that including gazetteer features can cause models to overuse or underuse them, leading to poor generalization. We propose GEMNET, a novel approach for gazetteer knowledge integration, including (1) a flexible Contextual Gazetteer Representation (CGR) encoder that can be fused with any word-level model; and (2) a Mixture-of- Experts gating network that overcomes the feature overuse issue by learning to conditionally combine the context and gazetteer features, instead of assigning them fixed weights. To comprehensively evaluate our approaches, we create 3 large NER datasets (24M tokens) reflecting current challenges. In an uncased setting, our methods show large gains (up to +49% F1) in recognizing difficult entities compared to existing baselines. On standard benchmarks, we achieve a new uncased SOTA on CoNLL03 and WNUT17.


Introduction
Identifying entities is a core NLP task. Named Entity Recognition (NER) is the task of finding entities and recognizing their type (e.g., person or location). Mention Detection (MD) is a simpler task of identifying entity spans, without the types.
Advances in neural NER have produced high scores on benchmark datasets like CoNLL03 and OntoNotes (Devlin et al., 2019). However, a number of challenges remain. As noted by Augenstein et al. (2017), these scores are driven by the use of well-formed news text, the presence of "easy" entities, and memorization effects due to entity overlap between train/test sets; these models perform significantly worse on unseen entities or noisy text. * This research was done during an internship at Amazon.

Current NER Challenges
Beyond news text, many challenges remain in NER. Context information has been shown to be important for NER (Jayarao et al., 2018), and short texts like search queries are very challenging due to low context and a lack of surface features (Guo et al., 2009;Carmel et al., 2014). Unseen and emerging entities also pose a challenge (Bernier-Colborne and Langlais, 2020). Finally, some entities, like movie names are not simple noun phrases and are harder to recognize (Ashwini and Choi, 2014). Table 1 lists more details about these challenges, and how they can be evaluated.
Entity Knowledge is essential for overcoming these issues, and critical in the absence of casing. Even a human may not correctly parse "what is [[life is beautiful]]?" without knowing that a movie is being referenced. However, most models start with no knowledge of real world entities, learning them from the training data. Continuous data annotation can add new entities, but is expensive and often not feasible.
Consequently, methods for integrating external knowledge, e.g., Knowledge Bases (KBs) or gazetteers, into neural architectures have gained renewed attention. However, such studies have reported limited gains (Liu et al., 2019;Rijhwani et al., 2020). The mixed success of gazetteers stems from three main limitations in current work: gazetteer feature representation, their integration with contextual models, and a lack of data.
For the representation, one-hot binary encoding is often used to represent gazetteer features (Song et al., 2020). However, this does not capture contextual info or span boundaries. Alternatively, independent span taggers trained on gazetteers have been proposed to extract potential entities Liu et al. (2019), but such models can be difficult to train and may not provide reliable features.

Challenge
Description Short Texts For: voice, search News texts have long sentences discussing many entities, but other use cases (search queries, questions) have shorter inputs. Datasets with minimal context are needed to assess performance of such use cases. Capitalization/punctuation features are large drivers of success in NER (Mayhew et al., 2019), but short inputs (ASR, user input) often lack these surface feature. An uncased evaluation setting is needed to understand model performance.

Long-tail Entities
For: domains with many entities In many domains entities have a large long-tail distribution, with millions of values (e.g., location names). This makes it hard to build representative training data, as it can only cover a portion of the potentially infinite entity space. A very large test set is required for effective evaluation. Emerging Entities For: domains with growing entities All entity types are open classes (new ones are added), but some groups have a faster growth rate, e.g., new books/songs/movies are released weekly. Assessing true generalization requires test sets with many unseen entities, to mimic an open-world setting. Complex Entities For: voice, search Not all entities are proper names: some types (e.g. creative works) can be linguistically complex. They can be complex noun phrases (Eternal Sunshine of the Spotless Mind), gerunds (Saving Private Ryan), infinitives (To Kill a Mockingbird), or full clauses (Mr.Smith Goes to Washington). Syntactic parsing of such nouns is hard, and most current parsers/NER systems fail to recognize them. The top system from WNUT 2017 achieved 8% recall for creative work entities (Aguilar et al., 2017). Effective evaluation requires corpora with many such entities. There are also limitations in the integration of gazetteer features. Existing studies often add extra features to a word-level model's Contextual Word Representations (CWRs), which typically contain no info about real world entities or their spans (Yamada et al., 2020). This concatenation approach is sub-optimal as it creates additional, and often highly correlated features. This has been shown to cause feature "under-training", where the model will learn to mostly rely on either context or gazetteer during training, and underuse the other (Yang et al., 2016). This can be problematic as the utility of the gazetteer is variable: it is valuable in low-context cases, but may not be useful when rich syntactic context (from the CWR) can identify entities. Conversely, a true entity may be missing from the gazetteer. However, when gazetteers are represented as an independent feature, the model assigns it a fixed weight, and its contribution to the prediction is static. To overcome this, external knowledge should dynamically be infused into relevant dimensions of the CWR, with the model learning to conditionally balance the contribution of the CWR and gazetteer to the prediction.
Finally, these issues are compounded by a lack of data reflecting the challenges from Table 1, which prevents the exploration of effective architectures for knowledge injection.

Our Contributions
The key contributions of this paper are new data and methods to address the above challenges.
We propose GEMNET, a gazetteer expert mixture network for effectively integrating gazetteers into any word-level model. The model includes an encoder for Contextual Gazetteer Representations (CGRs) as a way to incorporate any number of gazetteers into a single, span-aware, dense representation. We also propose a gated Mixture-of-Experts (MoE) method to fuse CGRs with Contextual Word Representations from any word-level model (e.g., BERT), something not explored in previous work. Our novel MoE approach allows the model to conditionally compute a joint CGR-CWR representation, training a gating network to learn how to balance the contribution of context and gazetteer features. Finally, we employ multi-stage training to drive further improvements by aligning the CGR/CWR vectors.
To evaluate our proposed approaches, we create 3 challenging NER datasets that represent short sentences, questions, and search queries. The created datasets have complex entities with low-context and represent the challenges in Table 1.
Extensive experiments in an uncased setting show that our MoE model outperforms other baselines, including concatenation, in all experiments. We achieve state-of-the-art (SOTA) results on CoNLL03/WNUT17, but its utility is more notable on our difficult low-context data. We show that short texts make NER much harder, but gazetteers yield huge gains of up to +49% F1, specially in recognizing complex/unseen entities. We also show that gazetteer coverage during training is important.

Related Work
Deep Learning for NER Neural approaches have greatly improved NER results in recent years. A shift to encoders e.g., BiLSTM-CRF models (Huang et al., 2015), using static word embed-dings eliminated the need for manual feature engineering (e.g., capitalization features). More recently, transformer-based Language Models (LMs), e.g., BERT (Devlin et al., 2019), achieved further improvements by using deep contextual word representations. Such models jointly learn syntactic cues and entity knowledge, and may fail to recognize unseen or syntactically ambiguous entities. Consequently, training data is augmented with gazetteers.
NER with Gazetteers Annotated NER data can only achieve coverage for a finite set of entities, but models face a potentially infinite entity space in the real world. To address this, researchers have integrated gazetteers into models (Bender et al., 2003;Malmasi and Dras, 2015). String matching is commonly used to extract gazetteer matches, which are then concatenated to word representations. Song et al. (2020) use gazetteers from the Wikidata KB to generate one-hot vectors that are concatenated to BERT representations, yielding minor improvements on CoNLL03. This concatenation approach has been shown to cause feature "under-training" (Yang et al., 2016), as discussed in §1. An alternative approach uses gazetteers to train a subtagger model to recognize entity spans. Liu et al. (2019) propose a hybrid semi-Markov CRF subtagger, reporting minor improvements. While a subtagger may learn regularities in entity names, a key limitation is that it needs retraining and evaluation on gazetteer updates. Recent work has considered directly integrating knowledge into transformers, e.g., KnowBert adds knowledge to BERT layers (Peters et al., 2019), and LUKE is pretrained to predict masked entities (Yamada et al., 2020). The drawbacks of such methods are that they are specific to Transformers, and the model's knowledge cannot be updated without retraining. We aim to overcome the limitations of previous work by designing a model-agnostic gazetteer representation that can be fused into any word-level model.

Mixture-of-Experts (MoE) Models
MoE is an approach for conditionally computing a representation, given several expert inputs, which can be neural models with different architectures (Arnaud et al., 2020) or models using different knowledge sources (Jain et al., 2019). In MoE, a gating network is trained to dynamically weight experts perinstance, according to the input. It has demonstrated to be useful in various applications like recommendation (Zhu et al., 2020), domain adaptation for sentiment analysis, and POS tagging (Guo et al., 2018). For NER, Liu et al. (2020) proposed a Mixture of Entity Experts (MoEE) approach where they train an expert layer for each entity type, and then combine them using an MoE approach. Their approach does not include external gazetteers, and the experts provide an independent representation that is not combined with the word representation. In our work we treat word and external gazetteer representations as independent experts, applying MoE to learn a dynamically fused representation.

Datasets
We experiment using three standard benchmarks: CoNLL03, OntoNotes, and WNUT17. However, these corpora do not capture the issues from Table 1; rich context and common entities (country names) allow a simple RNN model to achieve near-SOTA results. A key contribution of our paper is the creation of 3 new datasets that represent those challenges. They are difficult, as shown in §5.1. Our datasets are described below. 1 All data are uncased, and we make them publicly available. 2 Their statistics, listed in Table 2, show that they reflect the challenges from §1: short inputs (low context), with many unseen entities in the test set.

LOWNER (Low-Context Wikipedia NER)
To create our training set, we take advantage of the rich interlinks in Wikipedia. We parse the English Wikpedia dump and extract sentences from all articles. The sentences are parsed, and linked pages are resolved to their respective Wikidata entities to identify their type. To mimic search and voice settings, we minimize the context around the entities by dropping sentences with unlinked entities, identified using interlinks and a capitalization heuristic. The result is a corpus of 1.4 million low-context sentences with annotated entities, e.g., "A version for the [sega cd] was also announced."   (2020), we templatize the questions by applying NER to extract item names, which are then mapped to our taxonomy. Entities are replaced with their types to create templates, e.g., "who sang <CW>" and "when did <PROD> come out". Approx 3.5k Templates (appearing >= 5 times) are chosen and slotted with entities from a knowledge base to generate 18k annotated questions e.g., "when did [xbox 360] come out". There are a wide range of question shapes and entity types, please see Appendix A for examples.

ORCAS-NER: Search Query NER
To represent the search query domain, we utilize 10 million Bing user queries from the ORCAS dataset (Craswell et al., 2020) and apply the same templatization procedure as MSQ-NER. This yields search templates e.g., "<PROD> price" and "<CORP> phone number", which are used to create annotated queries, e.g., "[airpods pro] reviews". A total of 472k queries are generated from 97k unique templates, please see examples in Appendix A.

Gazetteer Data
Our gazetteer is composed of 1.67 million entities from the English Wikidata KB. Instead of collecting entities from the web (Khashabi et al., 2018), we focused on entities that map to our taxonomy. Alternative names (aliases) for entities are included. Gazetteer statistics are listed in Appendix B.

The GEMNET Model
We propose GEMNET, a generic gazetteer fusion approach that can be integrated with any word-level model, e.g., RNNs and Transformers. We experiment with both BiLSTM-CRF and BERT-CRF models which produce (contextual) word representations, and complement these "word experts" with gazetteers. The overall architecture is shown in Figure 1, and the components are detailed below.

Contextual Gazetteer Representations
Our gazetteer representations is obtained in two steps: entry matching, and contextual encoding. Gazetteer Entry Matching A gazetteer g is a list of entries that are associated with a category. For instance, a PERSON gazetteer contains a list of known people. The k-th entry g (k) is associated with a tokenized string ('John Carpenter') and t ( We denote input sentences as (w 1 , w 2 , . . . , w L ), where w i is the i-th word, and L is the length. Full string matching is applied to inputs to identify matches across all gazetteers. Overlapping matches are resolved by preferring longer ones over shorter ones, and earlier matches over later ones. A match matrix, M ∈ {0, 1} L×|T | , represents the matching results. It is initialized with zeros, and successful

O B-PROD I-PROD B-CORP I-CORP
indicating that the word w i+j is represented by a one-hot vector over the tag set T .
A key advantage of this representation is that it captures multiple matches for a given span in a sentence. As shown in Table 3, the word "apple" can be matched to product and organization types. Furthermore, it is span-aware due to the IOB2 encoding. Any number of types and gazetteers can be added as needed, allowing the model to learn from correlations, and identify ambiguous entities. M is extracted by a gazetteer matcher, as a preprocessing step outside the network. This modular approach has an important advantage: it allows the gazetteer to be updated without model retraining. This is useful for efficiently recognizing emerging entities, and supporting personalized user-defined entities (e.g., contact lists).
Contextual Encoding M can be directly used as input features, but is sparse. We use a linear projection to obtain a dense representation per word: where w ∈ R D×T and b ∈ R D are trainable parameters, D is the hidden dimension of gazetteer representation and f is an activation function. This creates a dense representation that captures interactions between multiple matches. We then contextualize this representation by applying a BiLSTM: is the concatenation. A sample visualization of the embeddings is shown in Appendix D.
This dense contextualized gazetteer representation (CGR) can capture information about entity span boundaries (present in M), as well as interactions between entities in a sentence.

Gazetteer Knowledge (CGR) Integration
The CGR operates on IOB2 tags and cannot memorize specific patterns; it is designed to be integrated with a lexical model. We consider these representations to be orthogonal: CGRs can complement the model's knowledge and syntactic representation.

CGR Concatenation
The simplest integration is to concatenate the dense CGR to the CWR, while jointly training the two representations.

Mixture-of-Experts (MoE) Model
The wordlevel model and CGRs complement each other and may not always be in agreement. The word model may have low confidence about the span of an unseen entity, but the gazetteer may have knowledge of it. Conversely, the model's syntactic context may be confident about a span not in the gazetteer.
In fact, the two sources can be considered as independent experts and an effective model should learn to use their outputs dynamically. Inspired by the MoE architecture (Pavlitskaya et al., 2020), we apply conditional computation to combine our representations, allowing the model to learn the contexts where it should rely more on each expert.
We add a gating network to create a weighted linear combination of the word and gazetteer representations. For a sentence, the two models output 3 their representations h word and h gaz , which are used to train the gating network: where θ are trainable parameters with size 2L, [·, ·] is the concatenation and σ is the Sigmoid activation function. We learn gating weights, w e , so that the model can learn to dynamically compute the hidden information h for each word. The architecture of our model is shown in Figure 1. After obtaining h, we feed it to a CRF layer to predict a tag.
Two-stage Training Our architecture jointly optimizes over both experts, but their initial states differ. The word expert often contains pretrained elements, either as word embeddings or transformers. The randomly-initialized CGR will have high initial loss, and its representation is not aligned with the word expert, preventing correct convergence. We tackle this problem through a two-stage training method to adapt the two experts to each other. In the first stage, we freeze the word ex-3 Outputs sizes must be equal, e.g., CGR must match BERT. pert and only train the CGR encoder with the MoE and CRF layers, forcing the model to use gazetteer knowledge in order to minimize the loss. Importantly, this also adapts the CGR encoder to align its representation with that of the word expert, e.g., the dimensions with noun signals will be aligned with those of BERT, enabling the computation of their linear combination. In the second stage, the two experts are jointly fine-tuned to co-adapt them. This ensures that the CGR encoder starts with reasonable weights, and allows the MoE gating network to better learn how to balance the two experts.

Models:
We integrate GEMNET with both BERT and BiLSTM word encoders.For BERT, we use the pretrained BERT BASE model. The last output layer is used, and for each word, we use the first wordpiece representation as its representation. The BiLSTM model has 3 inputs: GloVe embeddings (Pennington et al., 2014), ELMo embeddings (Peters et al., 2018) and CharCNN embeddings (Ma and Hovy, 2016).
Evaluation: We evaluate MD and NER, and report entity-level precision, recall and F1 scores.

MD Baselines
Our first experiment aims to measure the difficulty of our datasets ( §3) relative to existing benchmarks. We train a BERT model on CoNLL03 and use it to measure MD performance on our data. Measuring NER performance is not possible as we use a different tag set (WNUT17 vs CoNLL03).  Results: Compared to the CoNLL03 results, the LOWNER performance is worse. Although the evaluation on LOWNER is a transfer setting, the large gap shows the existing model cannot generalize well to our datasets due to the hard entities. Results for MSQ-NER and ORCAS-NER, which are short texts, are even lower. Overall, we note the difficulty of our datasets due to low context and hard entities.

NER Ablation Experiments
We explore all model architectures by training on LOWNER (set 1 in Table 2) and evaluating MD and NER performance on all datasets (sets 3-5 in Table 2). See Appendix C for training details.
Models: The GEMNET model is jointly trained and fused with BERT and BiLSTM word encoders, with and without two-stage training. To assess the impact of the MoE component, we also concatenate the CGR and CWR vectors, without MoE.

Baselines:
We compare against three baselines: (1) no gazetteer baselines; (2) binary concatenation: we simply concatenate the binary match features (M) to the word representations, as is common in the literature; (3) the subtagger model of Liu et al. (2019). They are shown as "baselines" in table 5.
Results: MD and NER performance for all models is shown in Table 5. Overall we note the high effectiveness of the GEMNET model. In particular, our BiLSTM-based GEMNET approach improves F1 by up to 49% over the no gazetteer BiLSTM baseline in ORCAS-NER. Different aspects of the results are discussed below.
Word Encoder Performance: For LOWNER, we note that BERT achieves the best results, which is to be expected since the data consists of full sentences. MD is easier than NER, and represents the upper bound for NER. Performance in all cases decreases with low context, with search queries (ORCAS-NER) being the hardest. BiLSTMs perform better on shorter inputs, e.g., ORCAS-NER.

Impact of Gazetteers:
Results improve in all cases with external knowledge. While the subtagger and the binary concatenation baselines yield gains compared to the no gazetteer baselines, our CGR-based approach outperforms all of them in all NER tests. This indicates the high effectiveness of our CGR. For LOWNER, using CGR+MoE, MD performance improves by 2.4%, while NER increases 4.7% over the no gazetteer BERT baseline. Low-context data, MSQ-NER and ORCAS-NER, have much lower baseline performance, and benefit greatly from external knowledge. The best MSQ-NER NER model improves 36% over the no gazetteer BiLSTM baseline, while ORCAS-NER increases by 49%. This clearly demonstrates the impact of gazetteer integration.
Effect of Integration Method: CGR outperforms baselines in all NER experiments, showing the effectiveness of a span-aware, contextual representation that is jointly trained with the wordlevel model. The MoE integration is superior to concatenation in all cases. This is more salient in low context settings, demonstrating that the MoE model can rely on the CGR feature when the syntactic context (CWR) is not discriminative. In some cases baselines actually degrade performance as the model can not effectively balance the experts.
Effect of Two-stage Training: We observe that two-stage training is crucial for BERT, including concatenation models and MoE models, but not for the BiLSTM model. This confirms our hypothesis that the CGR cannot be jointly trained with a large pretrained model. Freezing BERT and then jointly fine-tuning them provides great improvements.

Results on Benchmarks:
We applied GEMNET, i.e., BERT using CGR+MoE with two stage training, to the standard benchmarks. We experiment in an uncased setting, and and compare with the reported uncased SOTA (Mayhew et al., 2019). The SOTA uses BERT-CRF, which are the same as our baseline architecture. For comparison, we also reproduce the BERT baseline using our implementation. Results are shown in Table 6. Our models achieve SOTA results in all uncased settings, demonstrating generalization across domains; we improve by 3.9% on WNUT17.

Per-Class Performance & Error Analysis
We also look at performance across different entity classes to understand the source of our improvements. Table 7 shows relative gains per class, comparing the no gazetteer baseline performance against the best model. Detailed precision/recall values are in Appendix E (Table 16).
The smallest gains are on PER and LOC types, and the largest gains are on products and creative works (CW). This agrees with our hypothesis that these complex entities are the hardest to recognize.
Comparing datasets, increases are much larger on MSQ-NER and ORCAS-NER, confirming the challenges of short low-context inputs, and our models effectiveness in such cases.    We also conduct a qualitative error analysis to identify instances where the best non-gazetteer baseline fails, but our model provides correct output. Some examples are shown in Table 8. The baseline often lacks knowledge about complex and long-tail entities, either missing them (#1,6,8 show full or partial MD failure) or misclassifying them (#3-5 show NER errors). Another common trend we observe is baselines incorrectly predicting nested entities within complex entities (#2,10).

Effect of Gazetteer Coverage
We consider the impact of gazetteer coverage 4 on performance. We hypothesize that training coverage impacts how much the model learns to rely on the gazetteer. To verify this we examine two 4 The proportion of entities that are present in the gazetteer scenarios: (1) the gazetteer coverage for train and test match (i.e., both high or low); and (2) there is a coverage gap between train and test, e.g., train coverage is 90% but is 25% for test, or vice versa.
Model and Data: For each train/test set we create gazetteers that have p% coverage of the set's gold entities, with p ∈ {5, 10, 20, 30, 50, 75, 90, 95}. This is achieved by randomly dropping entities. We then train models using each p and evaluate on test sets, using all values of p. This experiment is done using LOWNER and MSQ-NER.

ORCAS-NER
Example 7: |bee-line| CORP revenue |bee-line revenue| CW |bee-line| CORP Example 8: |lexus rc 350| P ROD height |lexus rc| P ROD |lexus rc 350| P ROD Example 9: how old is |ingross| P ER |ingross| LOC |ingross| P ER Example 10: cast of |dr. devil and mr. hare| CW |dr. devil| P ER , |mr. hare| P ER |dr. devil and mr. hare| CW We also note that the gap between the best and worst result for LOWNER is not huge, showing the impact of sentence context. This gap is much larger for ORCAS-NER, where the model cannot rely on the context. Finally, we note that an alternative dynamic dropout method 5 achieved similar results.

Performance in a Low-Resource Setting
We also consider the impact of a low-resource setting (limited annotations) on performance, hypothesizing that gazetteers are more helpful in such settings. To verify this, we create random subsets of 5/10/20% of the training data and compare the NER performance of a baseline (BERT-base) vs our best model(BERT+CGR+MoE+2stage) when trained on this data. Results are shown in Table 9.
The results show that gazetteers are always more effective than baseline in low-resource scenarios. Specifically, they improve much faster with less data, achieving close to maximum performance with only 20% of the data.

Conclusion
We focused on integrating gazetteers into NER models. We proposed GEMNET, a flexible architecture that includes a Contextual Gazetteer Representation encoder, combined with a novel Mixture-of-Expert gating network to conditionally utilize this information alongside any word-level model. GEMNET supports external gazetteers, allowing the model's knowledge to be updated without retraining. We also developed new datasets to represent the current challenges in NER. Experimental results demonstrated that our method can alleviate the feature weight under-training issue, achieving significant improvements on our data and a standard benchmark, WNUT17. The datasets we released can serve as benchmarks for evaluating the entity knowledge possessed by models in future work.
Future work involves investigating integration with different model architectures, partial gazetteer matching, and additional entity features. types to create slotted templates, e.g., "when did [[iphone]] come out" becomes "when did <PROD> come out". The templates are then aggregated by frequency. This process results in 3, 445 unique question templates. While the NER system cannot correctly identify many entities, the most frequent templates are reliable. Examples are listed in Table 11.
Finally, we generate MSQ-NER by slotting the templates that have a frequency of >= 5 with random entities from the Wikipedia KB with the same class. Each template is slotted with the same number of times it appeared in MS-MARCO in order to maintain the same relative distribution as the original data. This results in 17, 868 questions, e.g., "when did [[xbox 360]] come out", which we use as a test set.

ORCAS-NER:
To represent the search query domain, we utilize 10 million Bing user queries from the ORCAS dataset (Craswell et al., 2020) and apply the same templatization procedure described above for MSQ-NER. This yields search templates e.g., "<PROD> price" and "<CORP> phone number", which are used to create annotated queries, e.g., "[[airpods pro]] reviews". This process creates 97, 324 unique query templates. We slot these templates according to their frequency, yielding a final dataset of 471, 746 queries. This is our largest, and most challenging, test set. Examples of our templates are listed in Table 12.

B Gazetteer Details and Statistics
We parsed a Wikidata dump from July 2020 and mapped entities to our NER taxonomy ( §3). This was done by traversing Wikidata's class and instance relations, and mapping them to our NER classes, e.g., Wikidata's human class maps to PER in our taxonomy, song to CW, and so on.
We extracted 1.67 million entities that were mapped to our classes. The distribution of these entities is shown in Table 13.

C Training Details & Hyperparameters
The hyperparameters searching range and the optimal ones we use in Section 5.2, including the results on our created datasets in Table 5 and benchmark results in Table 6, are shown in Table 14 (BiL-STM model) and    During training we set the gradient norm to be 5.0 to ensure smooth training. We also apply early stopping, halting the training process when we cannot improve performance on the development set during the last 15 epochs.

D CGR Embedding Visualization
As mentioned in §4, given a sentence, we use a gazetteer matcher to extract its representation M. M is passed to the CGR encoder (i.e., the green paprt in Figure 1)    in MSQ-NER as inputs to the CGR encoder and obtain the averaged embedding vectors of all the gazetteer tags, e.g., B-PER and B-CW. To visualize these gazetteer tags, we apply t-SNE (Maaten and Hinton, 2008) and generate their 2D visualization shown in Figure 3. It is clear that the tags in the same type, e.g., B-PROD and I-PROD, are close to each other. This indicates that our CGR encode can identify the semantic meaning of these tags and provide effective gazetteer representation.

E Additional Results
Some additional detailed results are included in this section. Table 16 shows detailed precision, recall and F1 scores for each entity class, comparing the no gazetteer baseline model and the best model for each dataset. We note that the worst performance is on products and creative works, as we hypothesized since the entities are much more linguistically complex. These classes achieve the largest increases with our models, which demonstrates that our methods successfully make up the models' weakness in the complex entities challenge.

Parameter
Search Range LOWNER optimal BiLSTM input word dimension [50,200] Table 16: Per-class performance across entity types. We show the optimal model (Ours) for each dataset, which is BERT+MoE+Two-stage for LOWNER, and BiLSTM+MoE for MSQ-NER and ORCAS-NER. The baselines are BERT, BiLSTM and BiLSTM without gazetteer, respectively.