Compact Lexicon Selection with Spectral Methods

In this paper, we introduce the task of selecting compact lexicon from large, noisy gazetteers. This scenario arises often in practice, in particular spoken language understanding (SLU). We propose a simple and effective solution based on matrix decomposition techniques: canonical correlation analysis (CCA) and rank-revealing QR (RRQR) factorization. CCA is ﬁrst used to derive low-dimensional gazetteer embeddings from domain-speciﬁc search logs. Then RRQR is used to ﬁnd a sub-set of these embeddings whose span approximates the entire lexicon space. Experiments on slot tagging show that our method yields a small set of lexicon entities with average relative error reduction of > 50% over randomly selected lexicon.


Introduction
Discriminative models trained with large quantities of arbitrary features are a dominant paradigm in spoken language understanding (SLU) Celikyilmaz et al., 2013;Liu and Sarikaya, 2014;Anastasakos et al., 2014;Xu and Sarikaya, 2014;Celikyilmaz et al., 2015;Kim et al., 2015a;Kim et al., 2015c;Kim et al., 2015b). An important category of these features comes from entity dictionaries or gazetteers-lists of phrases whose labels are given. For instance, they can be lists of movies, music titles, actors, restaurants, and cities. These features enable SLU models to robustly handle unseen entities at test time.
However, these lists are often massive and very noisy. This is because they are typically obtained automatically by mining the web for recent entries (such as newly launched movie names). Ideally, we would like an SLU model to have access to this vast source of information at deployment. But this is difficult in practice because an SLU model needs to be light-weight to support fast user interaction. It becomes more challenging when we consider multiple domains, languages, and locales.
In this paper, we introduce the task of selecting a small, representative subset of noisy gazetteers that will nevertheless improve model performance nearly as much as the original lexicon. This will allow an SLU model to take full advantage of gazetteer resources at test time without being overwhelmed by their scale.
Our selection method is two steps. First, we gather relevant information for each gazetteer element using domain-specific search logs. Then we perform CCA using this information to derive lowdimensional gazetteer embeddings (Hotelling, 1936). Second, we use a subset selection method based on RRQR to locate gazetteer embeddings whose span approximates the the entire lexicon space (Boutsidis et al., 2009;Kim and Snyder, 2013). We show in slot tagging experiments that the gazetteer elements selected by our method not only preserve the performance of using full lexicon but even improve it in some cases. Compared to random selection, our method achieves average relative error reduction of > 50%.

Motivation
We motivate our task by describing the process of lexicon construction. Entity dictionaries are usually automatically mined from the web using resources that provide typed entities. On a regular basis, these dictionaries are automatically updated and accumulated based on local data feeds and knowledge graphs. Local data feeds are generated from various origins (e.g., yellow pages, Yelp). Knowledge graphs such as www. freebase.com are resources that define a semantic space of entities (e.g., movie names, per-sons, places and organizations) and their relations.
Because of the need to keep dictionaries updated to handle newly emerging entities, lexicon construction is designed to aim for high recall at the expense of precision. Consequently, the resulting gazetteers are noisy. For example, a movie dictionary may contain hundreds of thousands movie names, but many of them are false positives.
While this large base of entities is useful as a whole, it is challenging to take advantage of at test time. This is because we normally cannot afford to consume so much memory when we deploy an SLU model in practice. In the next section, we will describe a way to filter these entities while retaining their overall benefit.

Row subset selection problem
We frame gazetteer element selection as the row subset selection problem. In this framework, we organize n gazetteer elements as matrix A ∈ R n×d whose rows A i ∈ R d are some representations of the gazetteer members. Given m ≤ n, let S(A, m) := {B ∈ R m×d : B i = A π(i) } be a set of matrices whose rows are a subset of the rows of A. Note that |S(A, m)| = n m . Our goal is to select 1 B * = arg min

B∈S(A,m)
A − AB + B F That is, we want B to satisfy range(B ) ≈ range(A ). We can solve for B * exactly with exhaustive search in O(n m ), but this brute-force approach is clearly not scalable. Instead, we turn to the O(nd 2 ) algorithm of Boutsidis et al. (2009) which we review below.

RRQR factorization
A key ingredient in the algorithm of Boutsidis et al. (2009) is the use of RRQR factorization. Recall that a (thin) QR factorization of A expresses A = QR where Q ∈ R n×d has orthonormal columns and R ∈ R d×d is an upper triangular matrix. A limitation of QR factorization is that it does not assign a score to each of the d components. This is in contrast to singular value decomposition (SVD) which assigns a score (singular value) indicating the importance of these components. 1 The Frobenius norm ||M ||F is defined as the entry-wise • Perform SVD on A and let U ∈ R d×m be a matrix whose columns are the left singular vectors corresponding to the largest m singular values.
• Associate a probability pi with the i-th row of A as follows: • Perform RRQR onĀ to obtainĀΠ = QR.
• Return the m rows of the original A corresponding to the top m columns ofĀΠ. RRQR factorization is a less well-known variant of QR that addresses this limitation. Let σ i (M ) denote the i-th largest singular value of matrix M . Given A, RRQR jointly finds a permutation matrix Π ∈ {0, 1} d×d , orthonormal Q ∈ R n×d , and upper triangular R = [R 11 R 12 ; 0 R 22 ] ∈ R d×d such that Because of this ranking property, RRQR "reveals" the numerical rank of A. Furthermore, the columns of AΠ are sorted in the order of decreasing importance.

Gazetteer selection algorithm
The algorithm is a two-stage procedure. In the first step, we randomly sample O(m log m) rows of A with carefully chosen probabilities and scale them to form columns of matrixĀ ∈ R d×O(m log m) .
In the second step, we perform RRQR factorization onĀ and collect the gazetteer elements corresponding to the top components given by the RRQR permutation. The algorithm is shown in Figure 1. The first stage involves random sampling and scaling of rows, but it is shown thatĀ has O(m log m) columns with constant probability. This algorithm has the following optimality guarantee: Theorem 3.1 (Boutsidis et al. (2009)). LetB ∈ R m×d be the matrix returned by the algorithm in Figure 1. Then with probability at least 0.7, In other words, the selected rows are not arbitrarily worse than the best rank-m approximation of A (given by SVD) with high probability.

Gazetteer embeddings via CCA
In order to perform the selection algorithm in Figure 1, we need a d-dimensional representation for each of n gazetteer elements. We use CCA for its simplicity and generality.

Canonical Correlation Analysis (CCA)
CCA is a general statistical technique that characterizes the linear relationship between a pair of multi-dimensional variables. CCA seeks to find k dimensions (k is a parameter to be specified) in which these variables are maximally correlated. Let x 1 . . . x n ∈ R d and y 1 . . . y n ∈ R d be n samples of the two variables. For simplicity, assume that these variables have zero mean. Then CCA computes the following for i = 1 . . . k: In other words, each (u i , v i ) is a pair of projection vectors such that the correlation between the projected variables u i x l and v i y l is maximized, under the constraint that this projection is uncorrelated with the previous i − 1 projections. This is a non-convex problem due to the interaction between u i and v i . However, a method based on singular value decomposition (SVD) provides an efficient and exact solution to this problem (Hotelling, 1936). The resulting solution u 1 . . . u k ∈ R d and v 1 . . . v k ∈ R d can be used to project the variables from the original d-and d -dimensional spaces to a k-dimensional space: The new k-dimensional representation of each variable now contains information about the other variable. The value of k is usually selected to be much smaller than d or d , so the representation is typically also low-dimensional.

Inducing gazetteer embeddings
We now describe how to use CCA to induce vector representations for gazetteer elements. Using the same notation, let n be the number of elements in the entire gazetteers. Let x 1 . . . x n be the original representations of the element samples and y 1 . . . y n be the original representations of the associated features in the element.
We employ the following definition for the original representations. Let d be the number of distinct element types and d be the number of distinct feature types.
• x l ∈ R d is a zero vector in which the entry corresponding to the element type of the l-th instance is set to 1.
• y l ∈ R d is a zero vector in which the entries corresponding to features generated by the element are set to 1.
In our case, we want to induce gazetteer (element) embeddings that correlate with the relevant features about gazetteers. For this purpose, we use three types of features: context features, search click log features, and knowledge graph features.
Context features: For each gazetteer element g of domain l, we take sentences from search logs on domain l containing g and extract five words each to the left and the right of the element g in the sentences. For instance, if g = "The Matrix" is a gazetteer element of domain l = "Movie", we collect sentences from movie-specific search logs involving the phrase "The Matrix". Such domain-specific search logs are collected using a pre-trained domain classifier.
Search click log features: Large-scale search engines such as Bing and Google process millions of queries on a daily basis. Together with the search queries, user clicked URLs are also logged anonymously. These click logs have been used for extracting semantic information for various NLP tasks (Kim et al., 2015a;Tseng et al., 2009;. We used the clicked URLs as features to determine the likelihood of an entity being a member of a dictionary. These features are useful because common URLs are shared across different names such as movie, business and music. Table 1 shows the top five most frequently clicked URLs for movies "Furious 7" and "The age of adaline".

Furious 7
The age of adaline imdb.com imdb.com en.wikipedia.org en.wikipedia.org furious7.com youtube.com rottentomatoes.com rottentomatoes.com www.msn.com movieinsider.com One issue with using only click logs is that some entities may not be covered in the query logs since logs are extracted from a limited time frame (e.g. six months). Even the big search engines employ a moving time window for processing and storing search logs. Consequently, click logs are not necessarily good evidence. For example, "apollo thirteen" is a movie name appearing in the movie training data, but it does not appear in search logs. One way to solve the issue of missing logs for entities is to search bing.com at real time. Given that the search engine is updated on a daily basis, real-time search can make sure we capture the newest entities. We run live search for all entities no matter if they appear in search logs or not. Each URL returned from the live search is considered to have an additional click.
Knowledge graph features: The graph in www.freebase.com contains a large set of tuples in a resource description framework (RDF) defined by W3C. A tuple typically consists of two entities: a subject and an object linked by some relation.
An interesting part of this resource is the entity type defined in the graph for each entity. In the knowledge graph, the "type" relation represents the entity type. Table 2 shows some examples of entities and their relations in the knowledge graph. From the graph, we learn that "Romeo & Juliet" could be a film name or a music album since it has two types: "film.film" and "music.album".

Experiments
To test the effectiveness of the proposed gazetteer selection method, we conduct slot tagging experiments across a test suite of three domains: Movies, Music and Places, which are very sensitive domains to gazetteer features. The task of slot tagging is to find the correct sequence of tags of words given a user utterance. For example, in Places domain, a user could say "search for home depot in kingsport" and the phrase "home depot" and "kingsport" are tagged with Place Name and Location respectively. The data statistics are shown in Table 3. One domain can have various kinds of gazetteers. For example, Places domain has business name, restaurant name, school name and etc. Candidate dictionaries are mined from the web and search logs automatically using basic pattern matching approaches (e.g. entities sharing the same or similar context in queries or documents) and consequently contain significant amount of noise. As the table indicates, the number of elements in total across all the gazetteers (#total gazet elements) in each domain are too large for models to consume. In all our experiments, we trained conditional random fields (CRFs) (Lafferty et al., 2001) with the following features: (1) n-gram features up to n = 3, (2) regular expression features, and (3) Brown clusters (Brown et al., 1992) induced from search logs. With these features, we compare the following methods to demonstrate the importance of adding appropriate gazetteers: • NoG: train without gazetteer features.
• RankAllG: train with all ranked gazetteers.  Here gazetteer features are activated when a phrase contains an entity in a dictionary. For RandG, we first sample a category of gazetteers uniformly and then choose a lexicon from gazetteers in that category. The results when we use selected gazetteer randomly in whole categories are very low and did not include them here. For selecting gazetteer methods (NoG, RnadG and RRQRG), we select 500,000 elements in total.  the size of test data is about 5k for each locale. The results are shown in Table 5. Interestingly, the RRQR even outperforms the AllG. This is because some noisy entities are filtered.
Finally, we show that the proposed method is useful even in all gazetteer scenario (AllG). Using RRQR, we can order entities according to their importance and transform a gazetteer feature into a few ones by binning the entities with their rankings. For example, instead of having one single big business names gazetteer, we can divide them into lexicon with first 1000 entities, 10000 entities and so on. Results using ranked gazetteers are shown in Table 6. We see that the Ranked gazetteers approach (RankAllG) has consistent gains across domains over AllG.

Conclusion
We proposed the task of selecting compact lexicons from large and noisy gazetteers. This scenario arises often in practice. We introduced a simple and effective solution based on matrix decomposition techniques: CCA is used to derive lowdimensional gazetteer embeddings and RRQR is used to find a subset of these embeddings. Experiments on slot tagging show that our method yields relative error reduction of > 50% on average over the random selection method.