Neural Transition Based Parsing of Web Queries: An Entity Based Approach

Web queries with question intent manifest a complex syntactic structure and the processing of this structure is important for their interpretation. Pinter et al. (2016) has formalized the grammar of these queries and proposed semi-supervised algorithms for the adaptation of parsers originally designed to parse according to the standard dependency grammar, so that they can account for the unique forest grammar of queries. However, their algorithms rely on resources typically not available outside of big web corporates. We propose a new BiLSTM query parser that: (1) Explicitly accounts for the unique grammar of web queries; and (2) Utilizes named entity (NE) information from a BiLSTM NE tagger, that can be jointly trained with the parser. In order to train our model we annotate the query treebank of Pinter et al. (2016) with NEs. When trained on 2500 annotated queries our parser achieves UAS of 83.5% and segmentation F1-score of 84.5, substantially outperforming existing state-of-the-art parsers.


Introduction
Web queries, authored by users in order to search for information, form a major gate to the Web, and their correct interpretation is hence invaluable. While earlier research (Bergsma and Wang, 2007;Barr et al., 2008) suggested that many queries are trivial in structure,   (henceforth PRS16) demonstrated that this is often not the case. Particularly, they demonstrated that queries related to questions that are answered in Community Question Answering (CQA) sites (social QA forums such as Yahoo Answers), follow a complex dependency grammar. As such queries are quite frequent (e.g. an early study (White et al., 2015) showed that they constitute ∼10% of all queries issued to search engines) and their interpretation can benefit from structural analysis (e.g. (Tsur et al., 2016)), effective query parsing is of importance. 2 In order to properly describe the syntactic structure of queries, PRS16 extended the standard dependency grammar so that it accounts for dependency forests that consist of syntactically independent segments, each of which has its own internal dependency tree. Additionally, they constructed a query treebank consisting of 4,000 CQA queries, manually annotated according to the query grammar. Examples of annotated queries from the treebank are given in Figures 2 and 4.
PRS16 presented two algorithms that can adapt off-the-shelf dependency parsers trained on standard edited text, so that they can produce syntactic structures that conform to the query grammar. Importantly, their methods do not contain any change to the states and transitions of the parser. Instead, they require millions of unannotated queries each paired with the title (usually a grammatical question) of the Yahoo Answers question page that was clicked by the user who initiated the query; it is the alignment between the query and the title that provides the training signal for their algorithms. Unfortunately, millions of (query, title) pairs are typically not available outside of big web corporates, and the practical value of the PRS16 algorithms is hence limited. Moreover, despite the unique supervision signal, their parsers achieve a segmentation F1-score of up to 70.4, which leaves a large room for improvement.
In this paper we present a transition-based BiL-STM query parser that requires no distant supervision. Our parser is based on two ideas: (a) We change the standard transition system so that the parser explicitly supports the PRS16 query gram-mar ( § 3.2); and (b) Observing that entities are very frequent in CQA queries and provide a strong structural signal ( § 4), we extend our parser to consider information from a named entity (NE) tagger. We explore both sequential and joint training of the NE tagger and the parser and demonstrate the superiority of the joint approach.
As another contribution of this paper, we annotate the dataset of PRS16 with NEs ( § 4). We use this data to establish our observation about the importance of NEs for query parsing, and in order to train and test our entity-aware parser.
We split the PRS16 corpus to train (2500 queries), development (750) and test (750) sections ( § 6). In this training setup our segmentation and entity-aware parser achieves a segmentation F1-score of 84.5 (100 on single-segment queries, 60.7 on multi-segment queries) and a dependency parsing UAS of 83.5. Our model outperforms its simpler variants that do not utilize segmentation and/or NE information. For example, the BiLSTM parser of Kiperwasser and Goldberg (2016), which forms the basis for our model, scores 67.7 in segmentation F1 and 77.0 in UAS.
We note that our training setup is very different than that of PRS16. They trained their parser on edited text from the OntoNotes 5 corpus (Weischedel et al., 2013) augmented with millions of (query, title) pairs, and their test set consists of the 4000 queries of the query treebank, as they do not train on queries. 3 While our work is not directly comparable to theirs, it is worth mentioning that their best model scores 70.4 in segmentation F1 and 76.4 in UAS, much lower than the numbers we report here for our best models.

Previous Work
We divide this section to two: We start with works that analyze the structure of queries and the role of NEs in their processing, and then discuss work on parsing of user generated content on the web. Query structure and entity analysis As noted in PRS16, web queries differ from standard sentences, as they tend to be shorter and have a unique dependency structure. Hence, prior to PRS16 several works have addressed the syntactic structure of web queries. However, all these works were restricted to tasks that are much simpler than full dependency parsing, including POS tagging (Bendersky et al., 2010;Ganchev et al., 2012), phrase chunking (Bendersky et al., 2011), semantic tagging (Manshadi and Li, 2009;Li, 2010) and classification of queries into syntactic classes (Allan and Raghavan, 2002;Barr et al., 2008).
NER has been recognized as a fundamental problem in query processing by Guo et al. (2009), and many works since (e.g. (Alasiry et al., 2012;Eiselt and Figueroa, 2013;Zhai et al., 2016)) explored various models and features for the task. Differently from those works, our goal is to design a BiLSTM model that can be easily integrated with modern BiLSTM parsers. We hence use simple input features and a simple NE scheme (e.g. see (Guo et al., 2009) for more fine-grained distinctions). More sophisticated features, entity schemes and deep learning architectures (e.g. (Lample et al., 2016)) are left for the future.
Syntactic parsing of Web data Only a handful of papers aimed to parse web data. One important example is the shared task of Petrov and McDonald (2012) on parsing web data from the Google Web Treebank, consisting of texts from the email, weblog, CQA, newsgroup, and review domains. Other relevant works are the tweet parsers of Foster et al. (2011), Kong et al. (2014) and Liu et al. (2018). However, all these works did not address the unique properties of web queries with question intent that express information needs in a concise manner (e.g. with one or more phrases or sentence fragments) and follow a forest-based grammar.
PRS16 were the first, and to the best of our knowledge the only work to address the parsing of web queries with question intent. However, as noted in § 1 their algorithms rely on millions of (query, title) pairs, which deems their algorithm impractical for most users. In practice, they started with a query log of 60M Yahoo Answers pages and ended up using 7.5M queries as distant supervision. In this paper we aim to overcome this limitation by introducing a high quality query parser that can train on several thousands annotated queries to provide higher UAS and segmentation F1 figures compared to those reported in PRS16 (see footnote 3 for their training protocol and data).
We finally note that joint parsing and NER was explored in past (Reichart et al., 2008;Manning, 2009, 2010), but for edited text, standard grammar and different modeling techniques. Our work re-emphasizes the strong ties between NER and parsing, in the context of query analysis.

Segmentation-Aware Parsing
In this section we present a parser that explicitly accounts for the query dependency grammar of PRS16. We start ( § 3.1) with a brief description of the BiLSTM parser of Kiperwasser and Goldberg (2016) (henceforth KG16), that forms the basis for our parser, and then describe our query parser.

The KG16 BiLSTM Parser
KG16 presented a BiLSTM model for transition based dependency parsing (Figure 1). Given a sentence s with words w 1 , ..., w n and corresponding POS tags p 1 , ..., p n , the word w i is represented as: where e(w i ) and e(p i ) are the embeddings of w i and p i , respectively, and • is the vector concatenation operator. 4 The BiLSTM consists of two LSTMs: LST M f orward and LST M backward .
Given an input vector x i , LST M f orward (x i ) captures the past context, and is calculated using the information in the input vectors x 1 , . . . , x i−1 . Similarly, LST M backward (x i ) captures the future context and is calculated using the information in the input vectors x n , . . . , x i+1 . v i , the resulting representation of x i , is given by: The parser implements the arc-hybrid system (Kuhlmann et al., 2011) which uses a configura- is a buffer and A is a set of dependency arcs. The arc-hybrid system allows three transitions: SHIF T , LEF T arc and RIGHT arc . At each step the parser scores the possible transitions and selects the highest scoring one.
The parser represents each configuration by the concatenation of the BiLSTM embeddings of the first word in the buffer (b 0 ) and the three words at the top of the stack (s 0 , s 1 and s 2 ). Then, a multi-layer perceptron (MLP) with one hidden layer scores the possible transitions given the current configuration: The embedding vectors are initialized using the Xavier initialization (Glorot and Bengio, 2010) and trained as part of the BiLSTM.
The parser employs a margin-based loss (MBL) function at each step: where T is the set of possible transitions and G is the set of correct transitions. The losses are summed throughout the parsing of a sentence and the parameters are updated accordingly. The parser employs a dynamic oracle (Goldberg and Nivre, 2013), which enables exploration in training. We next describe our modification of the KG16 parser. Specifically, we change the transition logic so that it can directly account for the query grammar defined in PRS16.

A Segmentation-Aware BiLSTM Parser
In order for our parser to directly account for the forest-based query grammar of PRS16, we follow previous work (e.g. Nivre (2009)) and modify its set of actions and transition logic. Before we do that, we start with a more standard modification.
An arc-eager KG16 parser The first step in the design of our parser is changing the arc-hybrid system of the KG16 parser to an arc-eager system (Nivre, 2008). To do that we change the definitions of the RIGHT arc and LEF T arc transitions and add a REDUCE transition. The (original) archybrid and the (modified) arc-eager KG16 parsers are denoted with P H and P E , respectively.
The motivation for this change is the addition of the REDU CE transition that explicitly facili-tates segmentation. After the words in the stack σ are reduced, they cannot be connected to the unprocessed words in the buffer β. This state constitutes a segmentation point. We use this connection between the REDU CE transition and the segmentation operation as an integral part of our new segmentation-aware parser, and hence all the following parsers extend this arc-eager parser.
A segmentation-aware parser In our parser, denoted as P B (for BASIC), a configuration c = (σ, β 1 , β 2 , A) consists of a stack σ = [s 0 , s 1 , ..], two buffers: , and a dependency arcs set A. The buffers β 1 , β 2 contain the unprocessed tokens, and the words within β 2 form the current segment. We expand the configuration representation to include not only the representations of the first token in the buffer β 1 (b 10 ) and the three tokens at the top of the stack σ (s 0 , s 1 , s 2 ) but also the representation of the last token in the buffer β 2 (b 2last ).
Given a sentence s = w 1 , . . . , w n , the initial configuration is In the final configuration the stack σ contains only the ROOT token (we refer to this as an empty stack) and both buffers, β 1 and β 2 are empty. The new transition set, described in Table 1, includes a new transition: PushToSeg. This transition adds a new token to the current segment by pushing the top token of β 1 to the end of β 2 . The REDU CE transition preconditions have also been modified: this transition is only allowed if β 2 is empty or there is more then one word in the stack.
This new parser performs a two-step process that repeats until convergence, to induce a parse forest. The first step is segment allocation, consisting of a sequence of PushToSeg transitions. When the parser reaches a configuration in which β 2 and σ are empty, only this transition is allowed. There can be one or more consecutive PushToSeg transitions, each pushes a new token from β 1 to β 2 . This step ends once the parser selects any other transition, to form a segmentation point. In the second step the allocated segment is parsed. In this step the parser acts as an arc-eager parser with β 2 as the main buffer, and the PushToSeg transition and the β 1 buffer being ignored until the segment is completely parsed (the PushToSeg transition is forbidden while the stack is not empty).
An example of the parsing process is provided in Figure 2. Appendix A provides a proof that the parser is complete and sound as required in Nivre (2008) from any dependency parser.
We next describe two auxiliary segmentation models that can be integrated with our parser.

Auxiliary Segmentation Models
We consider two models: one is independent of the parser while the other is added as a component to the parser.
Independent segmentation model Similarly to our parser, this model, denoted as SEG, is a BiL-STM that feeds an MLP classifier which predicts for every input word whether it is a segmentation point or not. The loss is a sum of word level MBL functions (equation 4). The input word representation, x i , and the definition of the hidden word vector, v i , are as in equations 1 and 2, respectively; the output scores vector o seg (w i ) is derived from the hidden vector, h seg , as in equation 3: where W 1 s , W 2 s , b 1 s , b 2 s are parameters. We consider two ways through which our P B parser uses the information from the SEG model. The parser we denote with P S concatenates the SEG hidden vector of the top word of the stack to the configuration representation: Alternatively, the parser we denote with P F S (for F U LL SEG) concatenates the hidden vectors of all the configuration elements to the configuration representation: This segmentation model can be trained independently of the parser or jointly with it. In development data experiments independent training was superior so we report results with this option.
Configuration-based segmentation model This model, denoted as FLAG, predicts whether a given parser configuration is a segmentation point. The answer is positive if the processed words (words in the stack σ) are in the same segment and the unprocessed words (words in the buffer β 2 or β 1 ) are not in this segment.  Figure 2: Example of the application of our segmentation-aware parser to the multi-segment query invent toy school project (borrowed from PRS16). First, a sequence of PushToSeg transitions is performed in order to insert the first segment invent toy to the buffer β 2 (transitions 1-2). Then, the segment is parsed until the buffer β 2 and the stack σ are empty (3-6). Similarly, the second segment school project is pushed to the buffer β 2 through a sequence of PushToSeg transitions (7-8) and then parsed (9-12).
The model is a simple MLP that receives a parser configuration as input and produces a hidden vector (denoted with h f lag (c)) and a scores vector (denoted with o f lag (c)). The equations for h f lag (c) and o f lag (c) are similar to equation 5, and the loss is an MBL loss as in equation 4. Information from this model is integrated into the parser configuration representation through: We refer to this parser as P F l (for FLAG). The F LAG model must be trained jointly with the parser as its input is a parser configuration.
As shown in § 7, adapting the KG16 parser to explicitly account for multiple segments improves over the KG16 parser in the task of query parsing. We next show how additional gains can be achieved when recognizing the role of NEs.

Entities in Query Parsing
In this section we explore the role of NEs in the syntactic structure of queries. We first describe our NE annotation process, and then qualitatively demonstrate the valuable structural cues they provide. In § 5 we will describe extensions of the segmentation-aware BiLSTM parser ( § 3.2) that integrate information from a BiLSTM NE tagger.
Data We consider five entity types: Location (e.g "London"), Person (e.g "Marilyn Monroe"), Organization (e.g "Google"), Product (e.g "Iphone 4") and Other (e.g. see Figure 4 for NEs such as "song name" and "computer game"). Two human annotators annotated the dataset. Of the 4000 queries, 400 were randomly selected for initial tagging by both annotators, so that they could discuss ambiguous cases and resolve conflicts (the labeled micro-F1 score between the annotators at this stage was 85.6). Then, the remaining 3600 queries were equally split between the two annotators, who again consulted each other in ambiguous cases (the inter-annotator micro-F1, measured on a randomly sampled set of 100 queries, was 92.0).

NEs as a dependency parsing signal
The dataset consists of 3010 single-segment (henceforth SSG) and 990 multi-segment (henceforth MSG) queries. 62.5% of the queries (59% of the SSG and 72% of the MSG) contain at least one NE. Figure 3   . Middle: NE type distribution (the "only names" column refers to queries that consist of named entities only, and the "any name" column refers to queries that contain at least one named entity). Bottom: The percentage of queries and segments that start with an NE.
in queries and the segmentation signal they provide. For example, as many as 35.4% of segments within the MSG queries start with an NE (42% of the first segments and 30% of the other segments). Finally, Figure 4 presents three example queries where NEs provide invaluable cues about the syntactic structure. Now that we have established the importance of NEs for query parsing, we are ready to describe our entity-aware query parser.

A Segmentation and Entity-Aware BiLSTM Parser
Here we describe the integration of NE signals in our segmentation-aware query parser ( § 3.2).
A BiLSTM NE tagger Our NE tagger is a BiL-STM with an MLP classification layer, very similar to our independent segmentation model (SEG, § 3.3). We denote the MLP's hidden state with h ne (w i ) and its output scores vector with o ne (w i ).
The model equations are: The margin-based loss (MBL) function we use in this model is: where θ = (W 1 n , b 1 n , W 2 n , b 2 n ) are the model parameters, ne predicted is the named entity type predicted by model and ne correct is the gold named entity type. M LP θ (w i )[ne i ] is the score given by the M LP to the ne i named entity type.
NE-aware parsing We consider two methods for integrating information from the NE tagger into the parser. In both methods, we construct a new feature representation for each input word, denoted with v M i , where M stands for the integration method. The new word representations are then used in the configuration representation ( § 3).
The first method, denoted with Hi (for Hidden), uses the hidden vector h ne (w i ): The second method, denoted with Fi (for Final), uses the NE embeddings: v F i i = v i • e(ne predicted (w i )) For both methods v i is the word representation generated by the parser (equation 2). For v F i i , ne predicted (w i ) is the named entity type predicted by the tagger for the word w i and e(ne predicted (w i )) is its embedding (part of the parser parametrization). (b) "tom waits" is a name of a person; "chocolate jesus" is a name of a song.
skyrim marry jarl elisif dobj amod root root (c) "skyrim" is a name of a computer game; "jarl elisif" is a name of a character in the game. Figure 4: Three example queries from the PRS16 dataset, along with their parse trees. NEs provide an important signal about the structure of the queries.
Tagger and parser training We consider two approaches: (a) Independent training: First, the tagger is trained with the gold NEs of the training set ( § 4), then the tagger is applied to the training set, and finally the parser is trained with the gold parse trees and the tagger's NE tagging of the training set.
(b) Joint training: The parameters of both models are updated together, each update is taking place after observing a single input sentence. In this joint model, the parser and the tagger are both using the same BiLSTM to learn the word representation v i . The loss function of this model is the sum of the losses of the parser (sum of the stepwise losses of Eq. 4) and the tagger (Eq. 7): 6 Experiments Task and data Our task is the query parsing task of PRS16, but unlike them we do not use millions of unannotated queries. Instead, we experiment with a supervised setup were the parser is trained on parsed web queries and no unannotated queries are used. For our experiments we randomly split the PRS16 dataset of 4000 queries annotated with dependency structures and POS tags, 5 into train, development and test sections. This split is done so that: (a) The train set consists of 2500 queries while the dev and the test sets consist of 750 queries each; (b) For each k ≥ 1, k-segment queries are split between the three sets so that to keep the 2500:750:750 proportion. As a result, the train, dev and test sets contain 618, 185 and 186 MSG queries, respectively, for the total of 989 MSG queries.
We consider the evaluation measures of PRS16: (a) The standard dependency parsing Unlabeled Attachment Score (UAS); and (b) Segmentation 5 webscope.sandbox.yahoo.com (dataset L-28)  F1-score, where a segment is considered correct if both its start and end point are correctly identified.

Models and baselines
We experiment with three model families: (a) The baseline KG16 parser and our arc-eager variant of the parser ( § 3.1, § 3.2); (b) Our segmentation-aware parsers ( § 3.2, § 3.3); and (c) Our segmentation and entityaware parsers ( § 5) where the parser and the NE tagger are trained either jointly or independently (we also consider the integration of NE information into the original (arc-hybrid) KG16 parser). 6 Hyper-parameter tuning Following (Kiperwasser and Goldberg, 2016) and due to the large number of models we experiment with, we consider a relatively small grid of hyper-parameter values, focused around the values chosen by these authors, as described in Table 2.
To avoid a very large number of experiments, we tune the parameters for the original KG16 archybrid model (P H ) and for our arc-eager version of the parser (P E ). We then report test-set results for the P H model with its tuned hyperparameters, and for all the other models with the hyper-parameters that were estimated for the P E model. While this setup gives an unfair advantage for the baseline P H model, it helps us avoid an expensive model-specific tuning process. The auxiliary segmentation models ( § 3.3) and the NE tagger ( § 5) use the hyper-parameters of Table 2, but for these models we do not perform any tuningfor each hyper-parameter with more than one option, we use the leftmost number from the table.

Results
Our results are presented in Table 3. We focus on selected members of each model family, and within each model family we focus on the simplest models (P H , P E and P B , with segmentation and entity information when appropriate), and on the most complex ones. We make sure to include the best performing model of each category, which happens to be one of the most complex models for all families, emphasizing the quality of our modeling choices. The results for the full list of models are in the spp. material.
The Baseline parsers section of the table demonstrates the impact of moving from the archybrid variant to an arc-eager variant of KG16, to better support segmentation. While the P H model performs slightly better than P E in terms of UAS (78.3 vs. 77.0), the segmentation F1-score of P E is 4.3 points better (72.0 vs. 67.7). Interestingly, this improvement is not achieved through better segmentation of MSG queries, but by avoiding unnecessary segmentation decisions on SSG queries.
The segmentation-aware parsers section of the table shows that further extending the arc-eager KG16 parser to explicitly account for segmentation results in substantial segmentation improvements. Particularly, the overall F1-score of P B -our segmentation-aware parser that does not use information from any auxiliary segmentation models -is as high as 77.4. This amounts to 5.4 and 9.7 additional F1 points compared to the arceager and the original arc-hybrid KG16 parsers, respectively. Information from auxiliary segmentation models (P S+F l and P F S+F l ) does not substantially increase performance in this family.
When considering entity information, the performance of our models and of the P E baseline substantially improves. However, while the UAS of the P E baseline increases to 80.2 (P E+F i with independent training), its segmentation F1-score does not cross the 66.9 bound (P E+Hi with joint training). 7 The gain of the segmentation-aware 7 The UAS numbers of the p H models with entity infor-  H and E stand for the arc-hybrid and arc-eager KG16 parser, while B is our segmentation aware parser without any auxiliary segmentation model or NE information. All models where these letters do not appear refer to extensions of our segmentation-aware parser (B). S: independent segmentation model. F S: full independent segmentation model. F l: configuration based segmentation model. Hi: NE aware parser (Hidden). F i: NE aware parser (Final). parser from entity information is much more substantial. First, regardless of how the entity information is integrated into the model (Hi vs. F i) and of whether the auxiliary segmentation models (S, F S and F l) are used or not, the model perfectly segments the SSG queries. Moreover, it demonstrates substantial performance boosts with respect to all measures. Our best performing model, P S+F l+F i with joint training (bold result in the bottom model section of the table) improves the original KG16 parser (P H , top row of the table) by 6.5 UAS points (83.5 vs. 77.0) and by 16.8 segmentation F1 scores (84.5 vs. 67.7).
Overall, joint training of the parsing and NER mation are similar to those of the p E models, but their segmentation quality is lower. Due to space limitations, we do not provide these numbers.   Figure 7: Example multi-segment query where the P S+F l+F i parser succeeds when trained jointly with the named entity tagger, and fails when using a pre-trained named entity tagger. models improves over independent training for both segmentation F1 and parsing UAS (in 5 of 8 cases for each measure, two bottom model sections of the table). This improvement comes mostly from SSG queries (7 of 8 UAS cases and the 2 cases where segmentation F1 could improve), but at the cost of some degradation on MSG queries. Moreover, our best model is a jointly trained one.
While the goal of this paper is mostly to improve the syntactic analysis of web queries, our simple NE tagger provides decent results. When trained independently of the parser its test-set (labeled) micro-F1 score is 85.6. When trained jointly with the parser it achieves similar scores in some cases: e.g. when jointly trained with the best performing parsing model, P S+F l+F i , it achieves a micro-F1 of 85.2. Yet, in other cases such as joint training with P S+H and P S+F i its micro-F1 drops to 78.0 and 78.5, respectively. Finally, figure 5-7 provide some qualitative analysis of our models and baselines.

Conclusions
We presented a new BiLSTM transition-based parser for web queries. Our parser is the first that explicitly accounts for the forest-based query grammar of PRS16. Moreover, we demonstrated the importance of NEs for understanding the syntactic structure of web queries, annotated the Query Treebank of PRS16 with NEs, and demonstrated how to effectively use NE information in the syntactic parsing of web queries.
In future work we intend to explore methods for closing the performance gap our algorithms still have for MSG queries (both UAS and segmentation F1) and for SSG queries (UAS only). Relevant directions include improving the transition logic of our parser, the BiLSTM NE model and the interactions between the two models.