Incorporating Behavioral Hypotheses for Query Generation

Generative neural networks have been shown effective on query suggestion. Commonly posed as a conditional generation problem, the task aims to leverage earlier inputs from users in a search session to predict queries that they will likely issue at a later time. User inputs come in various forms such as querying and clicking, each of which can imply different semantic signals channeled through the corresponding behavioral patterns. This paper induces these behavioral biases as hypotheses for query generation, where a generic encoder-decoder Transformer framework is presented to aggregate arbitrary hypotheses of choice. Our experimental results show that the proposed approach leads to significant improvements on top-$k$ word error rate and Bert F1 Score compared to a recent BART model.


Introduction
Query suggestion is key to the usability of a search engine in the way it helps users formulating more effective queries or exploring related search needs. Prior work tackles this problem by employing primarily two strategies. The first one is based on a discriminative characterization, with candidate queries drawn from production logs ranked to align with what users may most likely issue next. Although effective (Ahmad et al., 2018), this strategy is inherently restricted by what is available in the logs, which in turn can penalize tail queries (Dehghani et al., 2017). In this work, we pursue the second strategy where query suggestion is cast as a natural language generation problem, aiming at producing effective continuations of the user's intent by using generative modeling.
For query generation, prior research has focused mostly on extending standard Seq2Seq models where the input is a concatenation of earlier queries a user has submitted in a session (Sordoni et al.,Figure 1: An example search session where a user issues queries and optionally performs clicking at timestamps 1 to n. At time n+1, the user issues q n+1 following the previous search context of length n. 2015; Dehghani et al., 2017). However, literature often leaves out the influence of clickthrough actions (i.e., red blocks in Figure 1), which we argue should be taken into account in the generative process as they could be surrogates of the user's implicit search intent (Yin et al., 2016). Users may exhibit diverse behaviors such as consecutively issuing queries without further engagement, or following up a single query with extensive clickthrough actions. These vastly different patterns are indicative of information pieces that the users find most relevant, which we conjecture can help producing suggestions better aligned to the user needs.
We present an encoder-decoder Transformer model for the generative task that includes these patterns that we called behavioral hypotheses. One challenge that arises with Transformers is that they make minimal assumptions about input (i.e., a single string of tokens), making it non-trivial to add multiple hypotheses directly. To address this issue, we propose a generic approach that leverages tokenwise attention to aggregate multiple behavior hypotheses encoded by a shared Transformer encoder BART (Lewis et al., 2019). The resulting end-to-end model can capture the underlying userinduced belief while maintaining the same order of complexity as the original BART. For evaluation, we conduct experiments by sampling over 600K search sessions from a major commercial job search engine in Australia. With evaluation metrics including word error rate and BertScore (Zhang et al., 2020), we show that the approach outperforms prior competitive baselines and a recent Transformer model BART, suggesting attending to behavioral patterns is crucial to reflect users' intent.

Related Work
Generative approaches have been studied extensively in machine translation (Sutskever et al., 2014;, dialogue systems (Wen et al., 2015), and many other related areas (Gatt and Krahmer, 2018). The methodology was first applied to the query domain in Sordoni et al. (2015). Query suggestion has traditionally been a web search usability task. Ranking based approaches that leverage query co-occurrence and discriminative modeling are known to be most effective (Ozertem et al., 2012), but also likely to suffer from the lack of appropriate candidates for rarely seen queries. Some recent work sought to characterize the generative nature (Sordoni et al., 2015;Dehghani et al., 2017) of this process. The hierarchical formulation of sequence-to-sequence model (Sordoni et al., 2015) can effectively capture the query transitions, but does not offer a mechanism to incorporate implicit user signals (Wu et al., 2018). Our approach combines heterogeneous behavioral hypotheses by leveraging large-scale encoders and cross-structure attentions. Apart from similar attempts regarding encoding multiple sentences (Dai et al., 2019;Zhao et al., 2020), our work in a generative setting tackles a different problem of decoding over a meshed representation originated from multiple sources.

Approach
Let Q = (Q 1 , Q 2 , . . . , Q n ) represent a sequence of queries submitted by a user in a consecutive fashion, where each Q i comprises a sequence of terms. Each Q i can lead to a succession of follow-up user interactions, among which we are mostly interested in textual matching cues C i that enticed clicking on some underlying documents. The full list of such matching cues are denoted as C = (C 1 , C 2 , . . . , C n ), in which each C i is a set of text excerpts t i1 , . . . , t im , such as document title or metadata, displayed in response to Q i . Given such a search context (Q, C), the query generation task aims to create a candidate query Q n+1 that the user is most likely to follow up with. The overall process is depicted in Figure 1.
Behavioral Hypotheses. We conjecture that, when making a new query, the assumed user takes inspiration from his/her preceding search context, following some behavioral hypotheses formed by preceding queries or matching cues. In this paper we seek to characterize this influence to formulated queries as follows: Each of these definitions loosely specifies a generative story behind the process: influence may come directly from preceding queries (K 1 ), preceding matching cues (K 2 ), interacted queries and the respective matching cues (K 3 ) , or the most recently submitted query and observed cues (K 4 ).
Vanilla Encoder-Decoder Transformer. Recent advances in transfer learning has popularized the use of pretrained encoder-decoder Transformer networks, setting state-of-the-art across the board (Vaswani et al., 2017;Raffel et al., 2019;Lewis et al., 2019). Query generation can be cast as a sequence-to-sequence problem and fine-tuned on any of these pretrained models. A simplistic but typical approach is, on the input side, to concatenate all items in a behavioral hypothesis regardless of their types into one sentence with the separator token inserted in between, and on the output side to simply put in the ground truth to be generated, i.e., Q n+1 . All input/output sentences are first tokenized using the same byte-pair encoding, and properly formatted by adding start/end tokens to the sentence beginning and sentence end. Following this preparation step, the input sentence is encoded into a vector representation by multiple layers of Transformer, and decoded on the other side using a similar stack.
In this paper we use the BART model (Lewis et al., 2019) to implement this encoder-decoder network. BART leverages specialized pretraining objectives such as text infilling and is known to be performant on text generation problems such as machine translation and summarization.
Meshed Representations. One caveat of the above process is that the model is agnostic of the presence of multiple behavioral hypotheses in the input. To solve this problem, we propose a new approach that derives a meshed representation of the input hypotheses K 1 , . . . , K 4 by reusing one single BART instance. Each hypothesis is first encoded using a shared BART encoder, and a tokenwise attention mechanism is leveraged to learn effective ways to contextualize the individual representations together. This is essentially combining four input streams in a token-by-token fashion so that each hypothesis can contribute to the aggregate at each token step, in varying degrees as determined by the attention weights. We expect this change to help surfacing regularities across different hypotheses, in the way regularization or multi-task learning does to ease learning trajectories. Tokenwise attention is also designed to encourage early correction in the hope that a more robust representation can be formed at the end of the meshed sequence.
The procedure can be described as follows. Let T is the sequence length. We have: where W attn is the attention weight matrix to be learned and O the output. On the decoder side, the attention mask is set to the union of attention masks from all underpinning hypotheses. This approach does not require multiple BART instances but may take extra GPU memory, linear to the output batch size, to cache processed representations and additional computation cycles to work through all four behavioral hypotheses.

Experimental Setup
Our main testbed was a sample of session logs from the SEEK job search engine 1 . This task domain is known for its characteristic query topics, surrounding role titles, skills, and entities such as company names or geo-locations, and distinctively different user behaviors to general web search. This dataset is preferred over the AOL logs for the availability 1 https://www.seek.com.au of clicked document texts, but our approach should be equally applicable to other search domains.
We collected textual queries Q i and the titles of documents that were clicked on in response to Q i as C i . All search sessions were anonymized to ensure that the query and click information cannot be linked back to individual users. Session boundaries were determined by an inactivity of 30 minutes or more between two consecutive actions. In each session the latest query was held out as the ground truth. Training sessions (500K) were initially gathered from a two-week span starting from Oct 1, 2019, and out of the same period a separate split was selected as a dev set (1K); then, the latter two weeks from the same month were sampled to form the test sessions (100K). About 15% of the collected sessions were found to exceed the maximum sequence length of the BART encoder, and were removed from the experiment to avoid inadvertently favoring the proposed approach for that the baseline may only see truncated input. Standard preprocessing steps were performed to remove noisy queries that occurred 10 times or less across the periods, and singleton sessions that contain only one query. In our experiments we compare the following approaches: Seq2Seq+Attn A standard sequence-tosequence model using a two-layer bidirectional GRU  as the encoder and a uni-directional attentive GRU as the decoder . Our implementation used 1,000 hidden dimensions and the same byte-pair encodings as other methods.

MPS (Most Popular Suggestion)
A simple yet effective baseline used in (Hasan et al., 2011;Sordoni et al., 2015;Dehghani et al., 2017), based on co-occurrence frequencies of the last query in the search context and all candidate queries.
BART The vanilla BART model (Lewis et al., 2019). We took the full concatenated search context as input, and fine-tuned on pretrained weights for BART-Large model, complete with 12 transformer layers in total.
MeshBART The proposed meshed variant of BART, configured to have the same model capacity. It takes multiple input hypotheses and combines them using the proposed tokenwise attentions before entering the decoding phrase.
We report word error rate (WER) and Bert F1  scores (BertF1) (Popović and Ney, 2007;Zhang et al., 2020) adapted to the top-k setting, with respect to the reference (ground truth) across the given k hypotheses.
In addition to generation quality, we also measure ranking performance by mean reciprocal rank (MRR@k) and success at k (S@k) following prior work. To train the encoder-decoder models, cross entropy loss was used throughout. All neural models were trained up to 3 epochs (roughly 83k steps) and early-stopped if no further gain on dev WER@3 was observed in the next 10k steps. At inference time, up to 5 suggestions were generated for each session using beam search (width = 8). For Seq2Seq+Attn the batchsize was set to 128 and both BART-based methods to 16. All experiments were conducted on a single NVIDIA T4 GPU.

Results
Quality of Generated Queries. We present the effectiveness scores of different generation models in Table 1. The results show that query generation remains a difficult task: the top-3 beam search output from a standard Seq2Seq with attention baseline achieves on average 0.88 errors per token. Consistent with prior work (Sordoni et al., 2015), MPS delivers competitive performance, bolstering its wide adoption in production systems. The word error rates are pushed down further by a vanilla BART that simply encodes sessions as long sequences, showcasing the superior modeling power of pretrained Transformer networks. Among these results, MeshBART consistently demonstrates the best effectiveness across all metrics, suggesting that combining critical signals from the given behavioral hypotheses can improve generation quality. We also investigate if generation quality is influenced by other factors of the test population. Placing a limit on session length, in Figure 2 we find that long sessions result in lower performance across all models. The diverse and complex intents commonly involved in long sessions make nextquery prediction particularly challenging. Interestingly, WER@3 and MRR@3 appear to respond differently to the increased session length for different methods, e.g. BART can perform worse than MPS on excessively long sessions. Apart from that, MeshBART remains the most competitive across all buckets, suggesting it being a robust approach for query generation.
Further comparisons with non-generative MPS also shed light on the superiority of the proposed approach. In a W/T/L analysis on top-3 ranking performance, we find that across all test sessions MeshBART has seen 30% wins, 52% ties, and 18% losses to MPS on MRR@3. Another analysis conditioned on sessions with only one preceding query shows that MeshBART can produce at least one novel suggestion (i.e. queries not seen in the candidate pool) for 39.2% of the test sessions. The effect is more pronounced when the preceding query is rare or has a relatively smaller candidate pool. Examples given in Table 2 show that the proposed approach can formulate reasonable follow-up queries by generalizing seen query parts.
Analysis of Behavioral Hypotheses. Our design of behavioral hypotheses in Section 3 is inspired by the users' interaction patterns. Figure 3a illustrates the intensity of user clicking with respect  to the last query in a session, suggesting that the majority of clicks are predominately centered on the last query irrespective of the total number of clicks. Job search users are found to be more persistent on articulating an effective query, and from there consume extensively most returned results before disengaging from the search session. This interesting perk is best reflected by the use of K 4 hypothesis in our modeling framework.
To understand the inner workings of the meshed attentions, Figure 3b visualizes the actual attention values assigned to all four of the presented hypotheses across different length buckets. On the one hand, the attention weight associated with K 1 (i.e., all preceding queries) are found to positively correlate with the growth of session length. Longer search sessions might be due to the user actively exploring the search space, and this increased attribution signals the importance of explicit search intents from the modeling perspective. On the other hand, K 4 tends to receive less attention as the search session grows, indicating that the importance of the most recent interactions become diluted in long, exploratory search journeys. The value of recent interactions in generative modeling is best illustrated by Figure 3c, where the attention weight of K 4 appears to positively correlate with the intensity of last-round clicking in a search session. These results suggest that our approach has the flexibility to draw information from different hypotheses in a unified query generation process.

Conclusions
This paper presents an effective approach for incorporating user-induced interaction patterns as behavioral hypotheses into the query generation process. Under an encoder-decoder Transformer framework, the proposed tokenwise attentions demonstrate the desirable modeling working by placing emphasis on different behavioral hypotheses at different occasions. On a domain-specific search benchmark, our model outperforms all reference methods in aggregate and across varying session properties, demonstrating its effectiveness in a robust way. In future work, we will focus on producing novel continuations of the user's search intent, extending the approach to other domains, and automating the design of behavioral hypotheses. Qualitative evaluation for open-ended generation is also an interesting topic on the roadmap.