Making Information Seeking Easier: An Improved Pipeline for Conversational Search

This paper presents a highly effective pipeline for passage retrieval in a conversational search setting. The pipeline comprises of two components: Conversational Term Selection (CTS) and Multi-View Reranking (MVR). CTS is responsible for performing the first-stage of passage retrieval. Given an input question, it uses a BERT-based classifier (trained with weak supervision) to de-contextualize the input by selecting relevant terms from the dialog history. Using the question and the selected terms, it issues a query to a search engine to perform the first-stage of passage retrieval. On the other hand, MVR is responsible for contextualized passage reranking. It first constructs multiple views of the information need embedded within an input question. The views are based on the dialog history and the top documents obtained in the first-stage of retrieval. It then uses each view to rerank passages using BERT (fine-tuned for passage ranking). Finally, MVR performs a fusion over the rankings produced by the individual views. Experiments show that the above combination improves first-state retrieval as well as the overall accuracy in a reranking pipeline. On the key metric of NDCG@3, the proposed combination achieves a relative performance improvement of 14.8% over the state-of-the-art baseline and is also able to surpass the Oracle.


Introduction
The abilities of current conversational assistants (Alexa, Cortana etc.) to perform open-domain conversational information seeking (CIS) functions are limited (Dalton et al., 2019). Thus, to encourage and support research on conversational information seeking, the TREC Conversational Assistance Track (CAsT) (Dalton et al., 2019) defined a model * The author is now an Applied Scientist at Amazon Alexa AI. Alternatively, he can be contacted at vaibhav4595@gmail.com.
Title: goat breeds Description: Interested in buying goats that implies interest in different breeds of goats and their use (milk, meat, and fur). Turn Utterance (Question) 1 What are the main breeds of goat? 2 Tell me about boer goats. 3 What breed is good for meat? 4 Are angora goats good for it? 5 What about boer goats? 6 What are pygmies used for? 7 What is the best for fiber production? 8 How long do Angora goats live? 9 Can you milk them? 10 How many can you have per acre? 11 Are they profitable? of conversational information seeking in which the conversation is a sequence of related passage ranking tasks, some of which require knowing the conversational history. For example, the question "Can you milk them?" in Table 1 is not by itself sufficient to support effective retrieval; the conversational context is required. More formally, given a series of natural language utterances/questions U = {u 1 , u 2 , u 3 . . . u n } based on a conversational topic T , the task is to retrieve relevant passages P i for each utterance u i by conditioning on the utterances/questions occurring prior to it, i.e {u 1 , u 2 , . . . u i−1 }. Note that, each utterance in the conversation is essentially a question by itself.
CAsT questions pose a variety of problems for a conversational information seeking system. To begin with, the evolution of the conversation is accompanied by introduction of pronouns, which creates an under-specified (or missing) context within the posed questions. Depending on the question, the context markers might be explicit (pronouns) or implicit (ellipsis). For example, in Table 1, turn 4 contains the pronoun 'it', which explicitly refers to the term 'meat' in turn 3. On the other hand, turn 5 does not contain any explicit pronoun marker, but implicitly questions whether 'boer goats are good for meat' by grounding itself in turns 3 and 4.
One can think of coreference resolution as a special case of context resolution. However, off-theshelf coreference models struggle with conversational questions (Dalton et al., 2019). Contextualized questions lead to an ineffective representation of the desired information need, causing a poor retrieval of informative passages.
Recent (and relatively successful) attempts to conversational search have focused on rewriting the conversational questions into de-contextualized questions that contain all the necessary information. These de-contextualised questions are then used for retrieval. For instance, one of the best performing systems submitted to TREC CAsT was the ATeam's query rewriter which used a pre-trained GPT-2 model (Radford et al., 2019) to rewrite questions. More recently, Yu et al. (2020) fine-tuned GPT-2 using a large amount of ad-hoc search sessions for rewriting questions.
However, the performance of the above methods on passage ranking has a ways to go compared to non-automatic methods where ground truth query reformulations are used (Dalton et al., 2019). Both automatic and non-automatic methods use standard BERT (fine-tuned on passage ranking) for reranking passages. Thus, even if automatic query reformulations are perfect, their overall passage retrieval performance will have an upper bound which will be equal to what the ground truth reformulations can achieve. Also, the current automatic methods do not aim to adapt the reranker to the conversational setting.
Similar to the idea behind pseudo-relevance feedback, this paper starts by motivating that going beyond the ground truth question reformulations by incorporating additional terms from the dialog history and the top-retrieved passages (retrieved during the first-round of retrieval), which might not be present in the ground truth reformulations, can help in improving passage retrieval. For example: turn 6 in Table 1 is self-sufficient i.e there is no need to reformulate it. However, adding the term 'goat' to the question can help in improving the retrieval performance. At the same time, this paper also aims to adapt the typical ad-hoc reranker to the conversational setting by a simple means of data fusion.
Adding to the above challenges, the TREC CAsT dataset also has a limited number of training examples which might hinder the effective training of models. Navigating through all the above presented issues, this paper presents a ranking pipeline aimed at improving the performance of passage retrieval in a conversational setting. The entire pipeline consists of two major components: Conversational Term Selection (CTS) and Multi-View Reranking (MVR).
CTS is designed to handle the first-round retrieval of passages. Given an input question, CTS utilises BERT (Devlin et al., 2018) in conjunction with a linear classifier to perform a binary classification over terms provided by the dialog history. This results in a set of conversational terms which are concatenated with the input question and queried to a search engine in order to retrieve passages. As mentioned earlier, the limited amount of training data provided in the CAsT dataset hinders an effective training of the classifier used in CTS. To this end, the CTS classifier is trained using weak supervision by utilising dialogs from a task-oriented dialog dataset (Quan et al., 2019).
On the other hand, MVR is designed for reranking. It reranks the passages obtained through CTS. By a simple means of data fusion it adapts an adhoc reranker to the conversational setting. It first begins by constructing three different views of the information need embedded within an input question. Each view is a query in its own sense and aims at extracting different types of contextual information. The first view is based on the reformulation of the input question. Using a similar mechanism as pseudo-relevance feedback, the second and the third view use the dialog history and the passages retrieved during the first-round of retrieval in order to expand the input question. Later, MVR individually uses each view to rerank passages using BERT (which is fine-tuned for passage ranking). Finally, it performs a fusion over the rankings produced by the individual views.
The experimental results show that the entire pipeline is highly effective for passage retrieval i.e it improves the first-stage retrieval of passages as well the overall accuracy in a reranking pipeline. On the key metric of NDCG@3, the proposed pipeline achieves a relative performance improvement of 14.8% over the state-of-the-art baseline.
It also performs 3% relatively better than the Oracle which uses ground truth query reformulations for ranking of passages. To the best of our knowledge, no automatic system had been able to beat the Oracle until now.

Related Work
Previous research provides guidance about the requirements of conversational search systems. For example, Radlinski and Craswell (2017) described desirable key features for conversational information retrieval systems. Trippas et al. (2018) identified commonly-used interactions and informed conversational search system design by studying the conversations of real users. Thomas et al. (2017) released the Microsoft Information-Seeking Conversation (MISC) dataset, which mimics conversational assistants such as Cortana.
Prior to the CAsT dataset, researchers often utilised dialog response reranking tasks , conversational questionanswering (Choi et al., 2018;Reddy et al., 2019) and voice based recommendation  as a 'proxy' for a conversational search setting. For example, Kenter and de Rijke (2017) presented an end-to-end trainable Attentive Memory Network for reading comprehension. turn are used to fine-tune GPT-2 in order to rewrite conversational queries. The rewritten queries are then used for ranking passages.
3 Proposed Approach Figure 1 presents an overview of the approach. The pipeline consists of i) Conversational Term Selection (CTS), and ii) Multi-View Reranking. The CTS component uses a classifier to select relevant contextual terms from dialog history. The classifier's predictions are then used to convert the given question into a query before submitting it to the search engine. The top R passages obtained with this method are passed to the Multi-View Reranking component which begins by projecting the input question into Multi-View Queries. The first view is based on question reformulation, the second view utilizes the CTS predictions, whereas the third view uses the top retrieved passages obtained using CTS. Each of these views are then individually used for reranking. The process of reranking is performed using a BERT-based reranker. Finally, the rerankings produced using the individual views are combined using fusion. CTS and MVR are described in details below.

Conversational Term Selection (CTS)
Figure 2 provides an overview. CTS is designed for first-stage passage retrieval. First, given a conversational topic T , an utterance question u t produced during turn t, the set of questions T t−1 = {u 1 , u 2 . . . u t−1 } produced in the turns prior to t where each u i comprises of individual terms {u i,1 , u i,2 , u i,3 . . .} (u i,j represents the j th term of the i th utterance), the CTS classifier classifies each term u ij present within the questions of T t−1 as 0 or 1. Thus, term selection becomes a binary classification problem where each term occurring in previous turns should either be selected or removed. Each selected term acts a relevant contextual term which can help in improving retrieval. This process can also be thought of as a query expansion technique, albeit different from pseudo relevance feedback. Instead of conditioning on the retrieved documents for finding appropriate expansion terms, the previous turns of a conversation are used for doing so.
Later, the selected terms along with the input question are queried to a search engine for retrieving passages. Unlike MVR, CTS does not project an input question into multiple views at the time of retrieval. This would be unnecessary as the first-round of retrieval only focuses on retrieving relevant passages. It does not focus on bringing the highly relevant passages at the top. The job of bringing the most highly relevant passages at the top of the ranked list is that of the reranker. Nonetheless, it would still be interesting to see how a multi view initial ranking would affect the final reranking. This is left for future work.

Training Data Creation
For each question within a conversation topic T , the CAsT dataset also provides a ground truth reformulated version.
For each u t ∈ T , a ground truth reformulated question r t is also provided. These reformulated questions can be leveraged to create data for training the CTS classifier. First, for each question u t , a set of conversational terms CT t is created that help resolve the context of u t . The set CT t consists of terms present in r t but not in u t i.e., CT t = {r tj |∀r tj / ∈ u t }. Next, the terms present in questions ranging from u 1 . . . u t−1 are marked as 0 or 1 depending on whether they were a part of the set CT t or not. This process helps in forming the required dataset.

Training with Weak Supervision
To overcome the limitations caused by the small size of CAsT trainig data and to achieve better generalization capabilities, the CTS classifier is trained using weak supervision. This is done by additionally training the classifier with examples from a task-oriented dialog dataset. Quan et al. (2019) manually constructed a dataset on the basis of the public dataset Cam-Rest676 (Wen et al., 2016), which is meant for training task oriented dialog systems. This dataset is particularly suitable for training the CTS classifier because i) the utterances within a conversation consists of ellipsis and coreferences which can help in providing better signals, and ii) each utterance is accompanied by its ground truth reformulation, thereby making it slightly straightforward to manipulate the dataset in order to come up with examples suitable for training the CTS classifier. This can be done by simply using the process of data creation as described above (Section 3.1.1).
Note that this might lead to the creation of imprecise examples as the CamRest676 dataset does not provide any information about how much should one look back further within the dialog history in order to resolve the context of the input utterance. Due to this reason, training on the created data leads to a weakly-supervised classifier.

BERT with Linear Classifier
CTS classifier uses BERT in conjunction with a linear layer to select conversational terms. Given the question in the current turn u t , and the previous questions u 1 . . . u t−1 , BERT is used to generate the token representations of individual terms within the questions. Next, the token representations of the terms within u 1 . . . u t−1 are individually passed as inputs to the linear layer in order to decide whether to select the individual terms or not (as in Figure  2).

First-Stage Passage Retrieval
After the CTS classifier selects the necessary conversational terms, the selected terms are concatenated with the current question (input question) to define a query that can be used for passage retrieval. Passage retrieval is performed by the Indri search engine (Strohman et al., 2005) with the query wrapped around a 'combine' operator. Passages are indexed without the removal of stopwords. Stemming of the passages is done using the Krovetz stemmer.

Multi-View Reranking (MVR)
An overview of the architecture can be seen from Fig. 3. The input question is first converted into Multi-View Queries. Each view is produced using a different source and serves a different kind of purpose. The first view is constructed using the terms present in the dialog history. The second view is constructed using the terms present in the retrieved passages. The third view is the reformulation of the input question. Each of these views are individually used for reranking passages. Finally, MVR performs a fusion over the ranked list produced by the individual views.

Multi-View Queries
As mentioned earlier, MVR constructs three different views of the queries. Each view looks at a different source of information and tries to represent the information need embedded within the input question in a different manner. The views are described below: 1. Question Expansion using Dialog History: The outputs of the CTS classifier obtained via the CTS component is concatenated with the input question.
2. Question Expansion using Passages: Given the input question and a few of the R passages produced during the first-round of retrieval, the CTS classifier first classifies the terms present in each of the selected passages. The positively classified terms are then concatenated with the input question.
3. Query Reformulation using GPT-2: This view adopts the method presented by Yu et al. (2020). The input question is reformulated to a de-contextualized question using GPT-2 (Radford et al., 2019). For this task, GPT-2 is fine-tuned using weakly supervised data obtained from large amounts of ad hoc search sessions aimed at mimicking conversational style questions.
Note that all three views attemp to present the same information need in a different manner, albeit with different granularities. Query expansion with passages attempts to de-contextualize the input question using the retrieved passages. Query Reformulation using GPT-2 attempts to produce a well-formed natural language reformulation of the input question. Whereas, Question Expansion using Dialog History is a type of pseudo-relevance feedback mechanism which aims at selecting terms from the dialog history in order to keep the focus of the input question on topic.
There is a slight difference between Query Reformulation view and the Query Expansion view. Query reformulation only aims to reformulate the question by handling ellipsis or co-references. However, query expansion aims to extend beyond that by selecting additional terms which can help in keeping the focus of the question on topic and at the same provide more informational terms.

BERT Reranker
Each view is individually used to rerank the passages produced during the first-round of retrieval using a BERT-based reranker. Here, the BERTbase model is fine-tuned for the task of ad hoc passage ranking using the MS-MARCO passage ranking dataset. Following (Nogueira and Cho, 2019), BERT-base is fine-tuned on 2% of the training data.
Note that the reranker used here is the same as the one used by Yu et al. (2020) and ATeam (Dalton et al., 2019). However, MVR aims to extend the capabilities of the reranker and adapts it to the conversational setting by exposing it with multiple forms of the input question.

Fusion
This step within MVR is extremely straightforward and aims to merge the rankings produced by the individual views. This is done by a simple aggregation of the scores produced for a passage by each of the individual views.

Experimental Methodology
Dataset: The CAsT dataset (Dalton et al., 2020) consists of 30 training topics (9 questions per topic, 269 in total), and 50 test topics (9.6 questions per topic, 478 in total). However, relevance judgments are available only for 20 test topics (173 questions). Therefore, evaluation is performed only over the 20 judged topics. The passages in CAsT dataset are borrowed from MSMARCO and TREC Complex Answer Retrieval Track. On the other hand, the annotated CamRest676 dataset, which is used for weak supervision, consists of 676 dialogs with coreferences and ellipsis annotations (Quan et al., 2019).
Parameter Settings: The CTS classifier uses BERT-base-uncased model and is fine-tuned for 5 epochs. It uses Adam (Kingma and Ba, 2014) as the optimiser with a learning rate of 5 × 10 −5 . While training, the maximum length of the context is clipped to 100, and the length of the input question is clipped to 30. On the other hand, MVR uses a BERT-base-uncased model fine-tuned on 2% of the MS-MARCO Passage Ranking dataset. During training, the maximum length of the query is clipped to 64, whereas that of the passage is clipped to 256. The first-round of retrieval by CTS leads to a total of 1000 passages per input question. During the reranking phase, only top R of the initial passages are reranked by MVR.
Evaluation Metrics: The performance of the CTS classifier is measured using Precision (Prec), Recall and F1. The passage retrieval performance is measured using Normalized Discounted Cumulative Gain at a ranking depth of 3 (NDCG@3) which is the main metric prescribed by TREC CAsT. The results are also evaluated using Mean Reciprocal Rank (MRR).

Experiments and Results
This section is divided into two halves. The first half evaluates the performance of CTS. The second half evaluates the performance of MVR i.e the result of using the entire pipeline.

Efficacy of CTS
Experiments over CTS aim to answer the following questions: • Q1: How well does the CTS classifier perform?
• Q2: To what extent does incorporating weak supervision help improve the performance of the classifier?   • Q3: What is CTS's first-round retrieval performance? Table 2 shows the performance of the classifier when trained on CAsT training data. It also reflects the effects of training the classifier with different amounts of dialog history. The CAsT training set is split into training and validation in a ratio of 4:1. In the entire setup, the classifier is trained with restricted amount of dialog history and tested with the entire dialog history made available to it. This setup helps understand its generalization capabilities. It is clear that the precision of the classifier increases with an increase in the amount of dialog history. However, the trend for recall is the exact opposite. The F1 scores remain quite low for all the cases. These trends clearly depict the data scarcity issue which has been mentioned in Section 1 and 3.1.2. The classifier's generalization capabilities are hindered by the low number of training examples used in fine-tuning.

Q2: Effect of Training with
Weak-Supervision On comparing the statistics in Table 2 and Table  3, it is evident that the precision of the classifier improves greatly when trained on 'Add Only'. However, there is no improvement in its recall. On the other hand, it seems that the increase in the amount of k in 'Add + k previous' (with the exception of k = 1) leads to an increase in the classifier's recall. This trend is in contrast with Table 2 where the recall decreases with increasing number of turns. A possible reason could be the fact that presence of weakly supervised examples forces better grounding of the coreference terms within the dialog.
The best F1 score is obtained with 'Add + All previous'. This provides almost a 60% improvement over the best result in Table 2. Thus, it is clear that weakly supervised data helps in improving performance.
It might be possible that the CTS classifier ends up selecting a few noisy terms. This might lead to low scores for some of the relevant passages during the first-round of retrieval. However, MVR, by utilising three different types of information should be able to boost the scores for those relevant passages, thereby bringing them at the top of the ranked list.

Q3: Passage Retrieval Performance
The performance of the proposed method is compared to that of four baselines. Base1 uses the original questions without any modifications for retrieval. Base2 appends the nouns, verbs and adjectives from the preceding turns to the current question before retrieval. AllenCoref (Lee et al., 2017) performs co-reference resolution to re-write the input question before performing passage retrieval. Finally, Spacy N-Coref uses Spacy's neural coreference model to do the same as AllenCoref.
The results are shown in Table 4. The results of CTS are based on the model trained on 'Add + All Previous'. The poor performance of Base1 depicts the need for finding appropriate contextual terms for effective query creation. On the other hand, the poor performance of AllenCoref and Neu-ralCoref show that co-reference models were unable to resolve the questions effectively, thereby confirming that off-the-shelf co-reference methods struggle with conversational style questions. Their results might also hint that co-reference alone is not enough for retrieval. Base2, which simply   chooses the nouns, verbs and adjectives, performs better than co-reference models. However, selecting all the nouns, verbs and adjectives might end up adding noise (of undesirable proportions) to the created query and could cause a drift in its topic. By alleviating this issue precisely, CTS seems to outperform the other methods.
Here the retrieval performance of CTS is not compared with the state-of-the-art baselines. This is because the baselines only report their final results which are obtained after the reranking phase. It is also unclear how the baselines conduct their first-round of retrieval. However, the final results of this paper make a fair comparison with the final results of the state-of-the-art.

Efficacy of MVR
This part aims to measure the effectiveness of the entire pipeline by measuring the final reranking performance of MVR. Experiments over MVR aim to answer the following questions: 1. Q1: What is MVR's reranking performance? 2. Q2: What is the effect of adding different views?

Passage Reranking
The results can be seen from Table 5. As is evident, the performance of MVR is compared against several baselines. Pgbert, h2oloo RUN2 and CFDA CLIP RUN7 are the top three automatic runs submitted to TREC CAsT. Pgbert uses GPT-2 for query rewriting and later reranks the passages using BERT. Both h2oloo RUN2 and CFDA CLIP RUN7 use a heuristic-based method for query expansion. Later, they use the title of the conversation and the expanded query for reranking of passages using BERT. Note: in a real scenario a user may not necessarily provide a title to the conversation before starting one. Thus, h2oloo RUN2 and CFDA CLIP RUN7 simply utilise additional information which may not be readily available. MVR does not make use of any such 'given' additional information. GPT-2 Rewrite (Yu et al., 2020) uses a fine-tuned GPT-2 for question reformulation and reranks passages using BERT. Finally, the Oracle uses ground truth question reformulations for reranking via BERT. Out of the 1000 passages retrieved by CTS, MVR reranks the top R of them. The Query Expansion using Passage view is constructed using the top K passages, out of the 1000 total retrieved during the first-stage of retrieval. Both R and K are tuned and set as 500 and 50 respectively.
As is evident from Table 5, the MVR is able to outperform all the automatic baselines by quite a substantial margin. By using a sophisticated mechanism for conversational term selection and without using any additional information like the title, MVR is able to perform better than h2oloo RUN2 and CFDA CLIP RUN7, both of which utilise title and are based on a heuristic method for query expansion. This clearly depicts that the expansion terms selected by CTS helps MVR to produce an effective ranked list of passages.
Both Pgbert and GPT-2 rewrite use GPT-2 question reformulation. The reformulations act as the sole information for reranking passages. By accumulating three different types of information (one of which includes reformulation), MVR is able to perform better than question reformulation mechanims. MVR is able to outperform GPT-2 rewrite, which is also state-of-the-art by almost 14.8%.
On the key metric of NDCG@3, it can be said that MVR is better than the Oracle by a slight margin. Although the NDCG@3 of MVR is greater than the Oracle, its MRR is slightly lower. A reason for this could be the fact that the Oracle retrieves more relevant passages than the MVR but the MVR better ranks highly relevant passages.
One must also note that the rerankers used by all the baselines have the same configuration i.e all the rerankers are fine-tuned on the passage ranking corpus of MS-MARCO. Therefore, it would not be futile to say that the power of MVR lies within its  Multi-View Queries.

Performance of Adding Views
The results of using different views is presented in Table 6. It is clear that using the expansion using passage view does not a have good performance by itself. One of the reasons for this could be the fact that the questions asked in the CAsT conversations do not refer to any entities within the answer of the previous passages i.e the questions in CAsT can be resolved using dialog history alone. Therefore, expansion using passages by itself is not very efficient. However, it does help when combined with other views. This is because expansion using passages provides extra credit to the more highly relevant passages. On the other hand, expansion using the dialog history view is able to perform better than the best baseline as its NDCG is higher than that of than GPT-2 rewrite (refer Table 5). It is important to note that the GPT-2 rewrite is equivalent to the question reformulation view of MVR.
Fusion of the expansion using passages view and expansion using history view provides a further improvement over expansion using history view alone. Finally, by combining the all three views together, MVR is able to provide the best result.

Conclusion
This paper presents a simple yet highly effective pipeline for conversational search. The pipeline consists of two components: CTS and MVR. CTS aids in first-round of passage retrieval by selecting important contextual terms from the dialog history. MVR reranks the passages obtained by CTS by expressing the information need embedded within a question in multiple forms. The combination is able to surpass the state-of-the-art and at the same time perform slightly better than the Oracle. To the best of our knowledge, no automatic system has been able to do so.