Cross-genre Document Retrieval: Matching between Conversational and Formal Writings

This paper challenges a cross-genre document retrieval task, where the queries are in formal writing and the target documents are in conversational writing. In this task, a query, is a sentence extracted from either a summary or a plot of an episode in a TV show, and the target document consists of transcripts from the corresponding episode. To establish a strong baseline, we employ the current state-of-the-art search engine to perform document retrieval on the dataset collected for this work. We then introduce a structure reranking approach to improve the initial ranking by utilizing syntactic and semantic structures generated by NLP tools. Our evaluation shows an improvement of more than 4% when the structure reranking is applied, which is very promising.


Introduction
Document retrieval has been a central task in natural language processing and information retrieval. The goal is to match a query against a set of documents. Over the last decade, advanced techniques have emerged and provided powerful systems that can accurately retrieve relevant documents (Blair and Maron, 1985;Callan, 1994;Cao et al., 2006). While the retrieval part is crucial, proper ranking of the retrieved documents can significantly improve the overall user satisfation by putting more relevant documents at the top (Baliński and Daniłowicz, 2005;Yang et al., 2006;Zhou and Wade, 2009). Many previous works provide strong baselines for unstructured text retrieval and ranking problems; however, these systems usually assume a homogeneous domain for queries and target documents.
Due to the spike of applications that are required to maintain the conversation, dialog data has re-cently become a popular target among researchers. The work in this field concerns problems such as learning facts through conversation (Fernández et al., 2011;Williams et al., 2015;Hixon et al., 2015) or dialog summarization (Oya and Carenini, 2014;Misra et al., 2015). More recent work in this field has focused on several inter-dialogue tasks (Xu and Reitter, 2016;Kim et al., 2016;He et al., 2016). To the best of our knowledge our work is the first, where the cross-genre document retrieval is analyzed based on conversational and formal writings.
This paper analyzes the performance of stateof-the-art retrieval techniques targeting TV show transcripts and their descriptions. We first collect a dataset comprising transcripts from a popular TV show and their summaries and plots (Section 3). We then establish a solid baseline by adapting an advanced search engine and implement structure reranking to improve the initial ranking from the search engine (Section 4). Our evaluation shows a 4% improvement, which is significant (Section 5).

Related work
Information extraction for dialogue data has already been widely explored. Yoshino et al. (2011) presented a spoken dialogue system that extracts predicate-argument structures and uses them to extract facts from news documents. Flycht-Eriksson and Jönsson (2003) developed a dialogue interaction process of accessing textual data from a bird encyclopedia. An unsupervised technique for meeting summarization using decision-related utterances has been presented by Wang and Cardie (2012). Gorinski and Lapata (2015) studied movie script summarization. All the aforementioned work uses the syntactic and semantic relation extraction and thus is similar to ours; however, it is distinguished in a way that it lacks a cross-genre aspect.

Data
The Character Mining project provides transcripts of the TV show, Friends; transcripts from 8 seasons of the show are publicly available in the JSON format, 1 where the first 2 seasons are annotated for the character identification task (Chen and Choi, 2016). Each season consists of episodes, each episode contains scenes, each scene includes utterances, where each utterance comes with the speaker information.
For each episode, the episode summary and plot are first collected from fan sites, 2 then sentence segmented by NLP4J, 3 the same tool used for the provided transcripts. Generally, summaries give broad descriptions of the episodes, whereas plots describe facts within individual scenes. Finally, we create a dataset by treating each sentence as a query and its relevant episode as the target document. Table 2 shows the distributions of this dataset.

Structure Reranking
For each query (summary or plot) in the dataset, the task is to retrieve the document (episode) most relevant to the query. The challenge comes from the cross-genre aspect: how to retrieve documents in dialogues given the queries in formal writing. This section describes our structure reranking approach that significantly outperforms an advanced search engine, Elasticsearch 4 .

Relation Extraction
Since our queries and documents appear very different on the surface level (Table 1), relations are first extracted from them and matching is performed on the relation level, which abstracts certain pragmatic differences between these two types of writings. All data are lemmatized, tagged with parts-ofspeech and named entities, parsed into dependency trees, and labeled with semantic roles using NLP4J. A sentence may consist of multiple predicates, and each predicate comes with a set of arguments. A predicate together with its arguments is considered a relation. For each argument, heuristics are applied to extract meaningful contextual words by traversing the subtree of the argument. Our heuristics are designed for the type of dependency trees generated by NLP4J, but similar rules can be generalized to other types of dependency trees. Relations from dialogues are attached with the speaker names to compensate the lack of entity information.   Table 1.
By extracting relations that comprise only meaningful words, it prunes out much noise (e.g., disfluency), which allows the system to retrieve relevant documents with higher precision. While our relation extraction is based on the sentence level, it can be extended to the document level by adding coreference relations, which we will explore in the future. . .

Scoring Binary classification
Reranking Figure 2: The overview of our structure reranking. Given documents d 1 , . . . , d k and a query q, 4 sets of scores are generated: the Elasticsearch scores and the matching scores using 3 comparators: word, lemma, and embedding. The binary classifier Bin predicts whether the highest ranked document from Elasticsearch is the correct answer. If not, the system RR reranks the documents using all scores and returns a new top-ranked prediction.

Structure Matching
All relations extracted from dialogues are stored in an inverted index manner, where words in each relation are associated with the relation and the episode in which the relation occurs. Algorithm 1 shows how our structure matching works. Given a list of documents retrieved from the index based on a query q, it first initializes scores for all documents to 0. For each document d i , it compares each relation r q from q to relations extracted from d i . The relation r from d i is kept within R d if it has at least one word that overlaps with r q . For each relation r d ∈ R d , the comparator function returns the matching score between r d and r q . The maximum matching score is added to the overall score of this document. This procedure is repeated; finally, the algorithm returns the overall matching scores for all documents.
Input: D: a list of documents, q: a query. fr: a function returning all relations. fc: a comparator function. Output: S: a list of matching scores for D.
Algorithm 1: The structure matching algorithm.
The comparator function f c takes two relation sets, r d and r q , and returns the matching score between those two sets. For word and lemma, the count of overlapping words between them is used to produce two scores, r d s , and r q s , normalized by the length of the utterance and the query, respectively. The harmonic mean of the two scores is then returned as the final score. For embedding, f c uses embeddings to generate sum vectors from both sets and returns the cosine similarity of these two vectors.

Document Reranking
The Elasticsearch scores and the 3 sets of matching scores for the top-k documents (ranked by Elasticsearch) are fed into a binary classifier to determine whether or not to accept the highest ranked document. A Feed Forward Neural Network with one hidden layer of size 15 is used for this classification. If the binary classifier disqualifies the top-ranked document, the top-k documents are reranked by the weighted sums of these scores. A grid search is performed on the development set to find the optimized set of the weights. At last, the system returns the document with the highest reranked scores: d i = arg max i (λ e ·e i +λ w ·w i +λ l ·l i +λ m ·m i ).

Experiments
The data in Section 3 is split into training, development and evaluation sets, where queries from each episode are randomly assigned. Two standard metrics are used for evaluation: precision at k (P@k) and mean reciprocal rank (MRR).   Table 4: Evaluation on the development and evaluation sets for summary, plot, and all (summary + plot). Elastic 10 : Elasticsearch with k = 10, Struct w,l,m : structure matching using words, lemmas, embeddings, Rerank 1,λ : unweighted and weighted reranking.

Elasticsearch
Elasticsearch is used to establish a strong baseline. 5 Each episode is indexed as a document using the default setting, Okapi BM25 (Robertson et al., 2009), and the TF-IDF based similarity with improved normalization; the top-k most relevant documents are retrieved for each query. While P@1 is less than 50% (Table 5), P@10 shows greater than 70% coverage implying that it is possible to achieve a higher P@1 by reranking results from k ≥ 10.  Table 5: Elasticsearch results on (summary + plot).

Structure Matching
The Struct * rows in Table 4 show the results based on structure matching (Section 4.2). The highest P@1 of 39.53% is achieved on the evaluation set using lemmas. Although it is about 8% lower than the one achieved by Elasticsearch, we hypothesize that this approach can correctly retrieve documents for certain queries that Elasticsearch cannot.  To validate our hypothesis, we test structure matching on the subset of queries failed by Elasticsearch. We first take the top-10 results from Elasticsearch 5 www.elastic.co/products/elasticsearch then rerank the results using the scores from structure matching for queries that Elasticsearch gives P@1 of 0%. As shown in Table 6, structure matching is capable of reranking a significant portion (around 20%) of these queries correctly, establishing that our hypothesis is true.

Document Reranking
The scores from Elastic 10 and Struct * for each document are fed into the binary classifier that decides whether or not to accept the top-1 result from Elasticsearch. If not, the documents are reranked by the weighted sum of these scores (Section 4.3). The Rerank 1 row in Table 4 shows the results when all the weights = 1, which gives an over 4% improvement of P@1 on the evaluation set. The Rerank λ row shows the results when the optimized weights are used, which gives an additional 3% boost on the development set but not on the evaluation set. It is worth mentioning that we initially tackled this as a document classification task using convolutional neural networks similar to Kim (2014); however, it gave P@1 ≈ 20% and MRR ≈ 33%. Such poor results were due to the huge size of our documents, over 4.6K words on average, beyond the capacity of a CNN. Thus, we decided to focus on reranking, which gave the best performance.

Conclusion
We propose a cross-genre document retrieval task that matches between TV show transcripts and their descriptions in summaries and plots. Our structure reranking approach gives an improvement of more than 4% of P@1, showing promising results for this task. In the future, we will add more structural information such as coreference relations to our structure matching and apply a more sophisticated parameter optimization technique such as the Bayesian optimization for finding λ * .