Towards Improving Dialogue Topic Tracking Performances with Wikification of Concept Mentions

Dialogue topic tracking aims at analyzing and maintaining topic transitions in on-going dialogues. This paper proposes to utilize Wikiﬁcation-based features for providing mention-level correspondences to Wikipedia concepts for dialogue topic tracking. The experimental results show that our proposed features can signiﬁ-cantly improve the performances of the task in mixed-initiative human-human dialogues.


Introduction
Dialogue topic tracking aims at detecting topic transitions and predicting topic categories in ongoing dialogues which address more than a single topic. Since human communications in real-world situations tend to consist of a series of multiple topics even for a single domain, tracking dialogue topics plays a key role in analyzing human-human dialogues as well as improving the naturalness of human-machine interactions by conducting multitopic conversations.
Some researchers (Nakata et al., 2002;Lagus and Kuusisto, 2002;Adams and Martell, 2008) attempted to solve this problem with text categorization approaches for the utterances in a given turn. However, these approaches can only be effective for the cases when users mention the topic-related expressions explicitly in their utterances, because the models for text categorization assume that the proper category for each textual unit can be assigned based only on its own contents.
The other direction of dialogue topic tracking made use of external knowledge sources including domain models (Roy and Subramaniam, 2006), heuristics (Young et al., 2007), and agendas (Bohus and Rudnicky, 2003;Lee et al., 2008). While these knowledge-based methods have an advantage of dealing with system-initiative dialogues by controlling dialogue flows based on given resources, they have drawbacks in low flexibility to handle the user's responses and high costs for building the resources.
Recently, we have proposed to explore domain knowledge from Wikipedia for mixed-initiative dialogue topic tracking without significant costs for building resources (Kim et al., 2014a;Kim et al., 2014b). In these methods, a set of articles that have similar contents to a given dialogue segment are selected using vector space model. Then various types of information obtained from the articles are utilized to learn topic trackers based on kernel methods.
In this work, we focus on the following limitations of our former work in retrieving relevant concepts at a given turn with the term vector similarity between each pair of dialogue segment and Wikipedia article. Firstly, the contents of conversation could be expressed in totally different ways from the descriptions in the actual relevant articles in Wikipedia. This mismatch between spoken dialogues and written encyclopedia could bring about inaccuracy in selecting proper Wikipedia articles as sources for domain knowledge. Secondly, a set of articles that are selected by comparing with a whole dialogue segment can be limited to reflect the multiple relevances if more than one concept are actually mentioned in the segment. Lastly, lack of semantic or discourse aspects in concept retrieval could cause a limited capability of the tracker to deal with implicitly mentioned subjects.
To solve these issues, we propose to incorporate Wikification (Mihalcea and Csomai, 2007) features for building dialogue topic trackers. The goal of Wikification is resolving ambiguities and variabilities of every mention in natural language by linking the expression to its relevant Wikipedia concept. Since this task is performed using not If you like spicy foods, you must try chilli crab which is one of our favourite dishes here. 8 Tourist Great! I'll try that. FOOD→FOOD Figure 1: Examples of dialogue topic tracking on Singapore tour guide dialogues only surface form features, but also various types of semantic and discourse aspects obtained from both given texts and Wikipedia collection, our proposed method utilizing the results from Wikification contributes to improve the tracking performances compared to the former approaches based on dialogue segment-level correspondences.

Dialogue Topic Tracking
Dialogue topic tracking can be defined as a classification problem to detect where topic transitions occur and what the topic category follows after each transition. The most probable pair of topics at just before and after each turn is predicted by the following classifier: where x t contains the input features obtained at a turn t, y t ∈ C, and C is a closed set of topic categories. If a topic transition occurs at t, y t should be different from y t−1 . Otherwise, both y t and y t−1 have the same value. Figure 1 shows an example of dialogue topic tracking in a given dialogue fragment on Singapore tour guide domain between a tourist and a guide. This conversation is divided into four segments, since f detects three topic transitions at t 1 , t 4 and t 6 . The mixed-initiative aspects are also shown in this dialogue, because the first two transitions are initiated by the tourist, while the other one is driven by the guide without any explicit requirement from the tourist. From these results, we could obtain a topic sequence of 'Attraction', 'Transportation', and 'Food'.

Wikification of Concept Mentions in Spoken Dialogues
Wikification aims at linking mentions to the relevant entries in Wikipedia. As shown in the examples in Figure 2 for the dialogue in Figure 1, this task is performed by dealing with co-references, ambiguities, and variabilities of the mentions. Following most previous work on Wikification (Bunescu and Pasca, 2006;Mihalcea and Csomai, 2007;Milne and Witten, 2008;Dredze et al., 2010;Han and Sun, 2011;Chen and Ji, 2011), this work also takes a supervised learning to rank algorithm for determining the most relevant concept for each mention in transcribed utterances.
In this work, every noun phrase in a given dialogue session is defined as a single mention. To capture more abstract concepts, we take not only named entities or base noun phrases, but also every complex or recursive noun phrase in a dialogue as the instance to be linked. For each mention, a set of candidates are retrieved from a Lucene 1 index on the whole Wikipedia collection divided by section-level. The ranking score s(m, c) for a given pair of a mention m and its candidate concept c is assigned as follows: if c is the exactly same as g(m), 3 if c is the parent article of g(m), 2 if c belongs to the same article but different section of g(m), 1 otherwise.
, where g(m) is the manual annotation for the most relevant concept of m. Name Description SP the speaker who spoke that mention WM word n-grams within the surface of m WT word n-grams within the title of c EMT whether the surface of m is same as the title of c EMR whether the surface of m is same as one of redirections to c MIT whether the surface of m is a sub-string of the title of c TIM whether the title of c is a sub-string of the m's surface form MIR whether the surface of m is a sub-string of a redirected title to c RIM whether a re-directed title to c is a sub-string of the m's surface form PMT similarity score based on edit distance between the surface of m and the title of c PMR maximum similarity score between the surface of m and the redirected titles to c OC whether c previously occurred in the full dialogue history OCw whether c occurred within w previous turns with w ∈ {1, 3, 5, 10}

Wikification-based Features for Dialogue Topic Tracking
Following our previous work (Kim et al., 2014a;Kim et al., 2014b), the classifier f for dialogue topic tracking is trained on the labeled dataset using supervised machine learning techniques. The simplest baseline is to learn the classifier based on the vector space model (Salton et al., 1975) considering bag-of-words for the terms within the given utterances. An instance for each turn is represented by a weighted term vector defined as follows: , u t is the utterance mentioned in a turn t, tf idf (w i , u t ) is the product of term frequency of a word w i in u t and inverse document frequency of w i , λ is a decay factor for giving more importance to more recent turns, |W | is the size of word dictionary, and h is the number of previous turns considered as dialogue history features.
To overcome the limitations caused by lack of semantic or domain-specific aspects in the first baseline, we previosly proposed (Kim et al., 2014b) to leverage on Wikipedia as an external knowledge source with an extended feature space defined by concatenating the concept space with the previous term vector space as follows: φ (x) = α 1 , α 2 , · · · , α |W | , β 1 , β 2 , · · · , β |D| , where φ (x) ∈ R |W |+|C| , β i is the semantic relatedness between the input x and the concept in the i-th Wikipedia article and |C| is the number of concepts in the Wikipedia collection. The value for β i is computed with the cosine similarity between term vectors as follows: where φ(c i ) is the term vector composed from the i-th Wikipedia concept in the collection. In this work, the results of Wikification described in Section 3 are utilized to extend the feature space for training the topic tracker, instead of or in addition to the above mentioned feature values obtained from dialogue segment-level analyses. A value γ i in the new feature space is defined as the weighted sum of the number of mentions linked to a given concept c i within a dialogue segment as follows: where m k is the k-th mention in a given utterance u, g(m) is the top-ranked result of Wikification for the mention m, λ is a decay factor, and h is the window size for considering dialogue history.

Evaluation
To demonstrate the effectiveness of our proposed approach for dialogue topic tracking using Wikification results, we performed experiments on the Singapore tour guide dialogues which consists of 35 sessions collected from human-human conversations between tour guides and tourists. All the recorded dialogues with the total length of 21 hours were manually transcribed, then these 31,034 utterances were manually annotated with the following nine topic categories: Opening, Closing, Itinerary, Accommodation, Attraction, Food, Transportation, Shopping, and Other.  For topic tracking, an instance for both training and prediction of topic transition was created for every utterance in the dialogues. For each instance x, the term vector φ(x) was generated with the α values from utterances within the window sizes h = 2 for the current and previous turns and h = 10 for the history turns. The β values for representing the segment-level relevances were computed based on 3,155 Singapore-related articles which were used in our previous work (Kim et al., 2014b).
For Wikification, all the utterance were preprocessed by Stanford CoreNLP toolkit 2 , firstly. Each noun phrase in the constituent trees provided by the parser was considered as an instance for Wikification and manually annotated with the corresponding concept in Wikipedia. For every mention, we retrieved top 100 candidates from the Lucene index based on the Wikipedia database dump as of January 2015 which has 4,797,927 articles and 25,577,464 sections in total and added one more special candidate for NIL detection. Then, a ranking function using SVM rank3 was trained on this dataset, which achieved 38.04, 31.97, and 34.74 in precision, recall, and Fmeasure, respectively, in the evaluation for Wikification for each mention-level based on five-fold cross validation. The γ values in our proposed approach were assigned based on the top-ranked results from this ranking fuction for the mentions in the dialogues.
In this evaluation, the following three different schedules were applied for both training the models and prediction the topic transitions: (a) taking every utterance regardless of the speaker into account; (b) considering only the turns taken by the tourists; and (c) by the guides. While the first schedule aims at learning the human behaviours in topic tracking from the third person point of 2 http://nlp.stanford.edu/software/corenlp.shtml 3 http://www.cs.cornell.edu/people/tj/svm light/svm rank.html view, the others could show the tracking capabilities of the models as a sub-component in the dialogue system which act as a guide and a tourist, respectively.
The SVM models were trained using SVM light 4 (Joachims, 1999) with different combinations of the features. All the evaluations were done in five-fold cross validation to the manual annotations with two different metrics: one is accuracy of the predicted topic label for every turn, and the other is precision/recall/F-measure for each event of topic transition occurred either in the answer or the predicted result. Table 2 compares the performances of the feature combinations for each schedule. While the dialogue segment-level β features failed to show significant improvement compared to the baseline only with term vectors, the models with our proposed Wikification-based features γ achieved better performances in both transition and turn-level evaluations for all the schedules. The further enhancement led by the oracle features with the manual annotations for Wikification represented by γ indicates that the overall performances could be improved by refining the Wikification model.

Conclusions
This paper presented a dialogue topic tracking approach using Wikification-based features. This approach aimed to incorporate more detailed information regarding the correspondences between a given dialogue and Wikipedia concepts. Experimental results show that our proposed approach helped to improve the topic tracking performances compared to the baselines. For future work, we plan to apply the kernel methods proposed in our previous work also on the feature spaces based on Wikification as well as to improve the Wikification model itself for achieving better overall performances in dialogue topic tracking.