DocChat: An Information Retrieval Approach for Chatbot Engines Using Unstructured Documents

Most current chatbot engines are designed to reply to user utterances based on existing utterance-response (or Q-R) 1 pairs. In this paper, we present DocChat, a novel information retrieval approach for chat-bot engines that can leverage unstructured documents, instead of Q-R pairs, to respond to utterances. A learning to rank model with features designed at different levels of granularity is proposed to measure the relevance between utterances and responses directly. We evaluate our proposed approach in both English and Chinese: (i) For English, we evaluate Doc-Chat on WikiQA and QASent, two answer sentence selection tasks, and compare it with state-of-the-art methods. Reasonable improvements and good adaptability are observed. (ii) For Chinese, we compare DocChat with XiaoIce 2 , a famous chitchat engine in China, and side-by-side evaluation shows that DocChat is a perfect complement for chatbot engines using Q-R pairs as main source of responses.


Introduction
Building chatbot engines that can interact with humans with natural language is one of the most challenging problems in artificial intelligence. Along with the explosive growth of social media, like community question answering (CQA) websites (e.g., Yahoo Answers and WikiAnswers) and social media websites (e.g., Twitter and Weibo), * Contribution during internship at Microsoft Research. 1 For convenience sake, we denote all utterance-response pairs (either QA pairs or conversational exchanges from social media websites like Twitter) as Q-R pairs in this paper. 2 http://www.msxiaoice.com the amount of utterance-response (or Q-R) pairs has experienced massive growth in recent years, and such a corpus greatly promotes the emergence of various data-driven chatbot approaches. Instead of multiple rounds of conversation, we only consider a much simplified task, short text conversation (STC) in which the response R is a short text and only depends on the last user utterance Q. Previous methods for the STC task mostly rely on Q-R pairs and fall into two categories: Retrieval-based methods (e.g., Ji et al., 2014). This type of methods first retrieve the most possible Q ,R pair from a set of existing Q-R pairs, which best matches current utterance Q based on semantic matching models, then takeR as the response R. One disadvantage of such a method is that, for many specific domains, collecting such Q-R pairs is intractable. Generation based methods (e.g., Shang et al., 2015). This type of methods usually uses an encoder-decoder framework which first encode Q as a vector representation, then feed this representation to decoder to generate response R. Similar to retrieval-based methods, such approaches also depend on existing Q-R pairs as training data. Like other language generation tasks, such as machine translation and paraphrasing, the fluency and naturality of machine generated text is another drawback.
To overcome the issues mentioned above, we present a novel response retrieval approach, DocChat, to find responses based on unstructured documents. For each user utterance, instead of looking for the best Q-R pair or generating a word sequence based on language generation techniques, our method selects a sentence from given documents directly, by ranking all possible sentences based on features designed at different levels of granularity. On one hand, using documents rather than Q-R pairs greatly improve the adapt-ability of chatbot engines on different chatting topics. On the other hand, all responses come from existing documents, which guarantees their fluency and naturality. We also show promising results in experiments, on both QA and chatbot scenarios.

Task Description
Formally, given an utterance Q and a document set D, the document-based chatbot engine retrieves response R based on the following three steps: • response retrieval, which retrieves response candidates C from D based on Q: Each S ∈ C is a sentence existing in D.
• response ranking, which ranks all response candidates in C and selects the most possible response candidate asŜ: • response triggering, which decides whether it is confident enough to response Q usingŜ: where I is a binary value. When I equals to true, let the response R =Ŝ and output R; otherwise, output nothing.
In the following three sections, we will describe solutions of these three components one by one.

Response Retrieval
Given a user utterance Q, the goal of response retrieval is to efficiently find a small number of sentences from D, which have high possibility to contain suitable sentences as Q's response. Although it is not necessarily true that a good response always shares more words with a given utterance, this measurement is still helpful in finding possible response candidates (Ji et al., 2014).
In this paper, the BM25 term weighting formulas (Jones et al., 2000) is used to retrieve response candidates from documents. Given each document D k ∈ D, we collect a set of sentence triples S prev , S, S next from D k , where S denotes a sentence in D k , S prev and S next denote S's previous sentence and next sentence respectively. Two special tags, BOD and EOD , are added at the beginning and end of each passage, to make sure that such sentence triples can be extracted for every sentence in the document. The reason for indexing each sentence together with its context sentences is intuitive: If a sentence within a document can respond to an utterance, then its context should be revelent to the utterance as well.

Response Ranking
Given a user utterance Q and a response candidate S, the ranking function Rank(S, Q) is designed as an ensemble of individual matching features: where h k (·) denotes the k-th feature function, λ k denotes h k (·)'s corresponding weight.
We design features at different levels of granularity to measure the relevance between S and Q, including word-level, phrase-level, sentencelevel, document-level, relation-level, type-level and topic-level, which will be introduced below.

Word-level Feature
We define three word-level features in this work: (1) h W M (S, Q) denotes a word matching feature that counts the number (weighted by the IDF value of each word in S) of non-stopwords shared by S and Q.
(2) h W 2W (S, Q) denotes a word-toword translation-based feature that calculates the IBM model 1 score (Brown et al., 1993) of S and Q based on word alignments trained on 'questionrelated question' pairs using GIZA++ (Och and Ney, 2003). (3) h W 2V (S, Q) denotes a word embedding-based feature that calculates the average cosine distance between word embeddings of all non-stopword pairs v S j , v Q i . v S j represent the word vector of j th word in S and v Q j represent the word vector of i th word in Q.

Paraphrase
We first describe how to extract phrase-level paraphrases from an existing SMT (statistical machine translation) phrase table.
P T = { s i , t i , p(t i |s i ), p(s i |t i ) } 3 is a phrase table, which is extracted from a bilingual corpus, where s i (or t i ) denotes a phrase, in source (or target) language, p(t i |s i ) (or p(s i |t i )) denotes the translation probability from s i (or t i ) to t i (or s i ). We follow Bannard and Callison-Burch (2005) to extract a paraphrase table P P = { s i , s j , score(s j ; s i ) }. s i and s j denote two phrases in source language, score(s j ; s i ) denotes a confidence score that s i can be paraphrased to s j , which is computed based on P T : The underlying idea of this approach is that, two source phrases that are aligned to the same target phrase trend to be paraphrased.
We then define a paraphrase-based feature as: where S j+n−1 j denotes the consecutive word sequence (or phrase) in S, which starts from S j and ends with S j+n−1 , N denotes the maximum n-gram order (here is 3). Count P P (S j+n−1 j , Q) is computed based on the following rules:

Phrase-to-Phrase Translation
Similar to h P P (S, Q), a phrase translation-based feature based on a phrase table P T is defined as: is computed based on the following rules: We train a phrase table based on 'question-answer' pairs crawled from community QA websites.

Sentence-level Feature
We first present an attention-based sentence embedding method based on a convolution neural network (CNN), whose input is a sentence pair and output is a sentence embedding pair. Two features will be introduced in Section 4.3.1 and 4.3.2, which are designed based on two sentence embedding models trained using different types of data.
In the input layer, given a sentence pair S X , S Y , an attention matrix A ∈ R |S X |×|S Y | is generated based on pre-trained word embeddings of S X and S Y , where each element A i,j ∈ A is computed as: Then, column-wise and row-wise max-pooling are applied to A to generate two attention vectors can be interpreted as the attention score of the k th word in S X (or S Y ) with regard to all words in S Y (or S X ).
Next, two attention distributions D S X ∈ R |S X | and D S Y ∈ R |S Y | are generated for S X and S Y based on V S X and V S Y respectively, where the k th elements of D S X and D S Y are computed as: can be interpreted as the normalized attention score of the k th word in S X (or S Y ) with regard to all words in S Y (or S X ).
Last, we update each pre-trained word embed- . The underlying intuition of updating pre-trained word embeddings is to re-weight the importance of each word in S X (or S Y ) based on S Y (or S X ), instead of treating them in an equal manner.
In the convolution layer, we first derive an in- , centralized in the t th word in S X . Then, the convo-lution layer performs sliding window-based feature extraction to project each vector representation l t ∈ Z S X to a contextual feature vector h S X t : 1+e −2x is the activation function. The same operation is performed to S Y as well.
In the pooling layer, we aggregate local features extracted by the convolution layer from S X , and form a sentence-level global feature vector with a fixed size independent of the length of the input sentence. Here, max-pooling is used to force the network to retain the most useful local features by In the output layer, one more non-linear transformation is applied to l S X p : W s is the semantic projection matrix, y(S X ) is the final sentence embedding of S X . The same operation is performed to S Y to obtain y(S Y ).
We train model parameters W c and W s by minimizing the following ranking loss function:

Causality Relationship Modeling
We train the first attention-based sentence embedding model based on a set of 'question-answer' pairs as input sentence pairs, and then design a causality relationship-based feature as: y SCR (S) and y SCR (Q) denote the sentence embeddings of S and Q respectively. We expect this feature captures the causality relationship between questions and their corresponding answers, and works on question-like utterances.

Discourse Relationship Modeling
We train the second attention-based sentence embedding model based on a set of 'sentence-next sentence' pairs as input sentence pairs, and then design a discourse relationship-based feature as: y SDR (S) and y SDR (Q) denote the sentence embeddings of S and Q respectively. We expect this feature learns and captures the discourse relationship between sentences and their next sentences, and works on statement-like utterances. Here, a large number of 'sentence-next sentence' pairs can be easily obtained from documents.

Document-level Feature
We take document-level information into consideration to measure the semantic similarity between Q and S, and define two context features as: where S * can be S prev and S next that denote previous and next sentences of S in the original document. The sentence embedding model trained based on 'question-answer' pairs (in Section 4.3.1) is directly used to generate context embeddings for h DM (S prev , Q) and h DM (S next , Q). So no further training data is needed for this feature.

Relation-level Feature
Given a structured knowledge base, such as Freebase 5 , a single relation question Q (in natural language) with its answer can be first parsed into a fact formatted as e sbj , rel, e obj , where e sbj denotes a subject entity detected from the question, rel denotes the relationship expressed by the question, e obj denotes an object entity found from the knowledge base based on e sbj and rel. Then we can get Q, rel pairs. This rel can help for modeling semantic relationships between Q and R. For example, the Q-A pair What does Jimmy Neutron do? − inventor can be parsed into Jimmy Neutron, fictional character occupation, inventor where the rel is fictional character occupation.
Similar to Yih et al. (2014), We use Q, rel pairs as training data, and learn a rel-CNN model, which can encode each question Q (or each relation rel) into a relation embedding. For a given question Q, the corresponding relation rel + is treated as a positive example, and randomly selected other relations are used as negative examples rel − . The posterior probability of rel + given Q is computed as: rel − e cosine(y(rel − ),y(Q)) y(rel) and y(Q) denote relation embeddings of rel and Q based on rel-CNN. rel-CNN is trained by maximizing the log-posterior.
We then define a relation-based feature as: y RE (S) and y RE (Q) denote relation embeddings of S and Q respectively, coming from rel-CNN.

Type-level Feature
We extend each Q, e sbj , rel, e obj in the Sim-pleQuestions data set to Q, e sbj , rel, e obj , type , where type denotes the type name of e obj based on Freebase. Thus, we obtain Q, type pairs. Similar to rel-CNN, we use Q, type pairs to train another CNN model, denoted as type-CNN. Based on which, we define a type-based feature as: y T E (S) and y T E (Q) denote type embeddings of S and Q respectively, coming from type-CNN.

Unsupervised Topic Model
As the assumption that Q-R pair should share similar topic distribution, We define an unsupervised topic model-based feature h U T M as the average cosine distance between topic vectors of all non-stopword pairs v S j , v Q i , where v w = [p(t 1 |w), ..., p(t N |w)] T denotes the topic vector of a given word w. Given a corpus, various topic modeling methods, such as pLSI (probabilistic latent semantic indexing) and LDA (latent Dirichlet allocation), can be used to estimate p(t i |w), which denotes the probability that w belongs to a topic t i .

Supervised Topic Model
One shortcoming of the unsupervised topic model is that, the topic size is pre-defined, which might not reflect the truth on a specific corpus. In this paper, we explore a supervised topic model approach as well, based on 'sentence-topic' pairs.
We crawl a large number of S, topic pairs from Wikipedia documents, where S denotes a sentence, topic denotes the content name of the section that S extracted from. Such content names are labeled by Wikipedia article editors, and can be found in the Contents fields.
Similar to rel-CNN and type-CNN, we use the S, topic pairs to train another CNN model, denoted as topic-CNN. Based on which, we define a supervised topic model-based feature as: y ST M (S) and y ST M (Q) denote topic embeddings of S and Q respectively, coming from topic-CNN.

Learning to Ranking Model
We employ a regression-based learning to rank method (Nallapati, 2004) to train response ranking model, based on a set of labeled Q, C pairs, Feature weights in the ranking model are trained by SGD based on the training data that consists of a set of Q, C pairs, where Q denotes a user utterance and C denotes a set of response candidates. Each candidate S in C is labeled by + or −, which indicates whether S is a suitable response of Q (+), or not (−).
As manually labeled data, such as WikiQA (Yang et al., 2015), needs expensive human annotation effort, we propose an automatic way to collect training data. First, 'question-answer' (or Q-A) pairs {Q i , A i } M i=1 are crawled from community QA websites. Q i denotes a question. A i denotes Q i 's answer, which includes one or more sentences A i = {s 1 , ..., s K }. Then, we index answer sentences of all questions. Next, for each question Q i , we run response retrieval to obtain answer sentence candidates C i = {s 1 , ..., s N }. Last, if we know the correct answer sentences of each question Q i , we can then label each candidate in C i as + or −. In experiments, manually labeled data (WikiQA) is used in open domain question answering scenario, and automatically generated data is used in chatbot scenario.

Response Triggering
There are two types of utterances, chit-chat utterances and informative utterances. The former should be handled by chit-chat engines, and the latter is more suitable to our work, as documents usually contain formal and informative contents. Thus, we have to respond to informative utterances only. Response retrieval cannot always guarantee to return a candidate set that contains at least one suitable response, but response ranking will output the best possible candidate all the time. So, we have to decide which responses are confident enough to be output, and which are not.
In this paper, we define response triggering as a function that decides whether a response candidate S has enough confidence to be output: where T rigger(Q, S) returns true, if and only if all its three sub-functions return true.
I U (Q) returns true, if Q is an informative query. We collect and label chit-chat queries based on conversational exchanges from social media websites to train the classifier.
I Rank (S, Q) returns true, if the score s(S, Q) exceeds an empirical threshold τ : where α is the scaling factor that controls the distribution of s(·) smooth or sharp. Both α and τ are selected based on a separated development set.
I R (S) returns true, if (i) the length of S is less than a pre-defined threshold, and (ii) S does not start with a phrase that expresses a progressive relation, such as but also, besides, moreover and etc., as the contents of sentences starting with such phrases usually depend on their context sentences, and they are not suitable for responses.

Related Work
For modeling dialogue. Previous works mainly focused on rule-based or learning-based approaches (Litman et al., 2000;Schatzmann et al., 2006;Williams and Young, 2007). These methods require efforts on designing rules or labeling data for training, which suffer the coverage issue.
For short text conversation. With the fast development of social media, such as microblog and CQA services, large scale conversation data and data-driven approaches become possible. Ritter et al. (2011) proposed an SMT based method, which treats response generation as a machine translation task. Shang et al. (2015) presented an RNN based method, which is trained based on a large number of single round conversation data. Grammatical and fluency problems are the biggest issue for such generation-based approaches. Retrievalbased methods selects the most suitable response to the current utterance from the large number of Q-R pairs. Ji et al. (2014) built a conversation system using learning to rank and semantic matching techniques. However, collecting enough Q-R pairs to build chatbots is often intractable for many domains. Compared to previous methods, DocChat learns internal relationships between utterances and responses based on statistical models at different levels of granularity, and relax the dependency on Q-R pairs as response sources. These make DocChat as a general response generation solution to chatbots, with high adaptation capability.
For answer sentence selection. Prior work in measuring the relevance between question and answer is mainly in word-level and syntactic-level (Wang and Manning, 2010;Heilman and Smith, 2010;Yih et al., 2013). Learning representation by neural network architecture (Yu et al., 2014;Wang and Nyberg, 2015;Severyn and Moschitti, 2015) has become a hot research topic to go beyond word-level or phrase-level methods. Compared to previous works we find that, (i) Large scale existing resources with noise have more advantages as training data. (ii) Knowledge-based semantic models can play important roles.

Evaluation on QA (English)
Take into account response ranking task and answer selection task are similar, we first evaluate DocChat in a QA scenario as a simulation. Here, response ranking is treated as the answer selection task, and response triggering is treated as the answer triggering task.

Experiment Setup
We select WikiQA 6 as the evaluation data, as it is precisely constructed based on natural language questions and Wikipedia documents, which contains 2,118 'question-document' pairs in the training set, 296 'question-document' pairs in development set, and 633 'question-document' pairs in testing set. Each sentence in the document of a given question is labeled as 1 or 0, where 1 denotes the current sentence is a correct answer sentence, and 0 denotes the opposite meaning. Given a question, the task of WikiQA is to select answer sentences from all sentences in a question's corresponding document. The training data settings of response ranking features are described below. F w denotes 3 word-level features, h W M , h W 2W and h W 2V . For h W 2W , GIZA++ is used to train word alignments on 11.6M 'question-related question' pairs (Fader et al., 2013) crawled from WikiAnswers. 7 . For h W 2V , Word2Vec (Mikolov et al., 2013) is used to train word embedding on sentences from Wikipedia in English.
F p denotes 2 phrase-level features, h P P and h P T . For h P P , bilingual data 8 is used to extract a phrase-based translation table (Koehn et al., 2003), from which paraphrases are extracted (Section 4.2.1). For h P T , GIZA++ trains word alignments on 4M 'question-answer' pairs 9 crawled from Yahoo Answers 10 , and then a phrase table is extracted from word alignments using the intersect-diag-grow refinement.
F s denotes 2 sentence-level features, h SCR and h SDR . For h SCR , 4M 'question-answer' pairs (the same to h P T ) is used to train the CNN model. For h SDR , we randomly select 0.5M 'sentence-next sentence' pairs from English Wikipedia. F d denotes document-level feature h DM . Here, we didn't train a new model. Instead, we just reuse the CNN model used in h SCR .
F r and F ty denote relation-level feature h RE and type-level feature h T E . Bordes et al. (2015) released the SimpleQuestions data set 11 , which consists of 108,442 English questions. Each question (e.g., What does Jimmy Neutron do?) is written by human annotators based on a triple in Freebase which formatted as e sbj , rel, e obj (e.g., Jimmy Neutron, fictional character occupation, inventor ) Here, as described in Section 4.5 and 4.6, 'question-relation' pairs and 'question-type' pairs based upon SimpleQuestions data set are used to train h RE and h T E .
F to denotes 2 topic-level features, h U T M and h ST M . For h U T M , we run LightLDA (Yuan et al., 2015) on sentences from English Wikipedia, where the topic is set to 1,000. For h ST M , 4M 'sentence-topic' pairs are extracted from English Wikipedia (Section 4.7.2), where the most frequent 25,000 content names are used as topics.

Results on Answer Selection (AS)
The performance of answer selection is evaluated by Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR). Among all 'questiondocument' pairs in WikiQA, only one-third of documents contain answer sentences to their corresponding questions. Similar to previous work, questions without correct answers in the candidate sentences are not taken into account. We first evaluate the impact of features at each level, and show results in Table 1. F w , F p , and F s perform best among all features, which makes sense, as they can capture lexical features. F r and F ty perform not very good, but make sense, as the training data (i.e. SimpleQuestions) are based on Freebase instead of Wikipedia. Interestingly, we find that F to and F d can achieve comparable results as well. We think the reason is that, their training data come from Wikipedia, which fit the WikiQA task very well.
We evaluate the quality of DocChat on Wik-iQA, and show results in Table 2. The first four rows in Table 2 represent four baseline methods, including: (1) Yih et al. (2013), which makes use of rich lexical semantic features; (2) Yang et al. (2015), which uses a bi-gram CNN model with average pooling; (3) Miao et al. (2015), which uses an enriched LSTM with a latent stochastic attention mechanism to model similarity between Q-R pairs; and (4) Yin et al. (2015), which adds the attention mechanism to the CNN architecture. Table 2 shows that, without using WikiQA's training set (only development set for ranking weights), DocChat can achieve comparable per-  formance with state-of-the-art baselines. Furthermore, by combining the CNN model proposed by Yang et al. (2015) and trained on WikiQA training set, we achieve the best result on both metrics. Compared to previous methods, we think Doc-Chat has the following two advantages: First, our feature models depending on existing resources are readily available (such as Q-Q pairs, Q-A pairs, 'sentence-next sentence' pairs, and etc.), instead of requiring manually annotated data (such as WikiQA and QASent). Training of the response ranking model does need labeled data, but the size demanded is acceptable. Second, as the training data used in our approach come from open domain resources, we can expect a high adaptation capability and comparable results on other WikiQAlike tasks, as our models are task-independent.
To verify the second advantage, we evaluate DocChat on another answer selection data set, QASent (Wang et al., 2007), and list results in Table 3. CN N W ikiQA and CN NQASent refer to the results of Yang et al. (2015)'s method, where the CNN models are trained on WikiQA's training set and QASent's training set respectively. All these three methods train feature weights using QASent's development set. Table 3 tells, DocChat outperforms CN N W ikiQA in terms of MAP and MRR, and achieves comparable results compared to CN NQASent. The comparisons results show a good adaptation capability of DocChat. Table 4 evaluates the contributions of features at different levels of granularity. To highlight the differences, we report the percent deviation by removing different features at the same level from DocChat. From Table 4 we can see that, 1) Each feature group is indispensable to DocChat; 2) Features at sentence-level are most important than other feature groups; 3) Compared to results in Table 1, combining all features can significantly promote the performance.

Evaluation of Answer Triggering (AT)
In both QA and chatbot, response triggering is important. Similar to Yang et al. (2015), we also evaluate answer triggering using Precision, Recall, and F1 score as metrics. We use the WikiQA de-   velopment set to tune the scaling factor α and trigger threshold τ that are described in Section 5, where α is set to 0.9 and τ is set to 0.5. Table 5 shows the evaluation results compare to Yang et al. (2015). We think the improvements come from the fact that our response ranking model are more discriminative, as more semantic-level features are leveraged.

Evaluation on Chatbot (Chinese)
XiaoIce is a famous Chinese chatbot engine, which can be found in many platforms including WeChat official accounts (like business pages on Facebook Messenger). The documents that each official account maintains and post to their followers can be easily obtained from the Web. Meanwhile, a WeChat official account can choose to authorize XiaoIce to respond to its followers' utterances. We design an interesting evaluation below to compare DocChat with XiaoIce, based on the publicly available documents. (Beijing is a historical city that can be traced back to 3,000 years ago.) Table 6: XiaoIce response is more colloquial, as it comes from Q-R pairs; while DocChat response is more formal, as it comes from documents.

Experiment Setup
h ST M . As there is no knowledge base based labeled data for Chinese, we ignore relation-level feature h RE and type-level feature h T E . For ranking weights, we generate 90,321 Q, C pairs based on Baidu Zhidao Q-A pairs by the automatic method described in Section 4.8. This data set is used to train the learning to rank model feature weights {λ k } by SGD.
For documents, we randomly select 10 WeChat official accounts, and index their documents separately. The average number of documents is 600.
Human annotators are asked to freely issue 100 queries to each official account to get XiaoIce response. Thus, we obtain 100 query, XiaoIce response pairs for each official account. We also send the same 100 queries of each official account to DocChat based on official account's corresponding document index, and obtain another 100 query, DocChat response pairs. Given these 1,000 query, XiaoIce response, DocChat response triples, we let human annotators do a side-by-side evaluation, by asking them which response is better for each query. Note that, the source of each response is masked during evaluation procedure. Table 6 gives an example. Table 7 shows the results. Better (or Worse) denotes a DocChat response is better (or worse) than a XiaoIce response, Tie denotes a DocChat response and a XiaoIce response are equally good or bad. From Table 7 we observe that: (1) 156 Doc-Chat responses (58+47+51) out of 1,000 queries are triggered. The trigger rate of DocChat is 15.6%. We check un-triggered queries, and find most of them are chitchat, such as "hi", "hello", "who are you?". (2) Better cases are more than worse cases. Most queries in better cases are nonchitchat ones, and their contents are highly related to the domain of their corresponding WeChat official accounts. (3) Our proposed method is a perfect complement for chitchat engines on in-  formative utterances. The reasons for bad cases are two-fold: First, a DocChat response overlaps with a query, but cannot actually response it. For this issue, we need to refine the capability of our response ranking model on measuring causality relationships. Second, we wrongly send a chitchat query to DocChat, as currently, we only use a white list of chitchat queries for chitchat/non-chitchat classification (Section 5).

Conclusion
This paper presents a response retrieval method for chatbot engines based on unstructured documents. We evaluate our method on both question answering and chatbot scenarios, and obtain promising results. We leave better triggering component and multiple rounds of conversation handling to be addressed in our future work.