Toward Data-Driven Tutorial Question Answering with Deep Learning Conversational Models

There has been an increase in popularity of data-driven question answering systems given their recent success. This pa-per explores the possibility of building a tutorial question answering system for Java programming from data sampled from a community-based question answering forum. This paper reports on the creation of a dataset that could support building such a tutorial question answering system and discusses the methodology to create the 106,386 question strong dataset. We investigate how retrieval-based and generative models perform on the given dataset. The work also investigates the usefulness of using hybrid approaches such as combining retrieval-based and generative models. The results indicate that building data-driven tutorial systems using community-based question answering forums holds significant promise.


Introduction
Question answering in dialogue is a central concern for designing the next generation of dialogue systems. Recent work has made great strides in generating dialogue, for example, with neural conversation models (Vinyals and Le, 2015), persona-based conversation models (Li et al., 2014) and adversarial models (Li et al., 2017). Specifically, for responding to questions, information-retrieval techniques have long been explored (Jeon et al., 2005;Ramos, 2003;Lowe et al., 2015). A critical open question is how to build data-driven systems for specific domains. A challenge that is faced by the community for such systems is the availability of data for those domains. Given that transfer learning has not yet been shown to yield good results (Mou et al., 2016), there has been investigation in the area of partially data-driven and hand-crafted systems (Williams et al., 2017). However, handcrafted systems face tremendous limitations in authoring. Data-driven dialogue systems, which derive their functionality from corpora, have the potential to eliminate this bottleneck.
This work explores the possibility of building a data-driven question-answering system for Java programming. We leverage a promising source of data by drawing from community-based question answering forums of Stack Exchange. Forums typically also have sub-forums, such as Stack Overflow for programming questions and Ask Ubuntu for Ubuntu operating system related questions. Such community-based forums serve as excellent datasets for specific domains, such as programming or IT support, that are otherwise not easily available to the general public. The promise of this data is further demonstrated by other work done using the Stack Exchange data: Campbell and Treude (2017) explore how to use semantic parsing to convert an English sentence or query into a code snippet, while Campos et al. (2016) investigate returning relevant question answer pairs for Swing, Boost and LINQ by using indexing techniques and building feature-based classifiers.
With technology becoming ubiquitous, having programming skills are highly sought after. In a University or MOOC setting, 'Introduction to Programming' courses typically have a large class size, and with a limited number of Teaching Assistants, providing individual help becomes a difficult task. The work in this paper focuses on attempting to assist in helping students learn Java programming with a data-driven tutorial question answering system. This work attempts to build the tutorial question-answering system as both a retrieval-based question answering system (Ji et al., 2014) via the Dual Encoder architecture (Medsker and Jain, 2001;Bromley et al., 1994) and as a generative question answering system (Ritter et al., 2011) via the Sequence-to-Sequence architecture (Sutsveker et al., 2014;Cho et al., 2014). The retrieval-based model answers the user's question by predicting the most relevant answer from a set of predefined answers. In contrast to the retrievalbased model, the generative model answers the user's question by generating new answers based on the data on which the model was trained. Both of these approaches rely on building good semantic representations of the input in the vector space using word embeddings .
This work also explores the usefulness of a hybrid approach involving the combination of the retrieval-based and generative models. This paper thus represents the first work to explore deep learning techniques for data-driven tutorial dialogue for Java programming.

Related Work
Recently, there has been work using natural language processing and machine learning techniques within tools for programming support and computer science education.  explored using Deep Belief Networks to grade short-answer texts and showed that this approach outperformed conventional machine learning models. They also explored using student modeling and clustering based on engineered features to predict the grades with reasonable success. Wang et al. (2017) used a recurrent neural network to attempt to represent a student's knowledge states for programming exercises and found that the model was able to successfully identify students with knowledge gaps and provide indications that assistance may be necessary.
Work is also being done to build models from data that can generate their own answers to questions. Bengio et al. (2003) and Mikolov (2012) (Mikolov et al., 2010) were able to successfully construct a neural language model using recurrent neural networks, further reinforcing the prevailing conclusion that recurrent neural networks are the architecture of choice for this task. Sordoni et al. (2015) and Shang et al. (2015) were also able to model short conversations using a recurrent neural network.
A critical turning point for generative models was when ) & Cho et al. (2014 introduced the sequence-to-sequence framework in the domain of machine translation. The authors proposed an architecture to convert one sequence to another sequence using recurrent neural networks as encoders and decoders. Inspired by the previous success of recurrent neural networks and the sequence-to-sequence framework, Vinyals and Le (2015) proposed applying this framework to conversational modeling, framing question answering as a machine translation problem. While Vinyals and Le (2015) showed that the model was able to give short, coherent answers for queries in a variety of settings, they also mentioned limitations of the system: it is restricted to short answers and lacks a personality.
In addition to generative systems, retrievalbased systems have also shown success in the recent past. Kannan et al. (2016) used semisupervised learning with an LSTM RNN along with semantic intent clustering to generate highquality responses for the Google Smart Reply system. Lu et al. (2017) explored how to generate responses from a large answer space by using a dual encoder LSTM network and employing clustering to generate templates from their large answer set, reducing the answer set space for a customer support question answering system. Jeon et al. (2005) investigated how to find question similarity using word translation probabilities. Lowe et al. (2015) constructed a corpus of one million multi-turn dialogues from the Ask Ubuntu forum, then performed experiments with retrieval-based models that demonstrated that a useful question answering system could be built using a dataset sampled from a community-based question answering forum. These techniques helped us gain insight on how to identify the most appropriate responses from a knowledge base.
The work in this paper attempts to employ deep learning techniques to support computer science education by developing a programming support tool for Java Programming that provides automated tutorial question answering. The work builds upon recent work in retrieval-based and generative models to construct answers that combine the English language with the Java programming language.

Dataset
Stack Exchange is a set of community-based question answering websites, with each website covering a specific topic. Stack Overflow deals with programming questions and relies on selfmoderation through peer upvoting mechanisms. The user who posts the question can select the answer that they deem most appropriate. In some cases, the original poster does not select an answer, and in these cases the highest upvoted answer could be considered the best answer.
A typical Stack Overflow question can be seen in Figure 1. We see a title for the question at the top "Java String Declaration", followed by a description, "What is the difference between ... performance variation." An important piece of information is the meta tags seen underneath the description. We see the meta tags of "java" and "string", which describe on a high level to what the post is related. We see an upvote count to the left of the answer, a measure of how many other users agree with this answer. For the question in Figure 1, we see that there is an answer that has received the user's accepted answer status as well as 29 upvotes by the community.
Stack Exchange provides an anonymized data dump of all the user-contributed content, with the most recent version published on Dec 1, 2017. The data dump is in the format of a SQL database consisting of various components of the website represented in the form of SQL tables such as the Posts

Working with the Stack Exchange Database
The Posts table contained about 38 million posts, i.e. all the post data on Stack Overflow as of the data dump publication date. Every question and answer posted on the website is part of the Posts table, with different identifiers to signify the type of Post and relationships between the Posts. The question-answer relationship was defined as follows: the original question had a post ID, and answers corresponding to this question had the same post ID in their parentID column.

Filtering Posts
This work focuses on Java programming questions, which required us to narrow our search to Java-related questions from the Posts table. We first filtered to ignore questions containing the '<code>' tag in the 'Body' column, as our present goal is to answer general questions within a future tutorial system. In order to obtain posts related to Java, we used the Post table's 'Tags' column, which contained meta tags related to the post, as seen in Figure 1. In order to ignore technology-specific questions such as a question about 'Spring' or 'Hibernate', we created a list of tags to ignore based on frequency counts and prefixes (such as 'google-apixx' or 'facebook-api-xx'). Once these filters were in place, we filtered to ignore all unanswered questions based on the 'AnswerCount' column in the Posts table. Another filtering step was to take all the answers that contained code snippets defined by the <code> token and replace the tokens with 'CODE_START' and 'CODE_END' as labels to mark the beginning and end of the code snippet.

Dataset Statistics
We collected all corresponding answers from our set of filtered questions to create an initial corpus. This corpus contained 107,961 questiondescription-answer triplets, of which 47,220 questions did not have a 'user accepted best answer'. A statistical analysis based on a naive word split showed that there were outliers in the corpus, with very large maximum lengths of up to 10,000 words in an answer. We identified and removed the outliers in the corpus by removing the current largest sample and monitoring the average length of the corpus. We continued to remove the largest sample till we obtained a rela- tively stable average value. This outlier determination was performed for each sample type of question, description and answer separately. Ultimately, we removed questions longer than 19 words, or whose descriptions were longer than 125 words, or with answers longer than 175 words.
The questions, descriptions and answers in the dataset were then converted into a sequence of numbers using word indexing techniques, in order to be usable by a machine learning model. The word indexing techniques involved first tokenizing the sentences into word tokens by using an open-source tokenizer (Python NLTK). 2 Each word was labelled with a unique index and stored as a key-value pair in a data structure. Secondly, the words in each sentence were replaced by the corresponding indexes using the data structure created above to obtain a sequence of numbers which corresponded to the original sentence. A total of 284, 827 words were obtained through tokenization and subsequently indexed in the data structure.
To maintain the uniformity of sentence length, we 'pre-pad' the sequence with 0 before the original sequence. Adding zeros at the start of the original sequence (if required) allows the network to accept a fixed sequence length and the nature of the number zero also allows us to denote that the element in the sequence is an empty space. We 'pre-pad' and thus structure the sequence with actual content towards the end of the sequence because a time-based neural network is more likely to 'remember' time steps towards the end of the sequence, as those would be stored in the more recent memory which is captured by the network.
The filtering of the sentences with length thresholds is important, as it is difficult to capture semantic representations for lengthy text using word embeddings. Setting these thresholds resulted in a reduced dataset of 106,386 questions. The statistics for the final dataset are shown in Table 1. We also make this dataset available for public use as a contribution of this paper. 3

Methods and Techniques
With the future objective of building a data-driven tutorial question-answering system, we first explore three overarching approaches of retrievalbased models, generative models, and hybrid models for Java programming-based tutorial question answering.
The challenges associated with this dataset are that unlike traditional question answering datasets, this dataset has three streams of inputs. Each stream has its own unique descriptors such as vocabulary and length. The answers in the dataset contain interspersed English and Java, which could make building meaningful word vector representations difficult. Long sentences are typically more difficult to represent in a vector space and this dataset contains longer typical sentences for the description and answer than those seen in previous work of Lowe et al. (2015) and Lu et al. (2017). As a part of this work, we investigate which combinations of inputs from the dataset yield the most optimal results.

Dual Encoder LSTM (Siamese network)
The Siamese Network or Dual Encoder architecture (Medsker and Jain, 2001;Bromley et al., 1994) has shown success in the recent past to build a retrieval-based question answering system (Lowe et al., 2015;Lu et al., 2017).
To use the dataset with the Dual Encoder architecture, we needed to perform some additional pre-processing. We first built a dataset containing the question along with its description and the corresponding correct answer, and we assigned a label of 1 to these samples. We then created a sample containing the incorrect answer for a given question and description pair. This was done by randomly choosing another answer from the rest of the answer set and assigning a label of 0 to these samples.   (Schuster and Paliwal, 1997) as it allows the network to understand the context of a word with respect to both previous and next words and thus build better vector representations of the words. Each bidirectional encoder's output was merged together to obtain a single 600dimensional output, and then this output was fed to a fully connected network of two layers, with the first layer containing 500 neurons and the second layer containing 300 neurons. This was then run through a sigmoid activation function in order to obtain the result. This architecture used pre-trained GloVe word embeddings which were updated during the training phase. The LSTM cells contained 300 hidden units and 2 layers and optimized the binary crossentropy loss function.

Question & Description Dual Encoder (QDDE):
This architecture was similar to the Description and Answer Dual Encoder in that it consisted of a Dual Encoder LSTM network, where the first encoder encoded the question statement and the second encoder encoded the description statement. Each encoder's outputs were merged together to obtain a single 300-dimensional output and then this output was run through a sigmoid activation function in order to obtain the result.
Again, this architecture also used pre-trained GloVe word embeddings which were updated during the training phase. The LSTM cells contained 300 hidden units and a single layer and optimized the binary cross-entropy loss function.
The rationale for this architecture was to build a dual encoder that would be able to predict a description given a question. We wanted to investigate whether the dual encoder could learn relationships between smaller questions and longer descriptions. If we could successfully predict the description for a given question, it would allow us to leverage the similar lengths of the description and answer to obtain better results.

Techniques to answer queries
The aforementioned architectures were able to determine answers for the given training, validation and testing sets, where the correct answers are predetermined. To extend our model's use to the real world, we needed to define a different set of strategies to answer questions for which we do not know the predetermined answer. We explore our proposed strategies in the following section.
Question Description Matching followed by Description Answer Matching: This approach attempted to find a similarity measure between a given user question and a description of the given Figure 2. Question Description Matching followed by Description Answer Matching question via QDDE. The best matching description was then run against all the answers to determine the top 10 best possible answers, as seen in Figure 2. The intuition behind using this approach was that the description-answer dual encoder should provide better results as it used a bidirectional LSTM network and both sequence lengths were approximately the same.

Sequence-to-Sequence Models
Sequence-to-Sequence models are generative models that, unlike their retrieval-based counterparts, do not rely on choosing from an existing set of answers but rather generate answers on their own.
The preprocessing steps for the sequence-tosequence model were identical to the preprocessing steps specified for the dual encoder model.
Description to Answer Encoder -Decoder: This architecture used the description of a question as the input to the encoder and attempted to match the actual answer to the question using the decoder. The intuition behind matching a description to the answer was that as the sequences are of almost equal length, this could then be framed as a machine translation problem, which has seen significant success with the sequenceto-sequence model Cho et al., 2014;Vinyals and Le, 2015).
The encoder was a bidirectional recurrent neural network using LSTM cells. We chose bidirectionality for better sentence vector representation, and LSTM cells for their ability to capture longterm dependencies. The decoder is a standard recurrent neural network with LSTM cells.
The LSTM cells contained 512 hidden units and 2 layers. We used a dropout (Srivastava et al, 2014) probability of 0.2 and gradient normalization (Pascanu et al., 2013) of 3.6. We used 15 buckets, as the length of 175 (maximum answer length) would then be equally split into smaller chunks of size 12 increments. The Luong attention mechanism (Luong et al., 2015) was implemented in order to boost accuracy, as was beam search (Wiseman and Rush, 2016) of beam width 10 in order to obtain a better output for a given input. The vocabulary had to be reduced to 60,000 due to memory constraints.
All the hyperparameters stated for the networks discussed above were determined by performing a

Hybrid architecture (Dual Encoder + Sequence-to-Sequence)
In this work, we built a hybrid structure that combined both the retrieval-based model of the dual encoder with the generative model of the sequence-to-sequence model. The intuition behind building this model was that a user typically asks questions with a length of fewer than 20 words and may not necessarily have enough of a description to fit the 125-word limit sufficiently. The proposed architecture combats this issue by obtaining the user question and trying to find the most appropriate description from a set of prefixed descriptions, this is done by the question and description dual encoder mentioned earlier. The entire workflow can be seen in Figure 3. We take the top 10 predicted descriptions and feed these descriptions as input to the description to answer sequence-to-sequence model.
The Description to Answer model would result in 10 different generated answers and the answers were ranked based on the input descriptions ranking. This architecture also lets us leverage the nature of the dataset, in that it contains a question, a description and an answer as opposed to a traditional question-answer pair dataset.

Experiments
All the experiments performed as a part of this work were done on a desktop with the following specification i7 8-core CPU, 32GB RAM, and NVIDIA GTX 1070 8GB VRAM.
The dataset of 106,386 was split into separate training, testing and validation sets. Given the large network sizes used in the experiments, there were a correspondingly large number of parameters to be trained for each network, which in turn required a sufficiently large dataset to train on. Taking this into consideration we chose to not follow the traditional 80-20 train-test split but rather maintain a large enough training set and use the rest of the data for testing and validation. The training set thus contained 100,000 questions and corresponding description and answers triplets, the test dataset contained 5,000 triplets and the validation set contained 1,386 triplets.
For the dual encoder experiments, the training set size was 200,000 as we had to use both positive and negative samples while training. Whereas for the sequence-to-sequence model training we used only 100,000 description and answer pairs.

Quantitative Analysis
Dual Encoder: The recall@k metric works in conjunction with the group size. Given that we have a group size of 5, recall@1 tells us that if we had the option to choose 1 out of the 5 options, what is the probability that it would be correct. We take a look at Table 2, where the group size is 10 and we compare another popular retrieval-based method, TF-IDF (Ramos, 2003), to our obtained results. TF-IDF has been outperformed by dual encoders for conversational models in the past (Lowe et al., 2015), but we see some interesting results for our dataset.
We see that for DADE, TF-IDF is able to slightly outperform the dual encoder at the re-call@1 scores, but the dual encoder outperforms TF-IDF for recall@2 and recall@5. We further see that QDDE is outperformed by TF-IDF in both recall@1 and recall@2, only for QDDE to do better in recall@5.
We believe that we see this behavior because TF-IDF works based on word similarity and rates rare words between two documents as highly related (Ramos, 2003). Questions and descriptions containing common phrases are better perceived by TF-IDF than by the dual encoders. In addition, previous results like Lowe et al. (2015), worked on a corpus with an average word count of 10 words where they showed that the dual encoder architecture significantly outperformed TF-IDF, whereas our work deals with much longer utterances. In spite of long utterances, we see that the dual encoders do a comparable or better job than TF-IDF.
Sequence-to-Sequence: We followed Google's Neural Machine Translation tutorial (Luong et al., 2017) to build our sequence-to-sequence models. An important point to note is that while traditional machine translations are judged based on BLEU score (Papineni et al., 2002) and perplexity, a conversational model cannot be judged on BLEU and hence we used perplexity as the primary measure of judgment (Shao et al., 2017). Description to Answer Sequence-to-Sequence Model: The perplexity for the dev set continued to decrease, thus we could assume no overfitting had occurred over the epochs of train-ing this network. The dev and test perplexity scores were better than the previous model, with scores of 68.91 and 70.15 respectively, and this is also reflected in the coherent responses made by the model.

Qualitative Analysis
We chose a question which was neither part of the training, development nor test corpus to analyze the qualitative results. The reason these results are presented as qualitative is that since its part of neither of the corpus we do not have the actual expected response. The question that we chose is a fairly simple and straightforward question: "How can we create an integer array in java" We take a look at the responses given by the models in Table 3. As the models provide multiple answers, we have handpicked the answer that we thought was most relevant from the top 10. We have also cleaned the answer by referring the original post on Stack Overflow for readability.
The QDDE+DADE model produces a response that suggests using Java collections to achieve the same purpose of the array. The drawback here is that it diverges from the actual answer but is still relevant nonetheless. Another one of the top ten answers suggested looking at some of the Java documentation related to arrays.
We now take a look at the responses generated by sequence-to-sequence models and the hybrid model. While the hybrid model suggests using an 'ArrayList' instead of an array, it was able to form different codes for the condition of the 'for' loop in both the answers, suggesting that it may understand a relationship between functions such as 'get' and 'add' and the 'for' loop condition.
It is also interesting to see the response generated by the Description to Answer sequence-tosequence model. We can analyze some of the aforementioned testing responses generated via the Description to Answer Sequence-to-Sequence model as can be seen in Table 4.
By analyzing the generated responses for the samples above, we can see that the model has learned how to create new objects and has also learned what kind of commands are related to a given object, such as the date in Java needs simpledateformat class or that the file could need a file path. Perhaps the most notable was the crea-4 The sequence-to-sequence model does not include nonword tokens such as '=' or '{'. These have been added for readability.  tion of a coherent 'for' loop using the previously created object and referencing the appropriate method.
We also see that the models are able to successfully combine the English language along with java code, starting answers with phrases such as "you can use...", "i dont think there is a way...", "i am not sure but try…" and so on. The models are also able to draw a clear line between code snippets and English language and code start labels are mostly correctly completed with code end la-bels. There have also been instances where English phrases such "you can also try" are used between two code snippets.
While these examples have been sampled from a much larger set in which not all the responses are as appropriate, this still shows promise in using this architecture to build models that can appropriately respond to a query by generating their own response.

Conclusion
This work has examined how we can leverage community-based question answering forums as a source of data to build a dataset specific to general Java-based programming questions. We have seen that retrieval-based models obtain high recall rates on the testing set but are restricted only to the answer set available. On the other hand, generative models are able to successfully combine the English language along with Java code to make coherent responses at times, but the responses are small and do not completely answer the question. We found reasonable success with the hybrid model by combining the retrieval-based approach with the generative approach. The proposed approaches show promise in building a useful tutorial system based on the sampled dataset. These are the first steps made in that direction.
This work could be furthered by investigating jointly training the hybrid model to improve description selection and answer generation. One could also frame this task as a machine comprehension task, where the entire answer set could be used as the context. Doing so would allow us to leverage the memory network architecture, which performs better at tasks involving storing longterm memory. Finally, we could explore using adversarial training, as it has seen success on conversational models in the recent past (Li et al., 2017).

Acknowledgments
The authors wish to thank the members of the LearnDialogue group at the University of Florida for their helpful input. This work is supported in part by the National Science Foundation through grant CNS-1622438. Any opinions, findings, conclusions, or recommendations expressed in this report are those of the authors, and do not necessarily represent the official views, opinions, or policy of the National Science Foundation.