Predicting Algorithm Classes for Programming Word Problems

We introduce the task of algorithm class prediction for programming word problems. A programming word problem is a problem written in natural language, which can be solved using an algorithm or a program. We define classes of various programming word problems which correspond to the class of algorithms required to solve the problem. We present four new datasets for this task, two multiclass datasets with 550 and 1159 problems each and two multilabel datasets having 3737 and 3960 problems each. We pose the problem as a text classification problem and train neural network and non-neural network based models on this task. Our best performing classifier gets an accuracy of 62.7 percent for the multiclass case on the five class classification dataset, Codeforces Multiclass-5 (CFMC5). We also do some human-level analysis and compare human performance with that of our text classification models. Our best classifier has an accuracy only 9 percent lower than that of a human on this task. To the best of our knowledge, these are the first reported results on such a task. We make our code and datasets publicly available.


Introduction
In this paper we introduce and work on the problem of predicting algorithms classes for programming word problems (PWPs). A PWP is a problem written in natural language which can be solved using a computer program. These problems generally map to one or more classes of algorithms, which are used to solve them. Binary search, disjoint-set union, and dynamic programming are some examples. In this paper, our aim is to automatically map programming word problems to the relevant classes of algorithms. We ap- * * denotes equal contribution Figure 1: An example programming word problem. Note that the example shown here is one of the easy Codeforces problems -most problems are much harder. proach this problem by treating it as a classification task.
Programming word problems A programming word problem (PWP) requires the solver to design correct and efficient programs. The correctness and efficiency is checked by various testcases provided by the problem writer. A PWP usually consists of three parts -the problem statement, a well-defined input and output format, and time and memory constraints. An example PWP can be seen in Figure 1.
Solving PWPs is difficult for several reasons. One reason is, the problems are often embedded in a narrative, that is, they are described as quasi real-world situations in the form of short stories or riddles. The solver must first decode the intent of the problem, or understand what the problem is. Then the solver needs to apply their knowledge of algorithms to write a solution program. Another reason is that, the solution programs must be effi-cient with respect to the given time and memory constraints. An outgrowth of this is that, the algorithm required to solve a particular problem not only depends on the problem statement, but also the constraints. Consider that there may be two different algorithms which will generate the correct output, for example, linear search, and binary search, but only one of those will abide by the time and memory constraints.
With the growing popularity of these problems, various competitions like ACM-ICPC, and Google CodeJam have emerged. Additionally, several companies including Google, Facebook, and Amazon evaluate problem-solving skills of candidates for software-related jobs (McDowell, 2016) using PWPs. Consequently, as noted by Forišek (2010), programming problems have been becoming more difficult over time. To solve a PWP, humans get information from all its parts, not just the the problem statement. Thus, we predict algorithms from the entire text of a PWP. We also try to identify which parts of a PWP contribute the most towards predicting algorithms.
Significance of the Problem Many interesting real-world problems can be solved and optimised using standard algorithms. Time spent grocery shopping can be optimised by posing it as a graph traversal problem (Gertin, 2012). Arranging and retrieving items like mail, or books in a library can be done more efficiently using sorting and searching algorithms. Solving problems using algorithms can be scaled by using computers, transforming the algorithms into programs. A program is an algorithm that has been customised to solve a specific task under a specific set of circumstances using a specific language. Converting textual descriptions of such real-world problems into algorithms, and then into programs has largely been a human endeavour. An AI agent that could automatically generate programs from natural language problem descriptions could greatly increase the rate of technological advancement by quickly providing efficient solutions to the said real-world problems. A subsystem that could identify algorithm classes from natural language would significantly narrow down the search space of possible programs. Consequently, such a subsystem would be a useful feature for, or likely be even part of, such an agent. Therefore, building a system to predict algorithms from programming word problems is potentially an important first step toward an automatic program generating AI. More immediately, such a system could serve as an application to help people in improving their algorithmic problem-solving skills for software job interviews, competitive programming, and other uses.
As per our knowledge, this task has not been addressed in the literature before. Hence, there is no standard dataset available for this task. We generate and introduce new datasets by extracting problems from Codeforces 1 , a sport programming platform. We release the datasets and our experiment code at 2 .
Contribution The major contributions of this paper are: Four datasets on programming word problems -two multiclass 3 datasets having 5 and 10 classes and two multilabel 4 datasets having 10 and 20 classes. Evaluation of Classifiers on various multiclass and multilabel classifiers that can predict classes for programming word problems on our datasets along with the human baseline. We define our problem more clearly in section 2. Then we explain our datasets -their generation and format along with human evaluation in section 3. We describe the models we use for multiclass and multilabel classification in section 4. We delineate our experiments, models, and evaluation metrics in section 5. We report our classification results in section 6. We analyse some dataset nuances in section 7. Finally, we discuss related work and the conclusion in sections 8 and 9 respectively.

Problem Definition
The focus of this paper is the problem of mapping a PWP to one or more classes of algorithms. A class of algorithms is a set containing more specific algorithms. For example, breadth-first search, and Dijkstra's algorithm belong to the class of graph algorithms. A PWP can be solved using one of the algorithms in the class it is mapped to. Problems on the Codeforces platform have tags that correspond to the class of algorithms.
Thus, our aim is to find a tagging function, f * : S → P(T ) which maps a PWP string, s ∈ S, to a set of tags, {t 1 , t 2 , ...} ∈ P(T ). We also consider another variant of the problem. For the PWPs that only have one tag, we focus on finding a tagging 1 codeforces.com 2 https://github.com/aayn/codeforces-clean 3 each problem belongs to only one class 4 each problem belongs to one or more classes  function, f * 1 : S → T , which maps a PWP string, s ∈ S, to a tag, t ∈ T . We approximate f * and f * 1 by training models on data.

Data Collection
We collected the data from a popular sport programming platform called Codeforces. Codeforces was founded in 2010, and now has over 43000 active registered participants 5 . We first collected a total of 4300 problems from this platform. Each problem has associated tags, with most of the problems having more than one tag. These tags correspond to the algorithm or class of algorithms that can be used to solve that particular problem. The tags for a problem are given by the problem writer and they can only be edited only by high-rated (expert) contestants who have solved the problem. Next, we performed basic filtering on the data -removing the problems which had non-algorithmic tags, problems with no tags assigned to them, and also the problems wherein the problem statement was not extracted completely. After this filtering, we got 4019 problems with 35 different tags. This forms the Codeforces dataset. The label (tag) cardinality (average number of labels/tags per problem) was 2.24. Since the Codeforces dataset is the first dataset generated for a new problem, we select different subsets of this 5 http://codeforces.com/ratings/page/219 dataset with differing properties. This is to check if classification models are robust to different variations of the problem.

Multilabel Datasets
We found that a large number of tags had a very low frequency. Hence, we removed those problems and tags from the Codeforces dataset as follows. First, we got the list of 20 most frequently occurring tags, ordered by decreasing frequency. We observed that the 20 th tag in this list had a frequency of 98, in other words, 98 problems had this tag. Next, for each problem, we removed the tags that are not in this list. After that, all problems that did not have any tags left were removed.
This led to the formation of the Codeforces Multilabel-20 (CFML20) dataset, which has 20 tags. We used the same procedure for the 10 most frequently occurring tags to get the Codeforces Multilabel-10 (CFML10) dataset. The CFML20 has 98.53 (3960 problems) percent of the problems of the original dataset and the label (tag) cardinality only reduces from 2.24 to 2.21. CFML10 on the other hand has 92.9 percent of the problems with label (tag) cardinality 1.69. Statistics about both these multilabel datasets are given in Table 2.

Multiclass Datasets
To generate the multiclass datasets, first, we extracted the problems from the CFML20 dataset that only had one tag. There were about 1300 such problems. From those, we selected the problems whose tags occur in the list of 10 most common tags. These problems formed the Codeforces Multiclass-10 (CFMC10) dataset which contains 1159 examples. We found that the CFMC10 dataset has a class (tag) imbalance. We also make a balanced dataset, Codeforces Multiclass-5 (CFMC5), in which the prior class (tag) distribution is uniform. The CFMC5 dataset has five tags, each having 110 problems. To make CFMC5, first we extracted the problems whose tags are among the five most common tags. The fifth most common tag occurs 110 times. We sampled 110 random problems corresponding to the other four tags to give a total of 550 problems. Statistics about both the multiclass datasets are given in Table 1.

Dataset Format
Each problem in the datasets follows the same format (refer to Figure 1 for an example problem). The header contains the problem title, and the time and memory constraints for a program running on the problem testcases. The problem statement is the natural language description of the problem framed as a real world scenario. The input and output format describe the input to, and the output from a valid solution program. It also contains constraints that will be put on the size of inputs (for example, max size of input array, max size of 2 input values). The tags associated with the problem are the algorithm classes that we are trying to predict using the above information.

Class Categories in the Dataset
The classes for PWPs can be divided into two categories: Problem category classes tell us what kind of broad class of problem the PWP belongs to. For instance, math, and string are two such classes. Solution category classes tell us what kind of algorithm can solve a particular PWP. For example, a PWP of class dp or binary search would need a dynamic programming or binary search based algorithm to solve it.
Problem category PWPs are easier to classify because, in some cases, simple keyword mapping may lead to the classification (an equation in the problem is a strong indicator that a problem is of math type). Whereas, for solution category PWPs, a deeper understanding of the problem is required.
The classes belong to problem and solution categories for CFML20 are mentioned in the supplementary material.

Human Evaluation
In this section, we evaluate and analyze the performance of an average competitor on the task of predicting an algorithm for a PWP. The tags for a given PWP are added by its problem setter or other high-rated contestants who have solved it. Our test participants were recent computer science graduates with some experience in algorithms and competitive programming. We gave 5 participants the problem text along with all the constraints, and the input and output format. We also provided them with a list of all the tags and a few example problems for each tag. We randomly sample 120 problems from the CFML20 dataset and split them into two parts -containing 20 and 100 problems respectively. The 20 problems were given along with their tags to familiarize the participants with the task. For the remaining 100 problems, the participants were asked to predict the tags (one or more) for each problem. We chose to sample the problems from the CFML20 dataset as it is the closest to a real-world scenario of predicting algorithms for solving problems. We find that there is some variation in the accuracy reported by different humans with the highest F1 micro score being 11 percent greater than that of the the lowest. (see supplementary material for more details). The F1 micro score averaged over all 5 participants was 51.8 while the averaged F1 Macro was 42.7. The results are not surprising since this task is like any other problem solving task, and people based on their proficiency would get different results. This shows us that the problem is hard even for humans with a computer science education.

Classification Models
To test the compatibility of our problem with text classification paradigm, we apply to it some standard text classification models from recent literature.

Multiclass Classification
To approximate the optimal tagging function f * 1 (see section 2) we use the following models.
Multinomial Naive Bayes (MNB) and Support Vector Machine (SVM) Wang and Manning (2012) proposed several simple and effective baselines for text classification. An MNB is a naive Bayes classifier for multinomial models. An SVM is a discriminative hyperplane-based classifier (Hearst et al., 1998). These baselines use uni-grams and bigrams as features. We also try applying TF-IDF to these features.
Multi-layer Perceptron (MLP) An MLP is a class of artificial neural network that uses backpropagation for training in a supervised setting (Rumelhart et al., 1986). MLP-based models are standard for text classification baselines (Glorot et al., 2011).
Convolutional Neural Network (CNN) We also train a Convolutional Neural Network (CNN) based model, similar to the one used by Kim (2014) in their paper, to classify the problems. We use the model both with and without pre-trained GloVe word-embeddings (Pennington et al., 2014).
CNN ensemble Hansen and Salamon (1990) introduce neural network ensemble learning, in which many neural networks are trained and their predictions combined. These neural network systems show greater generalization ability and predictive power. We train five CNN networks and combine their predictions using the majority voting system.

Multilabel Classifiers
To approximate, f * (see section 2), we apply the following augmentations to the models described above.
Multinomial Naive Bayes (MNB) and Support Vector Machine (SVM) For applying these models to the multilabel case, we use the one-vsrest (or, one-vs-all) strategy. This strategy involves training a single classifier for each class, with the samples of that class as positive samples and all other samples as negatives (Bishop, 2006).
Multi-layer Perceptron (MLP) Nam et al. (2014) use MLP-based models for multilabel text classification. We use similar models, but use the MSE loss instead of the cross-entropy loss.
Convolutional Neural Network (CNN) For multilabel classification we use a CNN based feature extractor similar to the one used in (Kim, 2014). The output is passed through a sigmoid activation function, σ(x) = 1 1+e −x . The labels which have a corresponding activation greater than 0.5 are considered . Similar to the multiclass case, we train the model both with and without pre-trained GloVe (Pennington et al., 2014) word-embeddings.
CNN ensemble We train five CNNs and add their output linear activation values. We pass this sum through a sigmoid function and consider the labels (tags) with activation greater than 0.5.

Experiment setup
All hyperparameter tuning experiments were performed with 10-fold cross validation. For the nonneural network-based methods, we first vectorize each problem using a bag-of-words vectorizer, scikit-learn's (Pedregosa et al., 2011) CountVectorizer. We also experiment with TF-IDF features for each problem. In the multiclass case, we use the LIBSVM (chung Chang and Lin, 2001) implementation of the SVM classifier and we grid search over different kernels. However, the LIB-SVM implementation is not compatible with the one-vs-rest strategy (complexity O(n) where n is the number of classes), but only the one-vs-one (complexity O(n 2 )). This becomes prohibitively slow and thus, we use the LIBLINEAR (Fan et al., 2008) implementation for the multilabel case. For hyperparameter tuning, we applied a grid search over the parameters of the vectorizers, classifiers, and other components. The exact parameters tuned can be seen in our code repository. For the neural network-based methods, we tokenize each problem using the spaCy tokenizer (Honnibal and Montani, 2017). We only use words appearing 2 or more times in building the vocabulary and replace the words that appear fewer times with a special UNK token. Our CNN network architecture is similar to that used by Kim (2014). The batch size used is 32. We apply 512 one-dimensional convolution filters of size 3, 4, and 5 on each problem. The rectifier, R(x) = max(x, 0), is used as the activation function. We concatenate these filters, apply a global max-pooling followed by a fully-connected layer with output size equal to the number of classes. We use the PyTorch framework (Paszke et al., 2017) to build this model. For the word embedding we use two approaches -a vanilla PyTorch trainable embedding layer and a 300-dimensional GloVe embedding (Pennington et al., 2014). The networks were initialized using the Xavier method (Glorot and Bengio, 2010) at the beginning of each fold. We use the Adam optimization algorithm (Kingma and Ba, 2014) as we observe that it converges faster than vanilla stochastic gradient descent.

Multiclass Results
We see that the classification accuracy of the best performing classifier, CNN ensemble, for the CFMC5 dataset is 62.7 %. The highest accuracy for the CFMC10 dataset was achieved by the CNN classifer which does not use any pretrained embeddings. For all the multiclass classification results refer to table 3. We observe that CNNbased classifiers perform better than other classifiers -MLP, MNB, and SVM for both CFMC5 and CFMC10 datasets. Since these are the first learning results on the task of algorithm prediction for PWPs, we train a CNN classifier on a random labelling of the dataset. The results are given in the row called CNN random. To obtain this random labelling we shuffle the current mapping from problem to tag randomly. This ensures that the class distribution of the datasets remain the same. We see that all the classifiers significantly outperform the performance on the random dataset. We also observe that the classification accuracy is not the same for every class. We get the highest accuracy (see Fig. 2) for the class, data structures, at 90%, while, the lowest accuracy is for the class, greedy, at 40%. These results are on the CFMC5 dataset.

Multilabel Results
We see that CNN-based classifiers give the best results for the CFML10 and CFML20 datasets. The best F1 micro and macro scores for the CFML10 dataset were 45.32, 38.9 respectively. These were obtained by the CNN Ensemble model. For com-plete results see table 4. The best performing model on the CFML20 dataset was also the CNN ensemble. As we did in the multiclass case, we train a CNN model on the randomly shuffled labelling for both CFML10, CFML20 datasets. We find that all the classifers significantly outperform the model trained on a shuffled labelling. The human-level F1 micro and macro scores on a subset of the CFML20 dataset were 51.2 and 40.5. In comparison, our best performing classifier on the CMFL20 dataset, CNN Ensemble, got F1 macro and micro scores of 42.75, 37.29 respectively. We see that the performance of our best classifiers trail average human performance by about 8.45% and 3.21% on F1 micro and F1 macro scores respectively.

Experiments with various subsets of the problem
As described in section 1, a PWP consists of three components -the problem statement, input and output format, and time and memory constraints.
We seek to answer the following questions. Does one component contribute to the accuracy more than any other? Does the contribution of different components vary over the problem class? We performed some experiments to address these questions. We split the problem into two parts -1) the problem statement, and 2) the input and output format, and time and memory constraints. We train an SVM, and a CNN on these two components independently.
Multiclass PWP component analysis We find classifier accuracies on the CFMC5 dataset. We choose the CFMC5 dataset out of the two multiclass datasets because it has a balanced class distribution. We find that the classifiers perform quite well on only the input and output format, and time and memory constraints -the best classifier getting an accuracy of 56.4 percent (only 5.3 percent lower than the accuracy of CNN with the whole problem). Classification using only the problem statement gives worse results than using the format and constraints, with a classification accuracy of 45.2 percent for the best classifier CNN (16.5 percent lower than the accuracy of a CNN trained on the whole problem). Complete results are given in table 5. We also see that the performance across different classes varies when trained on different inputs. We find that the class dp performs better  when trained on the problem statement, whereas the other classes perform much better on the format and constraints. For each class except greedy, we see an additive trend -the accuracy is improved by combining both these features. Refer to figure 2 for more details.
Multilabel partial problem results We also tabulate the classifier accuracies on the CFML20 dataset by training it only on the format and constraints, and the problem statement. Even here, we observe similar trends as the multiclass partial problem experiments. We find that classifiers are more accurate when trained only on the format and constraints than only on the problem statement. Again, the accuracy is improved by combining both these features. Refer to table 5 for more details.

Problem category and Solution category results
We find that correctly classifying PWPs of the solution category is harder than correctly classifying PWPs of the problem category (table 5). For instance, take a look at the row corresponding to CFMC5 dataset and "all prob" feature. The accuracy for solution category is 54.24% as compared to 71.36% for the problem category. This trend is followed for both CFMC5 and CFML20 datasets and also when using different features of the PWPs. In spite of the difficulty, the classification scores for the solution category are significantly better than random.

Related Work
Our work is related to three major topics of research, math word problem solving, text document classification and program synthesis.
Math word problem solving In the recent years, many models have been built to solve different kinds of math word problems. Some models solve only arithmetic problems (Hosseini et al., 2014), while others solve algebra word problems . There are some recent solvers which solve a wide range preuniversity level math word problems (Matsuzaki et al., 2017), (Hopkins et al., 2017). , and Mehta et al. (2017) have built deep neural network based solvers for math word problems. Program synthesis Work related to the task of converting natural language description to code comes under the research areas of program synthesis and natural language understanding. This work is still in its nascent stage. Zhong et al. (2017) worked on generating SQL queries automatically from natural language descriptions. Lin et al. (2017) worked on automatically generating bash commands from natural language descriptions. Iyer et al. (2016) worked on summarizing source code. Sudha et al. (2017) use a CNN based model to classify the algorithm used in a programming problem using the C++ code. Our model tries to accomplish this task by using the natural language problem description. Gulwani et al. (2017) is a comprehensive treatise on program synthesis. Document classification The problem of classifying a programming word problem in natural language is similar to the task of document classification. The state-of-the-art approach currently for single label classification is to use a hierarchical attention network based model (Yang et al., 2016). This model is improved by using transfer learning (Howard and Ruder, 2018).  Table 5: Performance on different categories of PWPs for different parts of the PWPs. The rows with "only statement" features use only the problem description part of the PWP, the rows with "only i/o" use only the I/O and constraint information, and "all prob" use the entire PWP. The results under the "Soln category", "Prob category" columns are for the problems which have the label under problem, solution category respectively. "All" is for the entire dataset. So, for example, the F1 Micro score using only I/O and constraint for solution category problems of CFML20 is 34.63. Note that for CFMC5, F1 Mi (F1 Micro) is the same as accuracy, and F1 Ma (F1 Macro) score is a weighted Macro F1-score. , only format and constraints information (center), and only problem statement (right). Perfomance on the whole problem is the highest, followed by only format and constraints information. Performance across different classes (except greedy) is additive, which shows that features extracted from both the parts are of importance Other approaches include a Recurrent Convolutional Neural Network based approach (Lai et al., 2015) or the fasttext model (Joulin et al., 2016) which uses bag-of-words features and a hierarchical softmax. Nam et al. (2014) use a feed-forward neural network with binary cross entropy per label to perform multilabel document classification. Kurata et al. (2016) leverage label co-occurrence to improve multilabel classification.  use a CNN based architecture to perform extreme multilabel classification.

Conclusion
We introduced a new problem of predicting the algorithm classes for programming word problems. For this task we generated four datasets -two multiclass (CFMC5 and CFMC10), having five and 10 classes respectively, and two multilabel (CFML10 and CFML20), having 10 and 20 classes respec-tively. Our classifiers are falling short only by about 9 percent of the human score. We also did some experiments which show that increasing the size of the train dataset improves the accuracy (see supplementary material). These problems are much harder than high school math word problems as they require a good knowledge of various computer science algorithms and an ability to reduce a problem to these known algorithms. Even our human analysis shows that trained computer science graduates only get an F1 of 51.8. Based on these results, we see that algorithm class prediction is compatible with and can be solved using text classification.