Learning to Define Terms in the Software Domain

One way to test a person’s knowledge of a domain is to ask them to define domain-specific terms. Here, we investigate the task of automatically generating definitions of technical terms by reading text from the technical domain. Specifically, we learn definitions of software entities from a large corpus built from the user forum Stack Overflow. To model definitions, we train a language model and incorporate additional domain-specific information like word co-occurrence, and ontological category information. Our approach improves previous baselines by 2 BLEU points for the definition generation task. Our experiments also show the additional challenges associated with the task and the short-comings of language-model based architectures for definition generation.


Introduction
Dictionary definitions have been previously used in various Natural Language Processing (NLP) pipelines like knowledge base population (Dolan et al., 1993), relationship extraction, and extracting semantic information (Chodorow et al., 1985). Creating dictionaries in a new domain is time consuming, often requiring hand curation by domain experts with significant expertise. Developing systems to automatically learn and generate definitions of words can lead to greater time-efficiency (Muresan and Klavans, 2002). Additionally, it helps accelerate resource-building efforts for any new domain.
In this paper, we study the task of generating definitions of domain-specific entities. In particular, our goal is to generate definitions for technical terms with the freely available Stack Overflow 1 (SO) as our primary corpus. Stack Overflow is a technical question-and-answer forum aimed 1 https://stackoverflow.com at supporting programmers in various aspects of computer science. Each question is tagged with associated entities or "tags", and the top answers are ranked based on user upvotes (de Souza et al., 2014). Figure 1 shows a screenshot from the forum of a question and the entities tagged with the question. Our work explores the challenge of generating definitions of entities in SO using the background data of question-answer pairs and their associated tags.
Our base definition generation model is adapted from Noraset et al. (2017), a Recurrent Neural Network (RNN) language model to learn to generate definitions for common English words using embeddings trained on Google News Corpus. Over this base model, we leverage the distributed word information via the embeddings trained on domain specific Stack Overflow corpus. We improve this model to additionally incorporate domain-specific information such as co-occurring entities and domain ontology in the definition generation process. Our model also uses an additional loss function to reconstruct the entity word representation from the generated sequence. Our best model can generate definitions in software domain with a BLEU score of 10.91, improving upon the baseline by 2 points.
In summary, our contributions are as follows, 1. We propose a new dataset of entities in the software domain and their corresponding definitions, for the definition generation task. 2. We provide ways to incorporate domainspecific knowledge such as co-occurring entities and ontology information into a language model trained for the definition generation task.
3. We study the effectiveness of the model using the BLEU (Papineni et al., 2002) metric and present the results and discussion about our results.
Section 4 of this paper presents the dataset. Section 5 discusses the model in detail. In Section 6 and 7 we present the experimental details and results of our experiments. Section 8 provides an analysis and discussion of the results and the generated definitions.

Related Work
Definition modeling: The closest work related to ours is Noraset et al. (2017) who learn to generate definitions for general English words using a RNN language model initialized with pre-trained word embeddings. We adapt the method proposed by them and use it in a domain-specific construct. We aim to learn definitions of entities in the software domain. Hill et al. (2015) learn to produce distributed embeddings for words using their dictionary definitions as a means to bridge the gap between lexical and phrase semantics. Similarly, Tissier et al. (2017) use lexical definitions to augment the Word2Vec algorithm by adding an objective of reconstructing the words in the definition. In contrast, we focus solely on generating the definitions of entities. We add an objective of reconstructing the embedding of the word from the generated sequence. Also, all the above work focus on lexical definitions of general English words, while we focus on closed domain software terms. Dhingra et al. (2017) present a dataset of cloze-style queries constructed from definitions of software entities on Stack Overflow. In contrast to their work, we focus on generating the entire definition of entities.

RNN Language Models :
We use RNN based language models, conditioned on the term to be defined and its ontological category, to generate definitions. Such neural language models have been shown to successfully work for image-captioning tasks (Karpathy and Fei-Fei, 2015;Kiros et al., 2014), concept to text generation (Lebret et al., 2016;Mei et al., 2015), ma-chine translation (Luong et al., 2014;Bahdanau et al., 2014) and conversations and dialog systems (Shang et al., 2015;Wen et al., 2015).
Reconstruction Loss Framework : We also build an explicit loss framework to reconstruct the term by reducing the cosine distance between the embedding of the term and the embedding of the reconstructed term. We adapt this approach from Hill et al. (2015) who apply it to learn word representations using dictionaries. Inan et al. (2016) propose a loss framework for language modeling to minimize the distribution distance between the prediction distribution and the true data distribution. Though we use a different loss framework, we use a similar type of parameter tying in our implementation.

Definitions
Dictionary definitions represent a large source of our knowledge of meaning of words (Amsler, 1980). Definitions are composed of a 'genus' phrase and a 'differentia' phrase (Amsler, 1980). The 'genus' phrase identifies the general category of the defined word. This helps derive an 'Is A' relationship between the general category and the word being defined. The 'differentia' phrase distinguishes this instance of the general category from other instances. Definitions can have further set of differentia to imply more granular explanation. For example, Merriam-Webster 2 defines the word 'house' as 'a building that serves as living quarters for one or a few families'. Here the phrase 'a building' is the genus that denotes that a house is a building. The phrases 'that serves as living quarters' and 'for one or a few families' are differentia phrases which help identify house from other buildings. Our model of definitions adapts this interpretation. From a modeling perspective, we hypothesize that using language models would learn the template structure of definitions, and incorporating entity-entity co-occurrence as well as ontological category information would help us fill the specific differentia and genus concepts to the template structure.

Dataset
In SO, users can associate a question with a 'tag', such as 'Java' or 'machine learning', to help other users find and answer it. These tags are nearly always names of domain specific entities. Each tag Software Tag Definition Category hmac in cryptography hmac hash-based message authentication code is a specific construction for calculating a message authentication code mac involving a cryptographic hash-function in combination with a secret-key authentication persistence persistence in computer programming refers to the capability of saving data outside the application memory .
database ndjango ndjango is a port of the popular django template-engine to .net . framework intellisense intellisense is microsoft s implementation of automatic code-completion best known for its use in the microsoft visual-studio integrated development-environment .
compiler  has a definition on SO. For the definitions, we created a dataset of 25K software entities (tags from SO) and their definitions on SO. The data collection and pre-processing for the task is similar to cloze-style software questions collected in Dhingra et al. (2017). The definitions dataset was built from the definitional "excerpt" entry for each tag (entity) on Stack Overflow. For example, the excerpt for the "java" tag is, "Java is a general purpose object-oriented programming language designed to be used in conjunction with the Java Virtual Machine (JVM)." The dataset statistics can be seen in Table 2. This dataset is used for training our definition generation models. Examples of definitions in the dataset are shown in Table 1. We use a background corpus of top 50 threads 3 tagged with every entity on Stack Overflow (Dhingra et al., 2017) and attempt to learn definitions of entities from this data. We use this background corpus for training word embeddings and to give us tag co-occurrences. In SO, a particular question can have multiple tags associated with it, which we call 'co-occurring tags'. We extracted the top 50 question posts for each tag, along with any answer-post responses and metadata (tags, au-thorship, comments) using Scrapy 4 . From each thread, we used all text that is not marked as code and segmented them into sentences. Each sentence is truncated to 2048 characters, lower-cased and tokenized using a custom tokenizer compatible with special characters in software terms (e.g. .net, c++). The background corpus for our task consists of 27 million sentences.

Definition generation as language modeling
We model the task of generating definitions as a language modeling task. The architecture of the model is shown in Figure 2. We model the problem of generating a definition D = w 1 , w 2 ...w T given the entity w * , where the probability of generating a token p(w t ) is given by P (w t |w 1 ..w t−1 , w * ) and the probability of generating the entire definition is given by: We model this using LSTM language models (Mikolov et al., 2010;Hochreiter and Schmidhuber, 1997). An LSTM unit is composed of three multiplicative gates which control the proportions of information to forget and to pass on to the next time step. During training, the input to the LSTM are the word embeddings of the gold definition sequence. At test time, the input is the embedding of the input entity and previously generated words. We outline the functions in our model below : where E is an embedding matrix, initialized with pre-trained embeddings and W k is a weight matrix. sm is the softmax function.
Adapting the baselines from Noraset et al. (2017), we explore two variants of providing the model with the input entity : Seed Model: The input entity is given as the first input token to the RNN as a seed. The loss of predicting the start token, < sos >, given the word is not taken into account.
Concat Model: Along with being given as a seed to the model, the input entity is concatenated with the input token of the RNN at every timestep. We use these as baselines for our approach as well.

Incorporating Entity-Entity Co-occurrence
We propose an extended model to incorporate entity-entity association information from the tags to generate the final definition. We define a cooccurrence based probability measure for every entity, w e as : where c is count function and w e is defined as any software entity which is not the entity being defined. c(w e , w * ) is the count of sentences for which entities w e and w * were tagged together. This probability is smoothed for non-entity words with an value. We use = 0.0001 for all cases. To incorporate this probability into our model, we interpolate it with the language model probability defined in Equation 1 as follows : where λ t is a learned parameter, given as follows : where W p , W q , W r , b are learn-able parameters. The λ t parameter learns to switch between the contextual language model probability when generating tokens forming the structure of the definition and the entity-entity probability when generating tokens which are themselves entities.

Modeling Ontological Category Information
We use a pre-defined set of 86 categories which are the ontological categories for the software domain proposed as part of the GNAT 5 project. For each category, we compute a distributed representation vector by taking the mean of every dimension of the constituent tokens in the category name. We term this the average category embedding. We map every entity to its closest category in embedding space using cosine distance. Examples of category mappings to the terms are shown in Table 1. We explore two ways of using the average category embedding : 1) Adding the category embedding vector (ACE) to the word vector of the entity to extract a new vector which we hope is closer to words defining the entity. This is inspired from the idea of additive compositionality of vectors as shown in (Mikolov et al., 2013b).
2) Concatenating the category embedding vector (CCE) with the input embeddings at every timestep of the LSTM.
where E is the word embedding matrix, E is the category embedding matrix and c(x) is a function that maps every entity to its corresponding ontological category.

Loss Augmentation
To enforce the model to condition on the entity and generate definitions, we propose to augment the standard cross entropy loss with a loss framework that focuses on reconstructing the entity from the generated sequence. This additionally constrains the model to generate tokens close to the tokens in the definition. We introduce a second LSTM model which reverse encodes the output text sequence of the forward mode and projects the encoded sequence into an embedding space of the same dimension as the term being defined.
h t = LST M (y 1 ....y T ) (2) e * wr = W b * h t 5 http://curtis.ml.cmu.edu/gnat/software/ where y 1 ...y T is the generated definition sequence and W b is a weight matrix. We add an additional objective to the model to minimize the cosine distance between the projected vector and the embedding of the input term: where, w * is the input term, and e * wr is the reconstructed term vector.
The resulting network can be trained end-toend to minimize the cross entropy loss between the output and target sequence L(y, y ) in addition to the reconstruction loss between the input and reconstructed input vector L(w * , e * wr ). Since the decode step is a greedy decode step, gradients cannot propagate through it. To solve this, we share the parameters between the two LSTM networks and forward and reconstruction linear layers (Chisholm et al., 2017). To generate definitions at test time, the backward network does not need to be evaluated.

Experimental Setup
To train the model, we use the 25k definitions dataset built as described in Section 4. We split the data randomly into 90:5:5 train, test and validation sets as shown in Table 2. The words being defined are mutually exclusive across the three sets, and thus our experiments evaluate how well the models generalize to new words.
All of the models utilize the same set of fixed word embeddings from two different corpora. The first set of vectors are trained entirely on the Stack Overflow background corpus and the other set are pre-trained open domain word embeddings 7 . Both these embeddings are concatenated and we use this as the representation for each word. For the embeddings trained on Stack Overflow corpus, we use the Word2Vec (Mikolov et al., 2013a) implementation of Gensim 8 toolkit. In the corpus, we prepend to every sentence in a question-answer pair, every tag it is associated with. We further eliminated stopwords from the corpus and set a larger context window size of 10.
For the model, we use a 2 layer LSTM with 500 hidden unit size for both the forward and reconstruction layers of the models. The size of the em-6 Our implementation of baselines from (Noraset et al., 2017) Table 4: Experimental results of our baselines on common English words dataset (Noraset et al., 2017) bedding layer is set to 300 dimension. For training, we minimize the cross-entropy and reconstruction loss using Adam (Kingma and Ba, 2014) optimizer with a learning rate of 0.001 and gradient clip of 5. We evaluate the task using BLEU (Papineni et al., 2002).

Results
The results for the different models are summarized in Table 4. The first section are results of the baselines as reported by Noraset et al. (2017). The second section shows results of our implementation of the baselines on common English word definitions dataset. We report BLEU scores on definitions generated using a greedy approach. The remaining two sections are results from our proposed models on software entity definitions dataset.
In comparison, on the software entity definitions dataset, the same baselines do not generate any reasonable definitions, giving low BLEU scores. This demonstrates that using a languagemodel with embeddings trained on general purpose large-scale, domain-specific corpora is inadequate when the definitions are longer and more domain-specific.
The Surprisingly, adding the entity-entity relationships to the Concat Model does not provide any gains. Although, providing the category information from the ontology by adding it to the input word vector (ACE) provides us a higher BLEU score of 10.86.
Further, adding the reconstruction loss objective to the ACE model provides us small additional gains and achieves a 10.91 BLEU score. Although we see incremental improvements on the task, overall our results show that language models are inadequate to model definitions, as empirically shown by the low overall BLEU scores. Noraset et al. (2017) showed that RNN language models can be used to learn to generate definitions for common English words. On adapting their techniques for closed domain software entities, we find that the language models generate definitions which follow the template of definitions, but have incorrect terms in the genus and differentia components. Table 5 shows some reference definitions and definitions generated from our best model. In the generated definition of "virtual-pc", we observe that the model generates a definition which has a distinguishable genus and differentia, but the genus is not the right ontological category for the entity. The differentia is incorrect as well. Similarly in the definition of "esper", we observe that UNK is a open-source software for the java programming language.

Discussion
virtual-pc virtual-pc is a virtualization program that simulates a hardware environment using software.
the UNK is a commercial operating-system for the windows operating-system. library' which shows that the model is able to learn the right genus for the entity but generates an incorrect differentia component. We see that the reference differentia is quite long with many non-entity tokens which would be hard to model. This explains our results of obtaining lower BLEU scores using the entity-entity co-occurrence models as most differentia terms consist of many nonentity tokens. We also observed that the genus and differentia components for technical definitions have longer and very specific phrases compared to common English words. These phrases also tend to be very sparse in the vocabulary, making the task even more challenging.
The general English definitions dataset presented by Noraset et al. (2017) has 20K most common English words and their definitions. The English words for which the definitions are generated, also tend to appear in the corpus very frequently, thereby having better distributed representations. We presume that the higher BLEU scores for common English words are reflective of that. In contrast, entities in closed domains are much less frequent in background corpora increasing the difficulty of the task. Also, the average definition length in common English words definition corpus is 6.6 while the average length of definitions for software entities is 16.54, which adds additional complexity in generating these definitions. We hypothesize that due to low expressiveness of word representations of entities in comparisons to common English words, language models are unable to learn relations between entities and their genus and differentia components. The addition of the ontological category information alleviates the problem by a small margin, but is still insufficient for the model to learn to generate close-to-perfect definitions.
Through our observations, we find that RNN language models initialized with distributed word representations of entities is inadequate to generate definitions from scratch. We envision that future models should be able to learn better associations between entities and its genus and differentia phrases. Also, the model should ensure it has adequate long term memory to generate definitions that are longer in length.

Conclusion and Future Work
In this paper, we present our initial work in the task of definition generation for software entities or terms. We introduced different approaches for the task, where we explore ways of incorporating ontology information and entity-entity cooccurrence relationships. We also present the results and analysis for the same. Given the complexity of the task, we achieve around 2 BLEU improvements over baselines. We demonstrate that the current models are inadequate to automatically learn to generate complex definitions for entities.
As an immediate next step, we would like to approach the task from an encoder-decoder perspective by collecting external data about the word being defined and using it to guide the generation process. Our hypothesis is that providing external information about an entity and it's usage in various contexts, would help us better identify the genus and differentia for the entity. Currently, we give only the immediate parent category as an input from the ontology, we would also like to explore how to leverage on the entire ontology structure for definition generation.