Code and Named Entity Recognition in StackOverflow

There is an increasing interest in studying natural language and computer code together, as large corpora of programming texts become readily available on the Internet. For example, StackOverflow currently has over 15 million programming related questions written by 8.5 million users. Meanwhile, there is still a lack of fundamental NLP techniques for identifying code tokens or software-related named entities that appear within natural language sentences. In this paper, we introduce a new named entity recognition (NER) corpus for the computer programming domain, consisting of 15,372 sentences annotated with 20 fine-grained entity types. We trained in-domain BERT representations (BERTOverflow) on 152 million sentences from StackOverflow, which lead to an absolute increase of +10 F1 score over off-the-shelf BERT. We also present the SoftNER model which achieves an overall 79.10 F-1 score for code and named entity recognition on StackOverflow data. Our SoftNER model incorporates a context-independent code token classifier with corpus-level features to improve the BERT-based tagging model. Our code and data are available at: https://github.com/jeniyat/StackOverflowNER/


Introduction
Recently there has been significant interest in modeling human language together with computer code (Quirk et al., 2015;Iyer et al., 2016;Yin and Neubig, 2018), as more data becomes available on websites such as StackOverflow and GitHub.This is an ambitious yet promising direction for scaling up language understanding to richer domains.Access to domain-specific NLP tools could help a wide range of downstream applications.For example, extracting software knowledge bases from text (Movshovitz-Attias and Cohen, 2015), developing better quality measurements of StackOver- flow posts (Ravi et al., 2014), finding similar questions (Amirreza Shirani, 2019) and more.However, there is a lack of NLP resources and techniques for identifying software-related named entities (e.g., variable names or application names) within natural language texts.
In this paper, we present a comprehensive study that investigates the unique challenges of named entity recognition in the social computer programming domain.These named entities are often ambiguous and have implicit reliance on the accompanied code snippets.For example, the word 'list' commonly refers to a data structure, but can also be used as a variable name (Figure 1).In order to recognize these entities, we propose an named entity recognizer (NER) that utilizes a multi-level attention network to combine the textual context with the code snippet knowledge.Using our newly annotated corpus of 15,372 sentences from StackOverflow, we rigorously test our proposed model which outperforms state-of-theart BiLSTM-CRF tagging models for identifying 20 types of software-related named entities.Our key contributions are the following: • A new StackOverflow NER corpus manually annotated with 20 types of named entities, including all in-line code within natural language sentences ( §2).We demonstrate that NER in the software domain is an arXiv:2005.01634v1[cs.CL] 4 May 2020 ideal benchmark task for testing effectiveness of contextual word representations, such as ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019), due to its inherent polysemy and salient reliance on context.For example, 'windows' can be an English word, a variable, or a computer operating system, entirely depending on context.• An in-domain trained neural NER tagger for StackOveflow ( §3) that can recognize 20 fine-grained named entities related to software developing.We also tested its performance on GitHub text data, which include readme files and issue reports.• A code token recognizer ( §3.1) that utilizes StackOveflow code snippets to capture the character patterns of code related entities, and consistently improves the NER tagger.• In-domain trained ELMo and BERT representations ( §3.3) on 152 million sentences from StackOverflow that leads to more than 14 points increase in F 1 score over off-theshelf ELMo, and significantly outperforms off-the-shelf BERT.
Our named entity tagger achieves a 78.41%F 1 score on StackOverflow and 62.69% F 1 score on GitHub data for extracting the 20 software related named entity types.We believe this performance is sufficiently strong to be practically useful.We have released our data and code, including the named entity tagger, our annotated corpus, annotation guideline, a specially designed tokenizer, and pre-trained StackOverflow ELMo and BERT embeddings.

Annotated StackOverflow Corpus
In this section, we describe the construction of our StackOverflow NER corpus.We randomly selected 1,237 question-answer threads from Stack-Overflow 10-year archive (from September 2008 to March 2018) and manually annotated them with 20 types of entities.For each question, four answers were annotated, including the accepted answer, the most upvoted answer, as well as two randomly selected answers (if they exist).Table 1 shows the statistics of our corpus.40% of the question-answer threads were double-annotated, which are used as the development and test sets in our experiments ( §4).We also annotated 6,501 sentences from GitHub readme files and issue reports as additional evaluation data.

Annotation Schema
We defined and annotated 20 types of fine-grained entities, including 8 code-related entities and 12 natural language entities.Our annotation guideline was developed through several pilots and further updated with notes to resolve difficult cases as the annotation progressed.2Each entity type was defined to encourage maximum span length (e.g., 'SGML parser' instead of 'SGML').We annotated noun phrases without including modifiers (e.g., 'C' instead of 'Plain C'), except a few special cases (e.g., 'rich text' as a common FILE TYPE).On average, an entity contains about 1.5 tokens.While VARIABLE, FUNCTION and CLASS names mostly consist of only a single token, our annotators found that some are written as multiple tokens when mentioned in natural language text (e.g., 'array list' for 'ArrayList' in Figure 1).The annotators were asked to read relevant code blocks or software repositories to make a decision, if needed.Annotators also searched Google or Wikipedia to categorize unfamiliar cases.
The annotators were asked to update, correct, or add annotations from the user provided code markdown tags.StackOverflow users can utilize code markdowns to highlight the code entities within the natural language sentences.However, in reality, many users do not enclose the code snippets within the code tags; and sometimes use them to highlight non-code elements, such as email addresses, user names, or natural language words.While creating the StackOverflow NER corpurs, we found that 59.73% of code-related entities are not marked by the StackOverflow users.Moreover, only 75.54% of the code enclosed texts are actually code-related, while 10.12% used to are highlighting natural language texts.The rest of cases are referring to non-code entities, such as SOFTWARE NAMES and VERSIONS.While markdown tag could be a useful feature for entity segmentation ( §3.1.3),we emphasize the importance of having a human annotated corpus for training and evaluating NLP tools in the software domain.

Annotation Agreement
Our corpus was annotated by four annotators who are college students majored in computer science.We used a web-based annotation tool, BRAT (Stenetorp et al., 2012), and provided annotators with links to the original post on StackOverflow.For every iteration, each annotator was given 50 question-answer threads to annotate, 20 of which were double-annotated.An adjudicator then discussed disagreements with annotators, who also cross-checked the 30 single-annotated questions in each batch.The inter-annotator agreement is 0.62 before adjudication, measured by span-level Cohen's Kappa (Cohen, 1960).

Additional GitHub Data
To better understand the domain adaptability of our work, we further annotated the readme files and issue reports from 143 randomly sampled repositories in the GitHub dump (Gousios and Spinellis, 2012) (from October 29, 2007to December 31, 2017).We removed all the code blocks from the issue reports and readme files collected from these 143 repositories.The resulting GitHub NER dataset consists of 6,510 sentences and 10,963 entities of 20 types labeled by two inhouse annotators.The inter-annotator agreement of this dataset is 0.68, measured by span-level Cohen's Kappa.

StackOverflow/GitHub Tokenization
We designed a new tokenizer, SOTOKENIZER, specifically for the social computer programming domain.StackOverflow and GitHub posts exhibit common features of web texts, including abbreviations, emoticons, URLs, ungrammatical sentences and spelling errors.We found that tokenization is non-trivial as many code-related tokens are mistakenly split by the existing web-text tokenizers, including the CMU Twokenizer (Gimpel et al., 2011), Stanford TweetTokenizer (Manning et al., 2014), and NLTK Twitter Tokenizer (Bird et al., 2009): Therefore, we implemented a new tokenizer, using Twokenizer3 as the starting point and added additional regular expression rules to avoid splitting code-related tokens.

Named Entity Recognition
The extraction of software-related named entities imposes significant challenges as it requires resolving a significant amount of unseen tokens, inherent polysemy, and salient reliance on context.Unlike news or biomedical data, spelling patterns and long-distance dependencies are more crucial in the software domain to resolve ambiguities and categorize unseen words.Taken in isolation, many tokens are highly ambiguous and can refer to either programming concepts or common English words, such as 'go', 'react', 'spring', 'while', 'if ', 'select'.Therefore, we design the SoftNER model that leverages sentential context to disambiguate and domain-specific character representations to handle rare words.Figure 2 shows the architecture of our model, which consists of primarily three components: 1.An embedding extraction layer ( §3.1) that creates contextualized ELMo embeddings and two new domain-specific embeddings for each word in the input sentence.
2. A multi-level attention layer ( §3.2) that combines the three word embeddings using an embedding-level and a word-level attention network.
3. A BiLSTM-CRF layer that predicts the entity type of each word using the weighted word representations from the previous layer.

Input Embeddings
For each word in the input sentence, we extract ELMo (Peters et al., 2018) representation and two new domain-specific embeddings produced by (i) a Code Recognizer, which represents if a word can be part of a code entity regardless of context; and (ii) an Entity Segmenter, that predicts whether a word is part of any named entity in the given sentence.Each domain-specific embedding is created by passing a binary value, predicted by a network independent from the SoftNER model, through an embedding layer.We describe the two standalone auxiliary models that generate these domain-based vectors below.

In-domain Word Embeddings
Texts in the software engineering domain contain programming language tokens, such as variable names or code segments, interspersed with natural language words.This makes input representations pre-trained on general newswire text unsuitable for software domain.Therefore, we pretrained different in-domain word embeddings, including ELMo, BERT and GloVe vectors on the StackOverflow 10-year archive of 2.3 billion tokens ( §3.3).

Context-independent Code Recognition
Humans with prior programming knowledge can easily recognize that 'list()' is code, 'list' can be either code or a verb, whereas 'listing' is more likely a non-code token.We introduce a code recognition model to capture such prior probability of how likely a word can be a code token without considering any contextual information.It is worth noting that this standalone code recognition model is also useful for language-and-code research, such as retrieving code snippets based on natural language queries (Iyer et al., 2016;Giorgi and Bader, 2018;Yao et al., 2019) Our code recognition model, which eventually generates the Code Recognizer vector, is a binary classifier.It utilizes language model features and character patterns to predict whether a word is a code entity.The input features include unigram word and 6-gram character probabilities from two language models (LMs) that are trained on the Gigaword corpus (Napoles et al., 2012) and all the code-snippets from the StackOverflow 10-year archive respectively.We also pre-trained FastText (Joulin et al., 2016) word embeddings using these code-snippets, where a word vector is represented as a sum of its character ngrams.We first transform each ngram probability into a k-dimensional vector using Gaussian binning (Maddela and Xu, 2018), which has shown to improve the performance of neural models using numeric features (Sil et al., 2017;Liu et al., 2016;Maddela and Xu, 2018).We then feed the vectorized features into a linear layer, concatenate the output with Fast-Text character-level embeddings, and pass them through another hidden layer with sigmoid activation.We predict the token as a code-entity if the output probability is greater than 0.5.

Entity Segmentation
The segmentation task refers to identifying entity spans without assigning entity category.Segmentation is simpler and less error-prone than entity recognition as it does not require a finegrained classification of the segmented tokens.In fact, a segmentation model trained on our annotated StackOverflow corpus achieves an accuracy of 97.4 on the dev set (details in §4.5).To leverage the high performance of segmentation for entity recognition, we introduce Entity Segmenter, which predicts whether each token is an entity mention in the given sentence.For this binary tagging task, the model classifies a token as either I-ENTITY or O, instead of the traditional BIO scheme.
Our segmentation model, which generates the entity segmenter vector, consists of a BiLSTM encoder and a CRF decoder.For input, we concatenate ELMo embeddings with two handcrafted features, namely word frequency and code markdown.Inclusion of hand-crafted features is influenced by Wu et al. (2018), where word-shapes and POS tags were shown to improve the performance of sequence tagging models.
Word Frequency represents the word occurrence count in the training set.As many code tokens are defined by individual users, they occur much less frequently than normal English words.In fact, code and non-code tokens have an average frequency of 1.47 and 7.41 respectively in our corpus.Moreover, ambiguous token that can be either code or non-code entities have a much higher average frequency of 92.57.To leverage this observation, we include word frequency as a feature, converting the scalar value into a k-dimensional vector by Gaussian binning (Maddela and Xu, 2018).
Code Markdown indicates whether the given token appears inside a code markdown tag in the StackOverflow post.It is worth noting that code tags are noisy as users do not always enclose inline code in a code tag or use the tag to highlight non-code texts (details in §2.1).However, we find it helpful to include the markdown information as a feature as it improves the performance of our segmentation model.

Multi-Level Attention
We build an aggregated word vector from the input embeddings using a multi-level attention network similar to Yang et al.(2016).We combine the input embeddings in the first attention layer and calculate the importance of each word for the task in the second layer.Although such embedding-level attention is not commonly used in NER, we found it empirically helpful for the software domain.
Embedding-Level Attention We use three embeddings, ELMo (w i1 ), Code Recognizer (w i2 ), and Entity Segmenter (w i3 ), for each word w i in the input sentence.We introduce the embeddinglevel attention α it (t ∈ {1, 2, 3}) to capture each embedding's contribution towards the meaning of the word.To compute α it , we pass the input embeddings through a bidirectional GRU and generate their corresponding hidden representations h it = ← −− → GRU (w it ).These vectors are then passed through a non-linear layer, which outputs u it = tanh(W e h it + b e ).We introduce an embeddinglevel context vector, u e , which is learned during the training process.This context vector is combined with the hidden embedding representation using a softmax function to extract weight of the embeddings, α it = exp(u it T ue) t exp(u it T ue) .Finally, we create the word vector by a weighted sum of all the information from different embeddings as Weighted Word Representation We also use a word-level weighting factor α i to emphasize the importance of each word w i for the NER task.Similar to the embedding-level attention, we calculate α i from the weighted word vectors word i .We use bidirectional GRU to encode the summarized information from neighbouring words and get h i = ← −− → GRU (word i ).This is then passed through a hidden layer which outputs u i = tanh(W w h i + b w ).Using this vector, we extract the normalized weight for each word vector α i = exp(u i T uw) t exp(u i T uw) , where u w is another wordlevel context vector that is learned during training.Finally, we compute the weighted word representation word i = α i h i .The aggregated word vector word i is then fed into a BiLSTM-CRF network, which predicts the entity category for each word.

Implementation Details
We use PyTorch framework to implement our proposed SoftNER model and two auxiliary systems, namely the code recognition and the entity segmentation systems.Our SoftNER model consists of a BiLSTM encoder with character-level CNN features and a CRF decoder.The input of the network consists of 500-dimensional segmenter vectors, 300-dimensional code recognizer vectors and 1024-dimensional contextual word representations.To extract in-domain word representations, we pre-trained Glove, ELMo and BERT vectors on 152 million sentences from the Stack-Overflow archive, excluding all the sentences from the 1,237 posts in our annotated corpus.The pretraining of 300-dimensional Glove embeddings, with a frequency cut-off of 5, took 8 hours on We train the SoftNER model and the two auxiliary systems separately.Our segmentation model follows the same architecture and training setup as SoftNER except for the input, where ELMo embeddings are concatenated with 100-dimensional code markdown and 10-dimensional word frequency features.We set the number of bins k to 10 for Gaussian vectorization.Our code recognition model is a feedforward network with two hidden layers and a single output node with sigmoid activation.

Evaluation
In this section, we show that our SoftNER model outperforms all the previous NER approaches on the StackOverflow and GitHub data.We also discuss the factors pivotal to the performance of our model, namely pre-trained in-domain ELMo embeddings and our two domain-specific vectors.

Data
We train and evaluate our SoftNER model on the StackOverflow NER corpus of 9,352 train, 2,942 development and 3,115 test sentences we constructed in §2.We use the same data for our segmentation model but replace all the entity tags with an I-ENTITY tag.For the code recognition model, we created a lexicon of 6000 unique words randomly selected from the train set of the Stack-Overflow NER corpus.Each word was labelled individually without context as CODE, AMBIGU-OUS or NON-CODE by two annotators.The inter-annotator agreement was 0.89, measured by Cohen's Kappa.After discarding disagreements, we divided the remaining 5312 tokens into 4312 train and 1000 test instances.Then, we merged AM-BIGUOUS and NON-CODE categories to facilitate binary classification.We name this dataset of 5312 individual tokens as SOLEXICON.

Baselines
We compare our model with the following baseline and state-of-the-art approaches: • A BiLSTM-CRF model with in-domain ELMo embeddings (ELMoVerflow; details in §3.3).This architecture is the state-of-theart baseline NER models in various domains (Lample et al., 2016;Kulkarni et al., 2018;Dai et al., 2019).
• A Fine-tuned in-domain BERT model where we fine-tune the in-domain pretrained-BERT base cased (BERTOverflow; details in §3.3) checkpoint4 with our annotated corpus.
• A Fine-tuned out-domain BERT model where we fine-tune the out-domain BERT base cased checkpoint5 with our annotated corpus.
• A Feature-based Linear CRF model which uses the standard orthographic, context and gazetteer features, along with the code markdown tags and handcrafted regular expressions to recognize code entities (details in Appendix A).

Results
Table 2 shows the precision (P), recall (R) and F 1 score comparison of different models evaluated on the StackOverflow NER corpus.Our SoftNER model outperforms the existing NER approaches in all the three metrics.Compared to BiLSTM-CRF, SoftNER demonstrates a 9.7 increase in F 1 on the test set.

In-domain vs. Out-domain Word Embeddings
Table 3 shows the performance comparison between in-domain and out-domain word embeddings.We consider off-the-shelf ELMo (Peters et al., 2018) and GloVe (Pennington et al., 2014) vectors trained on newswire and web texts as outdomain embeddings.Using the state-of-the-art BiLSTM-CRF model (Lample et al., 2016;Kulkarni et al., 2018;Dai et al., 2019), we observe a large increase of 13.64 F 1 score when employing in-domain ELMo (ELMoVerflow) representations over in-domain GloVe (GloVeOverflow), and an increase of 15.71 F 1 score over outdomain ELMo.We found that fine-tuned outdomain BERT (Devlin et al., 2019) outperforms the out-domain ELMo (Table 3), although it under-performs in-domain ELMo (ELMoVerflow) by 12.8 F 1 score (Table 2) on our StackOverflow NER corpus.Similarly, for Github data (more details in §5), in-domain ELMo outperforms the outdomain fine-tuned BERT by 10.67 F 1 score (Table 8).In our experiments, fine-tuned BERTOverflow extracts the named entities with higher recall, whereas the ELMoVerflow extracts them with higher precision.However, as the overall F 1 score of the ELMoVerflow is slightly higher than the BERTOverflow, we used in-domain ELMo for the rest of our experiments.
It is worth noting that, the performance improvements from contextual word embeddings are more pronounced on our software domain than on newswire and biomedical domains.Original ELMo and BERT outperform GloVe by 2.06 and 2.12 points in F 1 respectively on CoNLL 2003 NER task of newswire data (Peters et al., 2018;Devlin et al., 2019).For biomedical domain, in-domain ELMo outperforms out-domain ELMo by 1.33 points in F 1 on the BC2GM dataset (Sheikhshabbafghi et al., 2018).
We hypothesized that the performance gains from the in-domain contextual embeddings are largely aided by the model's ability to handle ambiguous and unseen tokens.The increase in performance is especially notable (41% − → 70% accuracy) for unseen tokens, which constitute 38% of the tokens inside gold entity spans in our dataset.This experiment also demonstrates that our annotated NER corpus provides an attractive test-bed for measuring the adaptability of different contextual word representations.

Evaluation of Auxiliary Systems
Our domain-specific vectors, namely Code Recognizer and Entity Segmenter are also crucial for the overall performance of our SoftNER model.Table 6 shows an ablation study.Removing code recognizer vectors and segmenter results in a 2.25 and 4.65 drops in F 1 scores respectively.If we replace embedding-level attention with a simple concatenation of embeddings, the performance also drop by 1.84 in F 1 .In addition, we evaluate the effectiveness of our two domain-specific auxiliary systems on their respective tasks: Code Recognition Table 4 compares the performance of our code recognition model with other baselines on the SLEXICON test set ( §4.1), which consists of 1000 random words from the train set of StackOverflow NER corpus classified as either a code or a non-code token.The baselines include: (i) a Most Frequent Label baseline, which assigns the most frequent label according to the human annotation in SOLEXICON train set; and (ii) a frequency baseline, which learns a threshold over token frequency in the train set of StackOverflow NER corpus using a decision tree classifier.Our model outperforms both baselines in terms of F 1 score.Although the most frequent label baseline achieves better precision than our model, it performs poorly on unseen tokens resulting in a large drop in recall and F 1 score.The ablation experiments show that the FastText word embeddings along with the character and word-level features are crucial for the code token recognition task.Entity Segmentation Table 5 shows

Error Analysis
Based on our manual inspection, the incorrect predictions made by NER systems can be largely classified into the following two categories (see examples in Table 7): Segmentation Mismatch Entity-Type Mismatch Table 7: Representative examples of error categories.
In each example pair, the first sentence contains the gold entities, and the second sentence contains the predicted entities from NER model.
Segmentation Mismatch refers to the cases where model predicts the boundary of entities incorrectly.Our SoftNER model reduces such segmentation errors by 80.33% compared to the BiLSTM-CRF baseline.
Entity-Type Mismatch refers to the errors where a code entity (e.g., names of variables) is predicted as a non-code entity (e.g., names of devices), and vice-versa.Our SoftNER model reduces such entity type errors by 23.34% compared   to the BiLSTM-CRF baseline.
As illustrated in Figure 3, our SoftNER model reduced the errors in both categories by incorporating the auxilary output from segmenter and code recognizer model.

Domain Adaptation to GitHub data
To understand the domain adaptability of our StackOverflow based SoftNER, we evaluate its performance on readme files and issue reports from 143 randomly sampled repositories in the GitHub dump (Gousios and Spinellis, 2012).We also trained GitHub ELMo embeddings on 4 million sentences from randomly sampled 5,000 GitHub repositories.
Table 8 shows that the performance of our Soft-NER model using StackOverflow ELMo embeddings is similar to the top performing BiLSTM-CRF model using GitHub ELMo embeddings with a difference of only 2.1 points in F 1 .We also did not observe any significant gain after adding the code recognizer and segmenter vectors to the Github ELMo embeddings.We think one likely explanation is that GitHub data contains less coderelated tokens when compared to StackOverflow.The percentage of code-related entity tokens is 63.20% in GitHub and 77.21% in StackOverflow.
Overall, we observe a drop of our SoftNER tagger from 78.41 on StackOverflow (Table 2) to 62.69 on GitHub data (Table 8) in F 1 due to domain mismatch.However, we believe that our NER tagger still achieves sufficient performance to be useful for applications on GitHub. 6We leave further investigation of semi-supervised learning and other domain adaptation approach for future work.
There has been relatively little prior work on named entity recognition in the software engineering domain.Ye et al. (2016) annotated 4,646 sentences from StackOverflow with five named entity types (Programming Language, Platform, API, Tool-Library-Framework and Software Standard).The authors used a traditional feature-based CRF to recognize these entities.In contrast, we present a much larger annotated corpus consisting of 15,372 sentences labeled with 20 fine-grained entity types.We also develop a novel attention based neural NER model to extract those finegrained entities.

Conclusion
In this work, we investigated the task of named entity recognition in the social computer programming domain.We developed a new NER corpus, consisting of 15,372 sentences from StackOverflow and 6,510 sentences from GitHub annotated with 20 fine-grained named entities.We demonstrate that this new corpus is an ideal benchmark dataset for contextual word representations, as there are many challenging ambiguities that often require long-distance context to resolve.We proposed a novel attention based model, named Soft-NER, that outperforms the state-of-the-art NER models on this dataset.Furthermore, we investigated the important sub-task of code recognition.Our novel code recognition model captures additional spelling information beyond characterbased ELMo and consistently improves performance of the NER model.We believe our corpus and StackOverflow-based named entity tagger will be useful for various language-and-code tasks, such as code retrieval, software knowledge base extraction and automated question-answering.
A Feature-Based CRF Baseline We implemented a CRF baseline model using CRFsuite7 to extract the software entities.This model uses standard orthographic, contextual and gazetteer features.It also includes the code markdown tags ( §3.1.3)and a set of regular expression features.The regular expressions are developed to recognize specific categories of coderelated entities.Feature ablation experiments on this CRF model are presented in Table 9.One noticeable distinction from the named entity recognizer in many other domains is that the contextual features are not as helpful in feature-based CRFs for classifying software entities.

Figure 1 :
Figure 1: Examples of software-related named entities in a StackOverflow post.

Figure 2 :
Figure 2: Our SoftNER model.It utilizes an attention network to combine the contextual word embeddings (ELMo) with the domain-specific embeddings (Code Recognizer and Entity Segmenter).The detailed structure of the attention network is depicted on the right.
the performance of our segmentation model on the dev set of our StackOverflow corpus, where the entity tags are replaced by an I-ENTITY tag.Our model achieves an F 1 score of 84.3 and an accuracy of 97.4.Incorporating word frequency and code markdown feature increases the F 1 score by 1.2 and 2.1 points respectively.The low 10.5 F 1 score of Stanford NER tagger (Manning et al., 2014), which is trained on newswire text, demonstrates the importance of domain-specific tools for the software engineering domain.

Figure 3 :
Figure 3: Comparison of errors made by the ELMo BiLSTM-CRF baseline and our SoftNER on the dev set of StackOverflow NER corpus.In the error heatmap, darker cell color corresponds to higher error counts.Our SoftNER model reduces errors in all the categories.

Table 2 :
Evaluation on the dev and test sets of the StackOverflow NER corpus.Our SoftNER model outperforms the existing approaches.

Table 4 :
Evaluation results and feature ablation of our code recognition model on SOLEXICON test set of 1000 manually labeled unique tokens, which are sampled from the train set of StackOverflow NER corpus.

Table 5 :
Evaluation of our segmentation model on the dev set of the StackOverflow NER corpus.

Table 6 :
Ablation study of SoftNER on the dev set of StackOverflow NER corpus.

Table 8 :
Evaluation on the GitHub NER dataset of readme files and issue posts.All the models are trained on our StackOverflow NER corpus.Our SoftNER model performs close to BiLSTM-CRF model trained on the GitHub ELMo embeddings.

Table 9 :
This is because, in the StackOverflow NER corpus a significant number of neighbouring words are shared among different software entities.As an example, the bigram 'in the' frequently appears as the left context of the following types: APPLICATION, CLASS, FUNCTION, FILE TYPE, UI ELEMENT, LIBRARY, DATA STRUCTURE and LANGUAGE.Rule and Gazetteer Features 69.71 40.66 51.36 Feature based CRF performance with varying input features on dev data.