Graph-based Semi-supervised Gene Mention Tagging

The rapidly growing biomedical literature has been a challenging target for natu-ral language processing algorithms. One of the tasks these algorithms focus on is called named entity recognition (NER), often employed to tag gene mentions. Here we describe a new approach for this task, an approach that uses graph-based semi-supervised learning to train a Conditional Random Field (CRF) model. Benchmarking it on the BioCreative II Gene Mention tagging task, we achieved statistically signiﬁcant improvements in F-measure over BANNER, a widely used biomedical NER system. We note that our tool is transductive and modular in nature, and can be integrated with other CRF-based supervised NER tools.


Introduction
Detecting biomedical named entities such as genes and proteins is one of the first steps in many natural language processing systems that analyze biomedical text. Finding relations between entities, and expanding knowledge bases are examples of research that highly depend on the accuracy of gene and protein mention tagging.
Named entity recognition is typically modelled as a sequence tagging problem (Sha and Pereira, 2003). One of the most commonly used models for sequence tagging is a Conditional Random Field (CRF) (Lafferty et al., 2001;Sha and Pereira, 2003).
Many popular and best performing biomedical named entity recognition systems, such as BAN-NER (Leaman et al., 2008), Gimli (Campos et al., 2013) and BANNER-CHEMDNER (Munkhdalai et al., 2015) use CRF as their core machine learning model built on the MALLET toolkit (McCallum, 2002).
Inspired by the success of graph-based semisupervised learning methods in other NLP tasks (Subramanya et al., 2010;Zhu et al., 2003;Subramanya and Bilmes, 2009;Alexandrescu and Kirchhoff, 2009;Liu et al., 2012;Saluja et al., 2014;Tamura et al., 2012;Talukdar et al., 2008;Das and Petrov, 2011), we integrated the graph based semi-supervised algorithm of Subramanya et al. (2010) and adapted their approach to improve on the results from BANNER. We show that our approach achieves a statistically significant improvement in terms of F-measure on the BioCreative II dataset for gene mention tagging.
Semi-supervised learning for gene mention tagging is not without precedent. There has been several semi-supervised approaches for the gene mention task and they have always been more successful than fully supervised approaches (Jiao et al., 2006;Ando, 2007;Campos et al., 2013;Munkhdalai et al., 2015).
Ando (2007) used a semi-supervised approach, Alternative Structure Optimization or ASO, in the BioCreative II gene mention shared task along with other extensions, such as using a lexicon or combining several classifiers. ASO ranked first among all competitors in the shared task competition 2007. Ando reported usage of unlabeled data as the most useful part of his system improving the F-measure of the baseline by 2.09 points where the complete (winning) system had a total improvement of 3.23 points over the baseline CRF (Ando, 2007). Jiao et al. (2006) used conditional entropy over the unlabeled data combined with the conditional likelihood over the labeled data in the objective function of CRF (Jiao et al., 2006). Munkhdalai et al. (2015) trained word representations using Brown clustering (Brown et al., 1992) and word2vec (Mikolov et al., 2013) on MEDLINE and PMC document collections and used them as features along with traditional features in a CRF. Like many of these approaches we also use unlabeled data to augment our baseline CRF model. In all these previous studies the unlabelled data was orders of magnitude more than labelled data and distinct from the test data.
In this paper we take a transductive approach and use the test set as our unlabelled data. Moreover, our approach is orthogonal to all these approaches and can be used to augment many of them. This approach can be easily implemented as a post-processing step in any system that uses a CRF model. Examples of such systems include Gimli (Campos et al., 2013) and BANNER-CHEMNDNER (Munkhdalai et al., 2015). These tools have achieved the highest F-scores in the literature after ASO (Ando, 2007). Our approach relies on the extraction of label distributions from the CRF and augments the decoding algorithm to incorporate the new information about gene mentions from the graph-based learning approach we describe in this paper.

Method
Like many previous studies (Leaman et al., 2008;Munkhdalai et al., 2015;Campos et al., 2013), we formulate the gene mention tagging problem as a word level sequence prediction problem, where labels for each word in the input are either Gene-Beginning, Gene-Inside, and Outside (not a gene). This representation is called IOB (for inside-outside-beginning). We applied a graphbased semi-supervised learning (SSL) approach, previously shown effective on a similar labelling task, part-of-speech tagging, for gene mention tagging. (Subramanya et al., 2010) In graph-based SSL, a graph is constructed to represent partially labelled data. Each node in the graph represents a single word-level gene mention tagging decision and the edges between the nodes represent similarity between the nodes. The goal is to associate probability distributions over the IOB tags to all vertices. Label distributions for vertices that appear in labelled data are estimated based on the reference labels and propagate to vertices for unlabelled data in the graph. These label distributions are combined with the CRF decoding algorithm used for labelling the test data. Graph-based SSL is categorized into inductive and transductive approaches. In inductive settings (e.g. Subramanya et al. (2010)), a model is trained and can be used as-is for unseen data. In transductive settings however, the unlabelled data includes test data. We took a transductive approach in constructing our graph on the union of train set and test set as labelled and unlabelled data.
Since the graph is the cornerstone of the algorithm, let us describe its construction and usage before the overall algorithm.

Graph Construction
We use the following steps for constructing the graph for the gene mention tagging task adapted from the graph construction for part-of-speech tagging described in Subramanya et al. (2010): 1. Each vertex represents a 3-gram type and the middle word of this 3-gram is the word which is tagged as a gene mention using the IOB tags. The label distribution for this middle word is learned during graph propagation and subsequently combined with the CRF model at test time.
2. A vertex is represented by a vector of pointwise mutual information values between feature instances and its 3-gram type.
3. Edge weights represent the similarity between vertices and are obtained by computing the cosine similarity of feature vectors of their two end vertices.
4. For each vertex only the K nearest neighbours are kept (default = 10).
We considered several feature sets, namely contextual features (Table 1), simplified contextual features (Table 2), all features from the base CRF model, and the most informative features from the base CRF model. We picked the simplified contextual features based on preliminary results using cross-validation on our development set. To represent a vertex v with 3-gram w −1 w 0 w 1 , we look at all occurrences of its 3-gram in the text, consider the larger context w −2 w −1 w 0 w 1 w 2 and get the lemmas of these words. v is represented by a vector of point-wise mutual information values between all possible feature instances (e.g. all possible lemmas for w −2 ) and w −1 w 0 w 1 . We eliminated extremely frequent features (default > 10,000) to reduce the time complexity of graph construction. This should not affect the structure of the graph substantially because the point-wise mutual information between a feature and any given vertex decreases as the frequency  of the feature increases leaving extremely frequent features with relatively small weights.

Graph Propagation
In graph propagation we associate any given vertex u with a label distribution X u that represents how likely we think each label is for that vertex. The goal of graph-based SSL is to propagate existing knowledge about the labels through the graph. The initial knowledge about graph nodes is provided by the labeled data and potentially some prior knowledge. Figure 1 shows how graph propagation can assign label distributions to unlabelled vertices and change the label distributions coming from labelled data.
Propagation is accomplished by optimizing an objective function over the label distributions at each node in the graph. The objective function consists of three types of constraints: 1. For any labeled vertex u, the associated label distribution X u should be close to the reference distributionX u (obtained from labeled data); 2. Adjacent vertices u and k should have similar label distributions X u and X k ; 3. The label distributions of all vertices should comply with the prior knowledge, if such knowledge exists, or be uniformly distributed, otherwise.
The following objective function represents these three components: where u and v are nodes in the graph, L is the set of labelled vertices, V is the set of all vertices, N (u) is the set of neighbours of u, U is the uniform distribution over all labels, and µ and ν are weight constants for constraints 2 and 3, respectively. We used Euclidean distance as the distance metric.
While the first two terms in the objective function, and their corresponding constraints make intuitive sense, the uniformity constraint needs further explanation. The rationale behind using distance from uniform distribution is to avoid preferring a label over others in the absence of strong evidence.
The objective function is optimized using stochastic gradient descent. We implement the optimization algorithm for this as described in Subramanya et al. (2010): denote the label distributions of vertex i in iterations m and m − 1, respectively, δ(i ∈ L) is 1 if and only if i is a labeled vertex, and Y is the number of labels.

Overall algorithm
Once propagated the label distributions through the graph, we would need to combine what we learned in the graph with the tagging results from the CRF model. For that we use a self-training algorithm, shown in Figure 2.
On an input of a partially-labeled corpus, we first train a CRF model in a supervised fashion on the labeled data (crf-train, line 1); we then use this trained CRF model to assign label probability distributions to each word in the entire (labeled + unlabeled) corpus (posterior decode, line 4). As a result, each n-gram token in the corpus has a label distribution (the posteriors). For each n-gram type u (a vertex in the graph), we find all instances (n-gram tokens) of u and average over the label distributions of these instances to get a label distribution for u (token to type, line 5). Next, we perform graph-propagation (i.e. we optimize the objective function in equation 1) to learn the label distributions for all vertices. Finally, we linearly interpolate the trained CRF model and the label distributions from the graph: where t is a 3-gram token in a specific sentence, X CRF (t) denotes the posterior probability from the CRF model for the middle word in t, X Graph (t) denotes the label distribution of the 3gram type t after graph propagationn, and α ∈ [0, 1] is the mixture parameter between the CRF and graph models. The best label for all words in the entire corpus is then found using Viterbidecoding for the CRF using X int instead of X CRF (viterbi-decode, line 7). Viterbi decoding provides us with the best label for every n-gram token in the unlabeled corpus, which implies that our labeled set has grown to include the unlabeled corpus. We re-train the CRF on this expanded training set (crftrain, line 8); and iterate until convergence.
Note that the steps indicated by lines 1, 4, and 8 work on the corpus whereas graph propagation in line 6 works on the graph. So, the step in line 5 takes us from corpus to the graph, and the step in line 7 takes us back from the graph to the corpus.

Integration with BANNER
BANNER (Leaman et al., 2008) is a well-known open-source biomedical named entity recognizer that is widely used. Many studies have used BANNER for gene mention tagging Hakala et al., 2015;Pyysalo et al., 2015;Lee et al., 2014;Leaman et al., 2013) and many have cited it as a biomedical NER system with good performance (Dai et al., 2015;Krallinger et al., 2015;Luo et al., 2016;Gonzalez et al., 2016;Hebbring et al., 2015). BANNER uses CRF as its machine learning core, and we used it as our base CRF in lines 1 and 8 in Figure 2. We also modified BANNER's source code in order to extract the posterior proba-    84.93 88.28 86.57 Table 3: Graph-based SSL improves BANNER by increasing the precision.
bilities from the underlying MALLET CRF model (line 4). These probabilities were used in lines 5 through 7 in Figure 2. Furthermore, the lemmas we used as features in our graph construction (see section 2.1) came from BANNER's lemmatizer.
BANNER also does some post-processing: it discards all the mentions that contain unmatched brackets. We ran our method with and without this post-processing step and verified its utility in our approach as well.

Experiments
We show improvements over BANNER on the dataset of BioCreative II Gene Mention Tagging Task. This data set contains 15,000 training sentences and 5,000 test sentences. Annotations are given by the starting character index and finishing character index of the gene in the sentence (space characters are ignored). Some sentences have alternative annotations presented in a separate file.
The upper part of Table 3 shows the results of BANNER; Graph-Based SSL without postprocessing; and Graph-Based SSL with postprocessing. The hyper-parameters of Graph-Based SSL were chosen by cross-validation over different train/test splits with different hyperparameters tested for each split (α = 0.02, µ = 10 −6 , ν = 10 −4 , and number of iterations = 2). Table 3 shows that the improvement we get in F-measure is due to better precision which is further boosted by dropping the candidates with unmatched parentheses (which is our only postprocessing step).
The lower part of Table 3 puts our method in context. Although our method is competitive with these best performing methods in the literature, it has not outperformed any of them other than BANNER. Its precision however, is better than all other methods with the exception of Gimli. It would be interesting to integrate the graph-based approach to the ones with CRF as their machine  learning core (BANNER-ChemdNER, Gimli, and the approach of ) to further test the utility of the graph approach.

Qualitative analysis
To understand the differences between BANNER and the graph propagation results, a human domain expert compared the errors occurring in their respective outputs. Table 4 shows the number of these errors as well as some examples. These examples illustrate two important observations. First, there are examples of categories more general than genes in both false positives and false negatives for both systems. For example Kinase is a functional group of proteins; POZ/Zn, Iglike domain, and SH2 are protein domains; and E3 ubiquitin and NF-kappaB are gene families. Anecdotal evidence suggests that this is due to presence of similar annotations in the training/test data set. For example the bZIP protein, a protein family, and Ig-like domain, a gene/protein functional domain were both annotated as genes. This calls for a better gene mention corpus annotated according to more recent gene annotation guidelines. Second, there are some hard to explain false positives in BANNER. Examples include Ann Arbor, a city in Michigan, SAS GLM, a type of statistical test, and 1.6-kb cDNA, a molecular length. Our graphbased approach has eliminated these false positives.

Cross validation study
We conducted extensive cross-validation experiments using different train and test splits in order to explore the hyper-parameter values and to Figure 4: The same points as in Figure 3 shown as the difference from the Banner scores for the same train/test split. The origin in this graph is the BANNER score. Each cluster of points in Figure 3 becomes a line in this graph. detect trends in the values that were optimal for this task. The results show that graph-propagation consistently improves results over BANNER. Figures 3 and 4 were created by running graphpropagation over different train and test splits with different hyper-parameter values for each split. For each train/test split, we show only the Pareto optimal points (for each choice of hyperparameters we include it in the graph only if the performance is not dominated by another choice in both recall and precision). Figure 3 illustrates two points: 1) the precision and recall for the different Pareto optimal points for each train/test split is very similar, and 2) overall the different train/test splits have similar precision and recall values. Figure 4 shows the performance for each train/test split shown as the difference from the BANNER scores for that split. It shows that the precision scores of graph-propagation is always better than the BANNER baseline, while recall is sometimes worse. The F-scores for all train/test splits and for all Pareto optimal points in each split is always better than the BANNER baseline.
We can collect useful statistics about which hyper-parameter values are the most useful in graph-propagation in this task from the extensive set of experiments described above: for different train/test splits and for each split with different hyper-parameter values. Figure 5 shows the number of times different hyper-parameter values have appeared in the set of Pareto optimal points over all the train/test splits.
The hyper-parameter α (see equation 3) controls the interpolation between the BANNER posterior probability over labels and the label distri-bution from the graph-propagation step. Higher α values would prefer BANNER over graphpropagation. Figure 5 shows that smaller α values are preferred, which implies that the label distribution produced through graph-propagation is found to be more useful than the label distribution produced by BANNER. We also investigated the two extreme cases of α = 0 (only graph) and α = 1.0 (only BANNER followed by an extra Viterbi decoding step), and observed that both of these options were worse than the BANNER baseline.
In equation (1) higher ν values keep the label distribution at each vertex of the graph closer to the uniform distribution. Higher µ values would allow adjacent vertices to have a greater influence on the label distribution at the vertex. Figure 5 shows that, in our experiments, graph-propagation is sensitive to the values of µ. Lower µ values appear in Pareto optimal points more often. On the other hand, Figure 5 shows that graph-propagation is not as sensitive to different values of ν as long as it is not too high (10 −1 ). This might be due to our setting, where about 73% of vertices are labelled.
We looked for strong correlations between ν values, µ values, and number of iterations in graph propagation and found none.
Finally, for different iteration numbers of graphpropagation, we collected the frequency with which each number appeared in the Pareto optimal results. One iteration of graph-propagation produced 68 Pareto optimal points, two iterations produced 198 points, and three iterations produced 120 points in our experiments. This shows that having more than one iteration of graphpropagation can improve the results.
Our algorithm (Figure 2) has two levels of iterations. One outer iteration (the while loop) and one inner iteration in graph propagation. The numbers mentioned above refer to this inner iteration. All our results reported are for one outer iteration only. Our experiments in this paper were in a transductive setting where the graph was constructed over the test and training data. For this reason we did not experiment extensively with more than one outer iteration. In future work, we plan to experiment with increasing the amount of unlabeled data, and in this setting explore increasing the number of outer iterations.

A note on scalability
The most time consuming step in our approach was graph construction, where the bootleneck is to compute the edge weights between any possible vertex pairs. We experimented with a naive algorithm, where for every vertex pair the values of feature vectors for shared features were considered, and the cosine similarity was computed. We also implemented a variation on it, where the similarities between all pairs sharing a specific feature instance were computed, and the contributions of individual feature instances were summed to give the final similarity between any given pair. The first algorithm was too slow as expected due to its O(|V | 2 ) time complexity; the second one was too slow due to high frequency features. This is an important issue since the graph needs to be constructed for our approach to work on a new dataset.
Apart from the graph construction, the graph based approach is as scalable as CRF if a labeled train set is available for the new domain, as the CRF only needs to be trained on the new labelled set. If we wish to adapt the method in a domain where there is no labelled data in the target domain, there is no need for any training.   Figure 5: These graphs show the number of times specific hyper-parameter values α, µ and ν appeared in Pareto optimal points over all train/test splits. .

Conclusion and future directions
Our results show that propagating labels from 3grams present in training set to 3-grams only appearing in the test set can significantly improve BANNER, a well-known frequently used biomedical named entity recognition system for the gene mention tagging task. Our cross-validation study shows the robustness of this improvement. We also presented qualitative comparison by a human domain expert. Our ideas for future work are categorized into three groups: 1. Adding more unlabelled data: The only unlabelled data we included in the graph were the test data. Since the success of semi-supervised learning methods is usually due to huge amount of unlabelled data, we plan to use many more PubMed abstracts to construct the graph. This however will be challenging because the graph construction can be time consuming as it was in our case due to high frequency features.
2. Constructing a better graph: Contextual features we used to construct our graph are only one of the feature sets that have been shown useful in gene mention tagging task. Other feature sets include orthographic features, contextual features learnt from neural networks, features from parse trees. These features may also prove useful in constructing a graph that represents the similarity between gene mentions. Also, we can preprocess the raw sentences to collapse some collocations into one word so that the middle word in the 3-gram vertices are more meaningful.
3. Improving the latest approach: Although BANNER is one of the most frequently used biomedical named entity recognition system, it is not one with the best performance ever. Previous approaches have improved BANNER in a variety of ways, including semi-supervised learning. In particular, Munkhdalia et al. have achieved an Fmeasure of 87.04 by including word representations learnt from massive unlabelled data as features (Munkhdalai et al., 2015) . We plan to test our approach on their freely available system.