A Semi-Markov Structured Support Vector Machine Model for High-Precision Named Entity Recognition

Named entity recognition (NER) is the backbone of many NLP solutions. F1 score, the harmonic mean of precision and recall, is often used to select/evaluate the best models. However, when precision needs to be prioritized over recall, a state-of-the-art model might not be the best choice. There is little in literature that directly addresses training-time modifications to achieve higher precision information extraction. In this paper, we propose a neural semi-Markov structured support vector machine model that controls the precision-recall trade-off by assigning weights to different types of errors in the loss-augmented inference during training. The semi-Markov property provides more accurate phrase-level predictions, thereby improving performance. We empirically demonstrate the advantage of our model when high precision is required by comparing against strong baselines based on CRF. In our experiments with the CoNLL 2003 dataset, our model achieves a better precision-recall trade-off at various precision levels.


Introduction
Named Entity Recognition (NER) is the task of locating and categorizing phrases into a closed set of classes, such as organizations, people, and locations. NER is an information extraction task that is important for understanding large bodies of text and is an essential component for many natural language processing (NLP) pipelines. The most common evaluation metric for information extraction tasks is F 1 , which is the harmonic mean between precision and recall: that is, false positives and false negatives are weighted equally.
In certain real-world applications (e.g., medicine and finance), extracting wrong information is much worse than extracting nothing: hence, * Work conducted while working at Bloomberg L.P. in such domains, high precision is emphasized. Trade-offs between precision and recall have been well researched for classification (Joachims, 2005;Jansche, 2005;Cortes and Mohri, 2004). However, barring studies on inference-time heuristics, there is limited work on training precision-oriented sequence tagging models. In this paper, we present a method for training precision-driven NER models.
By defining custom loss objectives for the structured SVM (SSVM) model, we extend costsensitive learning (Domingos, 1999;Margineantu, 2001) to sequence tagging problems. A difficulty in applying cost-sensitive learning to NER is that the model needs to operate on segmentations of the input sentence and the labels of the segments. Inspired by semi-Markov CRF (Sarawagi and Cohen, 2005), we propose a semi-Markov SSVM model that scores and labels consecutive tokens together, which allows us to directly interact with the segment-level errors in the precision-beneficial loss of the SSVM model.
We compare our semi-Markov SSVM model with several competitive inference-time baselines that have been proposed for high-precision NER. Our results show that our model outperforms competitive baselines on organization names, and is at least as good as the best inference-time approaches at some precision levels for other NER classes.

Related Work
For classification, several papers try to optimize different evaluation metrics directly. Joachims (2005) proposes an SSVM model for optimizing multivariate performance measures of binary classification tasks. F β is one of the metrics in their example. Similarly, Jansche (2005) maximizes expected F-measure, Cortes and Mohri (2004) and Narasimhan and Agarwal (2013) optimize AUC and partial AUC, respectively. However, these cannot be directly applied to sequence tagging as labels are assigned at the token or segment level.
Cost-sensitive classification (Domingos, 1999;Margineantu, 2001;Elkan, 2001;Zadrozny et al., 2003) is another body of work where different mis-classification errors have different costs and one attempts to minimize the total cost that a model incurs on the test data. Our approach uses similar ideas -we make the costs of false positive prediction higher than the false-negative costs -and therefore can be viewed as a cost-sensitive model for sequence tagging problems.
For sequence tagging problems, inference-time heuristics for tuning the precision-recall trade-off for information extraction models have been proposed. Culotta and McCallum (2004) calculate confidence scores of the extracted phrases from a CRF model: these scores are used for sorting and filtering extractions. Similarly, Carpenter (2007) computes phrase-level conditional probabilities from an HMM model, and try to increase the recall of gene name extraction by lowering the threshold on these probabilities. Given a trained CRF model, Minkov et al. (2006) hyper-tune the weight for the feature which indicates the token is not a named entity. Changing this weight could encourage or discourage the CRF decoding process to extract entities. We compare our model with these inference-time approaches.

Models
We adopt the BiLSTM-CNNs architecture (Ma and Hovy, 2016) to extract features from a sequence of words for all models in this paper. 1 Each word is passed through character-level CNN, and the result is concatenated with Glove word 1 Our implementation is based on NCRF++ (Yang and Zhang, 2018). embedding (Pennington et al., 2014) to form the input of Bi-directional LSTM. To map the word representation obtained from BiLSTM into k (label) dimensions, one layer of feed-forward neural network is applied.
At the output layer, instead of using a CRF (Lafferty et al., 2001) to capture the output label dependencies, we use the SSVM objective (Tsochantaridis et al., 2004). While CRFs have consistently given state-of-the-art NER results, their objective function is difficult to directly modify for highprecision extraction. Hence, we select the SSVM formulation as it allows us to directly modify the loss function for high precision. Given training sequences (x i , y i ), i = 1 . . . m, the loss function for SSVM is: where ∆ is the Hamming loss between two sequences, Y x i contains all possible label assignments for the sentence x i , and s is the decoding score between input sentence x and label sequence y.

High-Precision SSVM
Without modifications, the SSVM performs similar to the CRF. However, the presence of ∆(y i , y) in the SSVM loss allows us to design custom loss functions for high precision NER. No inferencetime changes are introduced.

Class-specific Token-level Loss
The first modification we make is to pick a target entity class and modify ∆(y i , y) to have word-wise loss of tgt for false positives on the target class and loss of t gt for false positives on other classes. That is, let y j i be j-th element of sequence y i , we define Note that the target class in the above equation contains all the labels related to the target entity type; that is, if the target class is ORG, we consider B-ORG and I-ORG to be the related labels. Typically tgt t gt so that the false positives on the target class will generate more loss, thereby discouraging the model from making such decisions. Both tgt and t gt are determined through hyper-parameter tuning. Setting tgt = t gt = 1 falls back to the standard Hamming loss.
Semi-Markov SSVM A problem with tokenlevel loss is that it does not always reflect phraselevel errors accurately; it may over generate loss since a phrase could consist of multiple tokens. It is unclear how individual token false positives contribute to phrase-level false positives.
Therefore, we try a semi-Markov variation of the SSVM following (Sarawagi and Cohen, 2005). The semi-Markov formulation groups consecutive tokens into segments. Whole segments are considered as a single unit and only transitions between segments are modeled. We ignore all intrasegment transition probabilities, effectively collapsing the number of labels to 5 (ORG, PER, LOC, MISC, O instead of the BIO labelling scheme for CoNLL data). The scores of each segment are obtained by summing up the word-level class scores of words present in the segment (Ye and Ling, 2018). We restrict segments to be ≤ 7 tokens long, and we do not use any additional segment level features. During decoding, all possible segmentations of a sentence (≤ 7) will be considered. The architecture of our BiLSTM semi-Markov SSVM model is shown in Figure 1.
To tune the semi-Markov SSVM model to high precision for a specific class, a segment will contribute tgt to the loss if it is predicted as the target class and this segment does not exist in the gold segmentation. Other types of errors in the prediction have a loss of t gt . This is similar to the class-specific loss used on the token-level in the SSVM formulation. In our experiments, we refer to the token-level model simply as SSVM, and the segment-level model as semi-Markov SSVM.

Results
All experiments were conducted on the CoNLL 2003 English dataset. We first show the performance of CRF, SSVM, and semi-Markov SSVM models without tuning for high precision in Table 1. We see that all three models perform similarly, with CRF being slightly better. These numbers are the starting points for the rest of the experiments. We compare the proposed models with the following inference-time baselines: 2  Thresholded CRF We compute the probability of each extracted phrase by Constrained Forward-Backward algorithm (Culotta and Mc-Callum, 2004). An extraction is dropped if its phrase probability is lower than a given threshold, a tunable hyper-parameter.
Bootstrap CRF By generating bootstrap samples of the CoNLL training set, we generate 100 BiLSTM CRF models. To increase precision over a single CRF, we decode each sentence with each of the 100 models and compute the votes for each proposed named entity. The threshold (percent of votes) for a candidate entity is hyper-tuned.
Using the dev set, we tune the hyper-parameters of each model at which the desired precision is achieved. For our proposed SSVM-based mod- Figure 2: Precision-recall trade-off of the proposed SSVM model versus baselines: semi-Markov SSVM outperforms all models for ORG, is on par with Thresholded CRF for LOC, and is competitive for the PER class. The detailed numbers are listed in the Appendix. els, the hyper-parameters are tgt and t gt . 3 To speed up training, we initialize the parameters of the entire model (neural network and SSVM) using a pre-trained model with tgt = 1, t gt = 1, and train further for 20 epochs.
We set several precision levels from 90 to 100. For each precision level, we choose the hyperparameters which have precision higher than the target precision level and obtain the maximum F 1 score on the dev set, and report the corresponding test performance. The results are shown in Figure 2. Threshold CRF can achieve a wider range of precision than SSVM-based models. In this figure, we only focus on the range which SSVM-based models can achieve.
We can see that semi-Markov SSVM clearly outperforms all the other models for ORG, is on par with Thresholded CRF for LOC, and has some strong points in the high precision region for PER. The good performance on ORG is consistent with the observation in Ye and Ling (2018) that semi-Markov models have advantages in longer phrases because labels are assigned at the segment level directly. Since longer mentions tend to have a smaller phrase probability and the length of ORG mentions varies more than the length of the other two types, Thresholded CRF is less robust for ORG. The token-based SSVM is consistently worse than semi-Markov SSVM and fails to achieve higher precision, especially for PER. This shows that the semi-Markov property penalizes false positives at the phrase-level more accurately. Bootstrap CRF does not perform well for ORG and LOC, but is pretty strong for PER at some precision levels. We believe higher performance of bootstrap CRF on PER class comes from the fact 3 tgt is searched in the range between 1 and 5, and t gt is between 0.0001 and 0.1. that the baseline CRF model itself achieves very high precision for this class, which allows bootstrapping technique reduce the variance on predictions accurately. This makes bootstrapping approach more promising to situations where models have already achieved very high precision.

Error Analysis
We perform error analysis for the two main methods: Thresholded CRF and semi-Markov SSVM. We pick model settings such that both models achieve the same precision level (ORG:94.5 PER:97.9 LOC:95.5) for a given class. Table 2 illustrates the recall values achieved by these models for different entity mention lengths. We can see that semi-Markov SSVM clearly outperforms Thresholded CRF on multi-token mentions, especially for long organization names. The high percentage of long mentions in ORG explains semi-Markov SSVM's superior performance in Figure 2. However, we also see that semi-Markov SSVM produces more "larger predicted span" errors. Therefore the recall of unit-length mentions is lower than Thresholded CRF. This we believe is a side effect of semi-Markov models being more willing to predict longer length segments.
These two methods can be applied together to achieve even better results. For example, thresholding and bootstrap techniques can be applied to semi-Markov SSVM models as well. In this work, we focus on showing the performance of individual approaches.
Another question is what types of errors are reduced when tuning towards precision? We find that precision tuning reduces all error types, but especially the MISC type errors for all 3 classes (i.e., MISC being classified as one of the other 3 classes).

Conclusion
We proposed a semi-Markov SSVM model for high-precision NER. To our best knowledge, it is the first training-time model for high precision structured prediction. Experiment results show that our model performs better than inference-time approaches at several precision levels, especially for longer mentions. The proposed model offers promising future extensions in terms of directly optimizing other metrics such as Recall and F β . This work also opens up a range of questions from modeling to evaluation methodology.