NTNU-1@ScienceIE at SemEval-2017 Task 10: Identifying and Labelling Keyphrases with Conditional Random Fields

We present NTNU’s systems for Task A (prediction of keyphrases) and Task B (labelling as Material, Process or Task) at SemEval 2017 Task 10: Extracting Keyphrases and Relations from Scientific Publications (Augenstein et al., 2017). Our approach relies on supervised machine learning using Conditional Random Fields. Our system yields a micro F-score of 0.34 for Tasks A and B combined on the test data. For Task C (relation extraction), we relied on an independently developed system described in (Barik and Marsi, 2017). For the full Scenario 1 (including relations), our approach reaches a micro F-score of 0.33 (5th place). Here we describe our systems, report results and discuss errors.


Approach
We choose Conditional Random Fields (Lafferty et al., 2001) because they have produced state-ofart results on comparable sequence labelling tasks such as named entity recognition in biomedicine. Two systems were developed, using different feature sets and alternative CRF implementations.
Preprocessing Input text is linguistically analysed using the Spacy NLP pipeline (Honnibal and Johnson, 2015), including sentence splitting, tokenisation, lemmatisation and dependency parsing. Since CRFs cannot handle Brat's stand-off annotation format directly, keyphrase annotations are first converted to the Inside-Outside-Begin (IOB) tagging scheme by aligning their character offsets with the character offsets of tokens: if the start character offset of a token coincides with the start offset of an annotated keyphrase, the token receives a B (begin) tag; if the offset span of a token falls within the offset bounds of a keyphrase, the token gets an I (inside) tag, otherwise the token is assigned an O (outside) tag. Each sentence corresponds to a sequence of IOB tags, serving as the labelled sequence for the CRF. Separate IOB tags are derived for each of the three keyphrase classes (Material, Process, Task). Annotations and tokens do not always properly align; the resulting errors are discussed in Section 3.
System 1 relies on the CRFsuite implementation (Okazaki, 2007) as wrapped by the sklearncrfsuite module for SciKit Learn. A dedicated classifier is trained for each of the three keyphrase classes. CRFs are used with default parameter setting. The following features were selected per class by cross-validation on the development data: • Word features: word shape (e.g. 'Xxxx'), isalpha, is-lower-case, is-ascii, is-capitalized, is-upper-case, is-punctuation, like-number, prefix-chars (2,3,4), suffix-chars (2,3,4), isstopword, all in a window of size 3 for Material and Process; • Lemma and POS, in a window of size 5 for Material and Process, in window of size 3 for Process; • Wordnet (for Material only): synset names of all hypernyms (transitive closure), in a window of size 5 Supervised learning is generally hampered by skewed class distributions, where minority classes tend to be predicted poorly. In our case, the O tag is by far the most frequent tag. To reduce its weight, all sentences without a Material keyphrase are removed from training material of the CRF for predicting the Material class, and likewise for the other two classes.
Output is postprocessed with the intention of improving consistent labelling throughout a single text. For example, if a majority of the occurrences of the phrase 'carbon' in a text is labelled as Material, then any unlabelled occurrences are by extension also labelled as Material.

IOB-to-Brat conversion
The final step consists of merging the IOB tags predicted by the three separate models in order to produce labelled keyphrases in Brat format.
Experimental setup Cross-validation on the training data was used to select features and tune hyper-parameters. The best performing systems were tested on dev data to check for undesired overfitting. Finally the best systems were trained on the combination of train and dev data to make predictions on test data.
Relation extraction For Task C (relation extraction), we relied on an independently developed system described in (Barik and Marsi, 2017), which performs exhaustive pairwise classifications of keyphrase pairs of the same type within a sentence.

Results
Results for our three systems are shown in Table 1. Micro averages are weighted across the three labels for keyphrases and the two relation types, but as the keyphrases are substantially more frequent, the weight of the relations is relatively small. System 1 performs worst and system 2 performs best, although the differences are small. System 1 mainly wins on precision. The combination of both in system 3 does not offer any advantages, except for higher recall. All system obtain best scores for Material and worst scores for Task. This can be partly explained by the support for each class: Material and Process instances are much more frequent than Task in the training data.
Another part of the explanation may be that Process and Task are harder to distinguish from each other.
Results on test data are substantially lower than on the dev data, with 6 to 7 percent lower average F-scores. This suggests that the models were overfitted on the combination of train and dev data. This is somewhat surprising, because no such differences showed up between crossvalidated scores on the training data and scores on the dev data.
Performance on relation extraction is rather poor when compared with the scores obtained with manually annotated keyphrases as input. This is to be expected, as errors in keyphrase extraction propagate to errors in relation extraction. For more analysis of the relation extraction system, see (Barik and Marsi, 2017).

Discussion
IOB tags The offsets of annotated phrases did not always properly align with the beginning or end of a token. This was partly due to tokenisation errors. In particular, Spacy tended to consider periods as part of an abbreviation instead of the end of a sentence. For example, it took the period after 'Co(II)OEP.' as a part of an abbreviation rather than a sentence ending, which does not align with the annotated phrase 'Co(II)OEP'. Likewise, words compounded with a dash or slash (e.g. 'solid-liquid') were sometimes individually annotated as keyphrases, but not split by Spacy, or the other way around. There were also errors were annotators did not include all characters in the text span (e.g., 'ossil mass' instead of 'fossil mass', or unintentionally included extra characters (e.g. 'EBL and HSQ development, t').
In order to estimate the impact of IOB conversion errors on the scores, we converted annotated keyphrases in Brat format to IOB format and then back to Brat format. We then used the eval.py script to compute the scores of the resulting 'predictions'. The number of misalignments and their impact on precision, recall and F-score are shown in Table 2. We conclude that the impact of conversion to IOB tags on F-score is relatively small: between 1 to 3 percent at maximum, assuming all predictions are correct.
Failed attempts We tried tuning the CRF hyperparameters using grid search (for run 1), optimising the micro-average F-score over the B and I tags. However, this criterion did not correspond well with the official scores reported by eval.py. In fact, CRFs with optimised hyper-parameters yielded official scores that were lower than for CRFs with default parameter setting. Optimising directly on the official scores is more expensive and complicated, because of the conversion of IOB tags to Brat annotation. However, doing so may improve performance.
Qualitative error analysis The analysis of errors has been conducted over a random sample of 10% of the documents from the test data under the best system (2). This analysis shows that almost half of the errors are words or phrases incorrectly tagged as keyphrases. The other half are due to either incorrect boundaries (19%), such as ERP system instead of hybrid ERP system in S0166361516300926; label (18%), e.g. FIB instruments as Material instead of Process in S0168583X14003929; or both incorrect boundaries and label (15%), e.g. finding a group of optimized coefficients in S0021999113002945 is automatically annotated as Process whereas optimized coefficients is Material in the test data.
Other types of errors are those in which the same phrase has been annotated with two different labels and only one of these is correct. For example, SNR (S096386951400070X) or DP (S0010938X15301268) are both Material and Process, but only the former exists in the gold standard data. This is especially frequent among acronyms.
It is worth mentioning that part of these errors are also due to errors already present on the annotated test data. For instance, RH ceramics in value of the fracture toughness of RH ceramics is clearly some kind of material, but it is unlabelled in the gold standard data.  Besides, this analysis shows that around more than three quarters of these errors are due to keyphrases incorrectly labelled as Material (43%) or Process (42%), whereas only 15% are Task. Interestingly, a similar proportion of keyphrases is observed in the training data: there is a considerably lower number of keyphrases labelled as Task (1132), than Process (2992) and Material (2608). For example, nuclear fission reactors in S0263822312000657 was labelled as Material but it is a Task in the gold standard data; capture features in the solution (S0021999113006955) was predicted as Task but it should be a Process; or optimized coefficients in S0021999113002945 was predicted as a Task but it is a Material.
Regarding coverage, 62 entities are not covered by System 2 at all. This amounts to 35% of the gold standard data. The distribution of errors is very similar to the one reported for precision, with 45% of the entities not covered being Material, 40% Process and 15% Task. For instance, in S0021999113006955 there are two instances of true surface that were ignored by the classifier. Interestingly, another mention of the same keyphrase in the same document was correctly annotated as Material. However, postprocessing of predictions to enforce consistent labelling in System 1 did not show any nett improvements.