The NTNU System at SemEval-2017 Task 10: Extracting Keyphrases and Relations from Scientific Publications Using Multiple Conditional Random Fields

This study describes the design of the NTNU system for the ScienceIE task at the SemEval 2017 workshop. We use self-defined feature templates and multiple conditional random fields with extracted features to identify keyphrases along with categorized labels and their relations from scientific publications. A total of 16 teams participated in evaluation scenario 1 (subtasks A, B, and C), with only 7 teams competing in all sub-tasks. Our best micro-averaging F1 across the three subtasks is 0.23, ranking in the middle among all 16 submissions.


Introduction
Keyphrases are usually regarded as phrases that capture the main topics mentioned in a given text. Automatically extracting keyphrases and determining their relations from scientific articles has various applications, such as recommending articles to readers, matching reviewers to submissions, facilitating the exploration of huge document collections, and so on. An adapted nominal group chunker and a supervised ranking method based on support vector machines have previously been used to extract keyphrase candidates (Eichler and Neumann, 2010). The conditional random field based keyphrase extraction method has been presented (Bhaskar et al., 2012). A naïve approach has been proposed to investigate characteristics of keyphrases with section information from well-structured scientific articles (Park et al., 2010). Features broadly used for the supervised approaches in scientific articles have been assessed in the compilation of a comprehensive feature list (Kim and Kan, 2009). Maximal sequences and page ranking have been combined to discover latent keyphrases within scientific articles (Ortiz et al., 2010). Noun phrases containing multiple modifiers have been extracted from earth science publications and generalized by matching tree patterns to the syntax trees of the sources texts (Marsi and Öztürk, 2015). Keyphrase boundary classification has been regarded as a multi-task learning problem using deep recurrent neural network (Augenstein and Søgaard, 2017).
The ScienceIE task seeks solutions to automatically identify keyphrases within scientific publications, label them, and determine their relationships. Specifically, the ScienceIE task contains three subtasks: (A) Identification of keyphrases: to identify all the keyphrases within a given scientific publication; (B) Classification of identified keyphrases: to label each keyphrase as Process, Task, or Material; (C) Extraction of relationships between two identified keyphrases: to label keyphrases as Hyponym-of or Synonymof.
The ScienceIE task presents three evaluation scenarios. In Scenario 1, only plain text is given for subtasks A, B, and C; in Scenario 2, plain text with manually annotated keyphrase boundaries are given for subtasks B and C; and in Scenario 3, plain text with manually annotated keyphrases and their types are given for subtask C. System output is matched against a gold standard to measure system performance. The micro-averaging precision, recall, and F1 across the subtask(s) are used in the task. Each participating team can submit at most three results and the best result for each evaluation scenario is taken as the performance of the participating team.
This article describes the NTNU (National Taiwan Normal University) system for the Scien-ceIE task at the SemEval 2017 workshop. Our solution uses multiple conditional random fields at the sentence level. Each sentence is parsed to obtain features, including words, lemmas, part-ofspeech tags, and syntactic phrases. CRFs are then trained to learn sequential patterns using the datasets provided by task organizers. We participated in the evaluation scenario 1 with three subtasks. Our best micro-averaging F1 of 0.23 ranked in the middle of all 16 submissions.
The rest of this paper is organized as follows. Section 2 describes the details of the NTNU system for the ScienceIE task. Section 3 presents the evaluation results and performance comparisons. Section 4 discusses some findings. Conclusions are finally drawn in Section 5.

The NTNU System
Our proposed approach uses the Conditional Random Field (CRF) technique (Lafferty et al., 2001), a type of discriminative probabilistic graph model, by learning linguistically motivated features to extract the keyphrases from scientific articles and identify their relations. The linear chain CRF is empirically effective for predicting the sequence of labels given a sequence input. A word in a sentence is regarded as a state in our CRF. Given an observation and its adjacent states in terms of the distinguished features, the probability of reaching a state is determined based on the Stochastic Gradient Descent. In the testing phase, the proposed CRF reports the sequence of categories with the largest probability as the identified result.
The following four features are used for training the CRF model with the Stanford CoreNLP toolkit (Manning et al., 2014).
• Word: the original words in the sentence of a scientific article are directly used without any revision.
• Lemma: this is to reduce inflectional forms and derivationally related forms to determine the lemma of a word in terms of its intended meaning • Part-of-Speech: noun, verb, adjective, adverb, pronoun, etc.
• Syntactic Phrase: a phrasal category which is a type of syntactic unit in the grammar structure. Noun phrases are usually regarded as keyphrases in scientific texts. Hence, we only adopt noun phrases and their upper phrasal category as features. Table 1 shows an example sentence with its corresponding features. Each row denotes a token in the sequence. In addition to words, the remaining three features (i.e., lemmas, part-ofspeech tags, and syntactically phrasal tags) are provided by the Stanford CoreNLP toolkit. Table 2 shows the same example sentence with encoding for training multiple CRF models. We use the simplest IO encoding, which tags each token as either being in a particular type of keyphrase X or in no keyphrase (denoted as "O").
. . x Table 1: An example sentence with features.
We regard the relations Synonym-of and Hyponym-of as individual types in this sequential labeling problem. The one-vs.-rest strategy, which involves training a single classifier per class, is adopted using class samples as positive instances and all the other samples as negatives. In total, we have five corresponding CRF models for each type (i.e., Task, Process, Material, Synonym-of, and Hyponym-of). During the testing phase, all trained CRF models are parallel to label one of types. The tags predicting by both Synonym-of and Hyponym-of CRF models are reliable dependently on the other three models, because pairs of keyphrase should be identified first for relations. Hence, we check the pairs of keyphrases to keep those are identified by Task, Process and Material CRF models. Finally, we integrate all identified results as our system outputs without handling any conflicts.

Data
The datasets for the ScienceIE task were provided by task organizers . The collected corpus consisted of journal articles from ScienceDirect open access publications evenly distributed among Computer Science, Material Science and Physics. The training, development, and test datasets were comprised of sampled paragraphs, of which 350 were used for training data, 50 for development, and 100 for testing. These datasets were made available to participants without copyright restrictions.
No external resources were used to supplement the datasets. To pre/post-process the datasets, we transformed alphabet-based start/end counts into word-based positions.

Implementation
The CRF++ toolkit was used for system implementation. CRF++ is an open source implementation of conditional random fields for segmenting or labeling sequential data, and is available at https://taku910.github.io/crfpp/ Supplementary Material in the Appendix shows feature templates used in our implemented system. Each line denotes one template, in which the first characters "U" and "B" respectively represent unigram and bigram features. In each template, a special macro %[row, col] is used to specify a token in the input data, in which row specifies the relative position from the current focus token and col specifies the absolute position of the column.
The encoding scheme we used was one-hot. We had 5 columns, where the first four ones respectively denoted features, i.e., Word, Lemma, Partof-Speech and Syntactic Phrases, and the last was a given type, e.g., Process or not, for training a specific CRF model to label a given type. In the testing phase, the same template file was used and the last column was an estimated type predicting by the trained CRF model.

Metrics
The traditional metrics precision, recall, and F1score were computed to measure system performance for each subtask. The micro-averaging strategy was then used to obtain overall score across subtask(s). Table 3 shows our results for each defined type. "Task" for subtask B and "Hyponym-of" for subtask C clearly performed worse than other three types. Table 4 shows our results for each subtask. Comparing subtask C with subtasks A and B shows the former is relative more difficult.

Comparisons
Of the total 16 submissions, 9 teams did not participate in subtask C. We participated in all subtasks, achieving a micro-average F1 of 0.23, thus ranked 9 th of the 16 submissions.

Discussion
For this task, we only use multiple CRF models with four defined features. In addition to the Stanford CoreNLP toolkit for extracting features, we do not use any other methods such as the NER tool. Our error analysis reflects that the NER may be useful to improve the performance of Task keyphrase identification. It is also difficult to extract the Hyponym-of relation due to the limitation of long distance using existing features templates.
During the development phrase, we attempted to identify the relations between extracted phrases using manually crafted rules. Our multiple CRF models with the help of rules improved the performance on the development set, but performed worse on the testing set. Hence, we do not adopt rules in the system module. Our observations suggest that human-crafted rules do not perform well due to the challenge of coverage.

Conclusions
This study describes the NTNU system in the ScienceIE task, including system design, implementation and evaluation. This is our first exploration of this research topic. Future work will explore other features to further improve performance.