MayoNLP at SemEval 2017 Task 10: Word Embedding Distance Pattern for Keyphrase Classification in Scientific Publications

In this paper, we present MayoNLP’s results from the participation in the ScienceIE share task at SemEval 2017. We focused on the keyphrase classification task (Subtask B). We explored semantic similarities and patterns of keyphrases in scientific publications using pre-trained word embedding models. Word Embedding Distance Pattern, which uses the head noun word embedding to generate distance patterns based on labeled keyphrases, is proposed as an incremental feature set to enhance the conventional Named Entity Recognition feature sets. Support vector machine is used as the supervised classifier for keyphrase classification. Our system achieved an overall F1 score of 0.67 for keyphrase classification and 0.64 for keyphrase classification and relation detection.


Introduction
In this paper, we present details of our participation in the SemEval 2017 Task 10, Scien-ceIE (Augenstein et al., 2017). Named Entity Recognition (NER) is one of the major challenges in Natural Language Processing (NLP) and text mining. The interesting entity types in NER tasks vary from communities and corpora. In general, NLP community mainly focused on the identification of proper nouns or noun phrases, e.g., locations, names and organizations in news corpora (Nadeau and Sekine, 2007). In contrast, biomedical community is more interested in finding biomedical or clinical terminologies (Leaman and Gonzalez, 2008;Tsuruoka and Tsujii, 2005) in biomedical texts and scientific literatures. There are several machine learning based methods used in biomedical NER, which include Support Vector Machine (SVM) (Lee et al., 2004), Hidden Markov Model (HMM) (Zhou and Su, 2004) and Conditional Random Field (CRF) (Tsai et al., 2006). Semantic word embedding (Mikolov et al., 2013) is designed to capture different degrees of similarity between words using a vectorized representation, which preserves semantic and syntactic relationships. Word embeddings and word embedding based features have drawn increasing attention for classification tasks (Ma et al., 2015) and similarity prediction tasks (Afzal et al., 2016).
We leveraged pre-trained word embeddings to obtain head noun pattern features, and combined several other NER feature sets to improve the keyphrase classification performance. Although our team participated in Scenario 2 (keyphrase classification and relation detection), our efforts were focused on keyphrase classification task (Subtask B). For the relation detection problem (Subtask C), we implemented a straightforward rule-based system to detect synonyms and hyponyms given annotated keyphrases.
The rest of the paper is organized as follows. Section 2 briefly introduces the corpus used in this task. Section 3 discusses the methods proposed in our NER system. Section 4 addresses the experimental results in the development set, our submitted runs and official evaluation results. Finally, Section 5 concludes the paper with possible extensions for future work.

Materials
The corpus provided by the ScienceIE organizers consisted of 500 introductory paragraphs from ScienceDirect journal articles in Computer Science, Material Sciences and Physics. The corpus was divided into training, development and test sets, which contained 350, 50 and 100 documents, respectively. It is the first publicly available corpus with annotations focused on the research topics and goals of general domain scientific literature. The annotated keyphrases were relatively longer than other annotated corpora, which makes the boundary detection and classification task very challenging. More details of the corpus can be found in (Augenstein et al., 2017).

Preprocessing
To facilitate feature extraction for supervised classification, all plain text sentences and annotations were pre-processed by NLTK 1 for tokenizing, Part-of-Speech (POS) tagging and sentence detection.

Head Noun Extraction
Intuitively, the head noun of a keyphrase provides important information of its semantic category (Li, 2010).
For example, in the phrase "homonuclear chains of tangent Mie spherical CG segments" from the ScienceIE 2017 corpus, the noun "chains" determines the phrase is from the category "Material". In another example, the category of the phrase "applications of methodology of research" is determined by the head noun "application", which is an instance of "Task". Extracting the head noun can help eliminate ambiguous contexts while preserving the semantic information for the classification step. Therefore, we used the extracted head noun features, rather than the features from whole phrase, to determine the semantic category.
A shallow parsing approach is applied to extract the head nouns from given phrases. We removed the part at and after the preposition token "of", "with", "for" and "on", and kept only the features from the head noun for the feature extraction step. In the above examples, we extracted the head noun "chains" and "applications".

Feature Set
Given a sentence and a head noun token w i , we adopted several commonly used feature sets as the input of supervised classifiers for the baseline system. Part-Of-Speech features The Part-Of-Speech (POS) tags for the tokens in ±2 window.
Lemma features Lemmatized word of w i and its verb form from WordNet. For example, for the token "derivations", the lemmatized word is "derivation" and the verb form is "derive".

Word Embedding Distance Pattern
The extraction of head nouns in keyphrases enables utilizing word embedding information as features in the keyphrase classification task.
To improve the performance using baseline NER features described above, we proposed Word Embedding Distance Pattern (WEDP). It is based on the assumption that the differences among the head nouns in each semantic category should follow similar patterns in semantic word vector space. We would like to validate and obtain the patterns in this keyphrase classification task.
We selected 10 most frequent head nouns from each category in the training corpus. After excluding the duplications, we obtained the following list of keywords M ={model, particle, data, system, film, problem, algorithm, function, effect, equation, reaction, method, surface, alloy, layer, structure}. We also added the category names (task, material, process) into M .
Given a token w, the word embedding distance to each of the k-th word-embedding above is calculated by where k = 1, . . . , |K|, the distance function dist is the cosine distance, and w2v is the dictionary lookup method, which returns the embedding of the input token from a pre-trained word to vector (word2vec) model. If the token w cannot be found in the word embedding dictionary, we set d k (w) = 1 for all k.

Classification
In this study, we modeled the keyphrase classification task as a supervised multi-class classification problem. All features described in Section 3.3 were encoded into a sparse vector, and then combined with the WEDP as the input of supervised classifiers.

Relation Extraction
For the relation extraction subtask, we implemented a simple rule-based system. For each sentence, we considered all possible pairs of the entities as relation candidates. For each candidate, the context texts between two entities were extracted, including one character after the entity appeared later of the pair. The matching patterns we used are shown in Table 1. If any of those patterns matched with the context, we identified the pair as a detected relation. Relations sharing at least one entity were grouped together as one relation, according to the requirement of output format. We used hearstPattern 3 which implements Hearst patterns (Hearst, 1992) for hyponym detection.

Results
We tested several supervised classification methods, The results on development set are shown in Table 2. The L2-loss linear kernel SVM was selected as the classifier and used the scikit-learn 4 implementation. The result also validated that SVM can outperform other classification methods in high dimensional data (Chang and Lin, 2011). The hyperparameter C was tested in the range from 0.01 to 10. The F1 scores 5 range from 0.70 to 0.78 and yields the highest F1-score on the development set when C is set to 0.5.

Classifier
Material Process Task   Ablation experiments were conducted on the development set to find the importance of individual feature sets. The ablation results in F1 scores are shown in Table 3. From Table 3, we see that both the baseline feature sets and the WEDP contributed to the overall performance, since the combination of these two sets outperform the other feature settings.  The official evaluation uses the standard precision (P), recall (R) and F1 score as the metrics. We submitted two runs for official evaluation. Run 1 uses the feature set described in Section 3.3 with synonym detection result. Run 2 is derived by extending Run 1 by predicted hyponyms. Both runs achieved F1 score of 0.64 for Subtasks B and C. This was due to the insignificance from the positive cases of "Hyponym-of" relations on Run 2. The results of Run 2 are shown in  Table 4: Official evaluation results of the best submitted run on the test set using annotated keyphrase boundaries (Scenario 2). Table 4. From the results, "Task" is the most difficult category for our proposed method, but its relatively low proportion reduces its impact on the overall F1 score. Compared to the development set Table 2, the F1 scores of all three categories drop by at least 0.09, which indicates the selected classifier suffers from overfitting.

Conclusion
In this paper, we presented details of MayoNLP's participation in the ScienceIE share task at SemEval 2017. We used a supervised classifier for the keyphrase classification task using word embedding distance patterns, which improves the performance of conventional feature sets. Our system achieved an overall F1 score of 0.67 for keyphrase classification subtask and 0.64 for keyphrase classification and relation detection subtasks. It outperformed other participating systems in Scenario 2. A future extension of this work is to test the patterns on different pre-trained word embeddings. We will also develop methods for more accurate key noun extraction such as dependency parsing, to improve the overall classification performance.