SZTE-NLP at SemEval-2017 Task 10: A High Precision Sequence Model for Keyphrase Extraction Utilizing Sparse Coding for Feature Generation

In this paper we introduce our system participating at the 2017 SemEval shared task on keyphrase extraction from scientific documents. We aimed at the creation of a keyphrase extraction approach which relies on as little external resources as possible. Without applying any hand-crafted external resources, and only utilizing a transformed version of word embeddings trained at Wikipedia, our proposed system manages to perform among the best participating systems in terms of precision.


Introduction
The sheer amount of scientific publications makes intelligent processing of papers increasingly important. Automated keyphrase extraction techniques can mitigate the severe difficulties arising when navigating in massive document collections. Hence, extracting keyphrases from scientific literature has generated substantial academic interest over the past years (Witten et al., 1999;Hulth, 2003;Kim et al., 2010;Berend, 2016a).
Continuous word representations such as word2vec (Mikolov et al., 2013) has gained increasing popularity recently. These representations assign some semantically meaningful low dimensional vector w i to the vocabulary entries of large text corpora.
We demonstrated previously (Berend, 2016b) that useful features can be derived for various sequence labeling tasks by performing a sparse decomposition of the word embedding matrix. In this paper, we investigate the generalization properties of our proposed approach for the task of keyphrase extraction.

Sequence labeling framework
Our sequence labeling framework builds on top of our previous work which aimed at multiple different sequence labeling tasks, i.e. part-of-speech tagging and named entity recognition.

Feature representation
Each token in a sequence is described by a set of feature values and those of its direct neighbors in our model. We relied on multiple sources for deriving features, i.e.
• sparse coding of dense word embeddings, • Brown clustering of words, • word identity features and • orthographic characteristics.

Sparse coding derived features
The main source of features was sparse coding performed on continuous word embeddings. We demonstrated in (Berend, 2016b) that sequence labeling tasks can largely benefit from the sparse decomposition of dense word embedding matrices. That is, given a word embedding matrix W ∈ R d×|V | -with its columns containing the d dimensional dense word embeddings -we seek for its decomposition into a product of D ∈ R d×K and α ∈ R K×|V | -containing sparse linear combination coefficients for each of the word embeddings -such that W − Dα 2 F + λ α 1 gets minimized.
Features for some word w i are then determined based on its corresponding vector α i by taking the signs and indices of its non-zero coefficients, i.e.
As we observed a consistent benefit of using polyglot (Al-Rfou et al., 2013) embeddings previously, we now also rely on those embeddings for keyphrase extraction.

Brown clustering
Brown clustering (Brown et al., 1992) defines a hierarchical clustering over words and cluster supersets can be easily turned into features. We used the commonly employed approach of deriving features from the length-p (p ∈ {4, 6, 10, 20}) prefixes of Brown cluster identifiers as it was done previously by Ratinov and Roth (2009);Turian et al. (2010) as well.
We used the implementation of Liang (2005) for determining 1024 Brown clusters 1 based on the same Wikipedia dump which was used upon the training of the freely available polyglot word embeddings 2 that we relied on for performing sparse decomposition.

Orthographic features
Orthographic clues can vastly help identifying keyphrases in scientific publications. For this reason the below listed indicator features get determined for some word w:

Training the model
Features described in Section 2.1 were utilized in linear chain CRFs (Lafferty et al., 2001) relying on the CRFsuite (Okazaki, 2007) implementation. CRFSuite was applied with its default regularization parameters, i.e. 1.0 and 0.001 for 1 and 2 regularization, respectively.
The shared task also required the identification of keyphrase types beyond merely finding the keyphrases within the text. We handled the fact that keyphrase scopes of different keyphrase types could overlap by training a separate CRF model  for each keyphrase type and merging the predictions of the different models in a post-processing step. The models we trained employ the 5-class BIOES-augmented tagging scheme for the labels.

Experiments
In this section we report our evaluations on the SemEval-2017 Task 10 dataset which consists of 350 training, 50 development and 100 test text passages, respectively. Each text passage originates from either Computer Science, Material Sciences or Physics publications and the task was to identify and classify keyphrases into the types of Material, Process and Task.
The shared task included both a keyphrase type insensitive (Subtask A) and sensitive (Subtask B) evaluation. Further details about the dataset and the description of the keyphrase types can be accessed in (Augenstein et al., 2017).
The only preprocessing we performed on the shared task data was sentence splitting and tokenization of input sentences. These steps were executed relying on spacy 3 . In order the sparse word embedding and Brown clustering-based features to work effectively, it is important that the a substantial amount of tokens from the shared task data have word representation determined for, i.e. the coverage of the word representations is satisfactory.  word forms and tokens. Table 1b provides a more detailed breakdown of the coverages of word representations for the different keyphrase types also. As subsequent results illustrate, higher word coverage for a certain type of keyphrase does not necessarily imply better performance on that type as e.g. Task-type keyphrases have the highest token coverage, nevertheless, scores are the lowest on that particular type (cf. Table 4). Figure 1 illustrates the effect of varying the K and λ hyperparameters of sparse coding when not relying on orthographic or Brown clustering derived features. Figure 1b illustrates   Subsequently, we investigate how does adding orthographic and Brown clustering-derived features affect results for two extremely different hyperparameter combinations of sparse coding, i.e. K = 128, λ = 0.9 and K = 1024, λ = 0.1. These results are presented in Table 4a-4d. Table 4 reveals that when orthographic and/or Brown clustering-based features are used in conjunction with the sparse coding derived ones, results become more stable, i.e. they are much less affected by the choices of the K and λ. Simultaneously, the importance of word identity features diminishes once orthographic and/or Brown clustering-related ones get involved in the model. This effect is more pronounced when adding orthographic features.

Results on development data
Interestingly, when both orthographic and Brown clustering related features are employed, results become better for small values of K, however, this was not the case without the application of these additional feature classes.

Results on test data
Based on our experiments on the development data, out official shared task submission employed K = 128, λ = 0.9 alongside with orthographic and Brown clustering-derived features. One of our official submissions relied on word form features, whereas the other dismissed such ones. The final results of our submissions are included in Table 2.
As our main goal was to verify the applicability of sparse coding derived features in keyphrase extraction as well, we checked the performance of the model which uses all features except for the sparse coding derived ones. The result of that model is presented in Table 3. By comparing these scores with those in Table 2, we can see that even when using a low value for K and a large regularization parameter λ we manage to get better F-scores when sparse coding related features are employed.

Conclusion
In this paper, we proposed an approach for extracting keyphrases from scientific publications. A key source of features in our approach were those derived from the sparse coding of continuous word embeddings.
In our approach we did not use any task-specific features (such as lists or gazetters), which implies that i) by relying on some extra task specific features, results could be easily improved on this task and ii) the proposed approach is likely to be successfully applicable to further sequence labeling tasks without severe modifications.