Multi-Task Learning of Keyphrase Boundary Classification

Keyphrase boundary classification (KBC) is the task of detecting keyphrases in scientific articles and labelling them with respect to predefined types. Although important in practice, this task is so far underexplored, partly due to the lack of labelled data. To overcome this, we explore several auxiliary tasks, including semantic super-sense tagging and identification of multi-word expressions, and cast the task as a multi-task learning problem with deep recurrent neural networks. Our multi-task models perform significantly better than previous state of the art approaches on two scientific KBC datasets, particularly for long keyphrases.


Introduction
The scientific keyphrase boundary classification (KBC) task consists of a) determining keyphrase boundaries, and b) labelling keyphrases with their types according to a predefined schema. KBC is motivated by the need to efficiently search scientific literature, which can be summarised by their keyphrases. Several companies are working on keyphrase-based recommender systems for scientific literature or search interfaces where scientific articles decorate graphs, in which nodes are keyphrases. Such keyphrases must be dynamically retrieved from the articles, because important scientific concepts emerge on a daily basis, and the most recent concepts are typically the ones of interest to scientists.
KBC is not a common task in NLP, and there are only few small annotated datasets for inducing supervised KBC models, made ⋆ Both authors contributed equally available recently (QasemiZadeh and Schumann, 2016;Augenstein et al., 2017). Typical KBC approaches therefore rely on hand-crafted gazetteers (Hasan and Ng, 2014) or reduce the task to extracting a list of keyphrases for each document (Kim et al., 2010) instead of identifying mentions of keyphrases in sentences. For related more common NLP tasks such as named entity recognition and identification of multi-word expressions, neural sequence labelling methods have been shown to be useful (Lample et al., 2016). In order to overcome the small data problem, we study using more widely available data for tasks related to KBC and exploit their synergies in a deep multi-task learning setup.
Multi-task learning has become popular within natural language processing and machine learning over the last few years; in particular, hard parameter sharing of hidden layers in deep learning models. This approach to multi-task learning has three advantages: a) It significantly reduces Rademacher complexity (Baxter, 2000;Maurer, 2007), i.e., the risk of over-fitting, b) it is spaceefficient, reducing the number of parameters, and c) it is easy to implement. This paper shows how hard parameter sharing can be used to improve gazetteer-free keyphrase boundary classification models, by exploiting different syntactically and semantically annotated corpora, as well as more readily available data such as hyperlinks.
Contributions We study the so far widely underexplored, though in practice important task of scientific keyphrase boundary classification, for which only a small amount of training data is available. We overcome this by identifying good auxiliary tasks and cast it as a multi-task learning problem. We evaluate our models across two new, manually annotated corpora of scientific arti-cles and outperform single-task approaches by up to 9.64% F1, mostly due to better performance for long keyphrases.

Keyphrase Boundary Classification
Consider the following sentence from a scientific paper: (1) We find that simple interpolation methods, like log-linear and linear interpolation, improve the performance but fall short of the performance of an oracle.
This sentence occurs in the ACL RD-TEC 2.0 corpus. Here, interpolation methods and loglinear and linear interpolation are annotated as technical keyphrases, performance as a keyphrase related to measurements, and oracle is a keyphrase labelled as miscellaneous. Below, we are interested in predicting the boundaries and the types of all keyphrases.

Multi-Task Learning
Multi-task learning is an approach to learning, in which generalisation is improved by taking advantage of the inductive bias in training signals of related tasks. When abundant labelled data is available for an auxiliary task, but little data for the target task, multi-task learning can act as a form of semi-supervised learning combined with a distant supervision signal. Inducing a model from only the sparse target task data may lead to overfitting to random noise in the data, but relying on auxiliary data helps the model generalise, making it easier to abstract away from noise, as well as leveraging the marginal distribution of auxiliary input data. From a representation learning perspective, auxiliary tasks can be used to induce representations that may be beneficial for the target task. Caruana (1993) also suggests that the auxiliary task can help focus attention in the induction of the target task model. Finally, multi-task learning can be cast as a regulariser as studies show reductions in Rademacher complexity in multi-task architectures over single-task architectures (Baxter, 2000;Maurer, 2007).
Here, we follow the probably most common approach to multi-task learning, known as hard parameter sharing. This was introduced in Caruana (1993) in the context of deep neural networks, in which hidden layers can be shared among tasks. We assume T different training set, D 1 , · · · , D T , where each D t contains pairs of input-output sequences (w 1:n , y t 1:n ), w i ∈ V , y t i ∈ L t . The input vocabulary V is shared across tasks, but the output vocabularies (tagset) L t are task dependent. At each step in the training process we choose a random task t, followed by a random training instance (w 1:n , y t 1:n ) ∈ D t . We use the tagger to predict the labelsŷ t i , suffer a loss with respect to the true labels y t i and update the model parameters. The parameters are trained jointly for a sentence, i.e. cross-entropy loss over each sentence is employed. Each task is associated with an independent classification function, but all tasks share the hidden layers. Note that for our experiments, we only consider one auxiliary task at a time.

Experiments
Experimental Setup We perform experiments for both keyphrase boundary identification (unlabelled), and keyphrase boundary identification and classification (labelled). Metrics measured are token-level precision, recall and F1, which are micro-average results across keyphrase types. Types are defined by the two datasets studied.
Auxiliary tasks We experiment with five auxiliary tasks: (1) syntactic chunking using annotations extracted from the English Penn Treebank, following Søgaard and Goldberg (2016); (2) frame target annotations from FrameNet 1.5 (corresponding to the target identification and classification tasks in Das et al. (2014)); (3) hyperlink prediction using the dataset from Spitkovsky et al. (2010), (4) identification of multi-word expressions using the Streusle corpus (Schneider and Smith, 2015); and (5) semantic super-sense tagging using the Semcor dataset, following Johannsen et al. (2014). We train our models on the main task with one auxiliary task at a time. Note that the datasets for the auxiliary tasks are not annotated with keyphrase boundary identification or classification labels.
Datasets We evaluate on the SemEval 2017 Task 10 dataset (Augenstein et al., 2017) and the the ACL RD-TEC 2.0 dataset (QasemiZadeh and Schumann, 2016). The Se-mEval 2017 dataset is annotated with three keyphrase types, the ACL RD-TEC dataset with seven. For the former, we test on the development portion of the dataset, as the test set is not released yet. We randomly split ACL RD-TEC into a  Models Our single-and multi-task networks are three-layer, bi-directional LSTMs (Graves and Schmidhuber, 2005) with pre-trained SENNA embeddings. 1 For the multi-task networks, we follow the training procedure outlined in Section 3. The dimensionality of the embeddings is 50, and we follow Søgaard and Goldberg (2016) in using the same dimensionality for the hidden layers. We add a dropout of 0.1 to the input and train these architectures with momentum SGD with initial learning rate of 0.001 and momentum of 0.9 for 10 epochs. We use the implementations released by the authors and re-train models on our data.

Results and Analysis
Results for SemEval 2017 Task 10 corpus are presented in Table 2, and for the ACL RD-TEC corpus in Table 3. For the SemEval corpus, all five labelled multi-task learning models outperform both examples of previous work, as well as our singletask BiLSTM baseline, by some margin. For ACL 1 http://ronan.collobert.com/senna/ 2 http://nlp.stanford.edu/software/ CRF-NER.shtml 3 https://github.com/clab/ stack-lstm-ner RD-TEC, three of out five multi-task learning labelled labelled perform better than the single-task BiLSTM baseline. On the SemEval corpus, the F1 error reduction of of the best labelled model over the Stanford tagger is 9.64%. The lexicalised Finkel et al. (2005) model shows a surprisingly competitive performance on the ACL RD-TEC corpus, where it is only 2 points in F1 behind our best performing labelled model and on par with our best-performing unlabelled model. Results with Lample et al. (2016), on the other hand, are lower than the Finkel et al. (2005) baseline. This might be due to the model having a large set of parameters to model state transitions which poses a difficulty for small training datasets.
Overall, multi-task models show bigger improvements over baselines for the SemEval corpus, and all models achieve better results on ACL RD-TEC. Statistics shown in Table 1 help to explain this. Most noticeably, the SemEval dataset contains a significantly higher proportion of long keyphrases than the ACL dataset. Interestingly, ACL RD-TEC contains a large proportion of keyphrases which only appear once in the training set (singletons), significantly fewer keyphrases and more keyphrase type, but that does not seem to impact results as much as a high proportion of long keyphrases.
All models struggle with semantically vague or broad keyphrases (e.g. 'items', 'scope', 'key') and long keyphrases, especially those containing clauses (e.g. 'complete characterisation of the oxide particles', 'earley deduction proof procedure for definite clauses'). The multi-task models generally outperform the BiLSTM baseline for long   Table 3: Results for keyphrase boundary classification on the ACL RD-TEC corpus phrases (e.g. 'language-independent system for automatic discovery of text in parallel translation', 'honeycomb network of graphite bricks'). Being able to recognise long keyphrases correctly is part of the reason our multi-task models outperform the baselines, especially on the SemEval dataset, which contains many such long keyphrases.

Related Work
Multi-Task Learning Hard sharing of all hidden layers was introduced in Caruana (1993), and popularised in NLP by Collobert et al. (2011a). Several variants have been introduced, including hard sharing of selected layers (Søgaard and Goldberg, 2016) and sharing of parts (subspaces) of layers (Liu et al., 2015). Søgaard and Goldberg (2016) show that hard parameter sharing is an effective regulariser, also on heterogeneous tasks such as the ones considered here. Hard parameter sharing has been studied for several tasks, including CCG super tagging (Søgaard and Goldberg, 2016), text normalisation (Bollman and Søgaard, 2016), neural machine translation (Dong et al., 2015;Luong et al., 2016), and super-sense tagging (Martínez Alonso and Plank, 2017). Shar-ing of information can further be achieved by extending LSTMs with an external memory shared across tasks (Liu et al., 2016). A further instance of multi-task learning is to optimise a supervised training objective jointly with an unsupervised training objective, as shown in Yu et al. (2016) for natural language generation and auto-encoding, and in Rei (2017) for different sequence labelling tasks and language modelling. Boundary Classification KBC is very similar to named entity recognition (NER), though arguably harder. Deep neural networks have been applied to NER in Collobert et al. (2011b); Lample et al. (2016). Other successful methods rely on conditional random fields, thereby modelling the probability of each output label conditioned on the label at the previous time step. Lample et al. (2016), currently state-of-the-art for NER, stack CRFs on top of recurrent neural networks. We leave exploring such models in combination with multi-task learning for future work.
Keyphrase detection methods specific to the scientific domain often use keyphrase gazetteers as features or exploit citation graphs (Hasan and Ng, 2014). However, previous methods relied on corpora annotated for type-level identification, not for mention-level identification (Kim et al., 2010;Sterckx et al., 2016). While most applications rely on extracting keyphrases (as types), this has the unfortunate consequence that previous work ignores acronyms and other short-hand forms referring to methods, metrics, etc. Further, relying on gazetteers makes overfitting likely, obtaining lower scores on out-of-gazetteer keyphrases.

Conclusions and Future Work
We present a new state of the art for keyphrase boundary classification, using data from related, auxiliary tasks; in particular, super-sense tagging and identification of multi-word expressions. Deep multi-task learning improves significantly on previous approaches to KBC, with error reductions of up to 9.64%, mostly due to better identification and labelling of long keyphrases.
In future work, we want to explore alternative multi-task learning regimes to hard parameter sharing and experiment with additional auxiliary tasks. The auxiliary tasks considered here are standard NLP tasks, hyperlink prediction aside. Other tasks may be more directly relevant such as predicting the layout of calls for papers for scientific conferences, or predicting hashtags in tweets by scientists, since both data sources contain scientific keyphrases.