Syntactically Aware Neural Architectures for Definition Extraction

Automatically identifying definitional knowledge in text corpora (Definition Extraction or DE) is an important task with direct applications in, among others, Automatic Glossary Generation, Taxonomy Learning, Question Answering and Semantic Search. It is generally cast as a binary classification problem between definitional and non-definitional sentences. In this paper we present a set of neural architectures combining Convolutional and Recurrent Neural Networks, which are further enriched by incorporating linguistic information via syntactic dependencies. Our experimental results in the task of sentence classification, on two benchmarking DE datasets (one generic, one domain-specific), show that these models obtain consistent state of the art results. Furthermore, we demonstrate that models trained on clean Wikipedia-like definitions can successfully be applied to more noisy domain-specific corpora.


Introduction
Dictionaries and glossaries are among the most important sources of meaning for humankind. Compiling, updating and translating them has traditionally been left mostly to domain experts and professional lexicographers. However, the last two decades have witnessed a growing interest in automating the construction of lexicographic resources.
In this context, systems able to address the problem of Definition Extraction (DE), i.e., identifying definitional information spanning in free text, are of great value both for computational lexicography and for NLP. In the early days of DE, rulebased approaches leveraged linguistic cues observed in definitional data (Rebeyrolle and Tanguy, 2000;Klavans and Muresan, 2001;Malaisé et al., 2004;Saggion and Gaizauskas, 2004;Storrer and Wellinghoff, 2006). However, in order to deal with problems like language dependence and domain specificity, machine learning was incorporated in more recent contributions (Del Gaudio et al., 2013), which focused on encoding informative lexico-syntactic patterns in feature vectors (Cui et al., 2005;Fahmi and Bouma, 2006;Westerhout and Monachesi, 2007;Borg et al., 2009), both in supervised and semi-supervised settings (Reiplinger et al., 2012;. On the other hand, while encoding definitional information using deep learning techniques has been addressed in the past (Hill et al., 2015;Noraset et al., 2016), to the best of our knowledge no previous work has tackled the problem of DE by reconciling both the linguistic lessons learned in the past decades (e.g., the importance of lexico syntactic patterns or long-distance relations between definiendum and definiens) 1 and the processing potential of neural networks.
Thus, we propose to bridge this gap by learning high level features over candidate definitions via convolutional filters, and then apply recurrent neural networks to learn long term dependencies over these feature maps. Without preprocessing and only taking pretrained embeddings as input, it is already possible to consistently obtain state of the art results in two benchmarking datasets for DE (one generic, one domain-specific). Further improvements over this simple model are obtained by incorporating syntactic information by composing and embedding head-modifier syntactic dependencies and dependency labels. One interesting side result of our experiments is the observation that a model trained only on canonical wikipedia-like definitions performs significantly better in a domain-specific academic setting than a model that has been trained on that domain, which somewhat contradicts previously assumed notions about the creativity of academic authors when presenting and describing novel terminology. 2

Method
The impact of deep learning methods in NLP is today indisputable. The utilization of neural networks has improved the state of the art almost systematically in a wide number of tasks, from language modeling (Bengio et al., 2003;Yih et al., 2011;Mikolov et al., 2013) to text classification (Kim, 2014) or machine translation (Bahdanau et al., 2014), among many others.
In this paper we leverage two of the most popular architectures in deep learning for NLP with the goal to predict, given an input sentence, its probability of including definitional knowledge. In our best performing model we take advantage of Convolutional Neural Networks (CNNs) to learn local features via convolved filters (LeCun et al., 1998), and then apply to the learned feature maps a Bidirectional Long Short Term Memory (blstm) network (Hochreiter and Schmidhuber, 1997). In this way, we aim at capturing ngram-wise features (Zhou et al., 2015), which may be strong indicators of definitional patterns (e.g., the classic X is a Y pattern), combined with the learning of longterm sequential dependencies over these learned feature maps.
Following standard notation for sentence modeling via CNNs (Kim, 2014), we let x i ∈ R k be the k-dimensional word vector associated to the ith word in an input sentence S. We use as pre-2 Code available at bitbucket.org/ luisespinosa/neural_de trained embeddings the word2vec (Mikolov et al., 2013) vectors trained with negative sampling on the Google News corpus 3 . Each sentence is represented as an n × k matrix S, where n is the size of the longest sentence in the corpus (using padding where necessary). The convolution layer applies a filter w j ∈ R (h+1)k to each ngram window of h+1 tokens. Specifically, writing x i:i+h for the concatenation of the word vectors x i , x i+1 , ..., x i+h , we have: where b j ∈ R is a bias term and f is the ReLu activation function (Nair and Hinton, 2010). In total, we use 100 such convolutional features, i.e. we use the vector c i = c i 1 , c i 2 , · · · , c i 100 to encode the i th ngram. We empirically set the length h+1 of each ngram to 3. To reduce the size of the representation, we then use a max pooling layer with a pool size of 4. Let us write d . The input sentence S is then represented as the sequence d 1 , d 5 , d 9 , ..., d n−3 , which is used as the input to a bidirectional LSTM (BLSTM) layer. Finally, the output vectors of the final states for both directions of this BLSTM are connected to a single neuron with a sigmoid activation function. In all the experiments reported in this paper, we classify a sentence as definitional when the output of this neuron yields a value which is at least 0.5.

Incorporating Syntactic Information
The role of syntax has been extensively studied for improving semantic modeling of domain terminologies. Examples where syntactic cues are leveraged include medical acronym expansion (Pustejovsky et al., 2001), hyponym-hypernym extraction and detection (Hearst, 1992;Shwartz et al., 2016), and definition extraction either from the web (Saggion and Gaizauskas, 2004), scholarly articles (Reiplinger et al., 2012), and more recently from Wikipedia-like definitions (Boella et al., 2014).
However, the interplay between syntactic information and the generalization potential of neural networks remains unexplored in definition modeling, although intuitively it seems reasonable to assume that a syntax-informed architecture should have more tools at its disposal for discriminating between definitional and non-definitional knowledge. As an example of the importance of syntax in encyclopedic definitions, among the definitions contained in the WCL definition corpus (see Section 3.1), 71% of them include the lexico-syntactic To explore the potential of syntactic information, we represent dependency-based phrases by embedding them in the same vector space as the pretrained word embeddings introduced above. This approach draws from previous work on modeling phrases by composing their parts and the relations that link them (Socher et al., 2011(Socher et al., , 2013(Socher et al., , 2014. Specifically, let S d be the list of head-modifier relations obtained by parsing 4 sentence S. Each relation r in S d is a head-modifier tuple h, m, l . Here l denotes the dependency label of the relation (e.g., nsubj), which we represent as the vector r = 1 2 (h + m), with h and m the vector representations of words h and m respectively. This setting for composing first-order head-modifier relations is similar to the one proposed in Dyer et al. (2015) for dependency parsing. This leads to a represention of the sentence as a sequence r 1 , ..., r |S d | , which preserves the original order of head words. The intuition is that this "coarser" grained sorting 5 provides integrated semantic-syntactic information that can be leveraged both by the convolutional feature extraction step, and more importantly, by the sequential BLSTM module.
Then, for each sentence we concatenate the dependency-based representation r 1 , ..., r |S d | to the word vector sequence x 1 , ..., x n , to obtain the input to the convolutional layer of our model. It is worth mentioning that we tried different merging schemes (concatenation, but also dot product and averaging) at different layers, and found that the best way to inform our neural definition extractor is to encode this syntactic information explicitly at input time. Finally, we also explore the effect of enriching the input representation with the information of the dependency label. For each sentence, we enrich each head-modifier mean vector r i by concatenating them a one-hot representation of their corresponding dependency label. The search space of these labels is 46 (e.g., nsubj or dobj). An illustrative diagram of our proposed architecture is provided in Figure 1. 4 We use the dependency parser provided in the SpaCy NLP library: spacy.io. 5 It is coarser because in a dependency tree modifiers nat-   consists of manually annotated Wikipedia definitions and distractors (1,871 and 2,847 respectively). These distractors are sentences that also include the term (i.e., the Wikpedia page title) and are what the authors call "syntactically plausible false definitions". The style of the definitions is fairly consistent, and follows in most cases the Aristotelian genus et differentia structure of a definition (A is a B which C). We list below both an example definition and one of its distractors: The Amiga is a family of personal computers originally developed by Amiga Corporation.
Development on the Amiga began in 1982 with Jay Miner as the principal hardware designer.
W00: Introduced in Jin et al. (2013), this corpus consists of a collection of 731 definition sentences compiled from the ACL-ARC anthology (Bird et al., 2008), and 1454 distractors. Their style is different 6 , as they are used mostly for introducing and describing novel terminology in NLP research papers. Let us show an example for each sentence class: urally lose their original order.
Our system, SNS (pronounced "essence"), retrieves documents related to an unrestricted user query and summarizes a subset of them as selected by the user.
The senses with the highest confidence scores are the senses that contribute the most to the maximization function for the set .

Baselines
Let us provide a succint description of each competing baseline.
(1) WCL: An algorithm that learns word-class lattices for modeling higherlevel features over shallow parsing and part of speech .
As for our proposed models, we include results for a CNN architecture alone (CNN), as well as for the proposed CNN and BLSTM (C-BLSTM) combination. For both architectures, subscripts d or l denote the syntactically informed variant without and with one-hot label encoding information, respectively. Finally, among the many hyperparameters that can be explored, we report the impact of the dimensionality of the output vectors of the BLSTM layer, with sizes of 100 and 300. We did not attempt to tune the other hyperparameters.

Experiment 1: In-domain 10-fold CV
In this experiment, we compare the performance of different configurations of our proposed model with previous contributions in a 10-fold cross validation (CV) setting. The experimental results, listed in Table 1, show that a fairly simple CNN architecture with no preprocessing already achieves remarkably strong results, especially for the WCL dataset. Among our proposed systems, the overall best performance in Wikipedia definitions is obtained by the CNN l configuration. However, incorporating a BLSTM layer contributes towards the best performing model on the NLP-specific  dataset (C-BLSTM100 d ). Several conclusions can be drawn from these results. First, CNNs are capable of capturing a great deal of Wikipedia-like definitional information. This probably owes to the fairly recurrent linguistic structure of these definitions. On the contrary, however, LSTMs seem necessary in more complex scenarios, e.g., in those presented in the W00 dataset. Here, we argue that long term dependencies may play an important role, for example, for capturing cases where a full-fledged definitions appear spanning only over the last tokens of a sentence. Finally, syntax seems to help for most configurations, and for both datasets, although the difference is more pronounced in the more challenging W00 dataset.
These differences in performance are, however, small enough to make it difficult to draw strong conclusions other than that neural network architectures are a sensible choice for this task, and that syntax can play an important role depending on the type of data to be processed. It is important to highlight, finally, that depending on the application, one may be more interested in having an almost perfect precision (as in the system described in ). For automatic glossary generation from text, on the other hand, having a more balanced model, with high re-call at the expense of only slightly lower precision, may be preferred, as automatic glossaries usually undergo a human post-editing and revision step.

Experiment 2: Cross-domain DE
In this experiment we assess the performance of a cross-domain model on the W00 dataset (cf. Section 3.1). The main goal is to verify to what extent a model trained only on Wikipedia-like definitions can do well in a domain-specific setting. To this end, we apply our best performing configuration trained on the whole WCL corpus to the W00 dataset (WCL>W00), and compare it with the performance of our best configuration as per 10-fold CV (C-BLSTM100 d , see Table 1). This experiment is important, for example, for learning what would be more appropriate if we were to aim at constructing domain-specific glossaries or at extracting highly specific semantic relations from a domain terminology.

System
Precision Recall F-Score  The results in Table 2 reveal that, despite differences in style, a system modeled over encyclopedic definitions outperforms a neural model trained only on these idiosyncratic definitions. This might be due to several reasons. First, because of the slightly smaller size of this dataset. And second, the more noisy nature of the corpus may pose a stronger challenge for a neural model to identify recurrent definitional patterns. Still, our experimental results seem to suggest that these patterns do exist, as evidenced by the strong performance of the Wikipedia-trained model.

Qualitative Evaluation
We run our best performing model over a subset of the ACL-ARC anthology (Bird et al., 2008), specifically the subcorpus described in (Espinosa-Anke et al., 2016a), which removed noisy sentences as produced by the pdf to text conversion.
In Table 3 we show three high quality definitions discovered by our model, as well as three false positives. We may highlight the somewhat surprising remarkable capacity of the model to identify definitions beyond the is-a pattern (e.g., using the verb 'mean') and with long-distance dependencies between subject and object. As for the incorrect cases, we find that for this model to be used in the automatic glossary construction task, in addition to further refinement, it would have to be coupled with a term extraction system so that only definitions associated to meaningful domain terms are extracted. the segmentation of a translation memory is a key feature for our system Table 3: Examples of extracted definitions with over 0.9 confidence from a subset of the ACL-ARC corpus.

Conclusion
We have presented and evaluated a neural model based on CNNs and Bidirectional LSTMs which obtains state of the art results on two well known definition extraction datasets. From our experiments, it stems that: (1) Neural network architectures perform well for identifying definitional text snippets in corpora, more so with syntactic information; (2) A model trained on Wikipedia is competitive even in a domain-specific setting; and (3) More complex linguistic structures seem to be better captured with more complex models. As for future work, it would be interesting to explore whether meaningful further gains can be obtained by performing hyperparameter tuning.