Evaluating the Utility of Hand-crafted Features in Sequence Labelling

Conventional wisdom is that hand-crafted features are redundant for deep learning models, as they already learn adequate representations of text automatically from corpora. In this work, we test this claim by proposing a new method for exploiting handcrafted features as part of a novel hybrid learning approach, incorporating a feature auto-encoder loss component. We evaluate on the task of named entity recognition (NER), where we show that including manual features for part-of-speech, word shapes and gazetteers can improve the performance of a neural CRF model. We obtain a F 1 of 91.89 for the CoNLL-2003 English shared task, which significantly outperforms a collection of highly competitive baseline models. We also present an ablation study showing the importance of auto-encoding, over using features as either inputs or outputs alone, and moreover, show including the autoencoder components reduces training requirements to 60%, while retaining the same predictive accuracy.


Introduction
Deep neural networks have been proven to be a powerful framework for natural language processing, and have demonstrated strong performance on a number of challenging tasks, ranging from machine translation (Cho et al., 2014b,a), to text categorisation (Zhang et al., 2015;Joulin et al., 2017;Liu et al., 2018b). Not only do such deep models outperform traditional machine learning methods, they also come with the benefit of not requiring difficult feature engineering. For instance, both Lample et al. (2016) and Ma and Hovy (2016) propose end-to-end models for sequence labelling task and achieve state-of-the-art results. * https://github.com/minghao-wu/CRF-AE † Work carried out at The University of Melbourne Orthogonal to the advances in deep learning is the effort spent on feature engineering. A representative example is the task of named entity recognition (NER), one that requires both lexical and syntactic knowledge, where, until recently, most models heavily rely on statistical sequential labelling models taking in manually engineered features (Florian et al., 2003;Chieu and Ng, 2002;Ando and Zhang, 2005). Typical features include POS and chunk tags, prefixes and suffixes, and external gazetteers, all of which represent years of accumulated knowledge in the field of computational linguistics.
The work of Collobert et al. (2011) started the trend of feature engineering-free modelling by learning internal representations of compositional components of text (e.g., word embeddings). Subsequent work has shown impressive progress through capturing syntactic and semantic knowledge with dense real-valued vectors trained on large unannotated corpora (Mikolov et al., 2013a,b;Pennington et al., 2014). Enabled by the powerful representational capacity of such embeddings and neural networks, feature engineering has largely been replaced with taking off-the-shelf pre-trained word embeddings as input, thereby making models fully end-to-end and the research focus has shifted to neural network architecture engineering.
More recently, there has been increasing recognition of the utility of linguistic features Chen et al., 2017;Wu et al., 2017;Liu et al., 2018a) where such features are integrated to improve model performance. Inspired by this, taking NER as a case study, we investigate the utility of hand-crafted features in deep learning models, challenging conventional wisdom in an attempt to refute the utility of manually-engineered features. Of particular interest to this paper is the work by Ma and Hovy (2016)   introduce a strong end-to-end model combining a bi-directional Long Short-Term Memory (Bi-LSTM) network with Convolutional Neural Network (CNN) character encoding in a Conditional Random Field (CRF). Their model is highly capable of capturing not only word-but also characterlevel features. We extend this model by integrating an auto-encoder loss, allowing the model to take hand-crafted features as input and re-construct them as output, and show that, even with such a highly competitive model, incorporating linguistic features is still beneficial. Perhaps the closest to this study is the works by Ammar et al. (2014) and , who show how CRFs can be framed as auto-encoders in unsupervised or semisupervised settings. With our proposed model, we achieve strong performance on the CoNLL 2003 English NER shared task with an F 1 of 91.89, significantly outperforming an array of competitive baselines. We conduct an ablation study to better understand the impacts of each manually-crafted feature. Finally, we further provide an in-depth analysis of model performance when trained with varying amount of data and show that the proposed model is highly competent with only 60% of the training set.

Methodology
In this section, we first outline the model architecture, then the manually crafted features, and finally how they are incorporated into the model.

Model Architecture
We build on a highly competitive sequence labelling model, namely Bi-LSTM-CNN-CRF, first introduced by Ma and Hovy (2016). Given an input sequence of x = {x 1 , x 2 , . . . , x T } of length T , the model is capable of tagging each input with a predicted labelŷ, resulting in a sequence ofŷ = {ŷ 1 ,ŷ 2 , . . . ,ŷ T } closely matching the gold label sequence y = {y 1 , y 2 , . . . , y T }. Here, we extend the model by incorporating an auto-encoder loss taking hand-crafted features as in/output, thereby forcing the model to preserve crucial information stored in such features and allowing us to evaluate the impacts of each feature on model performance. Specifically, our model, referred to as Neural-CRF+AE, consists of four major components: (1) a character-level CNN (char-CNN); (2) a word-level bi-directional LSTM (Bi-LSTM); (3) a conditional random field (CRF); and (4) an auto-encoder auxiliary loss. An illustration of the model architecture is presented in Figure 1. Zadrozny, 2014;Chiu and Nichols, 2016;Ma and Hovy, 2016) have demonstrated that CNNs are highly capable of capturing character-level features. Here, our character-level CNN is similar to that used in Ma and Hovy (2016) but differs in that we use a ReLU activation (Nair and Hinton, 2010). 1

Char-CNN. Previous studies (Santos and
Bi-LSTM. We use a Bi-LSTM to learn contextual information of a sequence of words. As inputs to the Bi-LSTM, we first concatenate the pre-trained embedding of each word w i with its character-level representation c w i (the output of the char-CNN) and a vector of manually crafted features f i (described in Section 2.2): where [; ] denotes concatenation. The outputs of the forward and backward pass of the Bi-LSTM is then concatenated CRF. For sequence labelling tasks, it is intuitive and beneficial to utilise information carried between neighbouring labels to predict the best sequence of labels for a given sentence. Therefore,  we employ a conditional random field layer (Lafferty et al., 2001) taking as input the output of the Bi-LSTM h i . Training is carried out by maximising the log probability of the gold sequence: L CRF = log p(y|x) while decoding can be efficiently performed with the Viterbi algorithm.
Auto-encoder loss. Alongside sequence labelling as the primary task, we also deploy, as auxiliary tasks, three auto-encoders for reconstructing the hand-engineered feature vectors. To this end, we add multiple independent fully-connected dense layers, all taking as input the Bi-LSTM output h i with each responsible for reconstructing a particular type of feature:f t i = σ(W t h i ) where σ is the sigmoid activation function, t denotes the type of feature, and W t is a trainable parameter matrix. More formally, we define the auto-encoder loss as: Model training. Training is carried out by optimising the joint loss: where, in addition to L CRF , we also add the autoencoder loss, weighted by λ t . In all our experiments, we set λ t to 1 for all ts.

Hand-crafted Features
We consider three categories of widely used features: (1) POS tags; (2)  ] for the i-th word. In addition, we also experimented with including the label of the incoming dependency edge to each word as a feature, but observed performance deterioration on the development set. While we still study and analyse the impacts of this feature in Table 3 and Section 3.2, it is excluded from our model configuration (not considered as part of f i unless indicated otherwise).

Experiments
In this section, we present our experimental setup and results for name entity recognition over the CoNLL 2003 English NER shared task dataset (Tjong Kim Sang and De Meulder, 2003).

Experimental Setup
Dataset. We use the CoNLL 2003 NER shared task dataset, consisting of 14,041/3,250/3,453 sentences in the training/development/test set respectively, all extracted from Reuters news articles during the period from 1996 to 1997. The dataset is annotated with four categories of name entities: PERSON, LOCATION, ORGANIZATION and MISC. We use the IOBES tagging scheme, as previous study have shown that this scheme provides a modest improvement to the model performance (Ratinov and Roth, 2009;Chiu and Nichols, 2016;Lample et al., 2016;Ma and Hovy, 2016).
Model configuration. Following the work of Ma and Hovy (2016), we initialise word embeddings with GloVe (Pennington et al., 2014) (300dimensional, trained on a 6B-token corpus). Character embeddings are 30-dimensional and randomly initialised with a uniform distribution in Parameters are optimised with stochastic gradient descent (SGD) with an initial learning rate of η = 0.015 and momentum of 0.9. Exponential learning rate decay is applied every 5 epochs with a factor of 0.8. To reduce the impact of exploding gradients, we employ gradient clipping at 5.0 (Pascanu et al., 2013).
We train our models on a single GeForce GTX TITAN X GPU. With the above hyper-parameter setting, training takes approximately 8 hours for a full run of 40 epochs.
Evaluation. We measure model performance with the official CoNLL evaluation script and report span-level named entity F-score on the test set using early stopping based on the performance on the validation set. We report average F-scores and standard deviation over 5 runs for our model.
Baseline. In addition to reporting a number of prior results of competitive baseline models, as listed in Table 2, we also re-implement the Bi-LSTM-CNN-CRF model by Ma and Hovy (2016) (referred to as Neural-CRF in Table 2) and report its average performance.

Results
The experimental results are presented in Table 2. Observe that Neural-CRF+AE, trained either on the training set only or with the addition of the development set, achieves substantial improvements in F-score in both settings, superior to all but one of the benchmark models, highlighting the utility of hand-crafted features incorporated with the proposed auto-encoder loss. Compared against the Neural-CRF, a very strong model in itself, our model significantly improves performance, showing the positive impact of our technique for exploiting manually-engineered features. Although Peters et al. (2018) report a higher F-score using their ELMo embedding technique, our approach here is orthogonal, and accordingly we would expect a performance increase if we were to incorporate their ELMo representations into our model.

Ablation Study
To gain a better understanding of the impacts of each feature, we perform an ab-Model F1 Chieu and Ng (2002) 88.31 Florian et al. (2003) 88.76 Ando and Zhang (2005) 89.31 Collobert et al. (2011) 89.59  90. 10 Passos et al. (2014) 90.90 Lample et al. (2016) 90.94 Luo et al. (2015) 91.20 Ma and Hovy (2016) 91.21  91.62 Peters et al. (2018) 90.15 Peters et al. (2018)  lation study and present the results in Table 3. We observe performance degradation when eliminating POS, word shape and gazetteer features, showing that each feature contributes to NER performance beyond what is learned through deep learning alone. Interestingly, the contribution of gazetteers is much less than that of the other features, which is likely due to the noise introduced in the matching process, with many incorrectly identified false positives.
Including features based on dependency tags into our model decreases the performance slightly. This might be a result of our simple implementation (as illustrated in Table 1), which does not include dependency direction, nor parent-child relationships.
Next, we investigate the impact of different means of incorporating manually-engineered features into the model. To this end, we experiment with three configurations with features as: (1) input only; (2) output only (equivalent to multi-task learning); and (3) both input and output (Neural-CRF+AE) and present the results in Table 4. Simply using features as either input or output only improves model performance slightly, but insignificantly so. It is only when features are incorporated with the proposed auto-encoder loss do we observe a significant performance boost.   Training Requirements Neural systems typically require a large amount of annotated data. Here we measure the impact of training with varying amount of annotated data, as shown in Figure 2. Wtih the proposed model architecture, the amount of labelled training data can be drastically reduced: our model, achieves comparable performance against the baseline Neural-CRF, with as little as 60% of the training data. Moreover, as we increase the amount of training text, the performance of Neural-CRF+AE continues to improve.
Hyperparameters Three extra hyperparameters are introduced into our model, controlling the weight of the autoencoder loss relative to the CRF loss, for each feature type. Figure 3 shows the effect of each hyperparameter on test performance. Observe that setting λ i = 1 gives strong performance, and that the impact of the gazetteer is less marked than the other two feature types. While increasing λ is mostly beneficial, performance drops if the λs are overly large, that is, the auto-encoder loss overwhelms the main prediction task.

Conclusion
In this paper, we set out to investigate the utility of hand-crafted features. To this end, we have presented a hybrid neural architecture to validate this hypothesis extending a Bi-LSTM-CNN-CRF by incorporating an auto-encoder loss to take manual features as input and then reconstruct them. On the task of named entity recognition, we show significant improvements over a collection of competitive baselines, verifying the value of such features. Lastly, the method presented in this work can also be easily applied to other tasks and models, where hand-engineered features provide key insights about the data.