Dynamic Feature Induction: The Last Gist to the State-of-the-Art

We introduce a novel technique called dynamic feature induction that keeps inducing high dimensional features automatically until the feature space becomes ‘more’ linearly separable. Dynamic feature induction searches for the feature combinations that give strong clues for distinguishing certain label pairs, and generates joint features from these combinations. These induced features are trained along with the primitive low dimensional features. Our approach was evaluated on two core NLP tasks, part-of-speech tagging and named entity recognition, and showed the state-of-the-art results for both tasks, achieving the accuracy of 97 . 64 and the F1-score of 91 . 00 respectively, with about a 25% increase in the feature space.


Introduction
Feature engineering typically involves two processes: the process of discovering novel features with domain knowledge, and the process of optimizing combinations between existing features. Discovering novel features may require linguistic background as well as good understanding in machine learning such that it is often difficult to do. Optimizing feature combinations can be also difficult but usually requires less domain knowledge and more importantly, it can be as effective as discovering new features. It has been shown for many tasks that approaches using simple machine learning with extensive feature engineering outperform ones using more advanced machine learning with less intensive feature engineering (Xue and Palmer, 2004;Bengtson and Roth, 2008;Ratinov and Roth, 2009;Zhang and Nivre, 2011).
Recently, people have tried to automate the second part of feature engineering, the optimization of feature combinations, through leading-edge models such as neural networks (Collobert et al., 2011). Coupled with embedding approaches (Mikolov et al., 2013;Le and Mikolov, 2014;Pennington et al., 2014), neural networks can find the optimal feature combinations using techniques such as random weight initialization and back-propagation, and have established the new state-of-the-art for several tasks (Socher et al., 2013;Devlin et al., 2014;Yu et al., 2014). However, neural networks are not as good at optimizing combinations between sparse features, which are still the most dominating factors in natural language processing. This paper introduces a new technique called dynamic feature induction that automates the optimization of feature combinations (Section 3), and can be easily adapted to any NLP task using sparse features. Dynamic feature induction allows humans to focus on the first part of feature engineering, the discovery of novel features, while machines handle the second part. Our approach was experimented with two core NLP tasks, part-of-speech tagging (Section 4) and named entity recognition (Section 5) and showed the state-of-the-art results for both tasks.

Nonlinearity in NLP
Linear classification algorithms such as Perceptron, Winnow, or Support Vector Machines with a linear kernel have performed exceptionally well for various NLP tasks (Collins, 2002;Zhang and Johnson, 2003;Pradhan et al., 2005). This is not because our feature space is linearly separable by nature, but sparse fea- tures introduced to NLP yield very high dimensional vector space such that it is rather forced to be linearly separable. For example, NLP features for a word w i typically involve the word forms of w i−1 and w i (e.g., f i−1 , f i ). If the feature space is not linearly separable with these features, a common trick is to introduce 'higher' dimension features by joining 'lower' dimension features together (e.g., f i−1 f i ). The more joint features we introduce, the higher chance we get for the feature space being linearly separable although these joint features can be very overfitted.
Let us define low dimensional features as the primitive features such as f i−1 or f i , and high dimensional features as the joint features such as f i f i+1 . 1 Low dimensional features are well explored for most NLP tasks; it is the high dimensional features that are quite sensitive to specific tasks. Finding high dimensional features can be a manual intensive work and this is what dynamic feature induction intends to take over. Kudo and Matsumoto (2003) introduced the polynomial kernel expansion that explicitly enumerated the feature combinations. Our approach is distinguished because they used a frequency-based PrefixSpan algorithm (Pei et al., 2001) whereas we used the online learning weights for finding the feature combinations. Goldberg and Elhadad (2008) suggested an efficient algorithm for computing polynomial kernel SVMs by combining inverted indexing and kernel expansion. Their work is focused more on improving support vector machines whereas our work is generalized to any linear classification algorithm. Okanohara and Tsujii (2009) introduced an approach for generating feature combinations using 1 regularization and grafting (Perkins et al., 2003). Although we share similar ideas, their grafting algorithm starts with an empty feature set whereas ours starts with low dimensional features, and their correlation parameters α i,y are pre-computed whereas ours are dynamically determined. Strubell et al. (2015) suggested an algorithm that dynamically selected strong features during decoding. Our work is distinguished because we do not run multiple training phases as they do for figuring our strong features.

Dynamic Feature Induction
The intuition behind dynamic feature induction is to keep populating high dimensional features by joining low dimensional features together until the feature space becomes 'more' linearly separable. 2 Figure 1 shows how features are induced during training: 1. Given a training instance (x 1 , y 1 ), where x 1 is a feature set consisting of 5 features and y 1 is the gold label, the classifier predicts the labelŷ 1 .
2. Let us refer "strong features for y againstŷ" to features that give strong clues for distinguishing y fromŷ. Ifŷ 1 is not equal to y 1 (2.1), strong features for y 1 againstŷ 1 in x 1 are selected (2.2), and combinations of these features are added to the induced feature set F (2.3).
3. Given a new training instance (x 2 , y 2 ), combinations of features in x 2 are checked by F (3.1), and appended to x 2 if allowed (3.2).
4. The extended feature set x 2 is fed into the classifier. Ifŷ 2 is equal to y 2 , no feature combination is induced from x 2 .
Thus, high dimensional features in F are incrementally induced and learned along with low dimensional features during training. During decoding, each feature set is extended by the induced features in F, and the prediction is made using the extended feature set. The size of F can grow up to |X | 2 , where |X | is the size of low dimensional features. However, we found that |F| is more like 1 /4 · |X | in practice.
The following sections explain our approach in details. Sections 3.1, 3.2, and 3.3 describe how features are induced and learned during training. Sections 3.4 and 3.5 describe how the induced features are stored and expanded during decoding.

Feature Induction
Algorithm 1 shows an online learning algorithm that induces and learns high dimensional features during training. It takes the set of training instances D and the learning rate η, and returns the weight vector w and the set of induced features F.

Algorithm 1 Feature Induction
Input: D: training set, η: learning rate. Output: w: weight vector, F: induced feature set.
foreach (x, y) ∈ D do 5:ŷ ← arg max y ∈Y (w · φ(x, y , F) − I y (y )) 6: if y =ŷ then 7: The algorithm begins by initializing the weight vector w, the diagonal vector g, and the induced feature set F (lines 1-2). For each instance (x, y) ∈ D where y is the gold-label for the feature set x, it predictŝ y maximizing w · φ(x, y , F) − I y (y ), where I is defined as follows (lines 4-5): The feature map φ takes (x, y, F), and returns a d×ldimensional vector, where d and l are the sizes of features and labels, respectively; each dimension contains the value for a particular feature and a label. 3 If certain combinations between features in x exist in F, they are appended to the feature vector along with the low dimensional features (see Section 3.5 for more details). The indicator function I allows our algorithm to be optimized for the hinge loss for multiclass classification (Crammer and Singer, 2002): If y is not equal toŷ (line 6), the partial vector ∂ is measured (line 7), and g and w are updated (lines 8-9) by AdaGrad (Duchi et al., 2011), where the learning rate η is adjusted by g (in our case, .] y returns only the portion of the values relevant to y ( Figure 2). The i'th element in v represents the strength of the i'th feature for y againstŷ; the greater v i is, the stronger the i'th feature is. Next, indices of the top-k entries in v are collected in the ordered list L (line 11), representing the strongest features for y againstŷ. 4 Finally, the pairs of the first index in L, representing the strongest feature, and the other indices in L are added to the induced feature set F (lines 12-13). For For all our experiments, k = 3 is used; increasing k beyond this cutoff did not show much improvement. Notice that all induced features in F are derived by joining only low dimensional features together. Our algorithm does not join a high dimensional feature with either a low dimensional feature or another high dimensional feature. This was done intentionally to prevent from the feature space being exploded; such features can be induced by replacing ∅ with F in the line 10 as follows: It is worth mentioning that we did not find it useful for joining intermediate features together (e.g., (j, k) in the above example). It is possible to utilize these combinations by weighting them differently, which we will explore in the future. Additionally, we experimented with the combinations between strong and weak features (joining i'th and j'th features, where v i > 0 and v j < 0), which again was not so useful. We are planning to evaluate our approach on more tasks and data, which will give us better understanding of what combinations are the most effective.

Regularized Dual Averaging
Each high dimensional feature in F is induced for making classification between two labels, y andŷ, but it may or may not be helpful for distinguishing labels other than those two. Our algorithm can be modified to learn the weights of the induced features only for their relevant labels by adding the label information to F, which would change the line 13 in Algorithm 1 as follows: However, introducing features targeting specific label pairs potentially confuses the classifier, especially when they are trained with the low dimensional features targeting all labels. Instead, it is better to apply a feature selection technique such as 1 regularization so the induced features can be selectively learned for labels that find those features useful. We adapt regularized dual averaging (Xiao, 2010), which efficiently finds the convergence rates for online convex optimization, and works most effectively with sparse feature vectors. To apply regularized dual averaging, the line 1 in Algorithm 1 is changed to: c is a d × l-dimensional vector consisting of accumulative penalties. t is the number of weight vectors generated during training. Although w is technically not updated when y =ŷ, it is still considered a new vector. Thus, t is incremented for every training instance, so t ← t + 1 is inserted after the line 5. c is updated by adding the partial vector ∂ as follows (to be inserted after the line 7): Thus, each dimension in c represents the accumulative penalty (or reward) for a particular feature and a label. At last, the line 9 is changed to: The function 1 takes c, t, and the regularizer parameter λ tuned during development. If the absolute value of the accumulative penalty c i is greater than λ · t, the weight w i is updated by λ and t; otherwise, it is assigned to 0. For our experiments, RDA was able to throw out irrelevant features successfully, and showed improvement in accuracy; in fact, dynamic feature induction without RDA did not show as much improvement over low dimensional features.

Locally Optimal Learning to Search
Features in most NLP tasks are extracted from structures (e.g., sequence, tree). For structured learning, we adapt "locally optimal learning to search" (Chang et al., 2015b), that is a member of imitation learning similar to DAGGER (Ross et al., 2011). LOLS not only performs well relative to the reference policy, but also can improve upon the reference policy, showing very good results for tasks such as part-of-speech tagging and dependency parsing. We adapt LOLS by setting the reference policy as follows: 1. The reference policy π determines how often the gold label y is picked over the predicted labelŷ to build a structure. For all our experiments, π is initialized to 0.95.
2. For the first epoch, since π is 0.95, y is randomly picked overŷ for 95% of the time.
3. After every epoch, π is multiplied by 0.95. This allows the next epoch to pick y less often than the previous epoch (e.g., π becomes 0.95 2 = 0.9025 for the 2nd epoch so y is picked about 90% of the time instead of 95%).
For our experiments, LOLS gave only marginal improvement, probably because the tasks we evaluated, part-of-speech tagging and named entity recognition, did not yield complex structures. However, we still included this in our framework because we wanted to evaluate our approach on more tasks such as dependency parsing where learning to search algorithms show a clear advantage (Goldberg and Nivre, 2012;Choi and McCallum, 2013;Chang et al., 2015a).

Feature Hashing
Feature hashing is a technique of converting string features to vectors (Ganchev and Dredze, 2008;Weinberger et al., 2009). Given a string feature f and a hash function h, the index of f in the vector space is determined by taking the remainder of the hash code: The divisor δ is tuned during development. Feature hashing allows to convert string features into sparse vectors without reserving an extra space for a map whose keys and values are the string features and their indices. Given a feature index pair (i, j) representing strong features for y againstŷ (Section 3.1), the index of the induced feature can be measured as follows: For efficiency, feature hashing is adapted to our system such that the induced feature set F is actually not a set but a δ-dimensional boolean array, where each dimension represents the validity of the corresponding induced feature. Thus, the line 13 in Algorithm 1 is changed to: For the choice of h, xxHash is used, that is a fast non-cryptographic hash algorithm showing the perfect score on the Q.Score. 5

Feature Expansion
Algorithm 2 describes how high dimensional features are expanded from low dimensional features during training and decoding. It takes the sparse vector x l containing only low dimensional features and returns a new sparse vector x l+h containing both low and high dimensional features.

Algorithm 2 Feature Expansion
Input: x l : sparse feature vector containing only low dimensional features. Output: x l+h : sparse feature vector containing both low and high dimensional features.
1: x l+h ← copy(x l ) 2: for i ← 1 to |x l | do 3: for j ← i + 1 to |x l | do 4: k ← h int→int (i · |X | + j) mod δ 5: if F k then x l+h .append(k) 6: return x l+h The algorithm begins by copying x l to x l+h (line 1). For every combination (i, j) ∈ x l × x l , where i and j represent the corresponding feature indices (lines 2-3), it first measures the index k of the feature combination (line 4), then checks if this combination is valid (Section 3.4). If the combination is valid, meaning that (F k = True), k is added to x l+h (line 5).
Finally, x l+h is returned with the expanded high dimensional features.
4 Part-of-Speech Tagging

Corpus
The Wall Street Journal corpus from the Penn Treebank III is used (Marcus et al., 1993) with the standard split for part-of-speech tagging experiments.

Set
Sections

Tagging and Learning Algorithms
A one-pass, left-to-right tagging algorithm is used for our experiments. Such a simple algorithm is chosen because we want to see the performance gain purely from our approach, not by a more sophisticated tagging algorithm (Toutanova et al., 2003;Shen et al., 2007), which may improve the performance further. For learning, the final algorithm from Section 3 is used. Additionally, mini-batch is applied, where each batch consists of training instances from k-number of sentences, causing the sizes of these batches different. We found that grouping instances with respect to the sentence boundary was more effective than batching them across arbitrary sentences. For all our experiments, the learning rate η = 0.02 and the mini-batch boundary k = 5 were used without tuning.

Ambiguity Classes
The ambiguity class of a word is the concatenation of all possible tags for that word. For example, if the word 'study' can be tagged by NN (common noun) or VB (base verb), its ambiguity class becomes NN VB. Instead of building ambiguity classes only from the training dataset, we automatically tagged a mixture of large datasets, the English Wikipedia articles 6 and the New York Times corpus, 7 and pre-constructed ambiguity classes using the automatic tags before training. This was motivated by Moore (2015), who showed extraordinary results on the out-of-vocabulary words by limiting the classification to the ambiguity classes collected from such large corpora. 6 dumps.wikimedia.org/enwiki 7 catalog.ldc.upenn.edu/LDC2008T19 We used the ClearNLP POS tagger (Choi and Palmer, 2012) for tagging the data (about 141M words), threw away tags appearing less than a certain threshold, and created the ambiguity classes. For each word, tags appearing less than 20% of the time for that word were discarded. As the result, about 2M ambiguity classes were collected from these datasets. Table 2 shows the template for low dimensional features. Digits inside the curly brackets imply the context windows with respect to the word w i to be tagged.

Feature Template
For example, f {0,±1} represents the word-forms of w i , w i−1 , and w i+1 . No joint features (e.g., f 0 f 1 ) are included in this template; they should be automatically induced by dynamic feature induction.
Orthographic (Giménez and Màrquez, 2004) and word shape (Finkel et al., 2005) features are adapted from the previous work. The positional features indicate whether w i is the first or the last word in the sentence. Word clusters are trained on the same datasets in Section 4.3 using Brown et al. (1992).  Table 2: Feature template for part-of-speech tagging. f : wordform, fu: uncapitalized word-form, s: word shape, c: word cluster, π k : k'th prefix, σ k : k'th suffix, p: part-of-speech tag, a: ambiguity class, O: orthographic feature set, P: positional feature set.

Development
The regularization parameter λ (Section 3.2) and the modulo divisor δ (Section 3.4) are tuned during development through grid search on λ ∈ [1E-9, 1E-6] and δ ∈ [1.5M, 5M]. Table 3 shows the accuracies achieved by our models on the development set.  M 0 used the tagging and the learning algorithms in Section 4.2 and the feature template in Section 4.4, where the ambiguity classes were collected only from the training dataset; dynamic feature induction was not used for M 0 . By applying the external ambiguity classes in Section 4.3, M 1 achieved about a 5.8% improvement on OOV. M 2 gained small improvements by adding word clusters. Coupled with dynamic feature induction, M 3 and M 4 gained about 0.04% and 0.2% improvements on average for ALL and OOV. For both M 3 and M 4 , about 100K more features were generated from M 1 and M 2 , implying that about 25% of the features were automatically induced by dynamic feature induction. It is worth pointing out that improving upon M 1 was a difficult task because it was already reaching near the state-of-the-art. The external ambiguity classes by themselves were strong enough to make accurate predictions such that the induced features did not find a critical role in the classification. Table 4 shows the accuracies achieved by the models from Section 4.5 and the previous state-of-the-art approaches on the evaluation set.  Table 4: Part-of-speech tagging accuracies on the evaluation set.

Approach ALL OOV EXT
EXT: whether or not the approach used external data.
The results on the evaluation set appear much more promising. Still, the biggest gain was made by M 1 , but our final model M 4 was able to achieve a 0.8% improvement on OOV over M 2 , and showed the state-ofthe-art results on both ALL and OOV. Interestingly, M 2 showed a slightly lower accuracy on OOV than M 1 even with the additional word cluster features. On the other hand, M 2 did show a slightly higher accuracy on ALL, indicating that the model was probably too overfitted to the in-vocabulary words. 8 However, M 4 was still able to achieve improvements over M 2 on both ALL and OOV, implying that dynamic feature induction facilitated the classifier to be trained more robustly.

Corpus
The English corpus from the CoNLL'03 shared task is used (Tjong Kim Sang and De Meulder, 2003) for named entity recognition experiments.   Table 6 shows the feature template for NER, adapting the specifications in Table 2. Following the state-ofthe-art approaches (Table 8), word clusters are trained on the Reuters Corpus Volume I (Lewis et al., 2004) using Brown et al. (1992. Named entity gazetteers are collected from DBPedia. 9 Word embeddings are trained on the datasets in Section 4.3 using Mikolov et al. (2013) and appended to the sparse feature vectors as dense vectors. Note that the word embedding features did not participate in dynamic feature induction; it was not intuitive how to combine sparse and dense features together so we left it as a future work.

Development
The regularization parameter and the modulo divisor are tuned during development through the same grid search in Section 4.5. Table 7 shows the precisions and the recalls achieved by our models on the development set (the F1-scores are shown in Table 8).  M 0 used the tagging and the learning algorithms in Section 4.2 and the feature template in Section 5.2, excluding the gazetteer, cluster, and embedding features; dynamic feature induction was not applied to M 0 . M {1,2,3} gained incremental improvements from the gazetteer, cluster, and embedding features, respectively. M 4 showed 0.36% and 0.67% improvements on precision and recall respectively, and generated about 40K more features compared to M 3 . This is about 23% increase in features that is similar to the increase shown in Table 3.  All models showed improvements over their predecessors; the improvements made in TST were more dramatic than the ones made in DEV although they followed a very similar trend. Notice that M 3 , not using dynamic feature induction, showed very similar scores to Ratinov and Roth (2009). This was not surprising because M 3 adapted many features suggested by them, except for the non-local features. 11 M 4 achieved about 0.5% improvements over M 3 , showing the state-of-the-art result on TST. Considering that M 3 was already near state-of-the-art, this improvement was meaningful. It was interesting that Suzuki and Isozaki (2008) achieved the state-of-theart result on DEV although their score on TST was much lower than the other approaches. This might be because features extracted from the huge external data they used were overfitted to DEV, but more thorough analysis needs to be done. On the other hand, Passos et al. (2014) achieved the near state-of-the-art result on DEV while it also got a very high score on TST by utilizing phrase embeddings, which we will look into in the future.

Conclusion
In this paper, we introduced a novel technique called dynamic feature induction that automatically induces high dimensional features so the feature space can be more linearly separable. Our approach was evaluated on two NLP tasks, part-of-speech tagging and named entity recognition, and showed the state-of-the-art results on both tasks. The improvements achieved by dynamic feature induction might not be statistically significant, but important because they gave the last gist to the state-of-the-art; without this last gist, our system would have not reached the bar.
It is worth mentioning that we also experimented with several feature templates including many joint features without applying dynamic feature induction. The results we got from these manually induced features were not any better (often worse) than the ones achieved by dynamic feature induction, which was very encouraging. In the future, we will experiment our approach on more NLP tasks such as dependency parsing and conference resolution where induced features should play a more critical role.
We concede that our approach is more empirically motivated than theoretically justified. For instance, the choice of k (line 11) or the combination configuration for L (line 13) in Algorithm 1 are rather empirically derived. All the parameters are automatically tuned by running grid searches on the development sets (Sections 4.5 and 5.3); it would be intellectually intriguing to find a more principled way of adjusting these hyper-parameters than just brute-force search.
The locally optimal learning to search is used to help structured learning although it gives a relatively smaller impact to the tasks involving sequence classification such as part-of-speech tagging and named entity recognition. This framework is used because we plan to apply our approach on more structurally oriented tasks such as dependency parsing and AMR parsing. Our work is also related to feature grouping, which has been shown to be beneficial in learning high-dimensional data (Zhong and Kwok, 2011;Suzuki and Nagata, 2013). It will be interesting to compare our work to the previous work and see the strengths and weaknesses of our approach.