NILC at CWI 2018: Exploring Feature Engineering and Feature Learning

This paper describes the results of NILC team at CWI 2018. We developed solutions following three approaches: (i) a feature engineering method using lexical, n-gram and psycholinguistic features, (ii) a shallow neural network method using only word embeddings, and (iii) a Long Short-Term Memory (LSTM) language model, which is pre-trained on a large text corpus to produce a contextualized word vector. The feature engineering method obtained our best results for the classification task and the LSTM model achieved the best results for the probabilistic classification task. Our results show that deep neural networks are able to perform as well as traditional machine learning methods using manually engineered features for the task of complex word identification in English.


Introduction
Research efforts on text simplification have mostly focused on either lexical (Devlin and Tait, 1998;Biran et al., 2011;Glavaš andŠtajner, 2015;Paetzold and Specia, 2016b) or syntactic simplification (Siddharthan, 2006;Kauchak, 2013). Lexical simplification involves replacing specific words in order to reduce lexical complexity. Lexical simplification is an open problem, as identifying and simplifying complex words in a given context is not straightforward. Although very intuitive, this is a challenging task since the substitutions must preserve both the original meaning and the grammaticality of the sentence being simplified. Complex word identification is part of the usual lexical simplification pipeline (Paetzold and Specia, 2015), which is illustrated in Figure 1.
For the challenge, we focused on the English monolingual CWI track. We implemented three * The opinions expressed in this article are those of the authors and do not necessarily reflect the official policy or position of the Itaú-Unibanco. approaches using machine learning: the first one uses feature engineering; the second one takes the average embedding of target words as input to a neural network; and the third approach models the context of the target words using an LSTM (Gers et al., 1999). Our code is publicly available at github 1 .

Task Description
The setup of the CWI Shared Task 2018 is as follows: given a target word (or a chunk of words) in a sentence, predict whether or not a non-native English speaker would be able to understand it. These predictions are based on annotations collected from a mixture of 10 native and 10 nonnative speakers. The labels in the binary classification task were assigned "1" if at least one of the 20 annotators did not understand it (complex), and "0" otherwise (simple). The labels in the probabilistic classification task were assigned as the number of annotators who marked the word as difficult divided by the total number of annotators.
In this edition a multilingual dataset was available and participants could choose to participate in one or more of the following tracks: English monolingual CWI, German monolingual CWI, Spanish monolingual CWI, Multilingual CWI shared task with French test set. Also, the participants could choose between binary classification or probabilistic classification task. We chose to participate in the English monolingual track and in both classification tasks (see in Table  1

Datasets
In this work, we used two extra corpora to train language models, one of these to train a neural language model: • BookCorpus dataset: which has 11,038 free books written by yet unpublished authors ; • One Billion Word dataset: which is the largest public benchmark for language modeling (Chelba et al., 2013).

Proposed Methods
In this section we show our developed methods, following three approaches: feature engineering, feature learning and ensembles.

Methods based on Feature Engineering
We used linguistic, psycholinguistic and language model features to train several classification and probabilistic classification methods. Our feature set consists of three groups of features: • LEX: includes word length, number of syllables, number of senses, hypernyms and hyponyms in WordNet (Fellbaum, 1998); • N-gram: includes log probabilities of an ngram containing target words in two language models trained on BookCorpus and One Billion Word datasets using SRILM (Stolcke, 2002); • PSY: contains word-level psycholinguistic features such as familiarity, age of acquisition, concreteness and imagery values for every target word (Paetzold and Specia, 2016a).
Because an instance can contain more than a target word, mean, standard deviation, min and max values were calculated for each feature. A total of 38 features are extracted for each instance. We also normalized features using Z-score.
We trained Linear Regression, Logistic Regression, Decision Trees, Gradient Boosting, Extra Trees, AdaBoost and XGBoost methods for both classification and probabilistic classification tasks.

Methods based on Feature Learning and Transfer Learning
An alternative approach to feature engineering is to make the machine learning model itself create a data representation. This is the principle of feature learning. In this scenario, all elements of the vector contain an independent value, which has some meaning for the model (LeCun et al., 2015). Most importantly, we can reuse this representation in another tasks, which is called transfer learning or domain adaptation. This strategy is already used with success in Computer Vision, where deep neural networks are pre-trained in large supervised training sets like ImageNet (Girshick et al., 2014;Esteva et al., 2017).
It is common in Natural Language Processing (NLP) tasks to use pre-trained word embeddings with models like Word2Vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014). However, more recently some studies have used distributed sentences to produce contextualized embeddings, from a language model, machine translation model, or auto-encoder (Dai and Le, 2015;Yuan et al., 2016;Le et al., 2017;Peters et al., 2017Peters et al., , 2018McCann et al., 2017;Howard and Ruder, 2018).
In the next section we will explain how we used both strategies.

Average Embedding Method
Word embedding is a technique to represent words into dense real vectors, that helps NLP tasks and improves neural networks models (Collobert et al., 2011;Kim, 2014;Bowman et al., 2015), because this dense representation captures semantic and morphological information of the words. In this work, we obtained word vector representations for complex words. When a complex word was a chunk of words, we took the average of their vectors. We used word vectors from GloVe (6B tokens (Pennington et al., 2014)).
The resulting vector was passed on to a neural network with two ReLU layers (Nair and Hinton, 2010) followed by a Sigmoid layer, which predicted the probability of whether or not the word was complex (Figure 2).

LSTM Method
LSTM is a powerful tool for modeling sequential data. This type of neural network architecture can learn to map an input sentence of variable length into a fixed-dimensional vector representation. For this reason, a lot of state-of-theart systems in several NLP tasks incorporate an LSTM, for example, language modeling (Jozefowicz et al., 2016; Melis et al., 2017), machine translation (Di Gangi et al., 2017), textual inference (Tay et al., 2017), and others.
Some studies used a pre-trained LSTM language model (Dai and Le, 2015;Yuan et al., 2016;Le et al., 2017;Peters et al., 2017Peters et al., , 2018 to represent a sentence/document and used this representation to improve their results. Therefore, we trained a language model in the One Billion Word dataset using similar parameters from Le et al. (2017): one-layer LSTM with 512 units, 128 embedding size, and sampled softmax loss (Jean et al., 2015). However, we used weight tying, which means the weights between the embedding and softmax layer are shared, consequently reducing the total parameters of the model (Melis et al., 2017). For the CWI task, the LSTM read five words before the complex word, then the complex word itself (or the chunk of words). We took the last hidden vector from the LSTM and passed it through a Sigmoid layer.
In Figure 3 we show the pipeline where the blue boxes represent the context words and red boxes represents the complex word, which is a chunk in this example.

Sent2Vec
We also used sentence embeddings generated by Skip-thought . This model produces a sentence representation of 2,400 dimensions. We trained two models using sentence embedding. In the first, we passed the embedding through a sigmoid layer and in the second, we used two layers with ReLUs of 1,200 and 600 dimensions respectively, followed by a Sigmoid layer. In the last model we employed a dropout layer (0.7) between all of the layers. Both models obtained good results in the training set, however, the models had poor results in the development set.

Ensembles
We combined our three best systems: Feature Engineering, MLP Average Embedding and LSTM Transfer Learning. For the binary classification task, we combined the system by majority voting rule. For the probabilistic classification task we used stacking with Linear Regression as a base learner, which took the probabilities from our three best system as features.

Results
For the binary classification task, we first evaluated the ROC-AUC in the development set for all methods. For the Feature Engineering method, we decided to use an XGBoost classifier which achieved the best AUC in development set (0.91).
Although we selected the threshold which maximizes the F1 in the training set for each Feature Engineering, Shallow and Deep Neural network method, it is important to mention that these   thresholds were found for the whole training set and not for each subset. This guarantees that we are not overfitting our method to test data or to a specific dataset. Our results for the English monolingual classification task are described in Table  2. The Feature Engineering method itself achieved by far our best results for the three test sets. In order to achieve better results, we submitted a fourth system which calculated the majority voting of our three methods. This voting system surpassed our individual methods in two test sets, but was inferior compared to the Feature Engineering method by less than 1 −3 F1 in the WikiNews dataset. Majority voting was our best method for the classification task.
For the probabilistic classification task, our Feature Engineering method used also an XGBoost classifier which achieved the best MAE in development set (0.28). Our results for English monolingual probabilistic classification task are described in Table 3. While both Feature Engineering and Average Embedding did not perform well, our best individual system by a large margin was the LSTM method. In order to achieve better results, we used stacking of our three models. The stacking performed better than individual methods in two datasets, but was not better than LSTM for the News test set (2 −4 gap).

Conclusion
For the binary classification task, majority voting achieved our best results, although only slightly better than the standalone Feature Engineering model. For the probabilistic classification task, LSTM had better results in one data set, but the stacking method performed slightly better in the other data sets. The deep learning method showed its potential when contrasted with the feature engineering method.
In the future, we intend to explore more powerful neural language models, such as encoding characters embeddings (Jozefowicz et al., 2016), bidirectional language model (Peters et al., 2017(Peters et al., , 2018, and other transfer learning methods (Howard and Ruder, 2018). proach to lexical simplification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 496-501. Association for Computational Linguistics.
Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.