Deep Learning Architecture for Complex Word Identification

We describe a system for the CWI-task that includes information on 5 aspects of the (complex) lexical item, namely distributional information of the item itself, morphological structure, psychological measures, corpus-counts and topical information. We constructed a deep learning architecture that combines those features and apply it to the probabilistic and binary classification task for all English sets and Spanish. We achieved reasonable performance on all sets with best performances seen on the probabilistic task, particularly on the English news set (MAE 0.054 and F1-score of 0.872). An analysis of the results shows that reasonable performance can be achieved with a single architecture without any domain-specific tweaking of the parameter settings and that distributional features capture almost all of the information also found in hand-crafted features.


Introduction
In general, complex word identification (CWI) aims to identify words that are perceived as difficult for a given target audience. As such, children (De Belder and Moens, 2010), foreign language learners (Paetzold and Specia, 2016c) and readers suffering from aphasia (Devlin and Tait, 1998), dyslexia (Rello et al., 2013) or autism spectrum disorder (Štajner et al., 2017) will struggle with different words.
The goal of the current CWI shared task (Yimam et al., 2018) is to predict which words can be difficult for a non-native speaker, based on annotations collected from a mixture of native and nonnative speakers. The instructions for the English dataset are formulated so that the annotator marks the words he thinks are problematic for children, non-native speakers, or people with language disabilities.
Having such a diverse target audience requires a system that includes a variety of information at different levels of linguistic description. We include information that covers 5 aspects of the lexical item at hand, namely distributional information of the item itself, morphological structure, psychological measures, corpus-counts and topical information. With the exception of the psychological measures, all can be readily trained by an appropriate neural network architecture and/or acquired from large-scale corpora.
We train a neural network to integrate said sources of information and apply it to the probabilistic and the binary complexity assessment for the three English datasets and the Spanish one.

Complex Word Identification
The task of complex word identification has often been regarded as a critical first step for automatic lexical simplification (Shardlow, 2014). Indeed, erroneously identifying or failing to identify words as complex is likely to trigger important errors in the simplification pipeline. As a result, a growing number of studies have been dedicated specifically to complex word identification and have focused on developing accurate statistical learning methods and on collecting appropriate gold standards (Paetzold and Specia, 2016a;Yimam et al., 2017a,b;Štajner et al., 2017) Complex word identification has only relatively recently been framed as a machine learning (ML) problem (Zeng et al., 2005;Shardlow, 2013). Indeed, before any gold-standard datasets were made available, the early approaches to the identification of complex words in a text included, on the one hand, readability measures determining complex words based on word familiarity (Dale and Chall, 1948) or on syllable count (Gunning, 1952;Mc Laughlin, 1969) and, on the other hand, simplification methods which plainly considered all words as complex and simplified every-thing (Devlin and Tait, 1998) or simplified words based on a threshold on word familiarity (Elhadad, 2006).
The SemEval-2016 shared task on complex word identification (described in detail in Paetzold and Specia, 2016a) was the first evaluation campaign which provided a gold-standard dataset as well as an extensive comparison of different machine learning approaches for the task at hand. The submitted systems included different types of classifiers such as SVMs, random forests, maximum entropy systems, ... which combined different types of features, ranging from linguistic information (on a lexical, morphological, semantic and syntactic level), over psycholinguistic measures to corpus-based information such as frequencies.
The results on the shared task showed how ensemble methods (Paetzold and Specia, 2016b) outperformed any other ML technique and neural approaches in particular (Bingel et al., 2016). The task also showed however how a lack of annotation standards made it difficult for any MLapproach to model the rather inconsistent human assessment (Zampieri et al., 2017).

Deep Learning Architectures
The system we describe likewise inscribes itself in the ML-approach to CWI and draws inspiration from neural network literature in NLP. We adapt the architectures' initial purposes and apply it to the task at hand. Collobert et al. (2011) show how distributional information from words, called word embeddings, can be used in combination with a neural network architecture to largely replace hand-crafted features for learning NLPrelated tasks such as POS-tagging and NER. The embeddings capture fine-grained information covering its linguistic behavior and the neural network model successfully teases out the relevant properties from that representation for the given task. Character embeddings  take it one step further and also make it possible to encode and capture subword information in the modeling process.

Data sources
The English datasets cover 3 informationally dense target domains for which to assess lexical complexity, namely news, Wikipedia and Wikipedia news. The Spanish dataset contains data taken from Spanish Wikipedia pages. Table 1 summarizes the number of training, development and test items for each dataset we used in the experiment. We combined training and development sets and used it as a single training set.
As a general domain corpus we use the COWcorpora (Schäfer, 2015;Schäfer and Bildhauer, 2012). The corpora are gathered online and cover a wide scope of topics. The English corpus contains well over 13 billion tokens, the Spanish one over 4 billion tokens.
We have at our disposal psychological measures for English from the MRC Psycholinguistic Database (Wilson, 1988 To avoid skewness we perform a rank transformation, with equal ranks being given the first encountered rank, and normalize again by dividing by the highest rank.
Word length Word length is also determined.
Word embeddings Word embeddings are pretrained using the COW-corpora and are used to initialize several of our input layers in the neural network. We use the gensim implementation of word2vec to construct a 300 dimensional embedding space, based on a window-size of 5 including words that reach a minimum frequency threshold of 20.
Character embeddings Character embeddings are trained on the train and development set of all target words. Each character is replaced by a 16dimensional encoding which has been randomly initialized. Each word consists of a concatenation of its character representations. Figure 1 shows the general architecture for the CWI-task. The model has been constructed with the Keras deep learning library (Chollet et al., 2015) with tensorflow-gpu as a backend. It includes the 5 sources of information we discussed in the previous section/ which are used as features to represent information at the word and the sentence level. At the word level, we include engineered features (psychological measures, corpuscounts and word length) and distributional information (word and character embeddings). At the sentence level we concatenate embeddings to capture topical information.

Input Layers
We include engineered features for the English dataset following the idea that they correlate with cognitive complexity. The features include psychological information, corpus-counts and word length. Corpus-counts measure familiarity and infrequent words are attributed a higher degree of complexity. Word length then has been shown to be related to processing difficulties and is relevant for instance to determine which words pose problems for persons with dyslexia. Each target word is encoded by its word embedding, or in the case of word groups by their concatenated embeddings. The idea is that words with similar distributional patterns might have a comparable complexity. An LSTM layer with a dimensionality of 64 compacts the dimensionality of the representation.
Each target is also encoded as a sequence of its character embeddings. This input encoding is meant to capture morphological information as well as cues from letter sequences which might be perceived as difficult. The character embeddings are trained through 2 convolutional layers (4 filters, kernel size of 4, stride of 1) followed by max pooling (with a size of 2). An LSTM of size 64 is the final layer that directly encodes the character information.
The entire sentence is encoded as a concatenation of word embeddings and serves as a sort of topical approximation using contextual cues. An LSTM of 128 finalizes the information captured in this layer.

Dense Layers
All inputs are then concatenated and run through a shallow 3 layered fully connected network (each consisting of 32 nodes) with a moderate dropout rate of 0.3. A final dense layer predicts the output. 2 auxiliary loss functions are provided to ensure smooth training of the character and the topic model. We use binary cross-entropy as the loss function for the binary outcome task and mean squared error rate for the probabilistic one. We applied the architecture to the English datasets and, with the exception of the psychological measures, also to the Spanish one.

Dataset
Result Rank Maximum-score  The results in Table 2 show reasonably good performance for all tasks. Our architecture seems to work especially well for the regression task, but shows its aptitude for the classification task as well. The size of the training data seems to play a direct role in the system's ability for accurate predictions. This is in line with other deep learning literature. This does not hold for the Spanish set however, which might be due to a slight difference in apprehension during the data collection phase. The inclusion of corpus-counts and pre-trained embeddings from a general corpus, rather than a wikipedia corpus shows directly   in the performance of the respective tasks. Using a wikipedia corpus will probably positively influence the results for those particular sets. Yet, the inclusion of general corpus-information proves to be a valid alternative in lack of specialized corpora. The inclusion of the engineered features does not seem to affect the obtained scores much. Table 3 provides an overview of the relative contribution of each input layer to the final result for the English news dataset. The models were trained for 50 epochs. Considering each input layer separately, the word embeddings are the best estimator for the complexity task, followed closely by the character embeddings. Engineered features capture some information on the word's complexity, yet not as much as the embedding layers. Interestingly, sentence information does not outperform the baseline.
The combination of input layers shows the relative improvement that can be achieved by adding more information to the best performing input layer. The results indicate that combining information only marginally improves performance. They also confirm that the engineered features in combination with the embeddings do not contribute much to the final score.
This leads to the following conclusions for the current dataset. First, complexity is best determined by including focused information of the target word itself. The inclusion of contextual, topical information does not show any noticeable advantage. Looking at the combination of input layers, we can derive that the engineered features only add marginally different information from other input sources. This could be due to the limited number of words that are actually covered by the psychological dataset, but it also implies that the information from the corpus-counts is indirectly captured by the embeddings and from the word length by the character encodings. It is a case in point for replacing manual feature engineer-ing by word and character embeddings. Based on these results we cannot conclude whether the word embeddings' better performance over the character embeddings is due to pre-training.

Conclusion
Reasonable performance can be achieved with a single architecture including information from different levels of linguistic description. Information derived from large scale corpora makes it possible to include them as a starting point on which to build a general architecture that learns the appropriate weights for the specific problem, in our case, the CWI-task. Embeddings at the word and the character level seem to contain sufficient information to model the problem well.
Future work will include an exploration to find optimal hyperparameter settings to optimize the identification task. We will likewise explore whether pre-training the character embeddings on a larger corpus will put its performance on par with the pre-trained word embeddings. The latter would pave the way for a model with less training parameters and would significantly reduce complexity.