Sensible at SemEval-2016 Task 11: Neural Nonsense Mangled in Ensemble Mess

This paper describes our submission to the Complex Word Identiﬁcation (CWI) task in SemEval-2016. We test an experimental approach to blindly use neural nets to solve the CWI task that we know little/nothing about. By structuring the input as a series of sequences and the output as a binary that indicates 1 to denote complex words and 0 otherwise, we introduce a novel approach to complex word identiﬁcation using Recurrent Neural Nets (RNN). We also show that it is possible to simply ensemble several RNN clas-siﬁers when we are unsure of the optimal hyper-parameters or the best performing models using eXtreme gradient boosted trees clas-siﬁers. Our systems submitted to the CWI task achieved the highest accuracy and F-score among the systems that uses neural networks.


Introduction
The Deep Learning Tsunami has hit the Natural Language Processing (NLP) and Computational Linguistics field (Manning, 2016). Deep neural nets has shown to be the ultimate hammer in various NLP shared tasks, systems trained on neural nets often emerge as the top systems and/or beat state-of-theart performance (Collobert et al., 2011;Mikolov et al., 2013;Pennington et al., 2014;Levy et al., 2014;Shazeer et al., 2016;Gupta et al., 2015;Jean et al., 2015;Kreutzer et al., 2015;Sultan et al., 2014;Sultan et al., 2015) In the concluding remarks of the Google's Deep Learning course on Udacity 1 , Vincent Vanhoucke 1 https://www.udacity.com/course/deep-learning-ud730 said, "What's really cool about those [neural net application] examples is that you don't have to know much about the problem you're trying to solve". Armed with basic knowledge of deep learning and neural nets and almost zero familiarity of the problem, we attempt to treat the Complex Word Identification (CWI) task as a binary classification task using Long Short-Term Memory (LSTM) Recurrent Neural Nets (RNN) with Gated Recurrent Units (GRU).

Neural Network and Deep Learning
Neural Networks are powerful at modelling various modalities, e.g. signals, text, images, videos. As the name suggests, neural network is inspired by the brain's synaptic transmission mechanism that transmits signalling molecules (aka neurotransmitters) to different signal receptors (aka neurons) throughout our body. Metaphorically, we can emulate a neuron as a computational unit and consider the neurotransmitters as real number inputs and outputs that pass from a neuron to another. Each input to a neuron comes with a associated weight and the neuron will process the different inputs (often by summing them) and passing it to a non-linear function which will provide an output value.
For instance, we can think of a neuron as a typical AND/OR logic gate. Given two binary inputs x 1 and x 2 and a bias unit with input 1 and a varying weights, we pass it to a neuron that sums the product of the weights and input and passes the sum to a non-linear function that outputs a boolean y value of 0 if the sum is below 0 and 1 if the sum is above 0.
From the left graph and table in Figure 1, we see that the neuron emulates an OR logic gate where it outputs 1 when either of the input is a positive input and outputs 0 when both inputs are 0s. Similarly, right graph and table in Figure 1 presents a neural depiction of a AND logic gate. If we consider the 2nd row in the left table of Figure 1, the bias with the value of 1 and inputs x 1 = 0 and x 1 = 1 are fed into neuro to produce the y output. Within the neuron, it first sums the inputs and the associated weights up (i.e. w 0 * bias + w 1 * x 1 + w 1 * x 1 ). Then using a non-linear thresholding function, the neuron outputs y = 1 since the sum is larger than 0. Thus the neuron fulfils the function of an OR that accepts a 1 and 0 input bit to produce a positive bit.
The problem gets more complicated when we want to use neurons to emulate an exclusive OR XOR logic gate, to get a positive binary output, only one of the inputs can be positive and XOR returns a negative output when there are more than one or less than one positive input(s).
We can split the XOR problem into small logical expressions: As shown in Figure 2, we can emulate an XOR gate by stacking two layers of neurons. On the first layer, we solve for (i) z 1 to represent [x 1 AND NOT x 2 ] with weights -0.5, 1 and -1 attached to the bias, x 1 and x 2 (ii) z 2 to represent [NOT x 1 AND x 2 ] with weights -0.5, -1 and 1 attached to the bias, x 1 and x 2 .
At the second layer, we apply the same weights we use for the OR gates and feed z 1 and z 2 as the input to produce the XOR outputs. Interestingly, if we look at the first layer, we notice that the z 1 and z 2 will never both be 1s. Often the network architecture we use to solve the XOR example is referred to as a feed-forward multi-layered network.
The logic gates examples motivate the simple use of single neurons and the effect of stacking layers of neurons to produce the desired outcome 2 . Hence the notion, "Deep Learning".
In all the logic gate examples we have manually assigned the weights that are associated with the neurons and it perfectly predicts the desired XOR outputs. In practice, these weights has to be trained using pairs of input bits and their respective outputs.
The rest of the paper will not go through the neural network architecture used in our submission in the same level of detail as this section. Goldberg (2015) and Cho (2015) provides a great read on using deep learning and neural networks for NLP tasks and the formal mathematical descriptions of how to train the networks.

Recurrent Neural Net
A Recurrent Neural Net (RNN) is an architecture of deep neural network that chains up neurons in a sequential manner. The Elman Network is the simplest formulation of RNN; it allow arbitrarily sized structured inputs to be represented by a fixed-size vector observing the structured properties of the input (Elman, 1990). Returning to the XOR problem, instead of using a feed-forward multi-layered network, we can change the problem into a sequential one. Figure 3 shows how the inputs can be chained in a sequential manner in candence, x 1 , x 1 , y, x 1 , x 1 , y, ... (input input output, input input output, ...). And at the end of the sequence, the network can predict the output of the last set of inputs without an output. Despite its simplicity, RNN produces competitive results for sequence tagging (Xu et al., 2015) and language modelling (Mikolov et al., 2010). However, it is hard to train effectively due to the vanishing gradient problem; the gradients in the later steps of the sequence quickly diminishes during backpropagation (Rumelhart et al., 1988) and they don't reach the earlier inputs.
To solve the vanishing gradients problem, Hochreiter and Schmidhuber (1997) introduced the Long Short-Term Memory (LSTM). The intuition is to introduce a "memory cell" to preserve the gradients; at every input state, a gate is used to control how much of the new input should be kept in the memory cell how much it should forget. Cho et al. (2014) proposed a similar "memory" device, Gated Recurrent Unit (GRU), to control that add extra weight matrices to learn what long-distance relationships to remember or forget.

Complex Word Identification
Complex Word Identification (CWI) is task of identifying difficult words in a text automatically. Usually, it is structured as a subtask prior to lexical simplification where a difficult words from a text is substituted to simpler ones (Specia et al., 2012;Shardlow, 2013). The inputs of the task is a target word and the context sentence in which it occurs. For example, given the underlined word and the context sentence, The short words math or maths are often used for arithmetic , geometry or basic algebra by young student and their schools.
The desired output for the inputs would either be a 1 to indicate that the target word is complex and 0 if the target word is not.

Complex Word Identification with RNN
Neural network has opened a Pandora box where engineers can stack the network in different architectures to train their desired models for almost any NLP task. The sequential nature of language production fits the recurrent structure of the RNN and engineers can easily redesign any NLP task into a sequence prediction task. Knowing little about the task, we lemmatize and lowercase the sentence 3 and restructure the CWI inputs as a sequence where the target word is separated by a placeholder symbol < s > followed by the context sentence, e.g.
arithmetic < s > the short word math or math are often use for arithmetic , geometry or basic algebra by young student and their school .
We select the top model with the lowest labelling error on the training labels as our Baseline submission. Since we do not know the variance between the training and evaluation data, we select outputs from the top 5 models with the lowest training error and train an eXtreme Boosted Trees regressor (Friedman, 2001;Chen and Guestrin, 2016) to produce a single output label.
The open-source implementation of our system can be found on https://github.com/ alvations/stubboRNNess. It is based on the Passage RNN 4 and XGBoost 5 Ensemble libraries. Table 1 presents the results of the best systems and the neural network systems from the CWI task in SemEval-2016 (Paetzold and Specia, 2016). We have submitted our systems under the team name, Sensible.

Results
The CWI task was evaluated based on classic accuracy, precision recall and F-score metric. Additionally, the organizers decided to account for the harmonic mean between the accuracy and recall and they called it the G-Score.
The top teams used a variety of heuristics and classificatoin based techniques. PLUJAGH-SEWDFF uses frequency thresholding where they CoastalCPH's NeuralNet system extracted an array of features (including parts-of-speech, frequencies, character perplexity and embeddings) and they train a deep neural network with 2 hidden layers. AmritaCEN's w2vecSim trained an SVM classifier using Word2Vec embeddings and the similarity between the target word, in addition, they used character and token based features to train the classifier. Their w2vecSimPossystem added a POS feature to train the classifier.
Among the neural network systems, our baseline system achieved the highest F-and G-score. We are also the only team that restructured the target word and sentence to train recurrent neural net to predict the output label. One possible reason for the poor performance of our systems is due to the training data size of the task. The training data contains 2,237 labelled instances while the test data contains 88,221 instances. Given more data, we believe that our system can scale towards accuracies comparative to the top systems. Although our ensemble system performed poorly in the harmonic scores, we see that it achieves reasonably high accuracy close to the top systems. Our ensemble system was penalized due to the low recall and rate. Provided that we have more training data, the recall should proportionally increase and improve our ensemble system.

Conclusion
In this paper, we motivated the use of deep learning and neural nets in NLP applications and introduced basic notions of feed-forward and recurrent neural nets through the XOR example. And as expected, we can easily build a relatively competitive system with little understanding of the task by restructuring the inputs as a sequence to train an RNN classifier.
We have introduced a novel approach using RNN to solve the complex word identification task and showed that we can easily ensemble several RNN classifiers if we are unsure of the optimal hyperparameters or the best performing models using extreme gradient boosted trees classifiers.