SyntNN at SemEval-2018 Task 2: is Syntax Useful for Emoji Prediction? Embedding Syntactic Trees in Multi Layer Perceptrons

In this paper, we present SyntNN as a way to include traditional syntactic models in multilayer neural networks used in the task of Semeval Task 2 of emoji prediction. The model builds on the distributed tree embedder also known as distributed tree kernel. Initial results are extremely encouraging but additional analysis is needed to overcome the problem of overfitting.


Introduction
Syntactic models of language have always played a crucial role in many natural language processing tasks but, in recent years, it has been marginalized by the advent of neural networks and in particular long-short term memory (LSTM). These latter networks have had a tremendous impact on how linguistic tasks are modeled and, sometimes, solved.
In this paper, we want to explore the use of "traditional" syntactic information within a neural network framework in the task of Semeval Task 2 of emoji prediction (Barbieri et al., 2018(Barbieri et al., , 2017. We propose SyntNN as a way to include traditional syntactic models in multilayer neural networks. The model builds on the distributed tree embedder also known as distributed tree kernel (Zanzotto and Dell'Arciprete, 2012;Ferrone and Zanzotto, 2014;Zanzotto et al., 2015) that is a way to transpose syntactic information in small vectors. Initial results are extremely encouraging: SyntNN outperforms syntax-unaware neural networks on the trial set. Unfortunately, these promising results are not confirmed on the test set. Hence, we analyzed these results to try to understand why this has happened.

SyntNN: a Syntax-aware Multilayer Perceptron
SyntNN is a simple, non-recurrent neural network that aims to exploit traditional syntactic interpretations of tweets in classification tasks. This network wants to explore two questions: first, investigating whether "traditional" syntactic interpretation can be used within neural networks and, second, understanding if syntactic information is useful in modeling tweets for the specific task of emoji prediction. The architecture of SyntNN is then organized around two sub-networks (see Fig. 1): (1) a syntactic sub-network and (2) a semantic subnetwork. These two sub-networks are then merged to obtain the final classification layer.
The rest of the section describes the two subnetworks and the merging layer.

Syntactic Subnetwork with a Distributed Tree Embedding Layer
The syntactic subnetwork aims to take syntactic interpretations of tweets and make them available to a simple, non-recurrent neural network. The key point is how to obtain the transformation of the symbolic representation of syntactic trees into tensor-based representations that have meaningful properties. The Distributed Tree Embedding Layer (DTE) (see Fig. 1) is the core component of the syntactic subnetwork. DTE is based on a technique to embed the structural information of syntactic tree into dense, low-dimensional vectors of real numbers (Zanzotto and Dell'Arciprete, 2012;Ferrone and Zanzotto, 2014;Zanzotto et al., 2015). This technique has been originated as a way to replace syntactic kernel functions (Collins and Duffy, 2002) in kernel machines (Cristianini and Shawe-Taylor, 2000) but it can be seen as a principled way to insert syntactic information into vectors. In this technique, tree are transformed into distributed trees that are low dimensional vectors. These distributed trees build on the recently revitalized research in Distributed Representations (DR) (Hinton et al., 1986;Plate, 1994;Bengio, 2009;Collobert et al., 2011;Socher et al., 2011).
DTE is a layer that maps trees into lowdimensional vectors in a space R d . This space is a low dimensional space that embeds the space representing trees according to their subtrees. DTE is then represented as the following embedding layer: where S(T ) = t is a onehot layer that maps trees to vectors in the space of subtrees R n and W (DT E) d×n is a transformation matrix that embeds the huge space of subtrees R n in a smaller space R d . DTE is based on the Johnson-Lindenstrauss Tranform (JLT) (Johnson and Lindenstrauss, 1984;Dasgupta and Gupta, 1999) and it is not learned. Using JLT, the layer DTE maps vectors representing trees in the space of subtrees in vectors in a reduced space where similarity is approximately preserved, that is, given two syntactic trees T 1 and T 2 and given a , it is possible to find a W for which, if R d has a specific relation with R n (see (Dasgupta and Gupta, 1999)), this property holds: (2) where t 1 = S(T 1 ) and t 2 = S(T 2 ). Hence, the space R d embeds the space R n of the subtrees.
However, directly producing a DTE with JLT is unfeasible as the space of subtrees R n is huge and intractable.
Our solution is using the recursive model for computing distributed trees (Zanzotto and Dell'Arciprete, 2012;Zanzotto et al., 2015), which empirically guarantees the property in Equation 2. This model is a mapping function that encodes trees in vectors by assigning random vectors to node labels and by recursively computing vectors for subtrees by composing vectors for nodes. The mapping function has a linear complexity with respect to the size of the tree.
After the DTE, the syntactic subnetwork has two dense layers with ReLU activation functions. These dense layers should select subtrees that are relevant for the final task of emoji prediction.
As it is designed, the syntactic subnetwork should take into consideration the syntactic information of tweets and it should be a valuable model to experiment with syntactic information on the task of emoji prediction.

Semantic Subnetwork with a (Bag-of-)Word Embedding Layer
The semantic subnetwork is composed by a classic multilayer perceptron (MLP) network that takes as input tweets represented as x idf . These vectors represent tweets in the following way. Each dimension represents a word w and, if a tweet contains the word w, the dimension has the inverse document frequency (idf) value of the word w, otherwise it has a 0. Hence, the first layer of the semantic subnetwork is a word embedding layer 478 working in the following way: where E is a word embedding matrix. As word embeddings, we used the Stanford's Glove (twitter 27B, 200d) for the English task and the word embedding used in the paper How Cosmopolitan Are Emojis? (Barbieri et al., 2016) 1 for the Spanish task.

Merging Syntactic and Semantic Subnetworks
The merging layer of the syntactic and semantic subnetworks is composed by a concatenate layer that concatenate the syntactic vector with the semantic vector among the features axis. Then, a batch normalization layer performs the operation of batch normalization (Ioffe and Szegedy, 2015). At the end a 20 units layer compute the emoji's probability through a softmax layer.

Experimental Setting
To ensure replicability, this section fully describes the implementation details of SyntN N (Fig. 1) and the values of its metaparameters. Moreover, it introduces the networks used for comparison and the datasets used on the experiments. For the Syntactic Subnetwork of SyntNN, we used the Python implementation of the Distributed Tree Encoder 2 . Tweets' parse trees are obtained by using Stanford's CoreNLP 3 probabilistic context free grammar parser. Distributed trees are represented in a space R d with d = 4000. Then, the layer l 1(synt) is composed of 5512 units. The layer l 2(synt) is a cascade of two dense layers composed, respectively, of 2018 units and 1024 units. All these tree layers have dropout 0.5 and a ReLU activation function.
For the Semantic Subnetwork of SyntNN, we used pretrained word embeddings as Stanford's Glove 4 and the word embeddings given by the organizers of the task (Barbieri et al., 2016) 5 . The rest of the semantic subnetwork is the following. The first layer, the input layer I, is composed by 200/300 neurons. Each neuron take in input a dimension of the BoW vector. The number of input neuron varies according to the word embedding used: 200 if the word embedding used is Glove; 300 if the word embedding used in the other word embedding cited in the word embedding section. The second layer l 1(sem) consists of 512 neurons, dropout 0.5 and ReLU activation function. The third layer l 2(sem) consists of 1024 neurons, dropout 0.5 and ReLU activation function.
To understand whether SyntNN positively uses syntactic information, we compared our system with two neural networks trained in comparable conditions: (1) BOW-MLP and (2) BiLSTM. BOW-MLP is basically the Semantic Subnetwork of SyntNN without the Syntactic Subnetwork. BiLSTM is a bidirectional LSTM (Huang et al., 2015), which has been proven effective in many natural language processing tasks. For the BiL-STM, we used the same embedding layer used in SyntNN, we used a recurrent layer of 100 Bidirectional Long Short Term Memory (LSTM) neurons with activation function tanh, recurrent activation function hard sigmoid, recurrent dropout and dropout probability 0.5 and weight l2 regularizer with λ = 0.01. The output layer is composed by 20 neurons and activation function softmax.
All models are implemented using Keras library (Chollet et al., 2015) and run on tensorflow (Abadi et al., 2015) back-end on different cuda GPUs. Models are trained with Adam(Kingma and Ba, 2014) gradient descent algorithm with lr = 0.0001, β 1 = 0.9, β 2 = 0.999. The loss function used is the categorical crossentropy function. The BOW-MLP model and SyntNN model are trained for 140 epochs and batch size = 50, while the BiLSTM model is trained for 18 epochs and batch size = 50.
We performed our tests on the emoji prediction dataset and we used the Macro F1 Score evaluator provided by the organizers (Barbieri et al., 2018). No additional datasets have been used.

Results and Analysis
Initial experiments (Table 1), performed using the trial dataset as testing set, are extremely positive with respect to our research question: syntactic information is definitely important for both languages and SyntNN seems to be the good way to represent syntactic relations among words. In   These results seem to show that syntactic information is useful and distributed tree embedders are a possible, effective way to take into consideration syntactic information in multilayer perceptrons. Surprisingly, results on the official test set did not confirm results on the trial set ( Table 2). The first observation is that Macro F1 scores on the test set are definitely lower of the results obtained with different models on the trial set. Moreover, the relative rank among the models is not fully respected. In fact, SyntNN outperforms BOW-MLP only for the en dataset but it is definitely worser than BiL-STM. Whereas, models are performing similarly for the es dataset. The question is: Why? What happened?
We then tried to analyze where the poor results on the test set came from. The first observation is that unknown words in the test set (Table 3) are larger than for the trial set for both datasets. The unknown words on the test set is more than double with respect to the unknown words in the trial for the English dataset and more than 4 times for the Spanish dataset. Test is definitely farer than trial with respect to the training set. This seems to be the first reason why results are poorer on the test set. The second observation is on the degree of overfitting. In fact, this seems to to be the major problem of SyntNN and of the other two models (see Fig. 2). By looking at the loss function, three models largely overfit with respect to the epochs: the loss functions on the train set and on the trial set diverge. This can partially explain poor results.
However, to have a more in-dept analysis we need to know what and how these networks are modelling symbols in general and syntactic information in particular.

Conclusions
In this paper we presented a way to include traditional syntactic information in neural networks and we experimented with this model within the emoji prediction task. Although results on the test set does not confirm results on the trial set, this approach is promising as it opens to an higher explainability of decisions of the neural network (Zanzotto and Ferrone, 2017).