uOttawa at SemEval-2018 Task 1: Self-Attentive Hybrid GRU-Based Network

We propose a novel attentive hybrid GRU-based network (SAHGN), which we used at SemEval-2018 Task 1: Affect in Tweets. Our network has two main characteristics, 1) has the ability to internally optimize its feature representation using attention mechanisms, and 2) provides a hybrid representation using a character level Convolutional Neural Network (CNN), as well as a self-attentive word-level encoder. The key advantage of our model is its ability to signify the relevant and important information that enables self-optimization. Results are reported on the valence intensity regression task.


Introduction
Affect analysis is one of the main topics of natural language processing (NLP). It involves many sub-tasks such as sentiment and valence analyses expressed in text. We focus on the task of determining valence intensity.
Affect in tweets (AIT) is a challenging task as it requires handling an informal writing style, which typically has many grammar mistakes, slangs, and misspellings.
Our contributions can be summarized as below.
• The implementation of a social media text processor: A library to help process social media text such as short-forms, emoticons, emojis, misspellings, hash tags, and slangs, as well as tokenization, word normalization, and sentence encoding. • The implementation of a self-attentive deep learning system: This system can predict valence and intensity with limited corpora and vocabulary, and yet can have acceptable performance.

High-Level Description of Our System
Our goal is to provide a system that can predict valence and intensity for short text. Figure 1 shows a high-level description of our solution, which consists of two main components, social media text processor (Section 3) and self-attentive hybrid GRU-based network (Section 4.2).

Social Media Text Processor
The social media text processor aims to provide a reliable and fast tokenization. It involves the following preprocessing steps: • Use a named entity recognizer (NER) (Finkel, Grenager, & Manning, 2005) to identify entities such as persons, names, and places, and then replace them accordingly. • Build a vocabulary using an NGram tokenizer.
• Tokenize sentences into a set of tokens, and then use them to encode text into a sequence of indices (Table 1), which are fed into the network. • Clean text from accents, punctuations, and non-Latin characters. • Identify emoticons and emojis, and then replace them with meaningful text; e.g., replace the happy face emoticon :) with <happy>. • Recognize hashtags, URLs, and then briefly describe them; e.g. replace #depressed by <hashtag_start>depressed<hashtag_end>. • Identify user reference mentions, and then replace them with a person entity; e.g. <person>.

Model Description
The overall architecture of our SAHGN model is shown in Figure 2. The main components include 1) a word sequence encoder, 2) a bidirectional GRU-based layer that applies a self-attentive mechanism on the word level, 3) a character-level CNN feature extractor, and 4) an attention with context-aware mechanism.

Word Sequence Encoder
A network input is described as a sequence ( ) of tokens (such as words), where = [ 1 , 2 , … . , ] and denotes the timestep. is a one-hot input ( ) vector of a fixed length ( ) of tokens. A sequence that exceeds this length is truncated.
Word encoding. We use a word vocabulary to encode a sequence.
has fixed terms to determine the start and end of the sequence, as well as the out of vocabulary (OOV) words. We handle the variable length through padding for short sequences and truncating for long sequences.
Embedding layer. We apply a pretrained GloVe word embedding (Pennington, Socher, & Manning, 2014) on . GloVe projects these words into a lowdimensional vector representation ( ), where ∈ and is the word weight embedding matrix. is used to initialize the word embedding layer. We used the official training and development corpora to train the GloVe word embedding with a dimension of 100. The vocabulary size of this model is 8145 words, which is small and poses a major challenge to training, as well as to performance.

Self-attentive GRU-based Mechanism
Recurrent neural network (RNN) is commonly used for NLP problems (Yin, Kann, Yu, & Schütze, 2017;Young, Hazarika, Poria, & Cambria, 2017), as it enables remembering values over arbitrary time durations. RNN processes every element of an input embedding ( ) sequentially, such that ℎ = tanh ( + ℎ −1 ). is the weight matrix between an input and hidden states, while ℎ is the hidden state of the recurrent connection at timestep ( ). The design of the RNN enables variable length processing while preserving the sequence order. However, RNN has many limitations with long sequences, in particular the exponentially growing or decaying gradients. A common way to resolve these issues is by using gating mechanisms, such as LSTM and GRU (Gers, Schmidhuber, & Cummins, 2000;Hochreiter & Schmidhuber, 1997). We use GRU as it is faster to converge, in addition to being memory efficient.
Bidirectional GRU layer. In our model, we use bidirectional GRU layers. GRU receives a sequence of tokens as inputs, and then projects word information = (ℎ 1 , ℎ 2 , … . , ℎ ), where ℎ denotes the hidden state of GRU at a timestep ( ). It captures the temporal and abstract information of sequences in a forward (ℎ ) or reverse (ℎ ) manner. After that, we concatenate forward and backward representations; e.g. ℎ = ℎ || ℎ .
Attention mechanism. Words do not have equal valence weights in sentences. Towards that, we use an attention mechanism to signify the relatively important words.
Attention is used to compute the compatibility between a given source ( ) and query ( ). It uses an alignment function ( , ) to measure the level of dependency of to . This function produces an attention weight = ( , ) =1 . Then, a softmax function is applied to produce a probability distribution ( | , ) for each word ( ) of an input ( ). Hence, a bigger weight of indicates a higher importance than other words.
The attention alignment approaches have the same implementation, but they mainly differ on how they compute weights. This can be either in an additive manner ( , ) = tanh ( ( + )) (Bahdanau, Cho, & Bengio, 2014), or a multiplicative manner ( , ) = tanh (� . �) (Vaswani et al., 2017). In our model training, we use an additive attention mechanism, as it helped improve the prediction performance.
Self-Attention mechanism. In our model training, we have a small number of corpora, which are not sufficient to train an efficient word embedding or alleviate well-known problems such as polysemy. In an effort to overcome such limitations, we use a self-attention mechanism. This approach measures the dependency of different tokens in the same input embedding ( ). It mainly computes attention for each word by replacing and with a set of token pairs ( , ).

Character-level CNN
The CNN encoding layer (Figure 3) takes an input of a sequence ( ) of characters, where = [ 1 , 2 , … . , ] such that denotes the timestep. is a one-hot input ( ) vector of a fixed length ( ) of characters. CNN usually uses temporal convolutions (timestep-based) rather than spatial convolutions with text analysis.
We mainly use convolutions to extract low-level character information such as misspellings, slangs, and so on.
Character encoding. We define a charset of the size 95, including the upper and lower cases of the English alphabet, special characters, padding, and the start and end of a given input sequence. We need this charset to build a vocabulary, which is used to encode a character sequence. Similarly to the word embedding, we handle the variable length through padding and truncating (Section 4.1).
Character embedding layer. We build a character embedding of 32 dimensions. We use a uniform distribution scheme of a range (-0.5 to +0.5) to initialize its weight matrix.
We apply 3 convolutions of 100 features, as well as different filter lengths 2, 3, and 4. Each one-dimensional operation is used, where = 1 ( ), and is the filter length. After that, a max-pooling layer is applied on the feature map to extract abstract information, � = max( ). Then, we concatenate these feature representations into one output.
As opposed to recurrent layers (Section 4.2), convolutional operations with max-pooling are helpful to extract word features without paying attention to their sequence order (Kalchbrenner, Grefenstette, & Blunsom, 2014). These features are combined with recurrent features to improve the performance of our model.

Attention with Context
Output vectors received from previous steps are concatenated, and then fed into an attention with context.
We use a context-aware attention mechanism (Yang et al., 2016) to compute a fixed representa- ) of a sequence as the weighted sum of all tokens in that sequence. This representation is used as a classification feature vector to be fed to the final fully-connected sigmoid layer. This layer outputs a continuous value representing the valence of a given sentence.

Training
In our training, we use mini batch stochastic gradient of the size 32, to minimize the mean-squared error using back-propagation. We use Adam optimizer with a learning rate of 0.001 (Kingma & Ba, 2014). For training, we use 80% of the training set and 20% for validation. We test and report our results on both development and test sets.
Regularization. We use dropout to randomly drop neurons off the network, which helps preventing co-adaptation of neurons (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014). Dropout is also applied on the recurrent connection of our GRU-based layers. Additionally, we apply a weight decay approach through setting an L2 regularization penalty (Cortes, Mohri, & Rostamizadeh, 2012).
Hyperparameters. The size of the embedding layer is 200, and of the GRU layers is 150, which becomes 300 for bidirectional GRU. We apply a dropout of 0.4, and a dropout of 0.2 on the recurrent connections. Finally, an L2 regularization of 0.00001 is applied at the loss function.

Results
We report our results using the Pearson correlation between the prediction and gold rating sets on the test set (all instances). The other one (gold in 0.5-1 shown in Table 2) differs in including tweets only with intensity greater than or equal to 0.5.
Our model performed well on the development set scoring 0.869, while on the testing set, the performance degraded to 0.752. This degradation could be related to the size of the corpus we used to train our word embedding. We also trained only on 80% of the training set.

Conclusion
In this paper, we presented a self-attentive hybrid GRU-based network for predicting valence intensity for short text. We used a hybrid approach combining low-character-level features with self-attentive word embedding. Our network uses two different attention mechanisms to signify the relevant and important words, and hence optimize feature representation.
With limited corpora and vocabulary of the size 8152, our model still managed to achieve an optimized feature representation, which achieved excellent results on the development set. However, our model failed to maintain the same performance on the testing set.
For future work, we will explore the performance of our model with larger corpora against the testing set. It would also be interesting to see if the model performs well on other long-text NLP tasks such as topic classification.