Tweet2Vec: Character-Based Distributed Representations for Social Media

Text from social media provides a set of challenges that can cause traditional NLP approaches to fail. Informal language, spelling errors, abbreviations, and special characters are all commonplace in these posts, leading to a prohibitively large vocabulary size for word-level approaches. We propose a character composition model, tweet2vec, which finds vector-space representations of whole tweets by learning complex, non-local dependencies in character sequences. The proposed model outperforms a word-level baseline at predicting user-annotated hashtags associated with the posts, doing significantly better when the input contains many out-of-vocabulary words or unusual character sequences. Our tweet2vec encoder is publicly available.


Introduction
We understand from Zipf's Law that in any natural language corpus a majority of the vocabulary word types will either be absent or occur in low frequency. Estimating the statistical properties of these rare word types is naturally a difficult task. This is analogous to the curse of dimensionality when we deal with sequences of tokens -most sequences will occur only once in the training data. Neural network architectures overcome this problem by defining non-linear compositional models over vector space representations of tokens and hence assign non-zero probability even to sequences not seen during training (Bengio et al., 2003;Kiros et al., 2015). In this work, we explore a similar approach to learning distributed representations of social media posts by 1 https://github.com/bdhingra/tweet2vec composing them from their constituent characters, with the goal of generalizing to out-of-vocabulary words as well as sequences at test time.
Traditional Neural Network Language Models (NNLMs) treat words as the basic units of language and assign independent vectors to each word type. To constrain memory requirements, the vocabulary size is fixed before-hand; therefore, rare and out-of-vocabulary words are all grouped together under a common type 'UNKNOWN'. This choice is motivated by the assumption of arbitrariness in language, which means that surface forms of words have little to do with their semantic roles. Recently, (Ling et al., 2015) challenge this assumption and present a bidirectional Long Short Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) for composing word vectors from their constituent characters which can memorize the arbitrary aspects of word orthography as well as generalize to rare and out-of-vocabulary words.
Encouraged by their findings, we extend their approach to a much larger unicode character set, and model long sequences of text as functions of their constituent characters (including whitespace). We focus on social media posts from the website Twitter, which are an excellent testing ground for character based models due to the noisy nature of text. Heavy use of slang and abundant misspellings means that there are many orthographically and semantically similar tokens, and special characters such as emojis are also immensely popular and carry useful semantic information. In our moderately sized training dataset of 2 million tweets, there were about 0.92 million unique word types. It would be expensive to capture all these phenomena in a word based model in terms of both the memory requirement (for the increased vocabulary) and the amount of training data required for effective learning. Additional benefits of the character based approach include language independence of the methods, and no requirement of NLP preprocessing such as word-segmentation.
A crucial step in learning good text representations is to choose an appropriate objective function to optimize. Unsupervised approaches attempt to reconstruct the original text from its latent representation (Mikolov et al., 2013;Bengio et al., 2003). Social media posts however, come with their own form of supervision annotated by millions of users, in the form of hashtags which link posts about the same topic together. A natural assumption is that the posts with the same hashtags should have embeddings which are close to each other. Hence, we formulate our training objective to maximize cross-entropy loss at the task of predicting hashtags for a post from its latent representation.
We propose a Bi-directional Gated Recurrent Unit (Bi-GRU) (Chung et al., 2014) neural network for learning tweet representations. Treating white-space as a special character itself, the model does a forward and backward pass over the entire sequence, and the final GRU states are linearly combined to get the tweet embedding. Posterior probabilities over hashtags are computed by projecting this embedding to a softmax output layer. Compared to a word-level baseline this model shows improved performance at predicting hashtags for a held-out set of posts. Inspired by recent work in learning vector space text representations, we name our model tweet2vec.

Related Work
Using neural networks to learn distributed representations of words dates back to (Bengio et al., 2003). More recently, (Mikolov et al., 2013) released word2vec -a collection of word vectors trained using a recurrent neural network. These word vectors are in widespread use in the NLP community, and the original work has since been extended to sentences (Kiros et al., 2015), documents and paragraphs (Le and Mikolov, 2014), topics (Niu and Dai, 2015) and queries (Grbovic et al., 2015). All these methods require storing an extremely large table of vectors for all word types and cannot be easily generalized to unseen words at test time (Ling et al., 2015). They also require preprocessing to find word boundaries which is non-trivial for a social network domain like Twitter.
In (Ling et al., 2015), the authors present a compositional character model based on bidirectional LSTMs as a potential solution to these problems. A major benefit of this approach is that large word lookup tables can be compacted into character lookup tables and the compositional model scales to large data sets better than other stateof-the-art approaches. While (Ling et al., 2015) generate word embeddings from character representations, we propose to generate vector representations of entire tweets from characters in our tweet2vec model.
Our work adds to the growing body of work showing the applicability of character models for a variety of NLP tasks such as Named Entity Recognition (Santos and Guimarães, 2015), POS tagging (Santos and Zadrozny, 2014), text classification (Zhang et al., 2015) and language modeling (Karpathy et al., 2015;Kim et al., 2015).
Previously, (Luong et al., 2013) dealt with the problem of estimating rare word representations by building them from their constituent morphemes. While they show improved performance over word-based models, their approach requires a morpheme parser for preprocessing which may not perform well on noisy text like Twitter. Also the space of all morphemes, though smaller than the space of all words, is still large enough that modelling all morphemes is impractical.
Hashtag prediction for social media has been addressed earlier, for example in (Weston et al., 2014;Godin et al., 2013). (Weston et al., 2014) also use a neural architecture, but compose text embeddings from a lookup table of words. They also show that the learned embeddings can generalize to an unrelated task of document recommendation, justifying the use of hashtags as supervision for learning text representations.

Tweet2Vec
Bi-GRU Encoder: Figure 1 shows our model for encoding tweets. It uses a similar structure to the C2W model in (Ling et al., 2015), with LSTM units replaced with GRU units.
The input to the network is defined by an alphabet of characters C (this may include the entire unicode character set). The input tweet is broken into a stream of characters c 1 , c 2 , ...c m each of which is represented by a 1-by-|C| encoding. These one-hot vectors are then projected to a character space by multiplying with the matrix P C ∈ ..x m be the sequence of character vectors for the input tweet after the lookup. The encoder consists of a forward-GRU and a backward-GRU. Both have the same architecture, except the backward-GRU processes the sequence in reverse order. Each of the GRU units process these vectors sequentially, and starting with the initial state h 0 compute the sequence h 1 , h 2 , ...h m as follows: Here r t , z t are called the reset and update gates respectively, andh t is the candidate output state which is converted to the actual output state h t . W r , W z , W h are d h × d c matrices and U r , U z , U h are d h × d h matrices, where d h is the hidden state dimension of the GRU. The final states h f m from the forward-GRU, and h b 0 from the backward GRU are combined using a fully-connected layer to the give the final tweet embedding e t : where d t is the dimension of the final tweet embedding. In our experiments we set d t = d h . All parameters are learned using gradient descent.

Softmax:
Finally, the tweet embedding is passed through a linear layer whose output is the same size as the number of hashtags L in the data set. We use a softmax layer to compute the posterior hashtag probabilities: . (2) Objective Function: We optimize the categorical cross-entropy loss between predicted and true hashtags: Here B is the batch size, L is the number of classes, p i,j is the predicted probability that the ith tweet has hashtag j, and t i,j ∈ {0, 1} denotes the ground truth of whether the j-th hashtag is in the i-th tweet. We use L2-regularization weighted by λ.

Word Level Baseline
Since our objective is to compare character-based and word-based approaches, we have also implemented a simple word-level encoder for tweets. The input tweet is first split into tokens along white-spaces. A more sophisticated tokenizer may be used, but for a fair comparison we wanted to keep language specific preprocessing to a minimum. The encoder is essentially the same as tweet2vec, with the input as words instead of characters. A lookup table stores word vectors for the V (20K here) most common words, and the rest are grouped together under the 'UNK' token.

Data
Our dataset consists of a large collection of global posts from Twitter 2 between the dates of June 1, 2013 to June 5, 2013. Only English language posts (as detected by the lang field in Twitter API) and posts with at least one hashtag are retained. We removed infrequent hashtags (< 500 posts) since they do not have enough data for good generalization. We also removed very frequent tags (> 19K posts) which were almost always from automatically generated posts (ex: #androidgame) which are trivial to predict. The final dataset contains 2 million tweets for training, 10K for validation and 50K for testing, with a total of 2039 distinct hashtags. We use simple regex to preprocess the post text and remove hashtags (since these are to be predicted) and HTML tags, and replace usernames and URLs with special tokens. We also removed retweets and convert the text to lower-case.

Implementation Details
Word vectors and character vectors are both set to size d L = 150 for their respective models. There were 2829 unique characters in the training set and we model each of these independently in a character look-up table. Embedding sizes were chosen such that each model had roughly the same number of parameters (Table 2). Training is performed using mini-batch gradient descent with Nesterov's momentum. We use a batch size B = 64, initial learning rate η 0 = 0.01 and momentum parameter µ 0 = 0.9. L2-regularization with λ = 0.001 was applied to all models. Initial weights were drawn from 0-mean gaussians with σ = 0.1 and initial biases were set to 0. The hyperparameters were tuned one at a time keeping others fixed, and values with the lowest validation cost were chosen. The resultant combination was used to train the models until performance on validation set stopped increasing. During training, the learning rate is halved everytime the validation set precision increases by less than 0.01 % from one epoch to the next. The models converge in about 20 epochs. Code for training both the models is publicly available on github.

Results
We test the character and word-level variants by predicting hashtags for a held-out test set of posts.
Since there may be more than one correct hashtag per post, we generate a ranked list of tags for each  post from the output posteriors, and report average precision@1, recall@10 and mean rank of the correct hashtags. These are listed in Table 3. To see the performance of each model on posts containing rare words (RW) and frequent words (FW) we selected two test sets each containing 2,000 posts. We populated these sets with posts which had the maximum and minimum number of out-of-vocabulary words respectively, where vocabulary is defined by the 20K most frequent words. Overall, tweet2vec outperforms the word model, doing significantly better on RW test set and comparably on FW set. This improved performance comes at the cost of increased training time (see Table 2), since moving from words to characters results in longer input sequences to the GRU.
We also study the effect of model size on the performance of these models. For the word model we set vocabulary size V to 8K, 15K and 20K respectively. For tweet2vec we set the GRU hidden state size to 300, 400 and 500 respectively. Figure 2 Table 4: Precision @1 as training data size and number of output labels is increased. Note that the test set is different for each setting.
set described above. There is not much variation in the performance, and moreover tweet2vec always outperforms the word based model for the same number of parameters. Table 4 compares the models as complexity of the task is increased. We created 3 datasets (small, medium and large) with an increasing number of hashtags to be predicted. This was done by varying the lower threshold of the minimum number of tags per post for it to be included in the dataset. Once again we observe that tweet2vec outperforms its word-based counterpart for each of the three settings.
Finally, table 1 shows some predictions from the word level model and tweet2vec. We selected these to highlight some strengths of the character based approach -it is robust to word segmentation errors and spelling mistakes, effectively interprets emojis and other special characters to make predictions, and also performs comparably to the word-based approach for in-vocabulary tokens.

Conclusion
We have presented tweet2vec -a character level encoder for social media posts trained using supervision from associated hashtags. Our result shows that tweet2vec outperforms the word based approach, doing significantly better when the input post contains many rare words. We have focused only on English language posts, but the character model requires no language specific preprocessing and can be extended to other languages. For future work, one natural extension would be to use a character-level decoder for predicting the hashtags. This will allow generation of hashtags not seen in the training dataset. Also, it will be interesting to see how our tweet2vec embeddings can be used in domains where there is a need for semantic understanding of social media, such as tracking infectious diseases (Signorini et al., 2011). Hence, we provide an off-the-shelf encoder trained on medium dataset described above to compute vector-space representations of tweets along with our code on github.