Hallym: Named Entity Recognition on Twitter with Word Representation

Twitter is a type of social media that contains diverse user-generated texts. Traditional models are not applicable to tweet data because the text style is not as gram-maticalized as that of newswire. In this paper, we construct word embeddings via canonical correlation analysis (CCA) on a considerable amount of tweet data and show the efﬁcacy of word representation. Besides word embedding, we use part-of-speech (POS) tags, chunks, and brown clusters induced from Wikipedia as features. Here, we describe our system and present the ﬁnal results along with their analysis. Our model achieves an F1 score of 37.21% with entity types and distinguishes 53.01% of the entity boundaries.


Introduction
Named entity recognition (NER) is a task of finding and classifying names of things, such as person, location, and organization, given a sequence of words. NER is a very important subtask of information extraction (IE). With the development of the Internet, a huge amount of information has been generated by users. The information generated on the Internet, particularly on social media (e.g., Twitter and Facebook), includes very diverse and noisy texts. The volume of Twitter data has increased rapidly, and about 500 million tweets are sent per day 1 . In recent years, Twitter data have considered a new source in nature and researchers are paying increased attention to them (Bollen et al., 2011;Mathioudakis and Koudas, 2010). Twitter is a type of microblogging service in which users are allowed to post contents such as small messages, individual images, or videos. There 1 See "http://www.internetlivestats.com/twitter-statistics/" are a number of microblogging sites such as Twitter, Tumblr, Plurk and identi.ca. Each service has its own characteristics. For example, Plurk has a timeline view for videos and pictures, and Twitter has "status updates." The characteristic of "status updates" is one of the features that makes the classification of named entities in Twitter difficult. In Twitter, there is a limit for the number of characters that people can post at once. People post their thoughts with a short sentence; this leads to the problem that tweets do not contain sufficient contextual information (Ritter et al., 2011). The shared task of ACL W-NUT 2015 is to find named entities on Twitter. Here, we will focus on ten types of named entities: company, facility, geo-loc, movie, musicartist, other, person, product, sportsteam, and tvshow. We have the training and development data for Twitter and 53 gazetteers from the abovementioned shared task. In this paper, we describe the datasets in Section 2 and present the model that we use in this study in Section 3. In Section 4, we discuss the features used and the methods used for generating these features. We present our final results along with their analysis in Section 5 and conclude this paper in Section 6.

Data and Labels
In this section, we introduce the considered datasets and describe the data format used. We also list the characteristics of each entity type with some examples.

Data
The datasets provided by shared task are raw tweets. Table 1 shows an overview of the sizes of these datasets. In a tweet, each line contains words and its label is separated by a tab and a blank line that forms a sentence boundary. All tokens follow the IOB format. The token with a B-prefix indi-cates the beginning of a named entity and the token with an I-prefix indicates the inside of a named entity. An I-prefix only follows after a token with a B-prefix. An O tag indicates that a token does not belong to a specific named entity.

Labels
In the system, we focus on the following ten types of named entities: company The name of a company or a brand e.g., Snapchat, Twitter, and Facebook facility The name of an institution such as a museum, a center, or a restaurant e.g., Iowa City schools and Disneyland geo-loc The name of a city or country e.g., Chicago and Russia movie The title of a movie e.g., Interstellar and Inception musicartist The name of music groups or disc jockeys (DJs) e.g., Taylor Swift and Lady Gaga other A phrase that can be used generally such as the name of a ceremony or an anniversary, or the title of a song e.g., X-mas and Murphy's law

Model
Conditional Random Fields (CRFs) (Lafferty et al., 2001) and its variants have been successfully applied to various sequence labeling tasks (Maaten et al., 2011;Collins, 2002;McCallum and Li, 2003;Kim and Snyder, 2012;Kim et al., 2015b;Kim et al., 2015a;Kim and Snyder, 2013a;Kim and Snyder, 2013b). The NER task produces a sequence of named entity tags, y = (y 1 . . . y n ), given a sequence of words, x = (x 1 . . . x n ). We model the conditional probability p(y|x; θ) using linear-chain CRFs: where θ denotes a set of model parameters. Y returns all possible label sequences of x, and Φ maps (x, y) into a feature vector that is a linear sum of the local feature vectors: , the objective of the training is to find θ that maximizes the log likelihood of the training data under the model with l 2 -regularization:

Features
In this section, we describe a variety of features that we have used in this study. We also used CRFsuite 2 because it makes the application of new features easy. Apart from the base features and gazetteer features provided by the organizers, we have used the following new features: POS tags, chunks, brown clustering, and word representation. Our model is composed of the following features:

Base features
Base features include the gazetteer features and orthographic features. In the NER task, a huge amount of unlabeled data is often used for identifying unseen entities. There are already 53 gazetteers in the baseline system. The maximum window size for gazetteer features is 6, and the model will learn the named entity type associated with a specific phrase, if it is in one or more of the gazetteer lexicons. Orthographic features can be divided into five types. The orthographic feature templates are as follows: • n-gram: w i for i in {-1,0,1}, conjunction of previous word and current word w i−1 |w i for i in {-1,0}.
• Affixes: Prefixes and suffixes of x i . The first and last n characters ranging from 1 to 3.
• Capitalization: There are two patterns of capitalization: One is an indicator of capitalization for the first character, and the other is an indicator of capitalization for all characters.
• Digit: There are three patterns for numbers: i) Whether the current word has a digit, ii) whether the current word is a single digit, and iii) whether the current word has two digits.
• Non-alphabet: Whether the current word contains a hyphen and other punctuation marks. Among the other punctuation marks is the colon(:). In general, what follows right after a colon mark represents a feature weight. To make the model learn correctly, we normalize only the colon mark.

POS tags and chunks
In the NER task, POS tags and chunks contain very useful information for finding and classifying named entities. We predict POS tags and chunks by using a model trained with Twitter data. For POS tags, we use a model trained with the Penn Treebank-style tagset (Ritter et al., 2011). In a model, some Twitter-specific tags are added by Ritter et al. (2011): retweets, @usernames, #hashtags, and urls. For chunks, we use a named entity tagger 3 by Ritter et al. (2012). Predicted tags are used as features as follows: • POS tag: a conjunction feature with the current word and the current POS tag w 0 |p 0 .
• Chunk tag: a unigram feature for chunk tag c 0 and a conjunction feature with the current word and the current chunk tag w 0 |c 0 .

Brown clustering
Brown clustering is a hierarchical clustering method that groups words into a binary tree of classes (Brown et al., 1992). We downloaded a brown clustering 4 based on Wikipedia provided by Turian et al. (2010). We used whole bit string of the current word.

Word representation
As a new source, tweet data are not applicable to the traditional model because of the different text structure. For a new model, it is natural to use annotated data. However, it is difficult to create new labeled data for a rapid generation of tweets. Instead of constantly annotate new data, the general solution is creating induced word representations from a large body of unlabeled data (Mikolov et al., 2013;Pennington et al., 2014;Anastasakos et al., 2014). A lot of previous work have used CCA because of its simplicity and generality (Kim et al., 2015c;Kim et al., 2015d;Stratos et al., 2014;Kim et al., 2015b). We create a word representation by using the canonical correlation analysis (Hotelling, 1936). Furthermore, word embeddings are induced from 13 million tweets containing 270 million tokens. The dimension of word embeddings we used is 50 with words occurring more than twice in the data . The window size for the contextual information is 3: the current word and a word to the left and the right of the current word.

Error analysis
Twitter contains noisy and informal style text, and most of the state-of-art applications show a weak performance on Twitter data (Ritter et al., 2011). In this section, we check the errors for noisy text from the baseline system and categorize them. The last two errors are related to user-generated texts such as Twitter data.
Unseen word sequences: The main cause of this error is in a previously unseen sequence. A huge number of tweets are posted on Twitter every day and they contain up-to-date information on events. The most recent information such as new product information can lead to the formation of unprecedented word sequences. These sequences do not appear in  Foreign languages: This error is caused by tweets written in languages other than English. Words written in foreign languages are annotated by the O tag and not include a named entity. However, some words have the same spelling as an English word and thus, activate the gazetteer features. This problem leads to words with the O tag being predicted as a named entity type.
Type disambiguation: There are some words that have the same spelling but belong to different types according to the contextual information. This error is often observed for named entities such as sportsteam and musicartist. The word sequences with this error have a correctly distinguished entity boundary but predict the wrong entity type. For example, Tampa Bay in "Losing to the Penguins quasi-AHL lineup in December is a non-issue for Tampa Bay" is an entity for sportsteam, but the model classifies it as geo-loc instead of sportsteam. In another example, the names of two music artists in "Will Shawn Mendez be opening up for Taylor Swift" are predicted as person and not as musicartist.
Informal name or abbreviations: Twitter users compress what they want to say to meet the limit of 140 characters. This leads to informal texts unlike in news articles. Note that abbreviations do not indicate official full forms such as airports or countries. For example, Southie in "Proud that the 1st modern Olympic Champion is James Brendan Connolly of #Southie ." is an informal name of South Boston, and this word does not appear in the training set and gazetteers. With respect to abbreviations, people use abbreviattions for indicating a day or a month, such as Mon for Monday and Jan for January. These words are contained in gazetteers and activate the gazetteer features. A model makes errors by predicting them as named entities.
Hashtag: A hashtag is a combination of the "#" sign and some characters for organizing word sequences as searchable links in Twitter. The rule is to not use any space between the characters in the hashtag. For instance, the word New Delhi is transformed into #NewDelhi as a hashtag, so it is difficult to check the gazetteer lexicons for such text.

The effectiveness of word embedding
In this subsection, we describe the effectiveness of word embedding by analyzing the results obtained by using the model with and without word embedding. The only difference between both the models is the use of brown clustering and the word representation based on CCA.
In the NER task, the F1 score is a more appropriate metric than accuracy. Most of the labels in the NER data contain the O tag, indicating that they are not an entity. Since this leads to high accuracy, by using the F1 score, we obtain a more reasonable harmonic function of the precision and the recall. Table 2 shows the results obtained by using models with and without word embedding. As shown in table 2, brown clustering and word embedding have a good effect on performance. All types of entities except movie show error reduction. For determining the efficacy of word embedding, we compare the errors between the models without word embedding and with word embedding. We find that word embedding plays an important role in resolving the problem of unseen word sequences and the problem of type disambiguation. First, the model without word embedding does not learn about an entity ipad Mini Retina 2nd Generation 16GB wifi because some of the words do not appear in the training data. In contrast, the model with embedding can learn unseen words from the induced word representation. This helps the model to predict that the abovementioned entity indicates a product name. The model without word embedding also has the problem of disambiguation of a word Edison because the model only learns that this word is a person's name from the gazetteers. However, in the word sequence "Edison #weather on January 16 , 2015", Edison indicates a town in New Jersey. The model with word embedding is provided additional information by the word embedding process and predicts the abovementioned word as geo-loc correctly.

Conclusion
In this paper, we described the data and features used for generating our model. Besides POS tags and chunk tags, we used a word representation based on CCA for improving the model's performance. Our final model shows an error reduction of 14.08% from the baseline system. We also presented some primary and Twitter-specific problems by categorizing errors.