Context-aware Entity Morph Decoding

People create morphs, a special type of fake alternative names, to achieve certain communication goals such as expressing strong sentiment or evading censors. For example, “ Black Mamba ”, the name for a highly venomous snake, is a morph that Kobe Bryant created for himself due to his agility and aggressiveness in playing bas-ketball games. This paper presents the ﬁrst end-to-end context-aware entity morph decoding system that can automatically identify, disambiguate, verify morph mentions based on speciﬁc contexts, and resolve them to target entities. Our approach is based on an absolute “ cold-start ” - it does not require any candidate morph or target entity lists as input, nor any manually constructed morph-target pairs for training. We design a semi-supervised collective inference framework for morph mention extraction, and compare various deep learning based approaches for morph resolution. Our approach achieved signiﬁ-cant improvement over


Introduction
Morphs (Huang et al., 2013;Zhang et al., 2014) refer to the fake alternative names created by social media users to entertain readers or evade censors. For example, during the World Cup in 2014, a morph "Su-tooth" was created to refer to the Uruguay striker "Luis Suarez" for his habit of biting other players. Automatically decoding humangenerated morphs in text is critical for downstream deep language understanding tasks such as entity linking and event argument extraction.
However, even for human, it is difficult to decode many morphs without certain historical, cultural, or political background knowledge (Zhang et al., 2014). For example, "The Hutt" can be used to refer to a fictional alien entity in the Star Wars universe ("The Hutt stayed and established himself as ruler of Nam Chorios"), or the governor of New Jersey, Chris Christie ("The Hutt announced a bid for a seat in the New Jersey General Assembly"). Huang et al. (2013) did a pioneering pilot study on morph resolution, but their approach assumed the entity morphs were already extracted and used a large amount of labeled data. In fact, they resolved morphs on corpus-level instead of mention-level and thus their approach was context-independent. A practical morph decoder, as depicted in Figure 1, consists of two problems: (1) Morph Extraction: given a corpus, extract morph mentions; and (2). Morph Resolution: For each morph mention, figure out the entity that it refers to.
In this paper, we aim to solve the fundamental research problem of end-to-end morph decoding and propose a series of novel solutions to tackle the following challenges.

Challenge 1: Large-scope candidates
Only a very small percentage of terms can be used as morphs, which should be interesting and fun. As we annotate a sample of 4, 668 Chinese weibo tweets, only 450 out of 19, 704 unique terms are morphs. To extract morph mentions, we propose a two-step approach to first identify individual mention candidates to narrow down the search scope, and then verify whether they refer to morphed entities instead of their original meanings.
Challenge 2: Ambiguity, Implicitness, Informality Compared to regular entities, many morphs contain informal terms with hidden information. For example, "不 厚 (not thick)" is used to refer to "薄熙来 (Bo Xilai)" whose last name "薄 (Bo)" means "thin". Therefore we attempt to model the rich contexts with careful considerations for morph characteristics both globally (e.g., language models learned from a large amount of data) and locally (e.g. phonetic anomaly analysis) to extract morph mentions.
For morph resolution, the main challenge lies in that the surface forms of morphs usually appear quite different from their target entity names. Based on the distributional hypothesis (Harris, 1954) which states that words that often occur in similar contexts tend to have similar meanings, we propose to use deep learning techniques to capture and compare the deep semantic representations of a morph and its candidate target entities based on their contextual clues. For example, the morph "平西王(Conquer West King)" and its target entity "薄熙来 (Bo Xilai)" share similar implicit contextual representations such as "重 庆(Chongqing)" (Bo was the governor of Chongqing) and "倒台 (fall from power)".

Challenge 3: Lack of labeled data
To the best of our knowledge, no sufficient mention-level morph annotations exist for training an end-to-end decoder. Manual morph annotations require native speakers who have certain cultural background (Zhang et al., 2014). In this paper we focus on exploring novel approaches to save annotation cost in each step. For morph extraction, based on the observation that morphs tend to share similar characteristics and appear together, we propose a semi-supervised collective inference approach to extract morph mentions from multiple tweets simultaneously. Deep learning techniques have been successfully used to model word representation in an unsupervised fashion. For morph resolution, we make use of a large amount of unlabeled data to learn the semantic representations of morphs and target entities based on the unsupervised continuous bag-of-words method (Mikolov et al., 2013b).

Problem Formulation
Following the recent work on morphs (Huang et al., 2013;Zhang et al., 2014), we use Chinese Weibo tweets for experiments. Our goal is to develop an end-to-end system that automatically extract morph mentions and resolve them to their target entities. Given a corpus of tweets D = {d 1 , d 2 , ..., d |D| }, we define a candidate morph m i as a unique term t j in T , where T = {t 1 , t 2 , ..., t |T | } is the set of unique terms in D. To extract T , we first apply several well-developed Natural Language Processing tools, including Stanford Chinese word segmenter (Chang et al., 2008), Stanford part-ofspeech tagger (Toutanova et al., 2003) and Chinese lexical analyzer ICTCLAS (Zhang et al., 2003), to process the tweets and identify noun phrases. Then we define a morph mention m p i of m i as the p-th occurrence of m i in a specific document d j . Note that a mention with the same surface form as m i but referring to its original entity is not considered as a morph mention. For instance, the "平西 王 (Conquer West King)" in d 1 and d 3 in Figure 1 are morph mentions since they refer to the modern politician "薄熙来 (Bo Xilai)", while the one in d 4 is not a morph mention since it refers to the original entity, who was king "吴三桂 (Wu Sangui)".
For each morph mention, we discover a list of target candidates E = {e 1 , e 2 , ..., e |E| } from Chinese web data for morph mention resolution. We design an end-to-end morph decoder which consists of the following procedure: • Morph Mention Extraction -Potential Morph Discovery: This first step aims to obtain a set of potential entity-level morphs M = {m 1 , m 2 , ...}(M ⊆ T ). Then, we only verify and resolve the mentions of these potential morphs, instead of all the terms in T in a large corpus. -Morph Mention Verification: In this step, we aim to verify whether each mention m p i of the potential morph m i (m i ∈ M ) from a specific context d j is a morph mention or not.
• Morph Mention Resolution: The final step is to resolve each morph mention m p i to its target entity (e.g., "薄熙来 (Bo Xilai)" for the morph mention "平西王 (Conquer West King)" in d 1 in Figure 1).

Why Traditional Entity Mention Extraction doesn't Work
In order to automatically extract morph mentions from any given documents, our first reflection is to formulate the task as a sequence labeling problem, just like labeling regular entity mentions. We adopted the commonly used conditional random fields (CRFs) (Lafferty et al., 2001) and got only 6% F-score. Many morphs are not presented as regular entity mentions. For example, the morph "天 线 (Antenna)" refers to "温 家 宝 (Wen Jiabao)" because it shares one character "宝 (baby)" with the famous children's television series "天 线宝宝 (Teletubbies)". Even when they are presented as regular entity mentions, they must refer to new target entities which are different from the regular ones. So we propose the following novel two-step solution.

Potential Morph Discovery
We first introduce the first step of our approach -potential morph discovery, which aims to narrow down the scope of morph candidates without losing recall. This step takes advantage of the common characteristics shared among morphs and identifies the potential morphs using a supervised method, since it is relatively easy to collect a certain number of corpus-level morphs as training data compared to labeling morph mentions. Through formulating this task as a binary classifi-cation problem, we adopt the Support Vector Machines (SVMs) (Cortes and Vapnik, 1995) as the learning model. We propose the following four categories of features. Basic: (i) character unigram, bigram, trigram, and surface form; (ii) part-of-speech tags; (iii) the number of characters; (iv) whether some characters are identical. These basic features will help identify several common characteristics of morph candidates (e.g., they are very likely to be nouns, and very unlikely to contain single characters).
Dictionary: Many morphs are non-regular names derived from proper names while retaining some characteristics. For example, the morphs "薄 督 (Governor Bo)" and "吃 省 (Gourmand Province)" are derived from their target entity names "薄熙来 (Bo Xilai)" and "广东省 (Guandong Province)", respectively. Therefore, we adopt a dictionary of proper names (Li et al., 2012) and propose the following features: (i) Whether a term occurs in the dictionary. (ii) Whether a term starts with a commonly used last name, and includes uncommonly used characters as its first name. (iii) Whether a term ends with a geopolitical entity or organization suffix word, but it's not in the dictionary.
Phonetic: Many morphs are created based on phonetic (Chinese pinyin in our case) modifications. For instance, the morph "饭 饼 饼 (Rice Cake)" has the same phonetic transcription as its target entity name "范冰冰 (Fan Bingbing)". To extract phonetic-based features, we compile a dictionary composed of phonetic transcription, term pairs from the Chinese Gigaword corpus 2 . Then for each term, we check whether it has the same phonetic transcription as any entry in the dictionary but they include different characters.
Language Modeling: Many morphs rarely appear in a general news corpus (e.g., "六 步 郎 (Six Step Man)" refers to the NBA baseketball player "勒布朗·詹姆斯 (Lebron James)".). Therefore, we use the character-based language models trained from Gigaword to calculate the occurrence probabilities of each term, and use n-gram probabilities (n ∈ [1 : 5]) as features.

Morph Mention Verification
The second step is to verify whether a mention of the discovered potential morphs is indeed used as a morph in a specific context. Based on the ob-servation that closely related morph mentions often occur together, we propose a semi-supervised graph-based method to leverage a small set of labeled seeds, coreference and correlation relations, and a large amount of unlabeled data to perform collective inference and thus save annotation cost. According to our observation of morph mentions, we propose the following two hypotheses: Hypothesis 1: If two mentions are coreferential, then they both should either be morph mentions or non-morph mentions. For instance, the morph mentions "平西王 (Conquer West King)" in d 1 and d 3 in Figure 1 are coreferential, they both refer to the modern politician "薄熙来 (Bo Xilai)".
Hypothesis 2: Those highly correlated mentions tend to either be morph mentions or nonmorph mentions. From our annotated dataset, 49% morph mentions co-occur on tweet level. Based on these hypotheses, we aim to design an effective approach to compensate for the limited annotated data. Graph-based semi-supervised learning approaches (Zhu et al., 2003;Smola and Kondor, 2003;Zhou et al., 2004) have been successfully applied many NLP tasks (Niu et al., 2005;Chen et al., 2006;. Therefore we build a mention graph to capture the semantic relatedness (weighted arcs) between potential morph mentions (nodes) and propose a semi-supervised graph-based algorithm to collectively verify a set of relevant mentions using a small amount of labeled data. We now describe the detailed algorithm as follows.

Mention Graph Construction
First, we construct a mention graph that can reflect the association between all the mentions of potential morphs. According to the above two hypotheses, mention coreference and correlation relations are the basis to build our mention graph, which is represented by a matrix.
In Chinese Weibo, their exist rich and clean social relations including authorship, replying, retweeting, or user mentioning relations. We make use of these social relations to judge the possibility of two mentions of the same potential morph being coreferential. If there exists one social relation between two mentions m p i and m q i of the morph m i , they are usually coreferential and assigned an association score 1. We also detect coreferential relations by performing content similarity analysis. The cosine similarity is adopted with the tf-idf representation for the contexts of two mentions. Then we get a coreference matrix W 1 : where m p i and m q i are two mentions from the same potential morph m i , and kNN means that each mention is connected to its k nearest neighboring mentions.
Users tend to use morph mentions together to achieve their communication goals. To incorporate such evidence, we measure the correlation between two mentions m p i and m q j of two different potential morphs m i and m j as corr(m p i , m q j ) = 1.0 if there exists a certain social relation between them. Otherwise, corr(m p i , m q j ) = 0. Then we can obtain the correlation matrix: W 2 To tune the balance of coreferential relation and correlation relation during learning, we first get two matricesŴ 1 andŴ 2 by row-normalizing W 1 and W 2 , respectively. Then we obtain the final mention matrix W with a linear combination of W 1 andŴ 2 : W = αŴ 1 + (1 − α)Ŵ 2 , where α is the coefficient between 0 and 1 3 .

Graph-based Semi-supervised Learning
Intuitively, if two mentions are strongly connected, they tend to hold the same label. The label of 1 indicates a mention is a morph mention, and 0 means a non-morph mention. We use Y = Y l Y u T to denote the label vector of all mentions, where the first l nodes are verified mentions labeled as 1 or 0, and the remaining u nodes need to be verified and initialized with the label 0.5. Our final goal is to obtain the final label vector Y u by incorporating evidence from initial labels and the mention graph. Following the graph-based semi-supervised learning algorithm (Zhu et al., 2003), the mention verification problem is formulated to optimize the objective function Q(Y) = µ l i=1 (y i − y 0 i ) 2 + 1 2 i,j W ij (y i − y j ) 2 where y 0 i denotes the initial label, and µ is a regularization parameter that controls the trade-off between initial labels and the consistency of labels on the mention graph. Zhu et al. (2003) has proven that this formula has both closed-form and iterative solutions.

Morph Mention Resolution
The final step is to resolve the extracted morph mentions to their target entities.

Candidate Target Identification
We start from identifying a list of target candidates for each morph mention from the comparable corpora including Sina Weibo, Chinese News and English Twitter. After preprocessing the corpora using word segmentation, noun phrase chunking and name tagging, the name entity list is still too large and too noisy for candidate ranking. To clean the name entity list, we adopt the temporal Distribution Assumption proposed in our recent work (Huang et al., 2013). It assumes that a morph m and its real target e should have similar temporal distributions in terms of their occurrences. Following the same heuristic we assume that an entity is a valid candidate for a morph if and only if the candidate appears fewer than seven days after the morph's appearance.

Motivations of Using Deep Learning
Compared to regular entity linking tasks (Ji et al., 2010;Ji et al., 2011;, the major challenge of ranking a morph's candidate target entities lies in that the surface features such as the orthographic similarity between morph and target candidates have been proven inadequate (Huang et al., 2013). Therefore, it is crucial to capture the semantics of both mentions and target candidates. For instance, in order to correctly resolve "平西王 (Conquer West King)" from d 1 and d 3 in Figure 1 to the modern politician "薄熙来(Bo Xilai)" instead of the ancient king "吴三桂 (Wu Sangui)", it is important to model the surrounding contextual information effectively to capture important information (e.g., "重庆 (Chongqing)", "倒台 (fall from power)", and "唱红歌 (sing red songs)") to represent the mentions and target entity candidates. Inspired by the recent success achieved by deep learning based techniques on learning semantic representations for various NLP tasks (e.g., (Bengio et al., 2003;Collobert et al., 2011;Mikolov et al., 2013b;He et al., 2013)), we design and compare the following two approaches to employ hierarchical architectures with multiple hidden layers to extract useful features and map morphs and target entities into a latent semantic space.

Pairwise Cross-genre Supervised Learning
Ideally, we hope to obtain a large amount of coreferential entity mention pairs for training. A natural knowledge resource is Wikipedia which includes anchor links. We compose an anchor's surface string and the title of the entity it's linked to as a positive training pair. Then we randomly sample negative training instances from those pairs that don't share any links. Our approach consists of the following steps: (1) generating high quality embedding for each training instance; (2) pre-training with the stacked denoising auto-encoder (Bengio et al., 2003) for feature dimension reduction; and (3) supervised fine-tuning to optimize the neural networks towards a similarity measure (e.g., dot product). Figure 2 depicts the overall architecture of this approach. However, morph resolution is significantly different from the traditional entity linking task since the latter mainly focuses on formal and explicit entities (e.g., "薄 熙 来 (Bo Xilai)") which tend to have stable referents in Wikipedia. In contrast, morphs tend to be informal, implicit and have newly emergent meanings which evolve over time. In fact, these morph mentions rarely appear in Wikipedia. For example, almost all "平西王 (Conquer West King)" mentions in Wikipedia refer to the ancient king instead of the modern politician "薄熙来 (Bo Xilai)". In addition, the contextual words in Wikipedia used to describe entities are quite different from those in social media. For example, to describe a death event, Wikipedia usu-ally uses a formal expression "去世 (pass away)" while an informal expression "挂了 (hang up)" is used more often in tweets. Therefore this approach suffers from the knowledge discrepancy between these two genres.  To address the above challenge, we propose the second approach to learn semantic embeddings of both morph mentions and entities directly from tweets. Also we prefer unsupervised learning methods due to the lack of training data. Following (Mikolov et al., 2013a), we develop a continuous bag-of-words (CBOW) model that can effectively model the surrounding contextual information. CBOW is discriminatively trained by maximizing the conditional probability of a term w i given its contexts c(w i ) = {w i−n , ..., w i−1 , w i+1 , ..., w i+n }, where n is the contextual window size, and w i is a term obtained using the preprocessing step introduced in Section 2 4 . The architecture of CBOW is depicted in Figure 3. We obtain a vector X w i through the projection layer by summing up the embedding vectors of all terms in c(w i ), and then use the sigmoid activation function to obtain the final embedding of w i in c(w i ) in the output layer.
Formally, the objective function of CBOW can be formulated as L(θ) = where W is the set of unique terms obtained from the whole training corpus. p(w j |c(w i )) is the conditional likelihood of w j given the context c(w i ) and it is formulated as follows:  where L w i (w j ) = 1, w i = w j 0, Otherwise , σ is the sigmoid activation function, and θ w i is the embeddings of w i to be learned with back-propagation during training.

Data
We retrieved 1,553,347 tweets from Chinese Sina Weibo from May 1 to June 30, 2013 and 66, 559 web documents from the embedded URLs in tweets for experiments. We then randomly sampled 4, 688 non-redundant tweets and asked two Chinese native speakers to manually annotate morph mentions in these tweets. The annotated dataset is randomly split into training, development, and testing sets, with detailed statistics shown in Table 1 5 . We used 225 positive instances and 225 negative instances to train the model in the first step of potential morph discovery. We collected a Chinese Wikipedia dump of October 9th, 2014, which contains 2,539,355 pages. We pulled out person, organization and geopolitical pages based on entity type matching with DBpedia 6 . We also filter out the pages with fewer than 300 words. For training the model, we use 60,000 mention-target pairs along with one negative sample randomly generated for each pair, among which, 20% pairs are reserved for parameter tuning.

Overall: End-to-End Decoding
In this subsection, we first study the end-to-end decoding performance of our best system, and compare it with the state-of-the-art supervised learning-to-rank approach proposed by (Huang et al., 2013) based on information networks construction and traverse with meta-paths. We use the 225 extracted morphs as input to feed (Huang et al., 2013) system. The experiment setting, implementation and evaluation process are similar to (Huang et al., 2013).
The overall performance of our approach using within-genre learning for resolution is shown in Table 2.
We can see that our system achieves significantly better performance (95.0% confidence level by the Wilcoxon Matched-Pairs Signed-Ranks Test) than the approach proposed by (Huang et al., 2013). We found that (Huang et al., 2013) failed to resolve many unpopular morphs (e.g., "小马 (Little Ma)" is a morph referring to Ma Yingjiu, and it only appeared once in the data), because it heavily relies on aggregating contextual and temporal information from multiple instances of each morph. In contrast, our unsupervised resolution approach only leverages the pre-trained word embeddings to capture the semantics of morph mentions and entities.   Now we evaluate the performance of morph mention verification. We compare our approach with two baseline methods: (i) Naive, which considers all mentions as morph mentions; (ii) SVMs, a fully supervised model using Support Vector Machines (Cortes and Vapnik, 1995) based on unigrams and bigrams features. Table 3 shows the results. We can see that our approach achieves significantly better performance than the baseline approaches. In particular it can verify the mentions of newly emergent morphs. For instance, "棒棒 棒 (Good Good Good)" is mistakenly identified by the first step as a potential morph, but the second step correctly filters it out.

Diagnosis: Morph Mention Resolution
The target candidate identification step successfully filters 86% irrelevant entities with high preci-sion (98.5% of morphs retain their target entitis). For candidate ranking, we compare with several baseline approaches as follows: • BOW: We compute cosine similarity over bagof-words vectors with tf-idf values to measure the context similarity between a mention and its candidates. • Pair-wise Cross-genre Supervised Learning: We first construct a vocabulary by choosing the top 100,000 frequent terms. Then we randomly sample 48,000 instances for training and 12,000 instances for development. At the pre-training step, we set the number of hidden layers as 3, the size of each hidden layer as 1000, the masking noise probability for the first layer as 0.7, and a Gaussian noise with standard deviation of 0.1 for higher layers. The learning rate is set to be 0.01. At the fine-tuning stage, we add a 200 units layer on top of auto-encoders and optimize the neural network models based on the training data. • Within-genre Unsupervised Learning: We directly train morph mention and entity embeddings from the large-scale tweets and web documents that we collect. We set the window size as 10 and the vector dimension as 800 based on the development set.
The overall performance of various resolution approaches using perfect morph mentions is shown in Figure 4. We can clearly see that our second within-genre learning approach achieves the best performance. Figure 5 demonstrates the differences between our two deep learning based methods. When learning semantic embeddings directly from Wikipedia, we can see that the top 10 closest entities of the mention "平西王(Conquer West King)" are all related to the ancient king "吴 三桂(Wu Sangui)". Therefore this method is only able to capture the original meanings of morphs. In contrast, when we learn embeddings directly from tweets, most of the closest entities are relevant to its target entity "薄熙来 (Bo Xilai)".

Related Work
The first morph decoding work (Huang et al., 2013) assumed morph mentions are already discovered and didn't take contexts into account. To the best of our knowledge, this is the first work on context-aware end-to-end morph decoding.
Morph decoding is related to several traditional  NLP tasks: entity mention extraction (e.g., (Zitouni and Florian, 2008;Ohta et al., 2012;Li and Ji, 2014)), metaphor detection (e.g., (Wang et al., 2006;Tsvetkov, 2013;Heintz et al., 2013)), word sense disambiguation (WSD) (e.g., (Yarowsky, 1995;Mihalcea, 2007;Navigli, 2009)), and entity linking (EL) (e.g., (Mihalcea and Csomai, 2007;Ji et al., 2010;Ji et al., 2011;. However, none of these previous techniques can be applied directly to tackle this problem. As mentioned in section 3.1, entity morphs are fundamentally different from regular entity mentions. Our task is also different from metaphor detection because morphs cover a much wider range of semantic categories and can include either abstractive or concrete information. Some common features for detecting metaphors (e.g. (Tsvetkov, 2013)) are not effective for morph extraction: (1). Semantic categories. Metaphors usually fall into certain semantic categories such as noun.animal and noun.cognition. (2). Degree of abstractness. If the subject or an object of a concrete verb is abstract then the verb is likely to be a metaphor. In contrast, morphs can be very abstract (e.g., "函 数 (Function)" refers to "杨 幂 (Yang Mi)" because her first name "幂 (Mi)" means the Power Function) or very concrete (e.g., "薄督 (Governor Bo)" refers to "薄熙来 (Bo Xilai)"). In contrast to traditional WSD where the senses of a word are usually quite stable, the "sense" (target entity) of a morph may be newly emergent or evolve over time rapidly. The same morph can also have multiple senses. The EL task focuses more on explicit and formal entities (e.g., named entities), while morphs tend to be informal and convey implicit information.
Morph mention detection is also related to malware detection (e.g., (Firdausi et al., 2010;Chandola et al., 2009;Firdausi et al., 2010;Christodorescu and Jha, 2003)) which discovers abnormal behavior in code and malicious software. In contrast our task tackles anomaly texts in semantic context.
Deep learning-based approaches have been demonstrated to be effective in disambiguation related tasks such as WSD (Bordes et al., 2012), entity linking (He et al., 2013) and question linking (Yih et al., 2014;Bordes et al., 2014;Yang et al., 2014). In this paper we proved that it's cru-cial to keep the genres consistent between learning embeddings and applying embeddings.

Conclusions and Future Work
This paper describes the first work of contextaware end-to-end morph decoding. By conducting deep analysis to identity the common characteristics of morphs and the unique challenges of this task, we leverage a large amount of unlabeled data and the coreferential and correlation relations to perform collective inference to extract morph mentions. Then we explore deep learning-based techniques to capture the semantics of morph mentions and entities and resolve morph mentions on the fly. Our future work includes exploiting the profiles of target entities as feedback to refine the results of morph mention extraction. We will also extend the framework for event morph decoding.