Combining Word-Level and Character-Level Representations for Relation Classification of Informal Text

Word representation models have achieved great success in natural language processing tasks, such as relation classification. However, it does not always work on informal text, and the morphemes of some misspelling words may carry important short-distance semantic information. We propose a hybrid model, combining the merits of word-level and character-level representations to learn better representations on informal text. Experiments on two dataset of relation classification, SemEval-2010 Task8 and a large-scale one we compile from informal text, show that our model achieves a competitive result in the former and state-of-the-art with the other.


Introduction
Deep learning has made significant progress in natural language processing, and most of approaches treat word representations as the cornerstone. Though it is effective, word-level representation is inherently problematic: it assumes that each word type has its own vector that can vary independently; most words only occur once in training data and out-of-vocabulary(OOV) words cannot be addressed. A word may typically include a root and one or more affixes (rock-s, red-ness, quick-ly, run-ning, un-expect-ed), or more than one root in a compound (black-board, rat-race). It is reasonable to assume that words which share common components(root, prefix, suffix)may be potentially related, while word-level representation considers each word separately. On the other hand, new words enter English from every area of life, e.g. Chillaxing -Blend of chilling and relaxing, represent taking a break from stressful activities to rest or relax. Whereas the vocabulary size of word-level model is fixed beforehand, the lack of these word representations may lose important semantic information.
Especially on informal text, the problems of word-level representation will be amplified and hard to ignore. Recently, character-level representation, which takes characters as atomic units to derive the embeddings, demonstrates that it can memorize the arbitrary aspects of word orthography. Parameters of these simple model are less, and it will be not ideal when processing long sentence. Combining word-level and character-level representations attempts to overcome the weaknesses of the two representations.
We utilize a Bidirectional Gated Recurrent Unit (Bi-GRU) (Chung et al., 2014) and Convolutional Neural Networks(CNN) to capture twolevel semantic representations respectively. While character-level information is likely to be drowned out by word-level information if simply connected, we adopt Highway Networks (Srivastava et al., 2015) to balance both. To evaluate our model, we evaluate on a public benchmark: SemEval-2010 Task8. This dataset is small and restricted in their relation types and their syntactic and lexical variations, and it is still unknown whether learning on the range of the specific relation transfers well to informal text. As such, we introduce a large-scale dataset based on the corpus and queries of TAC-KBP Slot Filling Track (Surdeanu and Ji, 2014) between 2009 to 2014, which contains 48k relation sentences, called KBP-SF48 1 .
TAC-KBP corpus comes from newswire, Web, post and discussion forum documents actually comprised of informal content, including language mismatch and spelling errors. We extract sentences from slots and fillers of Slot Filling Evalua-tion with position indicators to keep the same format as SemEval-2010 Task8. For instance, the following sentence with two nominals surrounded by position indicators belong to org:founded by relation: Bharara's office brought insider trading charges against <e1>Raj Rajaratnam <e1/>, the co-founder of hedge fund <e2>Galleon Group<e2/>.

Related Work
Some works (Mikolov et al., 2013;Pennington et al., 2014) started to learn semantic representations of word by unsupervised approaches. Recently, relation classification has focused on neural networks. Zeng et al. (2014) utilized CNN to learn patterns of relations from raw text data to make representative progress, but a potential problem is that CNN is not suitable for learning long-distance semantic information. Santos et al. (2015) proposed a similar model named CR-CNN, and replaced the cost function with a rankingbased function. Some models (Xu et al., 2015;Cai et al., 2016) leveraged the shortest dependency path(SDP) between two nominals. Others  employed attention mechanism to capture more important semantic information.
Working to a new dataset KBP37, Zhang and Wang (2015) proposed a framework based on a bidirectional Recurrent Neural Network(RNN). However, all these methods depend on learning word-level distributed representation without utilizing morphological feature.
Recent work captures word orthography using character-based neural networks. dos Santos and Zadrozny (2014) proposed a deep neural network to learn character-level representation of words for POS Tagging.  demonstrated the effectiveness of character-level CNN in text classification. Kim et al. (2015) employed CNN and a highway network to learn rich semantic and orthographic features from encoding characters. There were some models (Ling et al., 2015;Dhingra et al., 2016) based on RNN structures, which can memorize arbitrary aspects of word orthography over characters.
Our model uses multi-channel GRU units and CNN architecture to learn the representations of word-level and character-level, and project it to a softmax output layer for relation classification.

Model
As shown in Figure 1, the model learns wordlevel and character-level representations respectively, and combines them with interaction to get the final representation.

Word-level
Given a relation sentence consisting of words w 1 , w 2 , ..., w m , each w i is defined as a one hot vector 1 w i , with value 1 at index w i and 0 in all other dimensionality. We multiply a matrix P W ∈ R dw×|V | by 1 w i to project the word w i into its word embedding x i , as with a lookup table: where d w is the size of word embedding and V is the vocabulary of training set.
Then input the x 1 , x 2 , ..., x m sequence to a Bi-GRU network iteratively. Each GRU unit apply the following transformations: where z t is a set of update gates, r t is a set of reset gates and is an element-wise multiplication. W r , W z , W h and U r , U z , U h are weight matrices to be learned, and h t is the candidate activation. We use element-wise sum to combine the forward and backward pass final states as word-level rep-

Character-level
To capture morphological features, we use convolutions to learn local n-gram features at the lower network layer. As character-level input, original sentence is decomposed into a sequence of characters, including special characters, such as white-space. We first project each character into a character embedding x i by a lookup table whose mechanism is exactly as Eq.1.
Given the x 1 , x 2 , , x n embedding sequence, we compose the matrix D k ∈ R kdc×n to execute convolutions with same padding: where d c is the size of word embedding and each column i in D k consists of the concatenation of  Figure 1: Hybrid model combining word-level and character-level representation.
vectors (i.e. k embeddings centered at the i-th character), W k con is a weight matrix of convolution layer, and C k ∈ R c×n is the output of the convolution with c filters. We use p groups of filters with varying widths to obtain n-gram feature, and concatenate them by column: The next step, c i , ..., c n denoted by the column vector of C are fed as input sequence to a forward-GRU network(Eq.2), and we pick up final states activation h c n as character-level representation.

Combination
Instead of fully connected network layer, we utilize Highway Networks to emphasize impact of character level. Highway can be used to adaptively copy or transform representations, even when large depths are not required. We apply this idea to retain some independence of word and character when merging with interaction. Let h * be the concatenation of h w m and h c n , The combination z is obtained by the Highway Network: where g is a nonlinear function (tanh), t is referred to as the transform gate, and (1 − t) as the carry gate. W T and W H are square weight matrices, and b T and b H are bias vectors.

Training
Training our model for classifying sentence relation is a processes to optimizing the whole parameters θ of network layers. Given a input sentence X and the candidate set of relation Y , the classifier returns outputŷ as follows: We let the combination vector z through a softmax layer to give the distribution y = sof tmax( The training objective is the penalized crossentropy loss between predicted and true relation: where N is the mini-batch size, m is the size of relation set, t ∈ R m denotes the one-hot represented ground truth, y i,j is the predicted probability that the i−th sentence belongs to class j, and λ is a coefficient of L2 regularization.

Dataset
We evaluate our model on two dataset. SemEval-2010 Task8 dataset contains 9 directional relations and an Other class. There exist dataset derived from TAC-KBP for relation classification, such as KBP37(20k example for evaluation) collected by (Zhang and Wang, 2015). Based on this and more public corpus of resent years, we introduce a new larger scale dataset, called KBP-SF48. There are 48,340 annotated examples distributed among 40 relations(excluding no relation and org:website), including 33,838 sentences for training that consists of 102 unique characters, 9,668 for testing and 4,834 for validation.
Compared to SemEval-2010 Task8, the relation type of KBP-SF48 is designed to build a Knowledge Base from unstructured text, including quite a few informal documents, and the specific nominals that be-longs to these relations can be filled in specific slots. There exists non-directional and the directional corresponding relations (e.g. per:children & per:parents and org:members & org:member of).
Our model yields an F1-score of 84.1%, and outperforms most of the existing competing approaches without using any humandesigned features and lexical resources.
On KBP-SF48 benchmark, we evaluate our model by top 1 precision, and mean rank of correct relation because of the existence of non-directional relations, We reproduce the results on our own to show the performances of the other systems with the same train/dev/test splits, and ablate different aspects of the proposed model to show the impact of every component of our architecture. As is seen from Table 2, our model achieves a state-of-the-art result on KBP-SF48 dataset. Our model has already outperformed the RNN-based (Zhang and Wang, 2015) model of the KBP37 dataset,

Model
Precision @1 Mean Rank RNN-based (Zhang and Wang, 2015) 68.9% 2.01 CNN (Zeng et al., 2014) 79.1% 1.55 BLSTM and Att-BLSTM     (Zeng et al., 2014; of SemEval-2010 Task8, and our model achieves a better result by combining character feature into word-level representation. Then, we illustrate Bi-GRU architecture of Tweet2Vec (Dhingra et al., 2016), a pure character-level composition model, to show the effectiveness of character-level representation. Next, we get rid of the impact of characters to do wordlevel only experiment, and replace the highway with a fully connected layer. These clean comparisons demonstrate that the characterlevel and Highway network help to learn a better representation for classification.

Conclusion
In this paper, we propose a hybrid model that combines word-level and character-level representations. This model encodes characters by a cascade of CNN and GRU units, encodes words by Bi-GRU units, and uses Highway Network to combine. We demonstrate that our model achieves competitive results on the popular benchmark SemEval-2010 Task8 and achieves a better performance at learning character features on the KBP-SF48 dataset without relying on any lexical resources. In future, we plan to add interactions for each word with the corresponding positional characters.