Supersense Tagging with a Combination of Character, Subword, and Word-level Representations

Recently, there has been increased interest in utilizing characters or subwords for natural language processing (NLP) tasks. However, the effect of utilizing character, subword, and word-level information simultaneously has not been examined so far. In this paper, we propose a model to leverage various levels of input features to improve on the performance of an supersense tagging task. Detailed analysis of experimental results show that different levels of input representation offer distinct characteristics that explain performance discrepancy among different tasks.


Introduction
Recently, there has been increased interest in using characters or subwords, instead of words, as the basic unit of language feature in natural language processing tasks. Utilizing subword information has been shown to be very effective for named entity alignment of parallel corpus  and named entity recognition (Lample et al., 2016;Santos and Guimaraes, 2015). Some recent advancements were achieved using character or subword features in neural machine translation and language modeling (Sennrich et al., 2015;Chung et al., 2016;Lee et al., 2016;Kim et al., 2016).
The main benefit of utilizing features below word-level is the ability to overcome out-ofvocabulary (OOV) and the rare word problems. When faced with very infrequent or OOV words in the test data, word-level models must resort to replacing them with "unknown word" tokens; and in many cases, this discarded information could be vital for understanding certain semantics of the text, hence word-level models could per-form poorly when said types of words appear frequently.
Traditionally, words are segmented into subwords using carefully engineered morpheme analyzers (Smit et al., 2014). Recently, we see a rise in popularity of data-driven methods such as employing an efficient encoding scheme of character sequences (e.g. byte-pair encoding ). Words could also be split into individual characters to capture even finer syntactic details. Subword schemes of varying linguistic granularity offer a trade-off between capturing semantic and syntactic features.
Despite of the success of character or subwordlevel approaches, there has been lack of studies on ways to combine different levels of features, namely character, subword, and word-level features. To the best of our knowledge, utilization of subword units have not even been applied to supersense tagging yet. In this paper, we present a novel neural network architecture that incorporates all three types of word-feature units (Section 3). We conduct experiments on SemCor dataset using our model (Section 4.2). Then we analyze the optimal combination of the word features for each classs of the 41 supersenses in detail (Section 4.3).

Supersense Tagset
The supersense tagset consists of a total of 41 supersenses which are top-level semantic classes used in WordNet (Fellbaum, 1998) as shown in Table 1. This set is generally used for evaluating the approaches to coarse-grained word sense disambiguation and information extraction such as extended NER (Ciaramita and Johnson, 2003;Ciaramita and Altun, 2006). In this paper, we use the SemCor dataset (

Subword Segmentation
We use Byte Pair Encoding (BPE)  to segment words into subwords. First, BPE produces the most efficient character encoding scheme given a corpus. The encoding scheme consists of a fixed-size dictionary containing the most frequent character sequences. If a word is not frequent enough to be listed in the dictionary, it is broken down into subwords that exist in the dictionary and the meaning of the word is inferred from the meanings of the subwords. For example, an infrequent word "transition" could be split into frequent character sequences "transi@@" and "tion".

Model Description
We define supersense tagging as a sequence labeling problem: given an input word sequence W = (w 1 , w 2 , . . . , w n ), it is segmented into subword sequences using some encoding scheme (e.g. We present a novel neural network model that incorporates all of varying levels of word features: character, subword, and word ( Figure 1). This model is similar to (Lample et al., 2016), but differs from it in that (i) our model uses subwordlevel features as the basic unit of the main LSTM architecture (Section 3.2), (ii) uses delayed prediction to synchronize subword-level sequences with word-level predictions (Section 3.2), and (iii) takes subword-level input representations along with characters and words (Section 3.1).

Input Representation
For each x, our model produces three types of embeddings: (i) character-level embedding z (c) , (ii) subword embedding z (s) , and (iii) word-level embedding z (w) . In order to produce characterlevel representation, a bidirectional long shortterm memory cell (LSTM) BiLSTM c is utilized. The hidden states of either directions of the cells are concatenated into a single character-level representation: c = h (f ) ; h (b) . Producing subword embeddings is trivial, as each x is assigned a trainable vector z (s) . Lastly, a word embedding z (w) is produced by taking the embedding of the word in which x belongs. Note for some experiments, we use Glove to initialize word embeddings 1 . These representations are concatenated to produce a single vector z ∈ R r for each x, where r is the subword embedding dimension:

BiLSTM-CRF Architecture
We employ BiLSTM-CRF as the base architecture. Unlike previous work, subword-level embeddings instead of word-level embeddings are fed in at each time step. Given a subword-level embedding sequence Z = (z 1 , z 2 , . . . , z m ), the main bidirectional LSTM, BiLSTM s , along with a synchronization layer L s , and a linear layer L o , produce prediction scores O = (o 1 , o 2 , . . . , o n ).
Note that due to the difference between input and output lengths m and n, synchronization between the two adjacent layer is required. The synchronization layer delays the supersense prediction until a word is fully formed by its subwords. Untrainable layer L s is implemented by selectively allowing hidden outputs, where the subword aligns with the ending of the word it belongs to, pass through the layer: Where W (s) ∈ R n×m and each element is defined as W (s) i,j = 1 (end (x j , w i )). Then the output layer L o applies linear transformation on H (w) to produce label scores O ∈ R n×k .
As the final layer, the conditional random field (CRF) takes time-independent label scores O and produces a joint score of the entire sequence by considering interdependency among labels: Where A is the transition matrix among labels. For all Y , we maximize log p (Y |W ) = s (W, Y ) − log Ȳ e s(W,Ȳ ) (7) 1 https://nlp.stanford.edu/projects/glove/ WhereȲ is all possible combinations. Maximizing the objective encourges the valid sequence of labels to be produced.

Experimental Setup
Dropout rate was 0.5, stochastic gradient descent (SGD) was used as learning method, and learning rate was 0.005. The gradient clipping is 5.0.  Table 3: Comparison of character, subword, and word-level models with/without pre-trained vectors.

SemCor Evaluations
The classification results of SemCor dataset using different combinations of input representations are shown in Table 3. We note that in unirepresentation settings the word-level model performs better than the character or subword-level model. This is presumably because supersense tagging predicts labels for each word. We also note that when word embeddings are pre-trained, the performance is always improved by the addition of character or subword-level embeddings. Overall, the best result is obtained when the subword and word embeddings are pre-trained and all embeddings are utilized.

Detailed Analysis
To investigate the effect of using character or subword-level embeddings, we select 15 supersenses and examine each of them individually (Table 4). With pre-trained vectors, c+s+w performs much better than other combinations, outperforming others in many classes. However, without the   pre-trained vectors, it fails to maintain the dominance.
Also, 5 out of the 7 combinations perform better than others in at least one class. This shows that character, subword, and word-level embeddings offer features of different characteristics that could be either advantageous or disadvantageous depending on the class.
We further conduct nearest neighbor analysis on various embedding combinations (Table 5). We find that, in most cases, words of the same supersenses are mapped closely to each other in the word embedding space. Similar to our findings in previous analysis, we also find that each model exhibits distinct characteristics. For example, in c+s model, the nearest neighbors of Mr. are Dr. and Mrs.. However, in sub model, male names such as Thomas and Bob are identified as the nearest neighbors.

Conclusion
In this paper, we examine the effect of various combinations of input representations on the performance of supersense tagging task. Furthermore, a modified BiLSTM-CRF model which is able take subword sequences and predict word labels is proposed. Our experiments on supersense tagging show that utilizing all token units (character, subword, and word-level) along with pretrained word vectors perform the best. Based on detailed analysis of selective supersense classes, we conjecture that each granlarity level of input representations offers different semantic and syntactic features that could have varying effects depending on the task. As future work, we intend to investigate the feasibility of a model that selflearns the optimal continuous combination of different levels of subword information depending on the task and data characteristics.