Learning Better Internal Structure of Words for Sequence Labeling

Character-based neural models have recently proven very useful for many NLP tasks. However, there is a gap of sophistication between methods for learning representations of sentences and words. While, most character models for learning representations of sentences are deep and complex, models for learning representations of words are shallow and simple. Also, in spite of considerable research on learning character embeddings, it is still not clear which kind of architecture is the best for capturing character-to-word representations. To address these questions, we first investigate the gaps between methods for learning word and sentence representations. We conduct detailed experiments and comparisons on different state-of-the-art convolutional models, and also investigate the advantages and disadvantages of their constituents. Furthermore, we propose IntNet, a funnel-shaped wide convolutional neural architecture with no down-sampling for learning representations of the internal structure of words by composing their characters from limited, supervised training corpora. We evaluate our proposed model on six sequence labeling datasets, including named entity recognition, part-of-speech tagging, and syntactic chunking. Our in-depth analysis shows that IntNet significantly outperforms other character embedding models and obtains new state-of-the-art performance without relying on any external knowledge or resources.


Introduction
Sequence labeling is the task of assigning a label or class to each element of a sequence of data, and is one of the first stages in many natural language processing (NLP) tasks. For example, named entity recognition (NER) aims to classify words in a sentence into several predefined categories of interest such as person, organization, location, etc.
Part-of-speech (POS) tagging assigns a part of speech to each word in an input sentence. Syntactic chunking divides text into syntactically related, non-overlapping groups of words. Sequence labeling is a challenging problem because human annotation is very expensive and typically only a small amount of tagging data is available.
Most traditional sequence labeling systems have been dominated by linear statistical models which heavily rely on feature engineering. As a result, carefully constructed hand-crafted features and domain-specific knowledge are widely used for solving these tasks. Unfortunately, it is costly to develop domain specific knowledge and hand-crafted features. Recently, neural networks using character-level information have been used successfully for minimizing the need of feature engineering. There are basically two threads of character-based modeling, one focuses on learning representations of sentences for semantics and syntax (Zhang et al., 2015;Conneau et al., 2017); the other focuses on learning representations of words for the purpose of eliminating handcrafted features for word shape information (Lample et al., 2016;Ma and Hovy, 2016).
Two main state-of-the-art approaches of learning character representations for sequence labeling emerged from the latter thread. One is based on RNNs and uses bidirectional LSTMs or GRUs to learn forward and backward character information (Ling et al., 2015;Lample et al., 2016;Yang et al., 2017). The other approach is based on CNNs with a fixed-size window around each word to create character-level representations (Santos and Zadrozny, 2014;Chiu and Nichols, 2016;Ma and Hovy, 2016). However, there is a gap in the sophistication between character-based methods for learning representations of sentences compared to that of words. We found that most of the stateof-the-art character-based CNN models for words use a convolution followed by max pooling as a shallow feature extractor, which is very different from the CNN models with deep and complex architecture for sentences. In spite of considerable research on learning character embeddings, it is still not clear which kind of architecture is the best for capturing character-to-word representations.
Therefore, a number of questions remain open: • Why is there a gap between methods for learning representations of sentences and words? How can this gap be bridged?
• How do state-of-the-art character embedding models differ in term of performance?
• What kind of neural network architecture is better for learning the internal structure of a word? Deep or shallow? Narrow or wide?
To answer these questions, we first investigate the gap between learning word representations and sentence representations for convolutional architectures. The most straightforward idea is to add more convolutional layers which follows the approaches from learning representations of sentences. Interestingly, we observe the accuracy does not increase much and found that accuracy drops when we increased the depth of the network. This observation shows that learning character representations for the internal structure of words is very different than sentences, and also might explain one of the reasons there has been a gap in character-based CNN models for representing words and sentences.
In this paper, we present detailed experiments and comparisons across different state-of-the-art convolutional models from natural language processing and computer vision. We also investigate the advantages and disadvantages of some of their constituents on different convolutional architectures. Furthermore, we propose IntNet, a funnel-shaped wide convolutional neural network for learning the internal structure of words by composing their characters. Unlike previous CNN-based approaches, our funnel-shaped Int-Net explores deeper and wider architecture with no down-sampling for learning character-to-word representations from limited supervised training corpora. Lastly, we combine our IntNet model with LSTM-CRF, which captures both word shape and context information, and jointly decode tags for sequence labeling.
The main contributions of this paper are the following: • We conduct detailed studies on investigating the gap between learning word representations and sentence representations.
• We provide in-depth experiments and empirical comparisons of different convolutional models and explore the advantages and disadvantages of their components for learning character-to-word representations.
• We propose a funnel-shaped wide convolutional neural architecture with no downsampling that focuses on learning a better internal structure of words.
• Our proposed compositional character-toword model combined with LSTM-CRF achieves state-of-the-art performance for various sequence labeling tasks.
This paper is organized as follows: Section 2 describes multiple threads of related work. Section 3 presents the whole architecture of the neural network. Section 4 provides details about experimental settings and compared methods. Section 5 reports model results on different benchmarks with detailed analyses and discussion.

Related Work
There exist three threads of related work regarding the topic of this paper: (i) different convolutional architectures from different domains; (ii) character embedding models for words; (iii) sequence labeling with deep neural network.
CNN models across domains. Convolutional neural networks (CNNs) are very useful in extracting information from raw signals. In the area of NLP, Kim (2014) was the first to propose shallow CNN with word embeddings for sentence classification. Zhang et al. (2015) proposed CNN with 6 convolutional layers by directly extracting character level information for learning representations of semantic structure on sentences. Recently, Conneau et al. (2017) proposed a VDCNN architecture with 29 convolutional layers using residual connections for text classification. However, one study on randomly dropping layers for training deep residual networks, (Huang et al., 2016), has shown that not all layers may be needed and highlighted there is some amount of redundancy in ResNet (He et al., 2016). Also, some research has shown promising results with wide architectures, for example, wide ResNet (Zagoruyko and Komodakis, 2016), Inception-ResNet (Szegedy et al., 2017) and DenseNet (Huang et al., 2017). These models use character-level information to learn representations are for sentences, not words.
Character embedding models. Santos and Zadrozny (2014) proposed a CNN model to learn character representations of words to replace hand-crafted features for part-of-speech tagging. Ling et al. (2015) proposed a bidirectional LSTM over characters to use as input for learning character-to-word representations. Chiu and Nichols (2016) proposed a bidirectional LSTM-CNN with lexicons for named entity recognition by applying the CNN-based character embedding model from Santos and Zadrozny (2014). Plank et al. (2016) proposed a bi-LSTM model with auxiliary loss for multilingual part-of-speech tagging by following the LSTM-based character embedding model from Ling et al. (2015). Cotterell and Heigold (2017) proposed a character-level transfer learning model for neural morphological tagging.
Sequence labeling. Collobert et al. (2011) first proposed a method based on CNN-CRF that learns important features from words and requires few hand-crafted features. Huang et al. (2015) proposed a bidirectional LSTM-CRF model by using word embeddings and hand-crafted features for sequence tagging. Lample et al. (2016) applied the LSTM-based character embedding model from Ling et al. (2015) with bidirectional LSTM-CRF and obtained best results on NER for Spanish, Dutch, and German. Ma and Hovy (2016) applied the CNN-based character embedding model from Chiu and Nichols (2016), but without using any data preprocessing or external knowledge and achieved the best result on NER for English and part-of-speech tagging. Also, there have been some joint models which use additional knowledge, like transfer learning (Yang et al., 2017), pre-trained language models (Peters et al., 2017), language model joint training (Rei, 2017), and multi-task learning (Liu et al., 2018). Without any additional supervision or extra resources, LSTM-CRF (Lample et al., 2016) and LSTM-CNN-CRF (Ma and Hovy, 2016) are current state-of-the-art methods. To test the effectiveness of our proposed model, we use these two models as our baselines in the latter sections.

IntNet
Character embeddings. The first step is to initialize the character embeddings for each word w in the input sequence. We define the finite set of characters V char . This vocabulary contains all the variations of the raw text, including uppercase and lowercase letters, numbers, punctuation marks, and symbols. Unlike some character-based approaches, we do not use any character-level prepossessing which enables our model to learn and capture regularities from prefixes to suffixes to construct character-to-word representations. The input word w is decomposed into a sequence of characters {c 1 , ..., c n }, where n is the length of w. Character embeddings are encoded by column vectors in the embedding matrix W char ∈ R d char ×|V char | , where d char is the number of parameters for each character in V char . Given a character c i , its embedding r char i is obtained by the matrix-vector product: where v char i is defined as a one-hot vector for c i . We randomly initialize a look-up table with values drawn from a uniform distribution with range [− 3 d char , + 3 d char ], where d char is empirically chosen by users. The character set includes all unique characters and the special tokens PADDING and UNKNOWN. We do not perform any character-level preprocessing, including case normalization, digit replacement (e.g. replacing all sequences of digits 0-9 with a single "0"), nor do we use any capitalization features (e.g. allCaps, upperInitial, lowercase, mixedCaps, noinfo).
Convolutional blocks. The input for the Int-Net is the sequence of character embeddings {r char 1 , ..., r char n }. First is the initial convolutional layer, which is a temporal convolutional module that computes 1-D convolutions. Let x i ∈ R d char ×r char be the concatenation of the character embeddings for each w. The initial convolutional layer applies a matrix-vector operation to each successive window of size k char . An input k-grams x i:i+k−1 is transformed through a convolution filter w c : where c i is the feature map of 1-D convolution, f is the non-linear ReLU function, and b c is a bias term. Equation 2 produces m filters with different kernel sizes. The filters are computed with different kernels by the initial convolutional layers are concatenated: where h is the number of kernels, g 0 is the output for the initial convolutional layer which feeds into the next convolutional block. We define F(·) as a function of several consecutive operations within a convolutional block. Firstly, a N×1 convolution transforms the input. The output size is 4 × m × h feature maps, like a bottleneck layer. The next step consists of multiple 1-D convolutions with kernels of different sizes. Lastly, we concatenate all the feature maps from kernels of different size. In each convolution, we use a batch normalization, followed by a ReLU activation and N×k temporal convolution.
Funnel-shaped wide architecture. The network comprises of L convolutional layers, which implies ( L−1 2 ) convolutional blocks. We use direct connections from every other layer to all subsequent layers, inspired by dense connection. Therefore, the l th layer has access to the feature maps of all the alternate layers: Equation 4 ensures maximum information flow between blocks in the network. Compared to residual connection F l (g l−1 ) + g l−1 , it can be viewed as an extreme case of residual connection and makes feature reuse possible. Unlike DenseNet and ResNet, we concatenate feature maps by different kernels in every other convolutional layers, which captures different levels of features and makes our wide architecture possible, inspired by Inception. Different levels of concatenation can help IntNet to learn different patterns of word shape information. We compare our architecture to residual connection and dense connection for learning character-to-word representations in Section 5.
Without down-sampling. Compared to other CNN models like ResNet and DenseNet, our model does not contain any halve down-sampling layer or average pooling to reduce resolution. We did not find these operations to be helpful and, in  some cases, found them to be detrimental to performance. These operations are useful for sentences and images, but might break the internal structure of words, like the sequential patterns for prefixes and suffixes.
Character-to-word representations. In the last layer, we use a max-over-time pooling operation:ĉ which takes the maximum value corresponding to a particular filter. The idea is to capture the most important feature with the highest value for each feature map. Finally, we concatenate all of salient features together as a representation for this word: where u is the number of salient features which is equal to the total number of output feature maps in the last layer. If each function F l produces p feature maps, we obtain (p 0 + p × L−1 2 ) representations, where p 0 is the number of output feature maps in the initial convolution layer.

Bi-directional RNN
Given the character-to-word representations are computed by IntNet in Equation 6, we denote the input vector (z 1 , z 2 , . . . , z n ) for a sentence. LSTM (Hochreiter and Schmidhuber, 1997) returns the sequence (h 1 , h 2 , . . . , h n ) that represents the sequential information at every step. We use the following implementation: where σ is the element-wise sigmoid function and is the element-wise product. z t is the input vector at time t and i t , f t , o t , c t are the input gate, forget gate, output gate, and cell vectors, all of which are the same size as the hidden vector h t . W zi , W zf , W zo , W zc denote the weight matrices of different gates for input z t ; W hi , W hf , W ho , W hc are the weight matrices for hidden state h t ,

Scoring Function
Instead of predicting each label independently, we consider the correlations between labels in neighborhoods and jointly decode the best chain of labels for a given input sentence by leveraging a conditional random field (Lafferty et al., 2001). Formally, the sequence of labels is defined as: y = (y 1 , y 2 , ..., y T ).
To define the scoring function f (h, y) for each position t, we multiply the hidden state h w t with a parameter vector w yt that is indexed by the tag y t to obtain the matrix of scores output by the bidirectional LSTM network. Therefore, the function f can be written as: In Equation 8, A is a matrix of transition scores, A i,j represents the score of a transition from the tag i to tag j, y 1 is the start tag of a sentence. Let Y(h) denote the set of possible label sequences for h. A probabilistic model for a sequence defines a family of conditional probabilities p(y|h) over all possible label sequences y given h with the following form:

Objective Function and Inference
For end-to-end network training, we use maximum conditional likelihood estimation to maximize the log probability of the correct tag sequence: While decoding, we predict the label sequence that obtains the highest score given by: The objective function and its gradients can be efficiently computed by dynamic programming; for inference, we use the Viterbi algorithm to find the best tag path which maximizes the score.

Datasets
We performed experiments on six standard datasets for sequence labeling tasks, i.e. named entity recognition, part-of-speech tagging, and syntactic chunking. To test the effectiveness of our proposed model, we do not use language-specific resources (such as gazetteers), external knowledge (such as transfer learning, joint training), handcrafted features, or any character preprocessing, we do not replace any rare words into UNKNOWN. Named entity recognition. CoNLL-2002 and CoNLL2003 datasets (Tjong Kim Sang, 2002;Tjong Kim Sang and De Meulder, 2003) contain named entity labels for Spanish, Dutch, English and German as separate datasets. These four datasets contain different types of named entities: locations, persons, organizations, and miscellaneous entities. Unlike some approaches, we do not combine the validation set with the training set. Although POS tags were made available for these datasets, we do not leverage those as additional information which sets our approach apart from that of transfer learning.
Part-of-speech tagging. The Wall Street Journal (WSJ) portion of Penn Treebank (PTB) (Marcus et al., 1993) contains 25 sections and categorizes each word into one out of 45 POS tags. We adopt the standard split and use sections 0-18 as training data, sections 19-21 as development data, and sections 22-24 as test data.
Syntactic chunking. The CoNLL 2000 chunking task (Tjong Kim Sang and Buchholz, 2000) uses sections 15-18 from the Wall Street Journal corpus for training and section 20 for testing. It defines 11 syntactic chunk types (e.g., NP, VP, ADJP), we adopt the standard split and sample 1000 sentences from the training set as the development set.

Training Settings
Initialization. The size of the dimensions of character embeddings is 32 which are randomly initialized using a uniform distribution. We adopt the same initialization method for randomly initialized word embeddings that are updated during training. For IntNet, the filter size of the initial convolution is 32 and that of other convolutions is 16. We have used filters of size [3,4,5] for all the kernels. The number of convolutional layers are 5 and 9 for IntNet-5 and IntNet-9, respectively, and we have adopted the same weight initialization as that of ResNet. We use pre-trained word embeddings for initialization, GloVe (Pennington et al., 2014) 100-dimension word embeddings for English, and fastText (Bojanowski et al., 2017) 300dimension word embeddings for Spanish, Dutch, and German. The state size of the bi-directional LSTMs is set to 256. We adopt standard BIOES tagging scheme for NER and Chunking.
Optimization. We employ mini-batch stochastic gradient descent with momentum. The batch size, momentum and learning rate are set to 10, 0.9 and η t = η 0 1+ρt , where η 0 is the initial learning rate 0.01 and ρ = 0.05 is the decay ratio, the value of gradient clipping is 5. Dropout is applied on the input of IntNet, LSTMs, and CRF, and its ratio 0.5 is fixed, but with no dropout inside of IntNet.

Compared Methods
To address those open questions in Section 1, we conduct detailed experiments and empirical comparisons on different state-of-the-art character embedding models across different domains. Firstly, we use LSTM-CRF with randomly initialized word embeddings as our initial baseline. We adopt two state-of-the-art methods in sequence labeling, denoted as char-LSTM (Lample et al., 2016) and char-CNN (Ma and Hovy, 2016). We add more layers to the char-CNN model and refer to that as char-CNN-5 and char-CNN-9, respectively for 5 and 9 convolutional layers. Furthermore, we add residual connections to the char-CNN-9 and refer it as char-ResNet. Also, we apply 3 dense blocks based on char-ResNet which we refer to as char-DenseNet, to compare the difference between residual connection and dense connection. Lastly, we refer to our proposed   model, which uses different convolution layers, as char-IntNet-5 and char-IntNet-9.
5 Results and Analysis 5.1 Character-to-word Models Table 1 presents the performance of different character-to-word models on six benchmark datasets. For sequence labeling, char-LSTM and char-CNN are current state-of-the-art character embedding models for learning character-to-word representations. We observe that char-LSTM performs better than char-CNN in most cases, however, char-CNN uses a convolution layer followed by max pooling as a shallow feature extractor, that does not explore the full potential of CNNs. Therefore, we implement two variations based on char-CNN, referred to as char-CNN-5 and char-CNN-9. The result shows that for most of the datasets, the F1 score does not improve much when we directly add more layers. We also observe some accuracy drop when we continuously increase the depth. This confirms why most CNNbased approaches for learning representations on words are shallow, which is very different from learning representations for sentences. Furthermore, we add residual connections to char-CNN-9 as char-ResNet-9, which confirms that residual connections can help train deep layers. We further improve char-ResNet-9 by changing residual connections into dense connection blocks as char-DenseNet-9, which shows that the dense connections are better than residual connections for learning word shape information.
Our proposed character-to-word model, char-IntNet-5 and char-IntNet-9 generally improves the results across all datasets. Our IntNet significantly outperforms other character embedding models, for example, the improvement is more than 2% in terms of F1 score for German and Dutch. Also, we observe that char-IntNet-5 is more effective for learning character-to-word representations than char-IntNet-9 in most of the cases. The only exception is German which seems to require a deeper and wider model for learning better representations.

State-of-the-art Results
Table 2 presents our proposed model in comparison with state-of-the-art results. LSTM-CRF is our baseline which uses fine-tuned pre-trained word embeddings. Its comparison with LSTM-CRF using random initializations for word embeddings, as shown in Table 1, confirms that pre-trained word embeddings are useful for sequence labeling. Since the training corpus for sequence labeling is relatively small, pre-trained embeddings learned from a huge unlabeled corpus can help to enhance word semantics. Furthermore, we adopt and re-implement two stateof-the-art character models, char-LSTM and char-CNN, by combining with LSTM-CRF, which we   refer to as LSTM-CRF-char-LSTM and LSTM-CRF-char-CNN. Lastly, we combine our proposed model with LSTM-CRF which we refer to as LSTM-CRF-char-IntNet-9 and LSTM-CRF-char-IntNet-5. These experiments show that our char-IntNet generally improves results across different models and datasets. The improvement is more pronounced for non-English datasets, for example, IntNet improves the F-1 score over the stateof-the-art results by more than 2% for Dutch and Spanish. It also shows that the results of LSTM-CRF are significantly improved after adding character-to-word models, which confirms that word shape information is very important for sequence labeling. Figure 2 presents the details of training epochs in comparison with other state-ofthe-art character models for different languages. It shows that char-CNN and char-LSTM converge early whereas char-IntNet takes more epochs to converge and generally performs better. It alludes to the fact that IntNet is suitable for reducing overfitting, since we have used early stopping while training.

Rare and OOV Words Analysis
Another advantage of learning internal structure of words is that it can capture representations for out-of-vocabulary (OOV) words. To better un-derstand the behavior of IntNet, Table 3 presents error analysis on in-vocabulary words (IV), outof-training-vocabulary words (OOTV), out-ofembedding-vocabulary words (OOEV), and outof-both-vocabulary words (OOBV) compared to different character models. The result shows that our proposed model significantly outperforms other character models on OOV words including OOTV, OOEV, and OOBV. For example, in OOBV category, our IntNet outperforms other models by more than 3% in terms of F1 score for Dutch and German datasets.
Furthermore, we present comparisons of nearest neighbors with different models for frequent words, rare words, and OOV words. Table 4 shows the results of nearest neighbors for learning word shape information, which gives insights on what kind of character-to-word representations can be learned by different models. For example, in OOV words, our IntNet model learns a better xx-month shape pattern when matching 11month compared to other models.

Discussion
In many situations, learning character-to-word representations of subword sequences that exceed the typical length of word shape pattern or morpheme sequences might result in noise. RNNs can capture longer sequences in theory, however, longer sequences do not guarantee better results when learning prefixes and suffixes. The funnelshaped wide architecture of IntNet, uses different kernels with different levels of concatenation to capture patterns of different subword lengths and that is flexible than char-LSTM and char-CNN.
For example, Table 4 shows T hursday in OOV words, our model learns a better word-shape structure for character-to-word representations compared to other methods.
When considering training time, IntNet is only 20% slower than char-CNN for the whole training process. Also, learning word representations use fewer parameters than learning sentence representations. Therefore, the impact of training speed for sequence labeling is limited. The inference time of IntNet is almost the same as char-CNN.

Conclusion
We presented empirical comparisons of different character embedding models for learning character-to-word representations and investigated the gaps between methods for learning representations of words and sentences. We conducted detailed experiments of different state-ofthe-art convolutional models, and explored the advantages and disadvantages of their components for learning word shape information. Furthermore, we presented IntNet, a funnel-shaped wide convolutional neural architecture with no downsampling that focuses on learning better internal structure of words by composing their characters from limited supervised training corpora. Our in-depth analysis showed that a shallow wide architecture is better than a narrow deep architecture for learning character-to-word representations. Omitting down-sampling operations is useful for capturing the sequential patterns of prefixes and suffixes. Our proposed compositional character-to-word model does not leverage any external resources, hand-crafted features, additional knowledge, joint training, or character-level preprocessing, and achieves new state-of-the-art performance for various sequence labeling tasks, including named entity recognition, part-of-speech tagging and syntactic chunking. In the future, we would like to explore using the IntNet model for other NLP tasks.