Syllable-aware Neural Language Models: A Failure to Beat Character-aware Ones

Syllabification does not seem to improve word-level RNN language modeling quality when compared to character-based segmentation. However, our best syllable-aware language model, achieving performance comparable to the competitive character-aware model, has 18%-33% fewer parameters and is trained 1.2-2.2 times faster.


Introduction
Recent advances in neural language modeling (NLM) are connected with character-aware models (Kim et al., 2016;Ling et al., 2015b;Verwimp et al., 2017). This is a promising approach, and we propose the following direction related to it: We would like to make sure that in the pursuit of the most fine-grained representations one has not missed possible intermediate ways of segmentation, e.g., by syllables. Syllables, in our opinion, are better supported as linguistic units of language than single characters. In most languages, words can be naturally split into syllables: ES: el par-la-men-to a-po-yó la en-mien-da RU: пар-ла-мент под-дер-жал по-прав-ку (EN: the parliament supported the amendment) Based on this observation, we attempted to determine whether syllable-aware NLM has any advantages over character-aware NLM. We experimented with a variety of models but could not find any evidence to support this hypothesis: splitting words into syllables does not seem to improve the language modeling quality when compared to splitting into characters. However, there are some positive findings: while our best syllable-aware language model achieves performance comparable to the competitive character-aware model, it has 18%-33% fewer parameters and is 1.2-2.2 times faster to train.

Related Work
Much research has been done on subword-level and subword-aware 1 neural language modeling when subwords are characters (Ling et al., 2015b;Kim et al., 2016;Verwimp et al., 2017) or morphemes (Botha and Blunsom, 2014;Qiu et al., 2014;Cotterell and Schütze, 2015). However, not much work has been done on syllable-level or syllable-aware NLM. Mikolov et al. (2012) show that subword-level language models outperform character-level ones. 2 They keep the most frequent words untouched and split all other words into syllable-like units. Our approach differs mainly in the following aspects: we make predictions at the word level, use a more linguistically sound syllabification algorithm, and consider a variety of more advanced neural architectures.
We have recently come across a concurrent paper (Vania and Lopez, 2017) where the authors systematically compare different subword units (characters, character trigrams, BPE (Sennrich et al., 2016), morphemes) and different representation models (CNN, Bi-LSTM, summation) on languages with various morphological typology. However, they do not consider syllables, and they experiment with relatively small models on small data sets (0.6M-1.4M tokens).

Syllable-aware word embeddings
Let W and S be finite vocabularies of words and syllables respectively. We assume that both words  Figure 1: Syllable-aware language model. and syllables have already been converted into indices. Let E S ∈ R |S|×d S be an embedding matrix for syllables -i.e., it is a matrix in which the sth row (denoted as s) corresponds to an embedding of the syllable s ∈ S. Any word w ∈ W is a sequence of its syllables (s 1 , s 2 , . . . , s nw ), and hence can be represented as a sequence of the corresponding syllable vectors: [s 1 , s 2 , . . . , s nw ]. (1) The question is: How shall we pack the sequence (1) into a single vector x ∈ R d W to produce a better embedding of the word w? 3 In our case "better" means "better than a character-aware embedding of w via the Char-CNN model of Kim et al. (2016)". Below we present several viable approaches.

Recurrent sequential model (Syl-LSTM)
Since the syllables are coming in a sequence it is natural to try a recurrent sequential model: which converts the sequence of syllable vectors (1) into a sequence of state vectors h 1:nw . The last state vector h nw is assumed to contain the information on the whole sequence (1), and is therefore used as a word embedding for w. There is a big variety of transformations from which one can choose f in (2); however, a recent thorough evaluation (Jozefowicz et al., 2015) shows that the LSTM (Hochreiter and Schmidhuber, 1997) with its forget bias initialized to 1 outperforms other popular architectures on almost all tasks, and we decided to use it for our experiments. We will refer to this model as Syl-LSTM.

Convolutional model (Syl-CNN)
Inspired by recent work on character-aware neural language models (Kim et al., 2016) we decided to try this approach (Char-CNN) on syllables. Our case differs mainly in the following two aspects: 1. The set of syllables S is usually bigger than the set of characters C, 4 and also the dimensionality d S of syllable vectors is expected to be greater than the dimensionality d C of character vectors. Both of these factors result in allocating more parameters on syllable embeddings compared to character embeddings. 2. On average a word contains fewer syllables than characters, and therefore we need narrower convolutional filters for syllables. This results in spending fewer parameters per convolution. This means that by varying d S and the maximum width of convolutional filters L we can still fit the parameter budget of Kim et al. (2016) to allow fair comparison of the models.
Like in Char-CNN, our syllable-aware model, which is referred to as Syl-CNN- [L], utilizes maxpooling and highway layers (Srivastava et al., 2015) to model interactions between the syllables. The dimensionality of a highway layer is denoted by d HW .

Linear combinations
We also considered using linear combinations of syllable-vectors to represent the word embedding: The choice for α t is motivated mainly by the existing approaches (discussed below) which proved to be successful for other tasks. Syl-Sum: Summing up syllable vectors to get a word vector can be obtained by setting α t (s t ) = 1. This approach was used by Botha and Blunsom (2014) to combine a word and its morpheme embeddings into a single word vector.

Syl-Avg:
A simple average of syllable vectors can be obtained by setting α t (s t ) = 1/n w . This can be also called a "continuous bag of syllables" in an analogy to a CBOW model (Mikolov et al., 2013), where vectors of neighboring words are averaged to get a word embedding of the current word.

Syl-Avg-A:
We let the weights α t in (3) be a function of parameters (a 1 , . . . , a n ) of the model, which are jointly trained together with other parameters. Here n = max w {n w } is a maximum word length in syllables. In order to have a weighted average in (3) we apply a softmax normalization:

Syl-Avg-B:
We can let α t depend on syllables and their positions: is a set of parameters that determine the importance of each syllable type in each (relative) position, b ∈ R n is a bias, which is conditioned only on the relative position. This approach is motivated by recent work on using an attention mechanism in the CBOW model (Ling et al., 2015a). We feed the resulting x from (3) into a stack of highway layers to allow interactions between the syllables.

Concatenation (Syl-Concat)
In this model we simply concatenate syllable vectors (1) into a single word vector: x = [s 1 ; s 2 ; . . . ; s nw ; 0; 0; . . . ; 0 We zero-pad x so that all word vectors have the same length n · d S to allow batch processing, and then we feed x into a stack of highway layers.

Word-level language model
Once we have word embeddings x 1:k for a sequence of words w 1:k we can use a word-level RNN language model to produce a sequence of states h 1:k and then predict the next word according to the probability distribution where W ∈ R d LM ×|W| , b ∈ R |W| , and d LM is the hidden layer size of the RNN. Training the model involves minimizing the negative log-likelihood over the corpus w 1:K : (5) As was mentioned in Section 3.1 there is a huge variety of RNN architectures to choose from. The most advanced recurrent neural architectures, at the time of this writing, are recurrent highway networks (Zilly et al., 2017) and a novel model which was obtained through a neural architecture search with reinforcement learning (Zoph and Le, 2017). These models can be spiced up with the most recent regularization techniques for RNNs (Gal and Ghahramani, 2016) to reach state-of-the-art. However, to make our results directly comparable to those of Kim et al. (2016) we select a two-layer LSTM and regularize it as in Zaremba et al. (2014).

Experimental Setup
We search for the best model in two steps: first, we block the word-level LSTM's architecture and pre-select the three best models under a small parameter budget (5M), and then we tune these three best models' hyperparameters under a larger budget (20M).

Pre-selection:
We fix d LM (hidden layer size of the word-level LSTM) at 300 units per layer and run each syllable-aware word embedding method from Section 3 on the English PTB data set (Marcus et al., 1993), keeping the total parameter budget at 5M. The architectural choices are specified in Appendix A.
Hyperparameter tuning: The hyperparameters of the three best-performing models from the preselection step are then thoroughly tuned on the same English PTB data through a random search according to the marginal distributions: (160), log (2000)), • log(d LM ) ∼ U (log(300), log (2000)), with the restriction d S < d LM . The total parameter budget is kept at 20M to allow for easy comparison to the results of Kim et al. (2016). Then these three best models (with their hyperparameters tuned on PTB) are trained and evaluated on small-(DATA-S) and medium-sized (DATA-L) data sets in six languages.
Optimizaton is performed in almost the same way as in the work of Zaremba et al. (2014). See Appendix B for details.

Syllabification:
The true syllabification of a word requires its grapheme-to-phoneme conversion and then splitting it into syllables based on some rules. Since these are not always available for lessresourced languages, we decided to utilize Liang's widely-used hyphenation algorithm (Liang, 1983).

Results
The results of the pre-selection are reported in Table 1. All syllable-aware models comfortably outperform the Char-CNN when the budget is limited to 5M parameters. Surprisingly, a pure word-level model, 6 LSTM-Word, also beats the character-aware one under such budget. The three best configurations are Syl-Concat, Syl-Sum, and Syl-CNN-3 (hereinafter referred to as Syl-CNN), and tuning their hyperparameters under 20M parameter budget gives the architectures in Table  2. The results of evaluating these three models on small (1M tokens) and medium-sized (17M-57M tokens) data sets against Char-CNN for different languages are provided in Table 3. The models demonstrate similar performance on small data, but Char-CNN scales significantly better on medium-sized data. From the three syllable-aware models, Syl-Concat looks the most advantageous as it demonstrates stable results and has the least number of parameters. Therefore in what follows we will make a more detailed comparison of Syl-Concat with Char-CNN. 6 When words are directly embedded into R d W through an embedding matrix EW ∈ R |W|×d W . 7 Syl-CNN results on DATA-L are not reported since computational resources were insufficient to run these configurations.  Table 3: Evaluation of the syllable-aware models against Char-CNN. In each case the smallest model, Syl-Concat, has 18%-33% less parameters than Char-CNN and is trained 1.2-2.2 times faster (Appendix C).

Shared errors:
It is interesting to see whether Char-CNN and Syl-Concat are making similar errors. We say that a model gives an error if it assigns a probability less than p * to a correct word from the test set. Figure 2 shows the percentage of errors which are shared by Syl-Concat and Char-CNN depending on the value of p * . We see that the vast majority of errors are shared by both models even when p * is small (0.01). PPL breakdown by token frequency: To find out how Char-CNN outperforms Syl-Concat, we partition the test sets on token frequency, as computed on the training data. We can observe in Figure 3 that, on average, the more frequent the word is, the bigger the advantage of Char-CNN over Syl-Concat. The more Char-CNN sees a word in different contexts, the more it can learn about this word (due to its powerful CNN filters). Syl-Concat, on the other hand, has limitations -it cannot see below syllables, which prevents it from extracting the same amount of knowledge about the word. PCA of word embeddings: The intrinsic advantage of Char-CNN over Syl-Concat is also sup-  ported by the following experiment: We took word embeddings produced by both models on the English PTB, and applied PCA to them. 8 Regardless of the threshold percentage of variance to retain, the embeddings from Char-CNN always have more principal components than the embeddings from Syl-Concat (see Table 4). This means that Char-CNN embeds words into higher dimensional space than Syl-Concat, and thus can better distinguish them in different contexts. LSTM limitations: During the hyperparameters tuning we noticed that increasing d S , d HW and d LM from the optimal values (in Table 2) did not result in better performance for Syl-Concat. Could it be due to the limitations of the word-level LSTM (the topmost layer in Fig. 1)? To find out whether this was the case we replaced the LSTM by a Variational RHN (Zilly et al., 2017), and that resulted in a significant reduction of perplexities on PTB for both Char-CNN and Syl-Concat (Table 5). Moreover, increasing d LM from 439 to 650 did result in better performance for Syl-Concat. Optimization details are given in Appendix B. Comparing syllable and morpheme embeddings: It is interesting to compare morphemes and syllables. We trained Morfessor 2.0 (Creutz and Lagus, 2007) in its default configuration on the PTB training data and used it instead of the syl-  labifier in our models. Interestingly, we got ≈3K unique morphemes, whereas the number of unique syllables was ≈6K. We then trained all our models on PTB under 5M parameter budget, keeping the state size of the word-level LSTM at 300 (as in our pre-selection step for syllable-aware models). The reduction in number of subword types allowed us to give them higher dimensionality d M = 100 (cf. d S = 50). 9 Convolutional (Morph-CNN-3) and additive (Morph-Sum) models performed better than others with test set PPLs 83.0 and 83.9 respectively. Due to limited amount of time, we did not perform a thorough hyperparameter search under 20M budget. Instead, we ran two configurations for Morph-CNN-3 and two configurations for Morph-Sum with hyperparameters close to those, which were optimal for Syl-CNN-3 and Syl-Sum correspondingly. All told, our best morpheme-aware model is Morph-Sum with d M = 550, d HW = 1100, d LM = 550, and test set PPL 79.5, which is practically the same as the result of our best syllable-aware model ). This makes Morph-Sum a notable alternative to Char-CNN and Syl-Concat, and we defer its thorough study to future work. Source code: The source code for the models discussed in this paper is available at https:// github.com/zh3nis/lstm-syl.

A Pre-selection
In all models with highway layers there are two of them and the non-linear activation of any highway layer is a ReLU. LSTM-Word: d W = 108, d LM = 300. Syl-LSTM: d S = 50, d LM = 300. Syl-CNN-[L]: d S = 50, convolutional filter widths are [1, . . . , L], the corresponding convolutional filter depths are [c·l] L l=1 , d HW = c·(1+. . .+ L). We experimented with L = 2, 3, 4. The corresponding values of c are chosen to be 120, 60, 35 to fit the total parameter budget. CNN activation is tanh. Linear combinations: We give higher dimensionality to syllable vectors here (compared to other models) since the resulting word vector will have the same size as syllable vectors (see (3)

B Optimization
LSTM-based models: We perform the training (5) by truncated BPTT (Werbos, 1990;Graves, 2013). We backpropagate for 70 time steps on DATA-S and for 35 time steps on DATA-L using stochastic gradient descent where the learning rate is initially set to 1.0 and halved if the perplexity does not decrease on the validation set after an epoch. We use batch sizes of 20 for DATA-S and 100 for DATA-L. We train for 50 epochs on DATA-S and for 25 epochs on DATA-L, picking the best-performing model on the validation set. Parameters of the models are randomly initialized uniformly in [−0.05, 0.05], except the forget bias of the word-level LSTM, which is initialized to 1. For regularization we use dropout (Srivastava et al., 2014) with probability 0.5 between wordlevel LSTM layers and on the hidden-to-output softmax layer. We clip the norm of the gradients (normalized by minibatch size) at 5. These choices were guided by previous work on wordlevel language modeling with LSTMs (Zaremba et al., 2014).
To speed up training on DATA-L we use a sampled softmax (Jean et al., 2015) with the number of samples equal to 20% of the vocabulary size (Chen et al., 2016). Although Kim et al. (2016) used a hierarchical softmax (Morin and Bengio, 2005) for the same purpose, a recent study (Grave et al., 2016) shows that it is outperformed by sampled softmax on the Europarl corpus, from which DATA-L was derived (Botha and Blunsom, 2014). RHN-based models are optimized as in Zilly et al. (2017), except that we unrolled the networks for 70 time steps in truncated BPTT, and dropout rates were chosen to be as follows: 0.2 for the embedding layer, 0.7 for the input to the gates, 0.7 for the hidden units and 0.2 for the output activations.

C Sizes and speeds
On DATA-S, Syl-Concat has 28%-33% fewer parameters than Char-CNN, and on DATA-L the reduction is 18%-27% (see Fig. 4). Training speeds are provided in the Table 6. Models were implemented in TensorFlow, and were run on NVIDIA Titan X (Pascal).