Improving Neural Language Models by Segmenting, Attending, and Predicting the Future

Common language models typically predict the next word given the context. In this work, we propose a method that improves language modeling by learning to align the given context and the following phrase. The model does not require any linguistic annotation of phrase segmentation. Instead, we define syntactic heights and phrase segmentation rules, enabling the model to automatically induce phrases, recognize their task-specific heads, and generate phrase embeddings in an unsupervised learning manner. Our method can easily be applied to language models with different network architectures since an independent module is used for phrase induction and context-phrase alignment, and no change is required in the underlying language modeling network. Experiments have shown that our model outperformed several strong baseline models on different data sets. We achieved a new state-of-the-art performance of 17.4 perplexity on the Wikitext-103 dataset. Additionally, visualizing the outputs of the phrase induction module showed that our model is able to learn approximate phrase-level structural knowledge without any annotation.


Introduction
Neural language models are typically trained by predicting the next word given a past context (Bengio et al., 2003).However, natural sentences are not constructed as simple linear word sequences, as they usually contain complex syntactic information.For example, a subsequence of words can constitute a phrase, and two non-neighboring words can depend on each other.These properties make natural sentences more complex than simple linear sequences.
Most recent work on neural language modeling learns a model by encoding contexts and matching the context embeddings to the embedding of the next word (Bengio et al., 2003;Merity et al., 2017;Melis et al., 2017).In this line of work, a given context is encoded with a neural network, for example a long short-term memory (LSTM; Hochreiter and Schmidhuber, 1997) network, and is represented with a distributed vector.The loglikelihood of predicting a word is computed by calculating the inner product between the word embedding and the context embedding.Although most models do not explicitly consider syntax, they still achieve state-of-the-art performance on different corpora.Efforts have also been made to utilize structural information to learn better language models.For instance, parsing-readingpredict networks (PRPN; Shen et al., 2017) explicitly learn a constituent parsing structure of a sentence and predict the next word considering the internal structure of the given context with an attention mechanism.Experiments have shown that the model is able to capture some syntactic information.
Similar to word representation learning models that learns to match word-to-word relation matrices (Mikolov et al., 2013;Pennington et al., 2014), standard language models are trained to factorize context-to-word relation matrices (Yang et al., 2017).In such work, the context comprises all previous words observed by a model for predicting the next word.However, we believe that contextto-word relation matrices are not sufficient for describing how natural sentences are constructed.We argue that natural sentences are generated at a higher level before being decoded to words.Hence a language model should be able to predict the following sequence of words given a context.In this work, we propose a model that factorizes a context-to-phrase mutual information matrix to learn better language models.The contextto-phrase mutual information matrix describes the relation among contexts and the probabilities of phrases following given contexts.We make the following contributions in this paper: • We propose a phrase prediction model that improves the performance of state-of-the-art word-level language models.
• Our model learns to predict approximate phrases and headwords without any annotation.

Related Work
Neural networks have been widely applied in natural language modeling and generation (Bengio et al., 2003;Bahdanau et al., 2014) for both encoding and decoding.Among different neural architectures, the most popular models are recurrent neural networks (RNNs; Mikolov et al., 2010), long short-term memory networks (LSTMs; Hochreiter and Schmidhuber, 1997), and convolutional neural networks (CNNs; Bai et al., 2018;Dauphin et al., 2017).Many modifications of network structures have been made based on these architectures.LSTMs with self-attention can improve the performance of language modeling (Tran et al., 2016;Cheng et al., 2016).As an extension of simple self-attention, transformers (Vaswani et al., 2017) apply multihead self-attention and have achieved competitive performance compared with recurrent neural language models.A current state-of-the-art model, Transformer-XL (Dai et al., 2018), applied both a recurrent architecture and a multi-head attention mechanism.To improve the quality of input word embeddings, character-level information is also considered (Kim et al., 2016).It has also been shown that context encoders can learn syntactic information (Shen et al., 2017).
However, instead of introducing architectural changes, for example a self-attention mechanism or character-level information, previous studies have shown that careful hyper-parameter tuning and regularization techniques on standard LSTM language models can obtain significant improvements (Melis et al., 2017;Merity et al., 2017).Similarly, applying more careful dropout strategies can also improve the language models (Gal and Ghahramani, 2016;Melis et al., 2018).LSTM language models can be improved with these approaches because LSTMs suffer from serious over-fitting problems.
Recently, researchers have also attempted to improve language models at the decoding phase.Inan et al. (2016) showed that reusing the input word embeddings in the decoder can reduce the perplexity of language models.Yang et al. (2017) showed the low-rank issue in factorizing the context-to-word mutual information matrix and proposed a multi-head softmax decoder to solve the problem.Instead of predicting the next word by using only similarities between contexts and words, the neural cache model (Grave et al., 2016) can significantly improve language modeling by considering the global word distributions conditioned on the same contexts in other parts of the corpus.
To learn the grammar and syntax in natural languages, Dyer et al. (2016) proposed the recurrent neural network grammar (RNNG) that models language incorporating a transition parsing model.Syntax annotations are required in this model.To utilize the constituent structures in language modeling without syntax annotation, parse-readpredict networks (PRPNs; Shen et al., 2017) calculate syntactic distances among words and computes self-attentions.Syntactic distances have been proved effective in constituent parsing tasks (Shen et al., 2018a).In this work, we learn phrase segmentation with a model based on this method and our model does not require syntax annotation.

Syntactic Height and Phrase Induction
In this work, we propose a language model that not only predicts the next word of a given context, but also attempts to match the embedding of the next phrase.The first step of this approach is conducting phrase induction based on syntactic heights.In this section, we explain the definition of syntactic height in our approach and describe the basics ideas about whether a word can be included in an induced phrase.
Intuitively, the syntactic height of a word aims to capture its distance to the root node in a dependency tree.In Figure 1, the syntactic heights are represented by the red bars.A word has high syntactic height if it has low distance to the root node.
A similar idea, named syntactic distance, is proposed by Shen et al. (2017) for constructing constituent parsing trees.We apply the method for calculating syntactic distance to calculate syntactic height.Given a sequence of embeddings of input words [x 1 , x 2 , • • • , x n ], we calculate their syntactic heights with a temporal convolutional net-work (TCN) (Bai et al., 2018).
where h i stands for the syntactic height of word x i .The syntactic height h i for each word is a scalar, and W h is a 1 × D matrix, where D is the dimensionality of d i .These heights are learned and not imposed by external syntactic supervision.In Shen et al. (2017), the syntactic heights are used to generate context embeddings.In our work, we use the syntactic heights to predict induced phrases and calculate their embeddings.
We define the phrase induced by a word based on the syntactic heights.Consider two words x i and x k .x k belongs to the phrase induced by x i if and only if for any j ∈ (i, k), h j < max(h i , h k ).For example, in Figure 1, the phrase induced by the red marked word the is "the morning flights", since the syntactic height of the word morning, h morning < h f lights .However, the word "to" does not belong to the phrase because h f lights is higher than both h the and h to .The induced phrase and the inducing dependency connection are labeled in blue in the figure.
Note that this definition of an induced phrase does not necessarily correspond to a phrase in the syntactic constituency sense.For instance, the words "to Houston" would be included in the phrase "the morning flights to Houston" in a traditional syntactic tree.Given the definition of induced phrases, we propose phrase segmenting conditions (PSCs) to find the last word of an induced phrase.Considering the induced phrase of the i-th word, is not the last word of a given sentence, there are two conditions that x j should satisfy: 1. (PSC-1) The syntactic height of x j must be higher than the height of x i , that is The syntactic height of x j+1 should be lower that x j .
Given the PSCs, we can decide the induced phrases for the sentence shown in Figure 1.The last word of the phrase induced by "United" is "canceled", and the last word of the phrase induced by "flights" is "Houston".For the word assigned the highest syntactic height, its induced phrase is all remaining words in the sentence.

Model
In this work, we formulate multi-layer neural language models as a two-part framework.For example, in a two-layer LSTM language model (Merity et al., 2017), we use the first layer as phrase generator and the last layer as a word generator: For a L-layer network, we can regard the first L 1 layers as the phrase generator and the next L 2 = L − L 1 layers as the word generator.Note that we use y i to represent the hidden state output by the second layer instead of h i , since h i in our work is defined as the syntactic height of x i .In the traditional setting, the first layer does not explicitly learn the semantics of the following phrase because there is no extra objective function for phrase learning.
In this work, we force the first layer to output context embeddings c i for phrase prediction with three steps.Firstly, we predict the induced phrase for each word according to the PSCs proposed in Section 3. Secondly, we calculate the embedding of each phrase with a head-finding attention.Lastly, we align the context embedding and phrase embedding with negative sampling.The word generation is trained in the same way as standard language models.The diagram of the model is shown in Figure 2. The three steps are described next.The 3-step diagram of our approach.The current target word is "the", the induced phrase is "morning flights", and the next word is "morning".The context-phrase and context-word alignments are jointly trained.

Phrase Segmentation
We calculate the syntactic height and predict the induced phrase for each word: where T CN (•) stands for the TCN model described in Equations ( 1) and ( 2), and n is the width of the convolution window.
Based on the proposed phrase segmenting conditions (PSCs) described in the previous section, we predict the probability of a word being the first word outside a induced phrase.Firstly, we decide if each word, x j−1 , j ∈ (i + 1, n], satisfies the two phrase segmenting conditions, PSC-1 and PSC-2.The probability that x j satisfies PSC-1 is Similarly, the probability that x j satisfies PSC-2 is where f HT stands for the HardTanh function with a temperature a: This approach is inspired by the context attention method proposed in the PRPN model (Shen et al., 2017).
Then we can infer the probability of whether a word belongs to the induced phrase of x i with where p ind (x i ) stands for the probability that x i belongs to the induced phrase, and Note that the factorization in Equation 10 assumes that words are independently likely to be included in the induced phrase of x i .

Phrase Embedding with Attention
Given induced phrases, we can calculate their embeddings based on syntactic heights.To calculate the embedding of phrase s we calculate an attention distribution over the phrase: where h i stands for the syntactic height for word x i and c is a constant real number for smoothing the attention distribution.Then we generate the phrase embedding with a linear transformation: where e i is the word embedding of x i .In training, we apply a dropout layer on s.

Phrase and Word Prediction
A traditional language model learns the probability of a sequence of words: where x i 1 stands for x 1 , x 2 , • • • , x i , which is the context used for predicting the next word, x i+1 .
In most related studies, the probability p(x i+1 |x i 1 ) is calculated with the output of the top layer of a neural network y i and the word representations e i+1 learned by the decoding layer: The state-of-the-art neural language models contain multiple layers.The outputs of different hidden layers capture different level of semantics of the context.In this work, we force one of the hidden layers to align its output with the embeddings of induced phrases s i .We apply an embedding model similar to Mikolov et al. (2013) to train the hidden output and phrase embedding alignment.We define the context-phrase alignment model as follows.
We first define the probability that a phrase ph i can be induced by context [x 1 , . . ., x i ].
where σ(x) =1 1+e −x , and c i stands for the context embedding of x 1 , x 2 , • • • , x i output by a hidden layer, defined in Equation 5. s i is the generated embedding of an induced phrase.The probability that a phrase ph i cannot be induced by con . This approach follows the method for learning word embeddings proposed in Mikolov et al. (2013).
We use an extra objective function and the negative sampling strategy to align context representations and the embeddings of induced phrases.Given the context embedding c i , the induced phrase embedding s i , and random sampled negative phrase embeddings s neg i , we train the neural network to maximize the likelihood of true induced phrases and minimize the likelihood of negative samples.we define the following objective function for context i: where n stands for the number of negative samples.With this loss function, the model learns to maximize the similarity between the context and true induced phrase embeddings, and minimize the similarity between the context and negative samples randomly selected from the induced phrases of other words.In practice, this loss function is used as a regularization term with a coefficient γ: It worth noting that our approach is modelagnostic and and can be applied to various architectures.The TCN network for calculating the syntactic heights and phrase inducing is an independent module.In context-phrase alignment training with negative sampling, the objective function provides phrase-aware gradients and does not change the word-by-word generation process of the language model.
The PTB dataset has a vocabulary size of 10,000 unique words.The entire corpus includes roughly 40,000 sentences in the training set, and more than 3,000 sentences in both valid and test set.
The WT2 data is about two times larger the the PTB dataset.The dataset consists of Wikipedia articles.The corpus includes 30,000 unique words in its vocabulary and is not cleaned as heavily as the PTB corpus.
The WT103 corpus contains a larger vocabulary and more articles than WT2.It consists of 28k articles and more than 100M words in the training set.WT2 and WT103 corpora can evaluate the ability of capturing long-term dependencies (Dai et al., 2018).
In each corpus, we apply our approach to publicly-available, state-of-the-art models.This demonstrates that our approach can improve different existing architectures.Our trained models will be published for downloading.The implementation of our models is publicly available. 1

Penn Treebank
We train a 3-layer AWD-LSTM language model (Merity et al., 2017) on PTB data set.We use 1,150 as the number of hidden neurons and 400 as the size of word embeddings.We also apply the word embedding tying strategy (Inan et al., 2016).We apply variational dropout for hidden states (Gal and Ghahramani, 2016) and the dropout rate is 0.25.We also apply weight dropout (Merity et al., 2017) and set weight dropout rate as 0.5.We apply stochastic gradient descent (SGD) and averaged SGD (ASGD; Polyak and Juditsky, 1992) for training.The learning rate is 30 and we clip the gradients with a norm of 0.25.For the phrase induction model, we randomly sample 1 negative sample for each context, and the context-phrase alignment loss is given a coefficient of 0.5.The output of the second layer of the neural network is used for learning context-phrase alignment, and the final layer is used for word generation.
We compare the word-level perplexity of our model with other state-of-the-art models and our baseline is AWD-LSTM (Merity et al., 2017).The experimental results are shown in Table 1.Although not as good as the Transformer-XL model (Dai et al., 2018) and the mixture of softmax model (Yang et al., 2017), our model significantly improved the AWD-LSTM, reducing 2.2 points of perplexity on the validation set and 1.6 points of perplexity on the test set.Note that the "finetuning" process stands for further training the language models with ASGD algorithm (Merity et al., 2017).
We also did an ablation study without either headword attention or negative sampling (NS).The results are listed in Table 1.By simply averaging word vectors in the induced phrase Without the attention mechanism, the model performs worse than the full model by 0.5 perplexity, but is still better than our baseline, the AWD-LSTM model.In the experiment without negative sampling, we only use the embedding of true induced  phrases to align with the context embedding.It is also indicated that the negative sampling strategy can improve the performance by 1.1 perplexity.Hence we just test the full model in the following experiments.

Wikitext-2
We also trained a 3-layer AWD-LSTM language model on the WT2 dataset.The network has the same input size, output size, and hidden size as the model we applied on PTB dataset, following the experiments done by Merity et al. (2017).Some hyper-parameters are different from the PTB language model.We use a batch size of 60.The embedding dropout rate is 0.65 and the dropout rate of hidden outputs is set to 0.2.Other hyperparameters are the same as we set in training on the PTB dataset.
The experimental results are shown in Table 2. Our model improves the AWD-LSTM model by reducing 1.7 points of perplexity on both the validation and test sets, while we did not make any change to the architecture of the AWD-LSTM language model.

Wikitext-103
The current state-of-the-art language model trained on Wikitext-103 dataset is the Transformer-XL (Dai et al., 2018).We apply our method on the state-of-the-art Transformer-XL Large model, which has 18 layers and 257M parameters.The input size and hidden size are 1024.16 attention heads are used.We regard the first 14 layers as the phrase generator and the last 4 layers as the word generator.In other words, the context-phrase alignment is trained with the outputs of the 14th layer.
The model is trained on 4 Titan X Pascal GPUs, each of which has 12G memory.Because of the limitation of computational resources, we use our approach to fine-tune the officially released pretrained Transformer-XL Large model for 1 epoch.The experimental results are shown in Table 3.Our approach got 17.4 perplexity with the officially released evaluation scripts, significantly outperforming all baselines and achieving new stateof-the-art performance2 .

Discussion
In this section, we show what is learned by training language models with the context-phrase alignment objective function by visualizing the syntactic heights output by the TCN model and the phrases induced by each target word in a sentence.We also visualize the headword attentions over the induced phrase.
The first example is the sentence showed in Figure 1.The sentence came from Jurafsky and Martin (2014) and did not appear in our training set. Figure 1 shows the syntactic heights and the induced phrase of "the" according to the groundtruth dependency information.Our model is not given such high-quality inputs in either training or evaluation.
Figure 3 visualizes the structure learned by our phrase induction model.The inferred syntactic heights are shown in Figure 3a.Heights assigned  to words "the" and "to" are significantly lower than others, while the verb "canceled" is assigned the highest in the sentence.Induced phrases are shown in Figure 3b.The words at the beginning of each row stand for the target word of each step.
Values in the matrix stand for attention weights for calculating phrase embedding.The weights are calculated with the phrase segmenting conditions (PSC) and the syntactic heights described in Equations 8 to 11.For the target word "united", h united < h canceled and h canceled > h the , hence the induced phrase of "united" is a single word "canceled", and the headword attention of "canceled" is 1, which is indicated in the first row of Figure 3b.The phrase induced by "canceled" is the entire following sequence, "the morning flights to houston", since no following word has a higher syntactic height than the target word.It is also shown that the headword of the induced phrase of "canceled" is "flights", which agrees with the dependency structure indicated in Figure 1.
More examples are shown in Figure 4. Figures 4a to 4d show random examples without any unknown word, while the examples shown in Figures 4e and 4f are randomly selected from sentences with unknown words, which are marked with the UNK symbol.The examples show that the phrase induction model does not always predict the exact structure represented by the dependency tree.For example, in Figure 4b, the TCN model assigned the highest syntactic height to the word "market" and induced the phrase "expect a rough market" for the context "the fund managers".However, in a ground-truth dependency tree, the verb "expect" is the word directly connected to the root node and therefore has the highest syntactic height.
Although not exactly matching linguistic dependency structures, the phrase-level structure predictions are reasonable.The segmentation is interpretable and the predicted headwords are appropriate.In Figure 4c, the headwords are "trying", "quality", and "involvement".The model is also robust with unknown words.In Figure 4e, "the <unk> council" is segmented as the induced phrase of "but a majority of".In this case, the model recognized that the unknown word is dependent on "council".
The sentence in Figure 4f includes even more unknown words.However, the model still correctly predicted the root word, the verb "speak".For the target word "with", the induced phrase is "strong <unk>".Two unknown words are located in the last few words of the sentence.The model failed to induce the phrase "<unk> and <unk>" for the word "do", but still successfully split "<unk>" and "and".Meanwhile, the attentions over the phrases induced by "speak", "do", and the first "<unk>" are not quite informative, suggesting that unknown words made some difficulties for headword prediction in this example.However, the unknown words are assigned significantly higher syntactic heights than the word "and".

Conclusion
In this work, we improved state-of-the-art language models by aligning context and induced phrases.We defined syntactic heights and phrase segmentation rules.The model generates phrase embeddings with headword attentions.We improved the AWD-LSTM and Transformer-XL language models on different data sets and achieved state-of-the-art performance on the Wikitext-103 corpus.Experiments showed that our model successfully learned approximate phrase-level knowledge, including segmentation and headwords, without any annotation.In future work, we aim to capture better structural information and possible connections to unsupervised grammar induction.

Figure 1 :
Figure 1: Groundtruth dependency tree and syntactic heights of each word.
Figure2: The 3-step diagram of our approach.The current target word is "the", the induced phrase is "morning flights", and the next word is "morning".The context-phrase and context-word alignments are jointly trained.
Induced phrases and headword attentions.

Figure 3 :Figure 4 :
Figure 3: Examples of induced phrases and corresponding headword attention for generating the phrase embedding.The word of each row stands for the target word as the current input of the language model, and the values in each row in the matrices stands for the words consisting the induced phrase and their weights.

Table 1 :
Experimental results on Penn Treebank dataset.Compared with the AWD-LSTM baseline models, our method reduced the perplexity on test set by 1.6.