Direct Output Connection for a High-Rank Language Model

This paper proposes a state-of-the-art recurrent neural network (RNN) language model that combines probability distributions computed not only from a final RNN layer but also middle layers. This method raises the expressive power of a language model based on the matrix factorization interpretation of language modeling introduced by Yang et al. (2018). Our proposed method improves the current state-of-the-art language model and achieves the best score on the Penn Treebank and WikiText-2, which are the standard benchmark datasets. Moreover, we indicate our proposed method contributes to application tasks: machine translation and headline generation.


Introduction
Neural network language models have played a central role in recent natural language processing (NLP) advances. For example, neural encoderdecoder models, which were successfully applied to various natural language generation tasks including machine translation , summarization (Rush et al., 2015), and dialogue (Wen et al., 2015), can be interpreted as conditional neural language models. Neural language models also positively influence syntactic parsing (Dyer et al., 2016;Choe and Charniak, 2016). Moreover, such word embedding methods as Skipgram (Mikolov et al., 2013) and vLBL (Mnih and Kavukcuoglu, 2013) originated from neural language models designed to handle much larger vocabulary and data sizes. Neural language models can also be used as contextualized word representations (Peters et al., 2018). Thus, language modeling is a good benchmark task for investigating the general frameworks of neural methods in NLP field.
In language modeling, we compute joint probability using the product of conditional probabilities. Let w 1:T be a word sequence with length T : w 1 , ..., w T . We obtain the joint probability of word sequence w 1:T as follows: p(w 1:T ) = p(w 1 ) T −1 t=1 p(w t+1 |w 1:t ).
(1) p(w 1 ) is generally assumed to be 1 in this literature, that is, p(w 1 ) = 1, and thus we can ignore its calculation. See the implementation of Zaremba et al. (2014) 1 , for an example. RNN language models obtain conditional probability p(w t+1 |w 1:t ) from the probability distribution of each word. To compute the probability distribution, RNN language models encode sequence w 1:t into a fixed-length vector and apply a transformation matrix and the softmax function. Previous researches demonstrated that RNN language models achieve high performance by using several regularizations and selecting appropriate hyperparameters (Melis et al., 2018;Merity et al., 2018). However, Yang et al. (2018) proved that existing RNN language models have low expressive power due to the Softmax bottleneck, which means the output matrix of RNN language models is low rank when we interpret the training of RNN language models as a matrix factorization problem. To solve the Softmax bottleneck, Yang et al. (2018) proposed Mixture of Softmaxes (MoS), which increases the rank of the matrix by combining multiple probability distributions computed from the encoded fixed-length vector.
In this study, we propose Direct Output Connection (DOC) as a generalization of MoS. For stacked RNNs, DOC computes the probability distributions from the middle layers including input embeddings. In addition to raising the rank, the proposed method helps weaken the vanishing gradient problem in backpropagation because DOC provides a shortcut connection to the output.
We conduct experiments on standard benchmark datasets for language modeling: the Penn Treebank and WikiText-2. Our experiments demonstrate that DOC outperforms MoS and achieves state-of-theart perplexities on each dataset. Moreover, we investigate the effect of DOC on two applications: machine translation and headline generation. We indicate that DOC can improve the performance of an encoder-decoder with an attention mechanism, which is a strong baseline for such applications. In addition, we conduct an experiment on the Penn Treebank constituency parsing task to investigate the effectiveness of DOC.

RNN Language Model
In this section, we briefly overview RNN language models. Let V be the vocabulary size and let P t ∈ R V be the probability distribution of the vocabulary at timestep t. Moreover, let D h n be the dimension of the hidden state of the n-th RNN, and let D e be the dimensions of the embedding vectors. Then the RNN language models predict probability distribution P t+1 by the following equation: where W ∈ R V ×D h N is a weight matrix 2 , E ∈ R De×V is a word embedding matrix, x t ∈ {0, 1} V is a one-hot vector of input word w t at timestep t, and h n t ∈ R D h n is the hidden state of the n-th RNN at timestep t. We define h n t at timestep t = 0 as a zero vector: h n 0 = 0. Let f (·) represent an abstract function of an RNN, which might be the Elman network (Elman, 1990), the Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, 1997) 3 Language Modeling as Matrix Factorization Yang et al. (2018) indicated that the training of language models can be interpreted as a matrix 2 Actually, we apply a bias term in addition to the weight matrix but we omit it to simplify the following discussion. factorization problem. In this section, we briefly introduce their description. Let word sequence w 1:t be context c t . Then we can regard a natural language as a finite set of the pairs of a context and its conditional probability distribution: L = {(c 1 , P * (X|c 1 )), ..., (c U , P * (X|c U ))}, where U is the number of possible contexts and X ∈ {0, 1} V is a variable representing a onehot vector of a word. Here, we consider matrix A ∈ R U ×V that represents the true log probability distributions and matrix H ∈ R U ×D h N that contains the hidden states of the final RNN layer for each context c t : Then we obtain set of matrices F (A) = {A + ΛS}, where S ∈ R U ×V is an all-ones matrix, and Λ ∈ R U ×U is a diagonal matrix. F (A) contains matrices that shifted each row of A by an arbitrary real number. In other words, if we take a matrix from F (A) and apply the softmax function to each of its rows, we obtain a matrix that consists of true probability distributions. Therefore, for some A ∈ F (A), training RNN language models is to find the parameters satisfying the following equation: Equation 6 indicates that training RNN language models can also be interpreted as a matrix factorization problem. In most cases, the rank of matrix HW is D h N because D h N is smaller than V and U in common RNN language models. Thus, an RNN language model cannot express true distributions if D h N is much smaller than rank(A ). Yang et al. (2018) also argued that rank(A ) is as high as vocabulary size V based on the following two assumptions: 1. Natural language is highly context-dependent.
In addition, since we can imagine many kinds of contexts, it is difficult to assume a basis that represents a conditional probability distribution for any contexts. In other words, compressing U is difficult.
2. Since we also have many kinds of semantic meanings, it is difficult to assume basic meanings that can create all other semantic meanings by such simple operations as addition and subtraction; compressing V is difficult. In summary, Yang et al. (2018) indicated that D h N is much smaller than rank(A) because its scale is usually 10 2 and vocabulary size V is at least 10 4 .

Proposed Method: Direct Output Connection
To construct a high-rank matrix, Yang et al. (2018) proposed Mixture of Softmaxes (MoS). MoS computes multiple probability distributions from the hidden state of final RNN layer h N and regards the weighted average of the probability distributions as the final distribution. In this study, we propose Direct Output Connection (DOC), which is a generalization method of MoS. DOC computes probability distributions from the middle layers in addition to the final layer. In other words, DOC directly connects the middle layers to the output. Figure 1 shows an overview of DOC, that uses the middle layers (including word embeddings) to compute the probability distributions. Figure 1 computes three probability distributions from all the layers, but we can vary the number of probability distributions for each layer and select some layers to avoid. In our experiments, we search for the appropriate number of probability distributions for each layer.
Formally, instead of Equation 2, DOC computes the output probability distribution at timestep t + 1 by the following equation: s.t.
J j=1 π j,ct = 1, where π j,ct is a weight for each probability distribution, k j,ct ∈ R d is a vector computed from each hidden state h n , andW ∈ R V ×d is a weight matrix. Thus, P t+1 is the weighted average of J probability distributions. We define the U × U diagonal matrix whose elements are weight π j,c for each context c as Φ. Then we obtain matrixÃ ∈ R U ×V : where K j ∈ R U ×d is a matrix whose rows are vector k j,c .Ã can be an arbitrary high rank because the righthand side of Equation 9 computes not only the matrix multiplication but also a nonlinear function. Therefore, an RNN language model with DOC can output a distribution matrix whose rank is identical to one of the true distributions. In other words,Ã is a better approximation of A than the output of a standard RNN language model. Next we describe how to acquire weight π j,ct and vector k j,ct . Let π ct ∈ R J be a vector whose elements are weight π j,ct . Then we compute π ct from the hidden state of the final RNN layer: where W π ∈ R J×D h N is a weight matrix. We next compute k j,ct from the hidden state of the n-th RNN layer: where W j ∈ R d×D h n is a weight matrix. In addition, let i n be the number of k j,ct from h n t . Then we define the sum of i n for all n as J; that is, N n=0 i n = J. In short, DOC computes J probability distributions from all the layers, including the input embedding (h 0 ). For i N = J, DOC becomes identical to MoS. In addition to increasing the rank, we expect that DOC weakens the vanishing gradient problem during backpropagation because a middle layer is directly connected to the output, such as with the auxiliary classifiers described in Szegedy et al. (2015).
For a network that computes the weights for several vectors, such as Equation 10, Shazeer et al. (2017) indicated that it often converges to a state where it always produces large weights for few vectors. In fact, we observed that DOC tends to assign large weights to shallow layers. To prevent this phenomenon, we compute the coefficient of variation of Equation 10 in each mini-batch as a regularization term following Shazeer et al. (2017). In other words, we try to adjust the sum of the weights for each probability distribution with identical values in each mini-batch. Formally, we compute the following equation for a mini-batch consisting of w b , w b+1 , ..., wb: where functions std(·) and avg(·) are functions that respectively return an input's standard deviation and its average. In the training step, we add λ β multiplied by weight coefficient β to the loss function.

Experiments on Language Modeling
We investigate the effect of DOC on the language modeling task. In detail, we conduct word-level prediction experiments and show that DOC improves the performance of MoS, which only uses the final layer to compute the probability distributions. Moreover, we evaluate various combinations of layers to explore which combination achieves the best score.

Datasets
We used the Penn Treebank (PTB) (Marcus et al., 1993) and WikiText-2 (Merity et al., 2017) datasets, which are the standard benchmark datasets for the word-level language modeling task. Mikolov et al. (2010) and Merity et al. (2017) respectively published preprocessed PTB 3 and WikiText-2 4 datasets. Table 1 describes their statistics. We used these preprocessed datasets for fair comparisons with previous studies.

Hyperparameters
Our implementation is based on the averaged stochastic gradient descent Weight-Dropped LSTM (AWD-LSTM) 5 proposed by Merity et al. (2018    dropout rate for vector k j,ct and the non-monotone interval. Since we found that the dropout rate for vector k j,ct greatly influences β in Equation 13, we varied it from 0.3 to 0.6 with 0.1 intervals. We selected 0.6 because this value achieved the best score on the PTB validation dataset. For the nonmonotone interval, we adopted the same value as Zolna et al. (2018). Table 2 summarizes the hyperparameters of our experiments.    represents the number of probability distributions from hidden state h n t . To find the best combination, we varied the number of probability distributions from each layer by fixing their total to 20: J = 20. Moreover, the top row of Table 3 shows the perplexity of AWD-LSTM with MoS reported in Yang et al. (2018) for comparison. Table 3 indicates that language models using middle layers outperformed one using only the final layer. In addition, Table  3 shows that increasing the distributions from the final layer (i 3 = 20) degraded the score from the language model with i 3 = 15 (the top row of Table 3). Thus, to obtain a superior language model, we should not increase the number of distributions from the final layer; we should instead use the middle layers, as with our proposed DOC. Table 3 shows that the i 3 = 15, i 2 = 5 setting achieved the best performance and the other settings with shallow layers have a little effect. This result implies that we need some layers to output accurate distributions. In fact, most previous studies adopted two LSTM layers for language modeling. This suggests that we need at least two layers to obtain high-quality distributions.  Table 6: Perplexities of our implementations and reruns on the PTB dataset. We set the non-monotone interval to 60. † represents results obtained by original implementations with identical hyperparameters except for non-monotone interval. ‡ indicates the result obtained by our AWD-LSTM-MoS implementation with identical dropout rates as AWD-LSTM-DOC. For (fin), we repeated fine-tuning until convergence.

Results
For the i 3 = 15, i 2 = 5 setting, we explored the effect of λ β in {0, 0.01, 0.001, 0.0001}. Although Table 3 shows that λ β = 0.001 achieved the best perplexity, the effect is not consistent. Table 4 shows the coefficient of variation of Equation 10, i.e., √ β in the PTB dataset. This table demonstrates that the coefficient of variation decreases with growth in λ β . In other words, the model trained with a large λ β assigns balanced weights to each probability distribution. These results indicate that it is not always necessary to equally use each probability distribution, but we can acquire a better model in some λ β . Hereafter, we refer to the setting that achieved the best score (i 3 = 15, i 2 = 5, λ β = 0.001) as AWD-LSTM-DOC. Table 5 shows the ranks of matrices containing log probability distributions from each method. In other words, Table 5 describesÃ in Equation 9 for each method. As shown by this table, the output of AWD-LSTM is restricted to D 3 7 . In contrast, AWD-LSTM-MoS (Yang et al., 2018) and AWD-LSTM-DOC outputted matrices whose ranks equal the vocabulary size. This fact indicates that DOC (including MoS) can output the same matrix as the true distributions in view of a rank. Figure 2 illustrates the learning curves of each method on PTB. This figure contains the validation scores of AWD-LSTM, AWD-LSTM-MoS, and AWD-LSTM-DOC at each training epoch. We trained AWD-LSTM and AWD-LSTM-MoS by setting the non-monotone interval to 60, as with AWD-LSTM-DOC. In other words, we used hyperparameters identical to the original ones to train AWD-LSTM and AWD-LSTM-MoS, except for the non-monotone interval. We note that the opti-   mization method converts the ordinary stochastic gradient descent (SGD) into the averaged SGD at the point where convergence almost occurs. In Figure 2, the turning point is the epoch when each method drastically decreases the perplexity. Figure  2 shows that each method similarly reduces the perplexity at the beginning. AWD-LSTM and AWD-LSTM-MoS were slow to decrease the perplexity from 50 epochs. In contrast, AWD-LSTM-DOC constantly decreased the perplexity and achieved a lower value than the other methods with ordinary SGD. Therefore, we conclude that DOC positively affects the training of language modeling. Table 6 shows the AWD-LSTM, AWD-LSTM-MoS, and AWD-LSTM-DOC results in our configurations. For AWD-LSTM-MoS, we trained our implementation with the same dropout rates as AWD-LSTM-DOC for a fair comparison. AWD-LSTM-DOC outperformed both the original AWD-LSTM-MoS and our implementation. In other words, DOC outperformed MoS.
Since the averaged SGD uses the averaged parameters from each update step, the parameters of the early steps are harmful to the final parameters. Therefore, when the model converges, recent studies and ours eliminate the history of and then retrains the model. Merity et al. (2018) referred to this retraining process as fine-tuning. Although most previous studies only conducted fine-tuning once, Zolna et al. (2018) argued that two finetunings provided additional improvement. Thus, we repeated fine-tuning until we achieved no more improvements in the validation data. We refer to the model as AWD-LSTM-DOC (fin) in Table 6, which shows that repeated fine-tunings improved the perplexity by about 0.5.
Tables 7 and 8 respectively show the perplexities of AWD-LSTM-DOC and previous studies on PTB and WikiText-2 8 . These tables show that AWD-LSTM-DOC achieved the best perplexity. AWD-LSTM-DOC improved the perplexity by almost 2.0 on PTB and 3.5 on WikiText-2 from the state-of-the-art scores. The ensemble technique provided further improvement, as described in pre- 8 We exclude models that use the statistics of the test data (Grave et al., 2017;Krause et al., 2017) from these tables because we regard neural language models as the basis of NLP applications and consider it unreasonable to know correct outputs during applications, e.g., machine translation. In other words, we focus on neural language models as the foundation of applications although we can combine the method using the statistics of test data with our AWD-LSTM-DOC. vious studies (Zaremba et al., 2014;, and improved the perplexity by at least 4 points on both datasets. Finally, the ensemble of the repeated finetuning models achieved 47.17 on the PTB test and 53.09 on the WikiText-2 test.

Experiments on Application Tasks
As described in Section 1, a neural encoder-decoder model can be interpreted as a conditional language model. To investigate the effect of DOC on an encoder-decoder model, we incorporate DOC into the decoder and examine its performance.

Dataset
We conducted experiments on machine translation and headline generation tasks. For machine translation, we used two kinds of sentence pairs (English-German and English-French) in the IWSLT 2016 dataset 9 . The training set respectively contains about 189K and 208K sentence pairs of English-German and English-French. We experimented in four settings: from English to German (En-De), its reverse (De-En), from English to French (En-Fr), and its reverse (Fr-En).
Headline generation is a task that creates a short summarization of an input sentence (Rush et al., 2015). Rush et al. (2015) constructed a headline generation dataset by extracting pairs of first sentences of news articles and their headlines from the annotated English Gigaword corpus (Napoles et al., 2012). They also divided the extracted sentenceheadline pairs into three parts: training, validation, and test sets. The training set contains about 3.8M sentence-headline pairs. For our evaluation, we used the test set constructed by Zhou et al. (2017) because the one constructed by Rush et al. (2015) contains some invalid instances, as reported in Zhou et al. (2017).

Encoder-Decoder Model
For the base model, we adopted an encoder-decoder with an attention mechanism described in Kiyono et al. (2017). The encoder consists of a 2-layer bidirectional LSTM, and the decoder consists of a 2-layer LSTM with attention proposed by Luong et al. (2015). We interpreted the layer after computing the attention as the 3rd layer of the decoder. We refer to this encoder-decoder as EncDec. For the hyperparameters, we followed the setting of Kiyono et al. (2017) except for the sizes of hidden   (Rush et al., 2015) 37.41 15.87 34.70 SEASS (Zhou et al., 2017) 46.86 24.58 43.53 Kiyono et al. (2017) 46.34 24.85 43.49 states and embeddings. We used 500 for machine translation and 400 for headline generation. We constructed a vocabulary set by using Byte-Pair-Encoding 10 (BPE) (Sennrich et al., 2016). We set the number of BPE merge operations at 16K for the machine translation and 5K for the headline generation.
In this experiment, we compare DOC to the base EncDec. We prepared two DOC settings: using only the final layer, that is, a setting that is identical to MoS, and using both the final and middle layers. We used the 2nd and 3rd layers in the latter setting because this case achieved the best performance on the language modeling task in Section 5.3. We set i 3 = 2 and i 2 = 2, i 3 = 2. For this experiment, we modified a publicly available encode-decoder implementation 11 . Table 9 shows the BLEU scores of each method. Since an initial value often drastically varies the result of a neural encoder-decoder, we reported the average of three models trained from different initial values and random seeds. Table 9 indicates that EncDec+DOC outperformed EncDec. Table 10 shows the ROUGE F1 scores of each method. In addition to the results of our implementations (the upper part), the lower part represents the published scores reported in previous studies. For the upper part, we reported the average of three models (as in Table 9). EncDec+DOC outperformed EncDec on all scores. Moreover, EncDec outperformed the state-of-the-art method (Zhou et al., 2017) on the ROUGE-2 and ROUGE-L F1 scores. In other words, our baseline is already very strong. We believe that this is because we adopted a larger embedding size than Zhou et al. (2017). It is noteworthy that DOC improved the performance of EncDec even though EncDec is very strong.

Results
These results indicate that DOC positively influences a neural encoder-decoder model. Using the middle layer also yields further improvement because EncDec+DOC (i 3 = i 2 = 2) outperformed EncDec+DOC (i 3 = 2).

Experiments on Constituency Parsing
Choe and Charniak (2016) achieved high F1 scores on the Penn Treebank constituency parsing task by transforming candidate trees into a symbol sequence (S-expression) and reranking them based on the perplexity obtained by a neural language model. To investigate the effectiveness of DOC, we evaluate our language models following their configurations.

Dataset
We used the Wall Street Journal of the Penn Treebank dataset. We used the section 2-21 for training, 22 for validation, and 23 for testing. We applied the preprocessing codes of Choe and Charniak (2016) 12 to the dataset and converted a token that appears fewer than ten times in the training dataset into a special token unk. For reranking, we prepared 500 candidates obtained by the Charniak parser (Charniak, 2000).

Models
We compare AWD-LSTM-DOC with AWD-LSTM (Merity et al., 2018) and AWD-LSTM-MoS (Yang et al., 2018). We trained each model with the same hyperparameters from our language modeling experiments (Section 5). We selected the model that achieved the best perplexity on the validation set during the training.  sents the current state-of-the-art scores in the setting without external data. The upper part also contains the score reported in Choe and Charniak (2016) that reranked candidates by the simple LSTM language model. This part indicates that our implemented rerankers outperformed the simple LSTM language model based reranker, which achieved 92.6 F1 score (Choe and Charniak, 2016). Moreover, AWD-LSTM-DOC outperformed AWD-LSTM and AWD-LSTM-MoS. These results correspond to the language modeling task. The middle part shows that AWD-LSTM-DOC also outperformed AWD-LSTM and AWD-LSTM-MoS in the ensemble setting. In addition, we can improve the performance by exchanging the base parser with a stronger one. In fact, we achieved 94.29 F1 score by reranking the candidates from retrained Recurrent Neural Network Grammars (RNNG) (Dyer et al., 2016) 13 , that achieved 91.2 F1 score in our configuration. Moreover, the lowest row of the middle part indicates the result by reranking the candidates from the retrained neural encoder-decoder based parser (Suzuki et al., 2018). Our base parser has two different parts from Suzuki et al. (2018). First, we used the sum of the hidden states of the forward and backward RNNs as the hidden layer for each RNN 14 . Second, we tied the embedding matrix to the weight matrix to compute 13 The output of RNNG is not in descending order because it samples candidates based on their scores. Thus, we prepared more candidates (i.e., 700) to be able to obtain correct instances as candidates.

Results
14 We used the deep bidirectional encoder described at http://opennmt.net/OpenNMT/training/models/ instead of a basic bidirectional encoder. the probability distributions in the decoder. The retrained parser achieved 93.12 F1 score. Finally, we achieved 94.47 F1 score by reranking its candidates with AWD-LSTM-DOC. We expect that we can achieve even better score by replacing the base parser with the current state-of-the-art one (Kitaev and Klein, 2018). Bengio et al. (2003) are pioneers of neural language models. To address the curse of dimensionality in language modeling, they proposed a method using word embeddings and a feed-forward neural network (FFNN). They demonstrated that their approach outperformed n-gram language models, but FFNN can only handle fixed-length contexts. Instead of FFNN, Mikolov et al. (2010) applied RNN (Elman, 1990) to language modeling to address the entire given sequence as a context. Their method outperformed the Kneser-Ney smoothed 5-gram language model (Kneser and Ney, 1995;Chen and Goodman, 1996).

Related Work
Researchers continue to try to improve the performance of RNN language models. Zaremba et al. (2014) used LSTM (Hochreiter and Schmidhuber, 1997) instead of a simple RNN for language modeling and significantly improved an RNN language model by applying dropout (Srivastava et al., 2014) to all the connections except for the recurrent connections. To regularize the recurrent connections, Gal and Ghahramani (2016) proposed variational inference-based dropout. Their method uses the same dropout mask at each timestep. Zolna et al. (2018) proposed fraternal dropout, which minimizes the differences between outputs from different dropout masks to be invariant to the dropout mask. Melis et al. (2018) used black-box optimization to find appropriate hyperparameters for RNN language models and demonstrated that the standard LSTM with proper regularizations can outperform other architectures.
Apart from dropout techniques, Inan et al. (2017) and Press and Wolf (2017) proposed the word tying method (WT), which unifies word embeddings (E in Equation 4) with the weight matrix to compute probability distributions (W in Equation 2). In addition to quantitative evaluation, Inan et al. (2017) provided a theoretical justification for WT and proposed the augmented loss technique (AL), which computes an objective probability based on word embeddings. In addition to these regularization techniques, Merity et al. (2018) used DropConnect (Wan et al., 2013) and averaged SGD (Polyak and Juditsky, 1992) for an LSTM language model. Their AWD-LSTM achieved lower perplexity than Melis et al. (2018) on PTB and WikiText-2.
Previous studies also explored superior architecture for language modeling. Zilly et al. (2017) proposed recurrent highway networks that use highway layers (Srivastava et al., 2015) to deepen recurrent connections. Zoph and Le (2017) adopted reinforcement learning to construct the best RNN structure. However, as mentioned, Melis et al. (2018) established that the standard LSTM is superior to these architectures. Apart from RNN architecture,  proposed the input-tooutput gate (IOG), which boosts the performance of trained language models.
As described in Section 3, Yang et al. (2018) interpreted training language modeling as matrix factorization and improved performance by computing multiple probability distributions. In this study, we generalized their approach to use the middle layers of RNNs. Finally, our proposed method, DOC, achieved the state-of-the-art score on the standard benchmark datasets.
Some studies provided methods that boost performance by using statistics obtained from test data. Grave et al. (2017) extended a cache model (Kuhn and De Mori, 1990) for RNN language models. Krause et al. (2017) proposed dynamic evaluation that updates parameters based on a recent sequence during testing. Although these methods might also improve the performance of DOC, we omitted such investigation to focus on comparisons among methods trained only on the training set.

Conclusion
We proposed Direct Output Connection (DOC), a generalization method of MoS introduced by Yang et al. (2018). DOC raises the expressive power of RNN language models and improves quality of the model. DOC outperformed MoS and achieved the best perplexities on the standard benchmark datasets of language modeling: PTB and WikiText-2. Moreover, we investigated its effectiveness on machine translation and headline generation. Our results show that DOC also improved the performance of EncDec and using a middle layer positively affected such application tasks.