Revisiting Representation Degeneration Problem in Language Modeling

Weight tying is now a common setting in many language generation tasks such as language modeling and machine translation. However, a recent study reveals that there is a potential flaw in weight tying. They find that the learned word embeddings are likely to degenerate and lie in a narrow cone when training a language model. They call it the representation degeneration problem and propose a cosine regularization to solve it. Nevertheless, we prove that the cosine regularization is insufficient to solve the problem, as the degeneration is still likely to happen under certain conditions. In this paper, we revisit the representation degeneration problem and theoretically analyze the limitations of the previously proposed solution. Afterward, we propose an alternative regularization method called Laplacian regularization to tackle the problem. Experiments on language modeling demonstrate the effectiveness of the proposed Laplacian regularization.


Introduction
Language modeling is a fundamental task in natural language processing, applications include machine translation (Bahdanau et al., 2015;Vaswani et al., 2017), image captioning (Vinyals et al., 2015;Xu et al., 2015) and speech recognition (Yu and Deng, 2016), to name a few. In the era of deep learning, a general model architecture usually contains a word embedding layer as input, multiple layers to encode word context as a fixed-size hidden state, and a softmax layer to transform the hidden-state into a categorical distribution of the next word (Merity et al., 2018;Yang et al., 2018;Gong et al., 2018;Gao et al., 2019). While in practice, the parameters of the embedding layer and the softmax layer are usually shared, which is called weight tying (Inan et al., 2017;Press and Wolf, 2017). * Corresponding author http://dm.uestc.edu.cn Despite the improvements from weight tying, a recent work (Gao et al., 2019) discovers that, with weight tying, the learned word embeddings are positively correlated and spread in a narrow cone as visualized in Figure 1(a). A similar phenomenon is observed in Gong et al. (2018). Thus, the semantic expressiveness of word embeddings is limited. They call it the representation degeneration problem. To tackle the problem, the authors propose a cosine regularization that minimizes the cosine similarities between any two word embeddings to enlarge the aperture of the cone. They show that it improves the language modeling performance and eases the degeneration as visualized in Figure 1 However, we argue that the cosine regularization might not be the best choice for solving this problem, and the reasons are: i) The cosine regularization minimizes similarities between any two word embeddings without considering whether they are semantically close or not. But we wish two words with similar semantics stay close in the embedding space. ii) Although the cosine regularization improves language generation performance, it does not fundamentally solve the representation degeneration problem. We prove that the degeneration still exists when there exists a certain regularization structure. Finally, we analyze the general condition of degeneration and show that there still are many low-frequency words that meet the condition and thus degenerate. Therefore, we argue that the degeneration is still likely to happen even with cosine regularization.
Motivated by these issues, we propose an alternative Laplacian regularization to tackle the representation degeneration problem. As the distributional hypothesis (Harris, 1954) states: two words that occur in similar contexts tend to have similar meanings. The general idea of Laplacian regularization is to minimize the squared Euclidean distance between two word embeddings when they have large context similarity. In contrast to cosine regularization, Laplacian regularization prevents minimizing all similarities of word pairs indiscriminately. Although the Laplacian regularization does not theoretically solve the degeneration problem either, we empirically demonstrate that it achieves better performance in most cases of language modeling experiments, and word embeddings are less likely to degenerate.
In summary, the main contributions of our work are listed as follows.
• We revisit the representation degeneration problem and theoretically analyze the limitations of the previously proposed cosine regularization solution.
• We propose an alternative Laplacian regularization to tackle the representation degeneration problem. We show that it eases the degeneration to an extent comparing with cosine regularization.
• We conduct experiments on language modeling task to demonstrate the effectiveness of our method.

Representation Degeneration Problem
In this section, we introduce the notations and review the representation degeneration problem. Given a vocabulary of words (indices) V = {1, ..., N }, and a text corpus represented as a sequence of words y = (y 1 , ..., y M ), where y i ∈ V. The joint probability of sequence y is factorized into a product of conditional probabilities using the chain rule.
where y <t denotes the first t − 1 words in y. Current neural language models encode variable-length context as a fixed-size hidden state denoted as h i . The conditional probability is calculated by the softmax function, and the model is trained by minimizing the negative log-likelihood loss as follows.
where w is the parameter of the softmax layer. When using weight tying, w l is the embedding for the l-th word. Next, we investigate the optimization process of word embeddings. We follow the analysis in Gao et al. (2019) and only focus on the extreme case of a non-appeared word w N in the following analysis, since the analysis can be extended to the case of rarely appeared words by applying Theorem 3 in Gao et al. (2019). Assume y i = N for all i, which means the N -th word with embedding w N does not appear in the corpus. Under the loglikelihood maximization objective and fixing all other parameters, we write the objective function for optimizing variable w N as follows.
where G i = N −1 l=1 exp(w T l h i ) and can be considered as a constant. Let v be a uniformly negative direction of h i , i.e., v T h i < 0 for all i. It is easy to see that the optimal solution of Eq. (3) can be achieved by setting w * N = lim k→∞ k · v and the minimum objective value is bounded by The authors prove that such a uniformly negative direction v exists if and only if the convex hull of the hidden states does not contain the origin. They discuss that the condition is very likely to hold, especially when layer normalization is applied. We further observe that the condition holds almost for sure in actual language modeling, even without layer normalization.
From the above analysis, we have an intuition for the representation degeneration problem. We can see that the embedding w N can be optimized along any uniformly negative direction to infinity. As the set of uniformly negative direction is convex, w N is likely to lie in a convex cone and move to infinity during optimization. This conclusion also applies to the case of rarely appeared words to a large extent (Gao et al., 2019). As most words in natural language are low-frequency words according to Zipf's law, the learned word embeddings tend to degenerate and lie in a narrow cone, which limits the model's semantic expressiveness. Notably, Gong et al. (2018) also show that the learned word embeddings overly encode word frequency information rather than semantic information, which implicitly supports the existence of the degeneration problem.

Solutions to The Problem
In this section, we first introduce the solution proposed in Gao et al. (2019). Then we theoretically analyze the limitations of the previously proposed method. Finally, we propose an alternative regularization to tackle the problem.

Cosine Regularization
As word embeddings tend to lie in a narrow cone, a straightforward solution is to enlarge the aperture of the cone, which is defined as the maximum angle between any two boundaries of the cone. However, for the ease of optimization, Gao et al. (2019) proposes to minimize the cosine similarities between any two word embeddings. The overall loss is the typical negative log-likelihood loss plus the regularization term as follows.
whereŵ = w/||w|| is the normalized direction of w, and γ > 0 is a hyperparameter. The cosine regularization minimizes the similarities of all word pairs indiscriminately, which might not be a good idea, especially when two words are semantically close and correlated. More importantly, this regularization technique is theoretically insufficient to solve the representation degeneration problem. We will show that in the following analysis.
Following the previous study, we write the objective function with cosine regularization term w.r.t. a non-appeared word w N as follows.
j=1ŵ j and can be considered as a constant. As the cosine regularization term is a function of w N , setting w N = lim k→∞ k · v may not achieve the optimal solution of Eq. (5), which prevents word embeddings from lying in the cone. However, we find that the degeneration still exists in certain cases. To show that, we first define the uniformly negative direction cone as follows. Definition 1. Let C denote the uniformly negative direction cone of hidden states, i.e., Note that C is a set of vectors, we use −C to denote the set of the negative vectors for convenience. Since the cosine regularization term is the projection length of vectorŵ C in direction of unit vector w N , the objective value depends onŵ C . The following theorem states that the degeneration exists whenŵ C lies in certain directions. Theorem 1. If the uniformly negative direction cone C is not empty, andŵ C is in −C, then the optimal solution of Eq. (5) can be achieved by set- it is easy to check that there exists a uniformly negative direction vector v * that is in C and has the opposite direction ofŵ C . Note that the two terms in Eq. (5) have bounded minimum values 1 M M i=1 log(G i ) and −||ŵ C ||, which can be both simultaneously achieved by setting w * N = lim k→∞ k · v * . We argue that the condition in Theorem 1 is likely to happen in language modeling. Under the log-likelihood maximization objective, each appeared word embedding w y i tends to be optimized to maximize the correlation between it and its hidden state h i . Note thatŵ C represents the average direction of all appeared words. Therefore,ŵ C is likely to negatively correlate with a uniformly negative direction v and lie in −C. From Theorem 1, we can see that the degeneration still exists as long asŵ C has an opposite direction of C. Nevertheless, this condition still seems strong. We will give a general condition under which the degradation exists. We first provide a lemma as follows. Lemma 1. Let w * N be the optimal solution of Eq. (5). If w * N is in C, then ||w * N || = ∞ and the minimum objective value is 1 We prove the lemma by contradiction. Suppose there is an optimal solution w N with a finite length that is in C. Let w * N = lim k→∞ k · w N and L(·) denote the objective function of Eq. (5).
which raises the contradiction.
Based on Lemma 1, we give the following theorem.
Theorem 2. If the uniformly negative direction cone C is not empty, and E G , then the optimal solution of Eq. (5) is in C.
Proof. Suppose there are two cases of optimal solution: w * N ∈ C and w * N / ∈ C. From Lemma 1, we have ||w * N || = ∞, and L(w * N ) is upper bounded. We compare the maximum value of L(w * N ) and the minimum value of L(w * N ).
We write Eq. (7) as expectation form and apply Jensen's inequality.
Eq. (10) gives the condition of L(w * N ) is constantly smaller than L(w * N ), under which the optimal solution is in C.
Note that the vocabulary size N is usually large in language modeling, e.g., 10000 for Penn Treebank data set and over 30000 for WikiText-2 data set. Suppose γ = 1, the right side of the inequality has a value of 0.9996 and 0.9999, respectively. It makes the inequality very likely to hold in practice, especially for low-frequency words, and we will empirically demonstrate it in the experiment. Based on Theorem 2 and Lemma 1, we argue that the cosine regularization is insufficient to solve the representation degeneration problem.

Laplacian Regularization
The distributional hypothesis (Harris, 1954) is a common assumption in various NLP tasks, which states that two words that occur in similar contexts tend to have similar meanings. We borrow this idea and propose an alternative Laplacian regularization technique. The overall objective is as follows.
where λ > 0 is a hyperparameter, and s ij is a similarity weight that measures the context similarity between w i and w j . L = D − S is called graph Laplacian matrix. D is a diagonal matrix whose entries are column or row sums of S. s ij can be calculated by any similarity function, for example, cosine similarity is used in this study.
Note that we detach h from the computational graph to cut off the back propagation gradient flow in implementation. However, computing the Laplacian regularization term with full vocabulary words is computationally expensive. Another issue is that computing s ij needs to sample appropriate contexts for word w i and w j . To address these issues, we compute the Laplacian regularization term in a stochastic mini-batch way. Specifically, let H ∈ R B×T ×D be the hidden state matrix before the softmax layer, where B is the batch size and T is the sequence length in one batch. We only compute words that are predicted by these B × T hidden states and use the corresponding hidden states as contexts to calculate s ij . Here we use this simple way to calculate s ij only for the ease of implementation. Though, one could design a sophisticated strategy to incorporate extra knowledge by selecting word pairs and manipulating similarity weights.
By contrast, Laplacian regularization minimizes the similarities of word pairs discriminately. It makes word embeddings with similar contexts closer in Euclidean space, which better captures the semantic correlation of words. More importantly, we show that it is less affected by the representation degeneration problem. We first write the objective function with Laplacian regularization term w.r.t. a non-appeared word w N as follows.
Proof. We prove the theorem by contradiction. Suppose w N is an optimal solution with ||w N || = ∞. It is easy to check that N j=1 ||w N −w j || 2 s j = ∞. Because the first term in Eq. (13) has bounded minimum value, the overall objective value is infinite. However, the objective function exists finite values, which raises the contradiction.
Note that when using cosine similarity to calculate s ij , it does not guarantee positive weights. However, we observe that in actual language modeling experiments, it is nearly impossible to have h T i h j ≤ 0, which further suggests the existence of C. From the above theorem, we can see that the optimal solution w * N cannot go along with any direction to infinity. However, it is difficult to give a quantitative analysis of whether the optimal solution will lie in C or not. We only give a qualitative analysis here. We first write the derivative of Eq. (13) w.r.t. w N as follows.
Qualitatively, the gradient direction involves three directions: h i , w N and −w j . Suppose that h i dominates the gradient direction, when applying gradient descent, the optimal solution is likely to fall into the uniformly negative direction cone C. However, as w j is an appeared word, it is likely to positively correlated with h i under the log-likelihood maximization objective. Therefore, −w j could have the opposite direction of h i and serve as a counterbalance to ease the degeneration effect. As for w N , it can be considered as a regularization to prevent having too large parameters. We empirically demonstrate the effectiveness of our method in the following experiments.

Experiments
In this section, we conduct experiments on language modeling task to demonstrate the effectiveness of our method.

Language Modeling
We conduct language modeling experiment on two widely used data sets of Penn Treebank (PTB) (Mikolov et al., 2010) and WikiText-2 (WT2) (Merity et al., 2017). We use two recent works as our baselines: the AWD-LSTM model 1 (Merity et al., 2018) and the AWD-LSTM-MoS model 2 (Yang et al., 2018), which achieved the state-of-the-art performance. Also, we compare with the cosine regularization technique (Gao et al., 2019), as we are all targeting the same representation degeneration problem.
For experimental settings, we faithfully follow all the settings 3 in AWD-LSTM and AWD-LSTM-MoS. There are no extra hyperparameters in our method except for λ. We set it to 0.01 and 0.001 for PTB and WT2, respectively. For cosine regularization, we set γ to 1 as described in its paper.
It is worth noting that the baseline papers' results are based on an older Pytorch 0.4.1 version, we find that the Pytorch version has a large impact on the language modeling performance since Pytorch 0.4.1 and > 1.0 have significant differences in implementation. On PTB data set, we can get a better 57.39/54.94 perplexity comparing with 58.34/56.18 by simply switching to a newer Pytorch without other changes. We must point out that building a new model upon the latest codebase, but still borrowing the numbers directly from the baseline paper could be misleading and result in unfair comparison. To this end, all experiments including the baselines are conducted under the  The language modeling results on PTB and WT2 data sets are presented in Table 1 and Table 2, respectively. Our method generally outperforms baseline methods with and without finetune. On PTB data set, our method improves the AWD-LSTM and AWD-LSTM-MoS baselines by up to 0.83/0.83 and 0.42/0.26 in terms of valid/test perplexity, respectively. On WT2 data set, our method improves the AWD-LSTM and AWD-LSTM-MoS baselines by up to 0.46/0.04 and 0.47/0.62 in terms of valid/test perplexity, respectively. When compared with cosine regularization, our method equipped with AWD-LSTM is sometimes underperformed. But our method consistently outperforms cosine regularization equipped with AWD-LSTM-MoS by up to 0.53/0.35 and 0.71/0.48 in terms of valid/test perplexity on PTB and WT2 data sets, respectively. Note that we do not change any configuration in baselines but only add regularization terms to the loss function. Thus, the improvements purely come from the regularization, which suggests that they ease the degeneration to an extent. By comparison, Laplacian regularization is generally better than cosine regularization.
To see how the regularization strength λ affects the language modeling performance, we run AWD-LSTM-MoS-LapReg on the large data set WT2 with λ tuned in the order of magnitude {1.0, 0.1, 0.01, 0.001, 0.0001}. The test perplex-ities are non-convergence, 62.60, 62.88, 62.83, 63.02, respectively. We can see that the perplexity fluctuates in an acceptable range and achieves the best at λ = 0.1.

Empirical Study for Theorem 2
We empirically examine whether the condition in Theorem 2 holds in actual language modeling. We calculate E G Z from the trained AWD-LSTM-CosReg model on the PTB and WT2 data sets, respectively. As we can see from Figure 2, many low-frequency words' E G Z are smaller than exp − 4γ(N −1) N 2 , especially for the data set with large vocabulary size, which shows that the condition in Theorem 2 is likely to hold in practice. It suggests that the degeneration still exists even with the cosine regularization, which is insufficient to solve the problem.

Visualization of Word Embeddings
To empirically investigate the effect of regularization techniques on word embeddings, we extract word embeddings trained on PTB data set and project them into 2-dimensional space for visualization. As shown in Figure 3(a), the word embeddings are clustered by their frequencies rather than semantics. The low-frequency words tend to cluster in a local region, which suggests that word embeddings lie in a narrow cone in the embedding space and the degeneration happens. However, when regularization techniques are applied, the learned word embeddings are more uniformly distributed around the origin and the degeneration effect is eased. As we can see from Figure 3(b) and Figure 3(c), the low/high-frequency word embeddings are better mixed, while the Laplacian regularization looks better than others.

Discussion and Future Work
From the above study, we analyze the limitations of the cosine regularization and empirically demonstrate the effectiveness of our proposed Laplacian regularization method. However, there is also an issue in it. To this end, we make further discussion in this section. Hopefully, it will provide some inspirations for later researches.
There is one question that must be asked: Does the Laplacian regularization completely solve the representation degeneration problem? Unfortunately, we cannot give a definite positive answer.
From the above empirical studies, we have evidence that the Laplacian regularization can ease the degeneration to an extent. However, there is also a failure case, the model cannot converge when the value of λ is set too large. Because if λ is sufficiently large, the regularization term will dominate the objective value and all word embeddings will be optimized to huddle together. The premise of this failure case is that the similarity weights s ij are all positive. Interestingly, we observe that almost similarity weights are positive, even though they are calculated by the cosine function, which further suggests that there may exist some intrinsic mechanism that causes the degeneration phenomenon. We will leave it to future study. Despite this issue, the Laplacian regularization is also a general framework to incorporate the external knowledge of word pair relations like semantic knowledge graph and synonymy/antonymy, which might bring benefits in certain applications.
In addition, we find that the representation degeneration problem is highly related to the softmax bottleneck problem (Yang et al., 2018). As a matter of fact, we consider they are two sides of the same problem. The softmax bottleneck states that a language model's output log-probability matrix should be high-rank for natural language. But the rank is limited by the embedding dimension D and thus the expressiveness of a model is compromised. The softmax bottleneck problem roots in an insufficient embedding dimension D. However, what really matters is the rank of the embedding matrix rather than the dimension D. As for the representation degeneration problem, it reveals that for a N × D word embedding matrix, the rank could be smaller than D since word embeddings are correlated and lie in a narrow cone. Thus, the true crux here is about the spectral density distribution of the embedding matrix. There is also evidence (Mu and Viswanath, 2018) that an embedding matrix with more uniformly distributed singular values better improves downstream task performance. Thus, we suggest two lines of researches to enhance the expressiveness of a language model. The first is to learn better expressive word embeddings (Gao et al., 2019;Gong et al., 2018;. The second is to design better expressive output/activation functions (Yang et al., 2018;Ganea et al., 2019;Kanai et al., 2018;Takase et al., 2018). Nonetheless, we want to clarify that only focusing on the embedding/output layers is far more insufficient for language modeling, since it is the middle layers that provide the major non-linearity which matters most for the expressiveness. Exploring new architectures like the BERT (Devlin et al., 2019) and the Transformer-XL (Dai et al., 2019) is also essential for the future study.

Related Work
For neural language modeling, Merity et al. (2018) build an important baseline named AWD-LSTM which applies various regularization techniques to train LSTM. Melis et al. (2018) also achieve similar results with highly regularized LSTMs. Built on AWD-LSTM, Yang et al. (2018) propose the AWD-LSTM-MoS model that achieves significantly lower perplexities by addressing the softmax bottleneck. Gong et al. (2018) find that word embeddings in language modeling are biased towards word frequency and propose an adversarial training scheme to address the problem. Similarly,  introduce an adversarial noise to the embedding layer while training language models. Recently, another promising trend of language model that is built upon the self-attention mechanism like the Transformer-XL (Dai et al., 2019) rapidly emerges. Gao et al. (2019) first point out the representation degeneration problem in training neural language models when applying the weight tying technique. A similar phenomenon can also be observed in Gong et al. (2018), though it does not explicitly target the degeneration problem. Furthermore, Ethayarajh (2019) observes that the contextualized representations are also anisotropic and lie in a narrow cone in all non-input layers. Recently, Wang et al. (2020) propose a new method that explicitly controls the singular value distribution to tackle the representation degeneration problem. We also consider that the softmax bottleneck problem (Yang et al., 2018) is highly related to the representation degeneration problem. There are a series of works (Ganea et al., 2019;Kanai et al., 2018;Takase et al., 2018) that follow this line of research.
The Laplacian regularization has been widely used in various fields like semi-supervised learning (Belkin and Niyogi, 2004), face recognition (Cai et al., 2007), graph embedding (Yu et al., 2020), and metric learning (Hoi et al., 2010), to name a few. However, to the best of our knowledge, it has not been applied for regularizing the word embedding matrix yet. We are probably the first to propose the Laplacian regularization on word embeddings.

Conclusion
In this paper, we study the representation degeneration problem that is first pointed out by Gao et al. (2019). We theoretically analyze the limitations of the previously proposed solution. Afterward, we propose an alternative Laplacian regularization method to tackle the problem. Experiments on language modeling demonstrate the effectiveness of our method. In the future study, we will try to further investigate this problem from the perspective of spectral density of embedding matrix.