Specializing Word Embeddings (for Parsing) by Information Bottleneck

Pre-trained word embeddings like ELMo and BERT contain rich syntactic and semantic information, resulting in state-of-the-art performance on various tasks. We propose a very fast variational information bottleneck (VIB) method to nonlinearly compress these embeddings, keeping only the information that helps a discriminative parser. We compress each word embedding to either a discrete tag or a continuous vector. In the discrete version, our automatically compressed tags form an alternative tag set: we show experimentally that our tags capture most of the information in traditional POS tag annotations, but our tag sequences can be parsed more accurately at the same level of tag granularity. In the continuous version, we show experimentally that moderately compressing the word embeddings by our method yields a more accurate parser in 8 of 9 languages, unlike simple dimensionality reduction.


Introduction
Word embedding systems like BERT and ELMo use spelling and context to obtain contextual embeddings of word tokens.These systems are trained on large corpora in a task-independent way.The resulting embeddings have proved to then be useful for both syntactic and semantic tasks, with different layers of ELMo or BERT being somewhat specialized to different kinds of tasks (Peters et al., 2018b;Goldberg, 2019).State-of-the-art performance on many NLP tasks can be obtained by fine-tuning, i.e., back-propagating task loss all the way back into the embedding function (Peters et al., 2018a;Devlin et al., 2018).
In this paper, we explore what task-specific information appears in the embeddings before finetuning takes place.We focus on the task of dependency parsing, but our method can be easily extended to other syntactic or semantic tasks.Our method compresses the embeddings by extracting just their syntactic properties-specifically, the information needed to reconstruct parse trees (because that is our task).Our nonlinear, stochastic compression function is explicitly trained by variational information bottleneck (VIB) to forget task-irrelevant information.This is reminiscent of canonical correspondence analysis (Anderson, 2003), a method for reducing the dimensionality of an input vector so that it remains predictive of an output vector, although we are predicting an output tree instead.However, VIB goes beyond mere dimensionality reduction to a fixed lower dimensionality, since it also avoids unnecessary use of the dimensions that are available in the compressed representation, blurring unneeded capacity via randomness.The effective number of dimensions may therefore vary from token to token.For example, a parser may be content to know about an adjective token only that it is adjectival, whereas to find the dependents of a verb token, it may need to know the verb's number and transitivity, and to attach a preposition token, it may need to know the identity of the preposition.
We try compressing to both discrete and continuous task-specific representations.Discrete representations yield an interpretable clustering of words.We also extend information bottleneck to allow us to control the contextual specificity of the token embeddings, making them more like type embeddings.
This specialization method is complementary to the previous fine-tuning approach.Fine-tuning introduces new information into word embeddings by backpropagating the loss, whereas the VIB method learns to exploit the existing information found by the ELMo or BERT language model.VIB also has less capacity and less danger of overfitting, since it fits fewer parameters than fine-tuning (which in the case of BERT has the freedom to adjust the embeddings of all words and word pieces, even those that are rare in the supervised fine-tuning data).VIB is also very fast to train on a single GPU.
We discover that our syntactically specialized embeddings are predictive of the gold POS tags in the setting of few-shot-learning, validating the intuition that a POS tag summarizes a word token's syntactic properties.However, our representations are tuned explicitly for discriminative parsing, so they prove to be even more useful for this task than POS tags, even at the same level of granularity.They are also more useful than the uncompressed ELMo representations, when it comes to generalizing to test data.(The first comparison uses discrete tags, and the second uses continuous tags.) 2 Background: Information Bottleneck The information bottleneck (IB) method originated in information theory and has been adopted by the machine learning community as a training objective (Tishby et al., 2000) and a theoretical framework for analyzing deep neural networks (Tishby and Zaslavsky, 2015b).
Let X represent an "input" random variable such as a sentence, and Y represent a correlated "output" random variable such as a parse.Suppose we know the joint distribution p(X, Y ).(In practice, we will use the empirical distribution over a sample of (x, y) pairs.)Our goal is to learn a stochastic map p θ (t | x) from X to some compressed representation T, which in our setting will be something like a tag sequence.IB seeks to minimize where I(•; •) is the mutual information. 1A low loss means that T does not retain very much information about X (the second term), while still retaining enough information to predict Y . 2 The balance between the two MI terms is controlled by a Lagrange multiplier β.By increasing β, we increase the pressure to keep I(X; T) small, which "narrows the bottleneck" by favoring compression over predictive accuracy I(Y ; T).Regarding β as a Lagrange multiplier, we see that the goal of IB is to maximize the predictive power of T subject to some constraint on the amount of information about X that T carries.If the map from X to T were deterministic, then it could lose information only by being non-injective: the traditional example is dimensionality reduction, as in the encoder of an encoder-decoder neural net.But IB works even if T can take values throughout a high-dimensional space, because the randomness in p θ (t | x) means that T is noisy in a way that wipes out information about X.Using a high-dimensional space is desirable because it permits the amount of effective dimensionality reduction to vary, with T perhaps retaining much more information about some x values than others, as long as the average retained information I(X; T) is small.

Formal Model
In this paper, we extend the original IB objective (1) and add terms I(T i ; X | Xi ) to control the contextsensitivity of the extracted tags.Here T i is the tag associated with the ith word, X i is the ELMo token embedding of the ith word, and Xi is the same word's ELMo type embedding (before context is incorporated).
In this section, we will explain the motivation for the additional term and how to efficiently estimate variational bounds on all terms (lower bound for I(Y ; T) and upper bound for the rest). 3 1 In our IB notation, larger β means more compression.Note that there is another version of IB that puts β as the coefficient in front of I(Y ; T): The two versions are equivalent.
2 Since T is a stochastic function of X with no access to Y , it obviously cannot convey more information about Y than the uncompressed input X does.As a result, Y is independent of T given X, as in the graphical model T → X → Y .
3 Traditional Shannon entropy H(•) is defined on discrete variables.In the case of continuous variables, we interpret H We instantiate the variational IB (VIB) estimation method (Alemi et al., 2016) on our dependency parsing task, as illustrated in Figure 1.We compress a sentence's word embeddings X i into continuous vector-valued tags or discrete tags T i ("encoding") such that the tag sequence T retains maximum ability to predict the dependency parse Y ("decoding").Our chosen architecture compresses each X i independently using the same stochastic, information-losing transformation.
The IB method introduces the new random variable T, the tag sequence that compresses X, by defining the conditional distribution p θ (t | x).In our setting, p θ is a stochastic tagger, for which we will adopt a parametric form ( §3.1 below).Its parameters θ are chosen to minimize the IB objective (2).By IB's independence assumption, 2 the joint probability can be factored as t) ]].Making this term small yields a representation T that, on average, retains little information about X.The outer expectation is over the true distribution of sentences x; we use an empirical estimate, averaging over the unparsed sentences in a dependency treebank.To estimate the inner expectation, we could sample, drawing taggings t from p θ (t | x).
We must also compute the quantities within the inner brackets.The p θ (t | x) term is defined by our parametric form.The troublesome term is p θ (t) = Ex [p θ (t | x )], since even estimating it from a treebank requires an inner loop over treebank sentences x .To avoid this, variational IB replaces p θ (t) with some variational distribution r ψ (t).This can only increase our objective function, since the difference between the variational and original versions of this term is a KL divergence and hence non-negative: upper bound to instead denote differential entropy (which would be −∞ for discrete variables).Scaling a continuous random variable affects its differential entropy-but not its mutual information with another random variable, which is what we use here.
Thus, the variational version (the first term above) is indeed an upper bound for I(X; T) (the second term above).We will minimize this upper bound by adjusting not only θ but also ψ, thus making the bound as tight as possible given θ.Also we will no longer need to sample t for the inner expectation of the upper bound, E t∼p θ (t |x) [log p θ (t |x) r ψ (t) ], because this expectation equals KL[p θ (t | x) || r ψ (t)], and we will define the parametric p θ and r ψ so that this KL divergence can be computed exactly: see §4.

Two Token Encoder Architectures
We choose to define That is, our stochastic encoder will compress each word x i individually (although x i is itself a representation that depends on context): see Figure 1.We make this choice not for computational reasons-our method would remain tractable even without this-but because our goal in this paper is to find the syntactic information in each individual ELMo token embedding (a goal we will further pursue in §3.3 below).
To obtain continuous tags, define p θ (t i | x i ) such that t i ∈ R d is Gaussian-distributed with mean vector and diagonal covariance matrix computed from the ELMo word vector x i via a feedforward neural network with 2d outputs and no transfer function at the output layer.To ensure positive semidefiniteness of the diagonal covariance matrix, we squared the latter d outputs to obtain the diagonal entries. 4lternatively, to obtain discrete tags, define p θ (t i | x i ) such that t i ∈ {1, . . ., k} follows a softmax distribution, where the k softmax parameters are similarly computed by a feedforward network with k outputs and no transfer function at the output layer.
We similarly define r ψ (t) = n i=1 r ψ (t i ), where ψ directly specifies the 2d or k values corresponding to the output layer above (since there is no input x i to condition on).

I(T
While the IB objective (1) asks each tag t i to be informative about the parse Y , we were concerned that it might not be interpretable as a tag of word i specifically.Given ELMo or any other black-box conversion of a length-n sentence to a sequence of contextual vectors x 1 , . . ., x n , it is possible that x i contains not only information about word i but also information describing word i + 1, say, or the syntactic constructions in the vicinity of word i.Thus, while p θ (t i | x i ) might extract some information from x i that is very useful for parsing, there is no guarantee that this information came from word i and not its neighbors.Although we do want tag t i to consider context-e.g., to distinguish between noun and verb uses of word i-we want "most" of t i 's information to come from word i itself.Specifically, it should come from ELMo's level-0 embedding of word i, denoted by xi -a word type embedding that does not depend on context.
To penalize T i for capturing "too much" contextual information, our modified objective (2) adds a penalty term γ • I(T i ; X | Xi ), which measures the amount of information about T i given by the sentence X as a whole, beyond what is given by Xi : Setting γ > 0 will reduce this contextual information.
In practice, we found that I(T i ; X | Xi ) was small even when γ = 0, on the order of 3.5 nats whereas I(T i ; X) was 50 nats.In other words, the tags extracted by the classical method were already fairly local, so increasing γ above 0 had little qualitative effect.Still, γ might be important when applying our method to ELMo's competitors such as BERT.
We can derive an upper bound on I(T i ; X | Xi ) by approximating the conditional distribution p θ (t i | xi ) with a variational distribution s ξ (t i | xi ), similar to §3.1.
].The formal presentation above does not assume the specific factored model that we adopted in §3.2.When we adopt that model, p θ (t i | x) above reduces to p θ (t i | x i )-but our method in this section still has an effect, because x i still reflects the context of the full sentence whereas xi does not.
Type Encoder Architectures Notice that s ξ (t i | xi ) may be regarded as a type encoder, with parameters ξ that are distinct from the parameters θ of our token encoder p θ (t i | x i ).Given a choice of neural architecture for p θ (t i | x i ) (see §3.2), we always use the same architecture for s ξ (t i | xi ), except that p θ takes a token vector as input whereas s ξ takes a context-independent type vector.s ξ is not used at test time, but only as part of our training objective.

I(Y ; T) -the Decoder q
The p(y) can be omitted during optimization as it does not depend on θ.Thus, making I(Y ; T) large tries to obtain a high log-probability p θ (y | t) for the true parse y when reconstructing it from t alone.
But how do we compute p θ (y | t)?This quantity effectively marginalizes over possible sentences x that could have explained t.Recall that p θ is a joint distribution over x, y, t: x, y p θ (x,y ,t) .To estimate these sums accurately, we would have to identify the sentences x that are most consistent with the tagging t (that is, p(x) • p θ (t|x) is large): these contribute the largest summands, but might not appear in any corpus.
To avoid this, we replace p θ (y | t) with a variational approximation q φ (y | t) in our formula for I(Y ; T).Here q φ (• | •) is a tractable conditional distribution, and may be regarded as a stochastic parser that runs on a compressed tag sequence t instead of a word embedding sequence x.This modified version of I(Y ; T) forms a lower bound on I(Y ; T), for any value of the variational parameters φ, since the difference between them is a KL divergence and hence positive: We will maximize this lower bound of I(Y ; T) with respect to both θ and φ.For any given θ, the optimal φ minimizes the expected KL divergence, meaning that q φ approximates p θ well.More precisely, we again drop p(y) as constant and then maximize a sampling-based estimate of Ey,t∼p θ [log q φ (y|t)].To sample y, t from the joint p θ (x, y, t) we must first sample x, so we rewrite as Ex,y [E t∼p θ (t |x) [log q φ (y|t)]].The outer expectation Ex,y is estimated as usual over a training treebank.The expectation E t∼p θ (t |x) recognizes that t is stochastic, and again we estimate it by sampling.In short, when t is a stochastic compression of a treebank sentence x, we would like our variational parser on average to assign high log-probability q φ (y | t) to its treebank parse y.
Decoder Architecture We use the deep biaffine dependency parser (Dozat and Manning, 2016) as our variational distribution q φ (y | t), which functions as the decoder.This parser uses a Bi-LSTM to extract features from compressed tags or vectors and assign scores to each tree edge, setting q φ (y | t) proportional to the exp of the total score of all edges in y.During IB training, the code5 computes only an approximation to q φ (y|t) for the gold tree y (although in principle, it could have computed the exact normalizing constant in polytime with Tutte's matrix-tree theorem (Smith and Smith, 2007;Koo et al., 2007;McDonald and Satta, 2007)).When we test the parser, the code does exactly find argmax y q φ (y | t) via the directed spanning tree algorithm of Edmonds (1966).

Training and Inference
With the approximations in §3, our final minimization objective is this upper bound on (2): We apply stochastic gradient descent to optimize this objective.To get a stochastic estimate of the objective, we first sample some (x, y) from the treebank.We then have many expectations over t ∼ p θ (t | x), including the KL terms.We could estimate these by sampling t from the token encoder p θ (t | x) and then evaluating all q φ , p θ , r ψ , and s ξ probabilities.However, in fact we use the sampled t only to estimate the first expectation (by computing the decoder probability q φ (y | t) of the gold tree y); we can compute the KL terms exactly by exploiting the structure of our distributions.The structure of p θ and r ψ means that the first KL term decomposes into n i=1 KL(p θ (t i |x i )||r ψ (t i )).All KL terms are now between either two Gaussian distributions over a continuous tagset6 or two categorical distributions over a small discrete tagset. 7o compute the stochastic gradient, we run backpropagation on this computation.We must apply the reparametrization trick to backpropagate "Treebank" is the treebank identifier in UD, "#Token" is the number of tokens in the treebank, "H(A)" is the entropy of a gold POS tag (in nats), and "H(A | X)" is the conditional entropy of a gold POS tag conditioned on a word type (in nats).
through the step that sampled t.This finds the gradient of parameters that derive t from a random variate z, while holding z itself fixed.For continuous t, we use the reparametrization trick for multivariate Gaussians (Rezende et al., 2014).For discrete t, we use the Gumbel-softmax variant (Jang et al., 2016;Maddison et al., 2016).
To evaluate our trained model's ability to parse a sentence x from compressed tags, we obtain a parse as argmax y q φ (y | t), where t ∼ p θ (• | x) is a single sample.A better parser would instead estimate argmax y Et [q φ (y | t)] where Et averages over many samples t, but this is computationally hard.

Experimental Setup
Data Throughout § §6-7, we will examine our compressed tags on a subset of Universal Dependencies (Nivre et al., 2018), or UD, a collection of dependency treebanks across 76 languages using the same POS tags and dependency labels.We experiment on Arabic, Hindi, English, French, Spanish, Portuguese, Russian, Italian, and Chinese (Table 1)-languages with different syntactic properties like word order.We use only the sentences with length ≤ 30.For each sentence, x is obtained by running the standard pre-trained ELMo on the UD token sequence (although UD's tokenization may not perfectly match that of ELMo's training data), and y is the labeled UD dependency parse without any part-of-speech (POS) tags.Thus, our tags t are tuned to predict only the dependency relations in UD, and not the gold POS tags a also in UD.
Pretrained Word Embeddings For English, we used the pre-trained English ELMo model from the AllenNLP library (Gardner et al., 2017).For the other 8 languages, we used the pre-trained models from Che et al. (2018).Recall that ELMo has two layers of bidirectional LSTM (layer 1 and 2) built upon a context-independent character CNN (layer 0).We use either layer 1 or 2 as the input (x i ) to our token encoder p θ .Layer 0 is the input ( xi ) to our type encoder s ξ .Each encoder network ( § §3.2-3.3) has a single hidden layer with a tanh transfer function, which has 2d hidden units (typically 128 or 512) for continuous encodings and 512 hidden units for discrete encodings.
Optimization We optimize with Adam (Kingma and Ba, 2014), a variant of stochastic gradient descent.We alternate between improving the model p θ (t|x) on even epochs and the variational distributions q φ (y|t), r ψ (t), s ξ (t i | xi ) on odd epochs.

Scientific Evaluation
In this section, we study what information about words is retained by our automatically constructed tagging schemes.First, we show the relationship between I(Y ; T) and I(X; T) on English as we reduce β to capture more information in our tags. 8 Second, across 9 languages, we study how our automatic tags correlate with gold part-of-speech tags (and in English, with other syntactic properties), while suppressing information about semantic properties.We also show how decreasing β gradually refines the automatic discrete tag set, giving intuitive fine-grained clusters of English words. 8We always set γ = β to simplify the experimental design.The "dim" in the legends means the dimensionality of the continuous tag vector or the cardinality of the discrete tag set.On the left, we plot predictiveness I(Y ; T) versus I(X; T) as we lower β multiplicatively from 10 1 to 10 −6 on a log-scale.On the right, we alter the y-axis to show the labeled attachment score (LAS) of 1-best dependency parsing.All mutual information and entropy values in this paper are reported in nats per token.Furthermore, the mutual information values that we report are actually our variational upper bounds, as described in §3.The reason that I(X; T) is so large for continuous tags is that it is differential mutual information (see footnote 3).Additional tradeoff curves w.r.t.I(T i ; X | Xi ) are in Appendix B.

Tradeoff Curves
As we lower β to retain more information about X, both I(X; T) and I(Y ; T) rise, as shown in Figure 2.There are diminishing returns: after some point, the additional information retained in T does not contribute much to predicting Y .Also noteworthy is that at each level of I(X, T), very low-dimensional tags (d = 5) perform on par with high-dimensional ones (d = 256).(Note that the high-dimensional stochastic tags will be noisier to keep the same I(X, T).)The low-dimensional tags allow far faster CPU parsing.This indicates that VIB can achieve strong practical task-specific compression.

Learned Tags vs. Gold POS Tags
We investigate how our automatic tag T i correlates with the gold POS tag A i provided by UD.

Continuous Version We use t-SNE (van der
Maaten and Hinton, 2008) to visualize our compressed continuous tags on held-out test data, coloring each token in Figure 3 according to its gold POS tag.(Similar plots for the discrete tags are in Figure 6 in the appendix.) In Figure 3, the first figure shows the original uncompressed level-1 ELMo embeddings of the tokens in test data.In the two-dimensional visualization, the POS tags are vaguely clustered but the boundaries merge together and some tags are diffuse.The second figure is when β = 10 −3 (moderate compression): our compressed embeddings show clear clusters that correspond well to gold POS tags.Note that the gold POS tags were not used in training either ELMo or our method.The third figure is when β = 1 (too much compression), when POS information is largely lost.An interesting observation is that the purple NOUN and blue PROPN distributions overlap in the middle distribution, meaning that it was unnecessary to distinguish common nouns from proper nouns for purposes of our parsing task. 9 Discrete Version We also quantify how well our specialized discrete tags capture the traditional POS categories, by investigating I(A; T).This can be written as H(A) − H(A | T).Similarly to §3.4, our probability distribution has the form ) is a variational distribution that we train to minimize this upper bound.This is equivalent to training q(a | t) by maximum conditional likelihood.In effect, we are doing transfer learning, fixing our trained IB encoder (p θ ) and now using it to predict A instead of Y , but otherwise 9 Both can serve as arguments of verbs and prepositions.Both can be modified by determiners and adjectives, giving rise to proper NPs like "The Daily Tribune."following §3.4.We similarly upper-bound H(A) by assuming a model q (a) = i q (a i ) and estimating q as the empirical distribution over training tags.Having trained q and q on training data, we estimate H(A | T) and H(A) using the same upperbound formulas on our test data.
We experiment on all 9 languages, taking T i at the moderate compression level β = 0.001, k = 64.As Figure 4 shows, averaging over the 9 languages, the reconstruction retains 71% of POS information (and as high as 80% on Spanish and French).We can conclude that the information encoded in the specialized tags correlates with the gold POS tags, but does not perfectly predict the POS.
The graph in Figure 4 shows a "U-shaped" curve, with the best overall error rate at β = 0.01.That is, moderate compression of ELMo embeddings helps for predicting POS tags.Too much compression squeezes out POS-related information, while too little compression allows the tagger to overfit the training data, harming generalization to test data.We will see the same pattern for parsing in §7.
Syntactic Features As a quick check, we determine that our tags also make syntactic distinctions beyond those that are recognized by the UD POS tag set, such as tense, number, and transitivity.See Appendix D for graphs.For example, even with moderate compression, we achieve 0.87 classification accuracy in distinguishing between transitive and intransitive English verbs, given only tag t i .
Stem When we compress ELMo embeddings to k discrete tags, the semantic information must be squeezed out because k is small.But what about the continuous case?In order to verify that semantic information is excluded, we train a classifier that predicts the stem of word token i from its mean tag vector E [T i ].We expect "player" and "buyer" to have similar compressed vectors, because they share syntactic roles, but we should fail  to predict that they have different stems "play" and "buy."The classifier is a feedforward neural network with tanh activation function, and the last layer is a softmax over the stem vocabulary.In the English treebank, we take the word lemma in UD treebank and use the NLTK library (Bird et al., 2009) to stem each lemma token.Our result (Appendix E in the appendix) suggests that more compression destroys stem information, as hoped.With light compression, the error rate on stem prediction can be below 15%.With moderate compression β = 0.01, the error rate is 89% for ELMo layer 2 and 66% for ELMo layer 1.Other languages show the same pattern, as shown in Appendix E in the appendix.Thus, moderate and heavy compression indeed squeeze out semantic information.

Annealing of Discrete Tags
Deterministic annealing (Rose, 1998;Friedman et al., 2001) is a method that gradually decreases β during training of IB.Each token i has a stochastic distribution over the possible tags {1, . . ., k}.This can be regarded as a soft clustering where each token is fractionally associated with each of the k clusters.With high β, the optimal solution turns out to assign to all tokens an identical distribution over clusters, for a mutual information of 0. Since all clusters then have the same membership, this is equivalent to having a single cluster.As we gradually reduce β, the cluster eventually splits.Further reduction of β leads to recursive splitting, yielding a hierarchical clustering of tokens (Appendix A).We apply deterministic annealing to the English dataset, and the resulting hierarchical structure reflects properties of English syntax.At the top of the hierarchy, the model places nouns, adjectives, adverbs, and verbs in different clusters.At lower levels, the anaphors ("yourself," "herself" . . .), possessive pronouns ("his," "my," "their" . . .), accusativecase pronouns ("them," "me," "him," "myself" . . .), and nominative-case pronouns ("I," "they," "we" . . . ) each form a cluster, as do the wh-words ("why," "how," "which," "who," "what," . . .).

Engineering Evaluation
As we noted in §1, learning how to compress ELMo's tags for a given task is a fast alternative to fine-tuning all the ELMo parameters.We find that indeed, training a compression method to keep only the relevant information does improve our generalization performance on the parsing task.
We compare 6 different token representations according to the test accuracy of a dependency parser trained to use them.The same training data is used to jointly train the parser and the token encoder that produces the parser's input representations.

Continuous tags:
Iden is an baseline model that leaves the ELMo embeddings uncompressed, so d = 1024.PCA is a baseline that simply uses Principal Components Analysis to reduce the dimensionality to d = 256.Again, this is not task-specific.MLP is another deterministic baseline that uses a multi-layer perceptron (as in Dozat and Manning (2016)) to reduce the dimensionality to d = 256 in a task-specific and nonlinear way.This is identical to our continuous VIB method except that the variance of the output Gaussians is fixed to 0, so that the d dimensions are fully informative.VIBc uses our stochastic encoder, still with d = 256.The average amount of stochastic noise is controlled by β, which is tuned per-language on dev data.Discrete tags: POS is a baseline that uses the k ≤ 17 gold POS tags from the UD dataset.VIBd is our stochastic method with k = 64 tags.To compare fairly with POS, we pick a β value for each language such that H Runtime.Our VIB approach is quite fast.With minibatching on a single GPU, it is able to train on 10,000 sentences in 100 seconds, per epoch.
Table 2: Parsing accuracy of 9 languages (LAS).Black rows use continuous tags; gray rows use discrete tags (which does worse).In each column, the best score for each color is boldfaced, along with all results of that color that are not significantly worse (paired permutation test, p < 0.05).These results use only ELMo layer 1; results from all layers are shown in Table 3 in the appendix, for both LAS and UAS metrics.
Analysis.Table 2 shows the test accuracies of these parsers, using the standard training/development/test split for each UD language.
In the continuous case, the VIB representation outperforms all three baselines in 8 of 9 languages, and is not significantly worse in the 9th language (Hindi).In short, our VIB joint training generalizes better to test data.This is because the training objective (2) includes terms that focus on the parsing task and also regularize the representations.
In the discrete case, the VIB representation outperforms gold POS tags (at the same level of granularity) in 6 of 9 languages, and of the other 3, it is not significantly worse in 2. This suggests that our learned discrete tag set could be an improved alternative to gold POS tags (cf.Klein and Manning, 2003) when a discrete tag set is needed for speed.

Related Work
Much recent NLP literature examines syntactic information encoded by deep models (Linzen et al., 2016) and more specifically, by powerful unsupervised word embeddings.Hewitt and Manning (2019) learn a linear projection from the embedding space to predict the distance between two words in a parse tree.Peters et al. (2018b) and Goldberg (2019) assess the ability of BERT and ELMo directly on syntactic NLP tasks.Tenney et al. (2019) extract information from the contextual embeddings by self-attention pooling within a span of word embeddings.
The IB framework was first used in NLP to cluster distributionally similar words (Pereira et al., 1993).In cognitive science, it has been used to argue that color-naming systems across languages are nearly optimal (Zaslavsky et al., 2018).In machine learning, IB provides an information-theoretic perspective to explain the performance of deep neural networks (Tishby and Zaslavsky, 2015b).
The VIB method makes use of variational upper and lower bounds on mutual information.An alternative lower bound was proposed by Poole et al. (2019), who found it to work better empirically.

Conclusion and Future Work
In this paper, we have proposed two ways to syntactically compress ELMo word token embeddings, using variational information bottleneck.We automatically induce stochastic discrete tags that correlate with gold POS tags but are as good or better for parsing.We also induce stochastic continuous token embeddings (each is a Gaussian distribution over R d ) that forget non-syntactic information captured by ELMo.These stochastic vectors yield improved parsing results, in a way that simpler dimensionality reduction methods do not.They also transfer to the problem of predicting gold POS tags, which were not used in training.
One could apply the same training method to compress the ELMo or BERT token sequence x for other tasks.All that is required is a modelspecific decoder q φ (y | t).For example, in the case of sentiment analysis, the approach should preserve only sentiment information, discarding most of the syntax.One possibility that does not require supervised data is to create artificial tasks, such as reproducing the input sentence or predicting missing parts of the input (such as affixes and function words).In this case, the latent representations would be essentially generative, as in the variational autoencoder (Kingma and Welling, 2013).

Supplementary Material A Details of Deterministic Annealing
In practice, deterministic annealing ( §6.3) is implemented in a way that dynamically increases the number of clusters k (Friedman et al., 2001), leading to a hierarchical clustering.First, we initialize with one cluster, and all the word tokens are mapped to that cluster with probability 1.Second, for each cluster i, duplicate the cluster C i to form C ia , C ib , and divide the probabilities associated with C i approximately evenly (with perturbation) between the two clusters, i.e., set p(c ia |x) =   2) is larger) as well as less information overall about word i (because we always set γ = β, so β is larger as well).The graphs show that the tag sequence T then becomes less informative about the parse Y .

C Additional t-SNE plots
Recall that Figure 3 6 also shows rows for the discrete type and token embeddings.In both cases, the "moder-ate compression" condition shows β = 0.001.
In the continuous case, each point given to t-SNE is the mean of a Gaussian-distributed stochastic embedding, so it is in R d .In the discrete case, each point given to t-SNE is a vector of k tag probabilities, so it is in R k and more specifically in the (k − 1)-dimensional simplex.The t-SNE visualizer plots these points in 2 dimensions.
The message of all these graphs is that the tokens or types with the same gold part of speech (shown as having the same color) are most nicely grouped together in the moderate compression condition.

D Syntactic Feature Classification
Figure 7 shows results for the Syntactic Features paragraph in §6.2, by showing the prediction accuracy of subcategorization frame, tense, and number from t i as a function of the level of compression.We used an SVM classifier with a radial basis function kernel.
All results are on the English UD data with the usual training/test split.To train and test the classifiers, we used the gold UD annotations to identify the nouns and verbs and their correct syntactic features.Figure 9 shows the reconstruction error rate for the other 8 languages.

Figure 1 :
Figure 1: Our instantiation of the information bottleneck, with bottleneck variable T. A jagged arrow indicates a stochastic mapping, i.e. the jagged arrow points from the parameters of a distribution to a sample drawn from that distribution.

Figure 2 :
Figure 2: Compression-prediction tradeoff curves of VIB in our dependency parsing setting.The upper figures use discrete tags, while the lower figures use continuous tags.The dashed lines are for test data, and the solid lines for training data.The "dim" in the legends means the dimensionality of the continuous tag vector or the cardinality of the discrete tag set.On the left, we plot predictiveness I(Y ; T) versus I(X; T) as we lower β multiplicatively from 10 1 to 10 −6 on a log-scale.On the right, we alter the y-axis to show the labeled attachment score (LAS) of 1-best dependency parsing.All mutual information and entropy values in this paper are reported in nats per token.Furthermore, the mutual information values that we report are actually our variational upper bounds, as described in §3.The reason that I(X; T) is so large for continuous tags is that it is differential mutual information (see footnote 3).Additional tradeoff curves w.r.t.I(T i ; X | Xi ) are in Appendix B.

Figure 3
Figure 3: t-SNE visualization of VIB model (d = 256) on the projected space of the continuous tags.Each marker in the figure represents a word token, colored by its gold POS tag.This series of figures (from left to right) shows a progression from no compression to moderate compression and to too-much compression.
c i |x) + x and p(c ib |x) = 1 2 p(c i |x) − x .Third, update β ← β/α, and run optimization until convergence.Fourth, for each former cluster i, if C ia and C ib have not differentiated from each other, remerge them by setting p(c i |x) = p(c ia |x)+p(c ib |x).(Optimization will have pulled them together again for higher β values and pushed them apart for lower β values.)Our heuristic is to re-merge them if for all word tokens x, |p(c ia |x) − p(c ib |x)| ≤ 0.01.Finally, loop back to the second step, unless the β value has fallen below a given threshold β min or we have reached a desired maximum number of clusters.

Figure 5
Figure 5 supplements the tradeoff curves in Figure 2 by plotting the relationship between I(T i ; X | Xi ) vs. I(Y ; T), and I(T i ; X | Xi ) vs. LAS.Moving leftward on the graphs, each T i contains less contextual information about word i (because γ in equation (2) is larger) as well as less information overall about word i (because we always set γ = β, so β is larger as well).The graphs show that the tag sequence T then becomes less informative about the parse Y .

Figures 8 -
Figures 8-9 supplement the Stem paragraph in §6.2. Figure 8 plots the error rate of reconstructing English stems as a function of the level of compression.Figure9shows the reconstruction error rate for the other 8 languages.

Figure 9 :
Figure 9: Error rate in reconstructing the stem of a word from the compressed version of the ELMo layer-1 and layer-2 embedding).Slight compression refers to β = 0.0001, and moderate compression refers to β = 0.01.

Table 1 :
Statistics of the datasets used in this paper.
Graph at left: I(A; T) vs. I(X; T) in English (in units of nats per token).Table at right: how well the discrete specialized tags predict gold POS tags for 9 languages.The H(A) row is the entropy (in nats per token) of the gold POS tags in the test data corpus, which is an upper bound for I(A; T).The remaining rows report the percentage I(A; T)/H(A).