Grounded Compositional Outputs for Adaptive Language Modeling

Language models have emerged as a central component across NLP, and a great deal of progress depends on the ability to cheaply adapt them (e.g., through finetuning) to new domains and tasks. A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size and is part of what makes it resistant to such adaptation. Prior work has used compositional input embeddings based on surface forms to ameliorate this issue. In this work, we go one step beyond and propose a fully compositional output embedding layer for language models, which is further grounded in information from a structured lexicon (WordNet), namely semantically related words and free-text definitions. To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary. We evaluate the model on conventional language modeling as well as challenging cross-domain settings with an open vocabulary, finding that it matches or outperforms previous state-of-the-art output embedding methods and adaptation approaches. Our analysis attributes the improvements to sample efficiency: our model is more accurate for low-frequency words.


Introduction
Language models (LMs) are at the heart of natural language processing, especially following their recent success in the pretraining paradigm (Dai and Le, 2015;Peters et al., 2018;Devlin et al., 2019;Radford et al., 2019, inter alia).Continued advances in NLP rely on the adaptability of LMs to domains beyond their training data and to new domains and tasks, e.g., through domain adaptive pretraining followed by finetuning (Gururangan et al., 2020).Here, we focus on an important component of LMs, namely the output vocabularyover which a LM's probability distribution over the "next word" (given the history) ranges-and investigate the impact of the type of its representation on the adaptability of neural LMs.
Today, LMs are typically trained with a closed output vocabulary derived from the training data; the vocabulary is not modified when the language model is adapted or deployed.This makes large pretrained language models struggle with rare words, despite being able to produce contextualized representations for them (Schick and Schütze, 2020).More importantly, this means a generative LM can never give nonzero probability to a specific word it did not see in training.This is a longstanding challenge of language modeling (Jelinek, 1997), but it becomes especially important when we adapt to new domains and tasks.
One way to "open up" the vocabulary is to model sequences of bytes, characters, or "wordpieces" rather than the conventional word tokens (Sennrich et al., 2016;Radford et al., 2018;Ponti et al., 2019).While effective, this approach requires the LM to memorize subsequences if it is to treat them as words.These models appear to require greater network depth and show slower convergence than word-based alternatives (Cherry et al., 2018;Al-Rfou et al., 2019); the extra work comes at a cost.This is one of the reasons why the area of word-level language modeling is still very active (Baevski and Auli, 2019;Sukhbaatar et al., 2019;Khandelwal et al., 2020;Press et al., 2020).
Interpolations between word-and character-or morphology-based LMs represent another class of solutions (Mielke and Eisner, 2018;Gerz et al., 2018;Ataman et al., 2020).These "hybrid" approaches combine benefits from both model types.However, they introduce complexity which makes them potentially more difficult to train, maintain, and analyze.Notable for enabling adaptability are interpolated LMs based on copy mechanisms (Merity et al., 2017), dynamic evaluation (Krause et al., 2018), and neural caches (Grave et al., 2017c,a); the last provides state-of-the-art adaptation performance and, unlike the rest, it does not require additional training.
We propose a new word-level Grounded Compositional output LM (GroC) that applies a compositional representation to the output vocabulary (Section 3).Each word's output embedding is built from its surface character sequence and (if available) those of semantically related words and a free-text definition of from WordNet (Fellbaum, 1998).This parameterization offers two chief advantages.First, GroC can assign probability to words not seen during training.This means that a vocabulary different from the training vocabularye.g., one associated with a different text domain, crucial in adaptive settings-can be considered at inference time.Second, because there are no word type-specific parameters, the number of model parameters in GroC does not depend on the training vocabulary or its size.
We evaluate GroC on language modeling with both fixed and open vocabularies in English.On standard language modeling (Section 4) we observe that our model has superior perplexity and is more sample efficient than a variety of existing output embedding approaches, including the recent adaptive embedding of Baevski and Auli (2019).The open-vocabulary settings include a cross-domain setting and finetuning (Section 5).We find that GroC also outperforms strong interpolated baselines, including the unbounded neural cache model of Grave et al. (2017a) on "near" domains and performs competitively on "far" domains.
Our analysis shows that our approach has improved sharing across words in the output vocabulary.We show experimentally that the perplexity gains are strongest for low-frequency words, implying improved sample efficiency relative to baselines: compositional output representations allow us to predict words from fewer training examples.

Preliminaries on Language Modeling
Language models assign probability to sequences of tokens; the task is usually framed as learning the conditional probability distributions over individual tokens given their histories of tokens to the left (Bahl et al., 1983).Training requires a sequence of T tokens x = x 1 , . . ., x T , each x t a member of a preselected vocabulary V. We let x t ∈ {0, 1} |V| denote the one-hot encoding of x t .The probability of the sequence x is factored using the chain rule of probability: p(x) = T t=1 p(x t | x 1 , . . ., x t−1 ). (1) To approximate this joint distribution, researchers have fit parametric families based on relative frequencies (Bahl et al., 1983;Kneser and Ney, 1995;Goodman, 2001) and neural networks (Bengio et al., 2003;Mikolov et al., 2010).Here, we focus on the latter due to their established effectiveness (Merity et al., 2018;Baevski and Auli, 2019).Tokens in this work correspond to words but they can also correspond to individual characters (Al-Rfou et al., 2019) or byte pairs (Radford et al., 2019).

Neural Language Models
To make clear this paper's contributions, we describe neural language models by decomposing them into several abstract parts.
In most neural language models, the first layer of computation obtains an input embedding of each history word x j using a lookup function.In our notation, this corresponds to selecting the word type's row in a fixed input embedding matrix, E in : x j E in , which we denote e in x j .Importantly, however, input embeddings need not be lookups; for example, they can be built compositionally from the characters in the surface form of the word (Ling et al., 2015), an idea central to this work.
Next, the history or "prefix" words x <t = x 1 , . . ., x t−1 is encoded into a fixed, ddimensional vector h t−1 using a prefix function f : V * → R d .f can be a recurrent or feedforward network; we will experiment with LSTMs (Hochreiter and Schmidhuber, 1997) in Section 4, but our method is agnostic to the prefix function design.In general, each history encoding is defined as x 1 , . . ., e in x t−1 ). (2) Finally, the distribution over the next word (random variable X t ) is given by where E out ∈ R |V|×d is the output embedding matrix and b ∈ R |V| is the bias vector (corresponding roughly to unigram log-frequencies of words in the vocabulary).
The parameters of the model-including all parameters of the prefix function f , E in , E out , and b-are all chosen by maximizing the likelihood of the training sequence x under the model (Eq.1).Note that, though we focus on an autoregressive (left-to-right) language model objective, our analysis below is applicable to other language model pretraining objectives such as masked language modeling (Devlin et al., 2019) and replaced token detection (Clark et al., 2020).

Choice of Output Representations
Above we assumed an output embedding matrix E out that independently parameterizes each word in the vocabulary with a separate d-dimensional vector.This approach requires d × |V| parameters, leading to concerns about cost and overparameterization. Prior work addressed this issue by tying parameters between the input and output embedding matrices (i.e., E out = E in ; Inan et al., 2017;Press and Wolf, 2017).However, the parameters for each word are still independent from each other, as displayed in Figure 1(a).
An alternative, also considered here, is to share output parameters across words as well as with the input embeddings.Specifically, this involves making the output embedding a function of the input embedding using a shared parameterization across words, E out = g(E in ), as displayed in Figure 1(b).For example, Gulordava et al. (2018) used a linear transformation, while Baevski and Auli (2019) used a linear transformation for each frequency bin to dedicate parameters to words proportional to their frequencies.Pappas and Henderson (2019) used a deep residual transformation as g, demonstrating that shared parameterizations perform better than independent ones.The two latter studies also provided evidence that models with shared parameterization are more sample efficient than independent parameterizations since they perform better on low-frequency words.
Limitations We argue that dependence of a model's parameterization on the size of the vocabulary leads to several limitations shared by current word-level language models.First, the output embedding methods above have terms that scale with the vocabulary size, such as the lookup table for the input embedding or the bias vector, which is a concern for the parameterization of infrequent words.Second, handling of words unseen in the training data leads us to the convention of uninformative "out-of-vocabulary" word types or linguistically naïve, data-driven vocabulary transformations that f < l a t e x i t s h a 1 _ b a s e 6 4 = " a j 1 V r W g q S r q k g q J / b L g t s T m R K / w = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e i F 4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 aggressively decompose words into smaller units (Sennrich et al., 2016).Finally, when pretrained language models are adapted on a downstream task, they do not allow graceful modifications to the vocabulary as required by the task or its data domain.Decoupling the training vocabulary from the target vocabulary that a model can use during inference or finetuning will simplify sequential training and enable open vocabularies.
Building on encouraging results with compositional input embeddings (Ling et al., 2015;Józefowicz et al., 2016;Peters et al., 2018), we introduce a language model with shared compositional embeddings for input as well as for output word representations.Further, we go beyond past work based on surface forms, making optional use of relations and natural language definitions from structured lexicons like WordNet (Fellbaum, 1998).To our knowledge, this is the first word-level language model whose parameters do not depend on the vocabulary size and which is grounded to an external structured lexicon.Our experiments show that our models are more sample efficient (Section 4) on closed vocabularies and perform competitively on cross-domain settings (Section 5).

GroC: Grounded Compositional Output Language Models
We present our grounded compositional output language model (Figure 2).1 Following the decomposition of neural language models in Section 2 (Equations 2-3), we consider each part of the model in turn: input embeddings (Section 3.1), output embeddings (Section 3.2), and bias (Section 3.3).
As noted above, our approach is agnostic to the

Output embedding Bias
The hungry person ate with ?
voracious appetite < l a t e x i t s h a 1 _ b a s e 6 4 = " t + n C B x U P P s y S l 8 0 6 y E 8 m g 2 q Z 1 D R n e I U 3 5 9 F 5 c d 6 d j 0 V r w c l n j u E P n M 8 f z c O M 7 w = = < / l a t e x i t > f < l a t e x i t s h a 1 _ b a s e 6 4 = " a j 1 V r W g q S r q k g q J / b L g t s T m R K / w = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e i F 4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " 9 Q F j a f B q T l 3 A A t 8 j l s e Z 7 P S q b y A • < l a t e x i t s h a 1 _ b a s e 6 4 = " y f c 6 N Y W r 3 I m + Z B K 4 G g u 2 C K 7 0 r V E = " > A A A B 7 H i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 F j 0 4 r G C a Q t t K J v N p l 2 6 2 Y T d i V B K f 4 M X D 4 p 4 9 Q d 5 8 9 + 4 b X P Q 1 g c D j / d m m J k X Z l I Y d N 1 v Z 2 1 9 Y 3 N r u 7 R T 3 t 3 b P z i s H B 2 3 T J p r x n 2 W y l R 3 Q m q 4 F I r 7 K F D y T q Y 5 T U L J 2 + H o b u a 3 n 7 g 2 I l W a F 1 z i p k T + A P n 8 w f a W I 6 2 < / l a t e x i t > x t < l a t e x i t s h a 1 _ b a s e 6 4 = " z Z v c J V 7 G r A b B I O i x 3 g N M + q p Y s w g = " > A A A B 6 n i c b V B N S 8 N A E J 3 4 W e t X 1 a O X x S J 4 K k k V 9 F j 0 4 r G i / Y A 2 l M 1 2 0 y 7 d b M L u R C y h P 8 G L B 0 W 8 + o u 8 + W / c t j l o 6 4 O B x 3 s z z M w L E i k M u u 6 3 s 7 K 6 t r 6 x W d g q b u / s 7 u 2 X D g 6 b J k 4 1 4 w 0 W y 1 i 3 A 2 q 4 F I o 3 U K D k 7 U R z G g W S t 4 L R z d R v P X J t R K w e c J x w P 6 I D J U L B K F r p / q m H v V L Z r b g z k G X i 5 a Q M O e q 9 0 l e 3 H 7 M 0 4 g q Z p M Z 0 P D d B P 6 M a B Z N 8 U u y m h i e U j e i A d y < l a t e x i t s h a 1 _ b a s e 6 4 = " y e I 9 b w U m T R n e I U 3 5 9 F 5 c d 6 d j 0 V r w c l n j u E P n M 8 f y T e M 7 A = = < / l a t e x i t > training vocabulary (V) and to the prefix encoder (f ) that has been the focus of most innovations in neural language model design.

Compositional Input Embeddings
We build on the compositional model of Ling et al. (2015), which encodes a word using its surface string (i.e., character sequence), adding two more sources of information.Peters et al. (2019) enhanced word representations with information from external relational knowledge bases, specifically for words that refer to entitites.Like them, we use a structured lexicon (WordNet); we encode every word in the lexicon using its neighbors.The second follows Bahdanau et al. (2017), who used definitions to represent out-of-vocabulary words; we encode definitions for all words (regardless of training-set frequency).We begin by replacing the matrix E in ∈ R |V|×d with a neural network that defines a word's embedding compositionally from its surface form, its position relative to other words in a structured lexicon, and a natural language definition.For each word x, we refer to these, respectively, as the word type's surface embedding c x , relational embedding r x , and definitional embedding d x .We assume each has a dimensionality of d.The last two are optional (if missing, they are set to zero), and we redefine e x as the concatenation of the three, namely e x = c x , r x , d x .For r x and d x , we used the structured relations (synonyms and hyponyms) and free-text definitions in WordNet (Fellbaum, 1998).
In this study, we focus on simple, computationally efficient options for the three encoders.A word x's character sequence is encoded as surface encoding c x using a convolutional network followed by a highway network (Józefowicz et al., 2016;Peters et al., 2018).Its relational encoding r x is given by an average of c x across WordNet synonyms and hyponyms x .The definitional encoding of x, d x , we similarly take an average of the surface encodings c x over words x appearing in the definition.For computational efficiency, we set a maximum limit to the number of words to be used for both relations and definitions (see Appendix B.1).
If a word's information is not in WordNet, we set r x and/or d x to 0. In future work, additional encodings could be appended, such as contextualized examples (Khandelwal et al., 2020).
A notable property of these input embeddings is that their parameter count does not depend on the vocabulary size |V|.Further, the vocabulary used in training need not be identical to the one used during finetuning, evaluation, or deployment.For example, during training we can use the full vocabulary combined with a softmax approximation method (e.g., Grave et al., 2017b), or by dynamically narrowing the choice of x t based on its history using co-occurrence statistics (L'Hostis et al., 2016).During finetuning or evaluation, one can use the same vocabulary (required for traditional perplexity evaluations) or a different one chosen statically or dynamically, since any word's input embedding can be calculated compositionally.

Compositional Output Embeddings
One straightforward option for vocabulary sizeindependent output embeddings is to reuse the compositional input embeddings from Section 3.1, along the lines of Press and Wolf (2017).Concretely, at timestep t, we take the set V t of output word types allowed, embed each word type v ∈ V t as in Section 3.1, and stack these into a matrix E in t which serves directly as E out .
Though these compositional representations do enable extensive sharing across the vocabulary, we suspect that the features they capture may require additional processing before capturing "output" distributional similarity, especially when another domain is the real target use case for the language model.This follows prior work discussed in Section 2.2, which showed that making the output embedding a function of the input embeddings with shared parameters improves over simple tying. 2e therefore adopt a depth-k residual network for the output embedding function g (from Section 2.2) that consists of a feedforward function g i at each layer j with d-dimensions each and apply it to the input embedding at timestep t: Hence, we use E out t (k) as the output embedding at timestep t.To avoid overfitting, we apply variational dropout in between the layers, following Pappas and Henderson (2019).In contrast to that work, our resulting output embeddings are compositional.
The depth k and the dropout rate are hyperparameters to be tuned on development data.The number of parameters is proportional to k times the number of parameters in the feedforward network (O(d 2 )); it does not depend on the vocabulary size.

Bias
In conventional language models, each word in the vocabulary is assigned a bias parameter that roughly captures its log-frequency under a unigram distribution.This is the last part of a neural language model whose parameters depend on the vocabulary size.Instead of a dedicated, independent bias parameter for each word v ∈ V, we define where σ is the activation function and we introduce parameters w ∈ R d and a ∈ R. The bias values b v are stacked to form b and used in Equation 3.

Training
Since all components are differentiable with respect to their parameters, the entire model can be trained to maximize training-data likelihood as described earlier (Section 2.1).Parameters include: • Input character embeddings, the convolutional network for c * , and 3d 2 parameters for projection (Section 3.1); • Output embedding transformation, including the depth-k feedforward network for output embeddings (Section 3.2) and the bias parameters (Section 3.3); and • Prefix encoder f , an orthogonal design choice to our method (an LSTM in our experiments).
The model size can be adjusted by changing output embedding hyperparameters to fit a given memory requirement -this is the same as any other neural network.Note that despite our vocabularysize independent parameterization, we still need to process all the words in the supplied vocabulary leading to increased training times despite the model's sample efficiency.This can be prohibitive for very large vocabularies (≥ 100K), where we recommend using softmax approximation methods and making sparse updates of the output embedding parameters (see Appendix 1.3).During inference, E out can be cached for fast access; there is no need to execute a forward pass more than once.

Conventional Language Modeling
We first establish the performance of GroC in the conventional closed-vocabulary setting, considering two datasets.We consider out-of-sample generalization (measured by test-set perplexity) and also analyze fit across the vocabulary by frequency bin.

Experimental Setup
Datasets.We evaluate our methods on two English datasets: penn (Marcus et al., 1993) and wikitext2 (Merity et al., 2017).We report test perplexity using the provided training/dev./testsplits (see details in Appendix B.3). Table 1 also quantifies the percentage of each dataset's vocabulary that is covered by WordNet (used to derive relational and definitional encodings).Models.All of the models compared use the same prefix encoder: a vanilla recurrent neural network based on the implementation by Merity et al. (2017) with 2 layers and 1024 LSTM units, regularized with hidden unit dropout of 0.65 along the lines of Grave et al. (2017a).More details are given in Appendix B.4.The following output embedding approaches are compared: • Lookup table: trains a full output embedding lookup table that corresponds to the vocabulary as defined in Eq. 3.
• Convolutional (Józefowicz et al., 2016): an alternative to a lookup table that uses a character-level convolutional neural network followed by a highway network plus a linear "correction" for each vocabulary element to represent the outputs.3 • Tied (Press and Wolf, 2017): avoids training separate input and output embedding matrices by tying their parameters.This is a common technique that mitigates the overparameterization issue of the lookup table.
• Bilinear (Gulordava et al., 2018): performs a simple linear transformation of the input embedding to produce the output embedding that effectively shares parameters across outputs.
• Deep residual (Pappas and Henderson, 2019): performs a deep residual transformation of the input embedding with variational dropout in between its layers, which is more expressive than the bilinear one.
• Adaptive (Baevski and Auli, 2019): uses a bilinear transformation of the input and output embedding with parameters proportional to the word frequencies, to assign more capacity to frequent words and less capacity to infrequent ones.This is considered to be a state-of-the-art output embedding method.
For fair comparison, we apply variational dropout to all output embeddings.Hyperparameter selection of dropout rates, output network depth and activation, linear "correction," and adaptive frequency cutoffs was conducted by grid search on validation data.Details are given in Appendix B.2.

Results
Table 2 reports perplexities achieved by all seven models.The main finding is that GroC achieves lower perplexity than the previous models, on both datasets.Note that GroC outperforms the state-ofthe-art output embedding method of Baevski and Auli (2019); specifically, by −9.8 and −8.2 points on penn and wikitext-2, respectively.The difference with the other methods is even larger.We also confirm the findings of Pappas and Henderson (2019), that output parameter sharing methods outperform tied output embedding and the lookup table, and, of Józefowicz et al. (2016), that convolutional output embeddings lag behind full softmax (lookup table).Notably, GroC outperforms the best reported scores by Merity et al. (2017) and Grave et al. (2017a) on penn, using about 11M fewer parameters and a similar prefix network to the latter.See Appendix B.5 for a more detailed comparison with state-of-the-art models of similar size.Nevertheless, GroC is about 1.3× slower than the convolutional method on penn; with sparse updates (p > 0.3) we can make it 2.1× faster than that method, which is comparable to the speed of the bilinear method, while maintaining a perplexity improvement of −26 points (see detailed speed comparisons in Table 10 in Appendix B.4).

Analysis
The experiment above establishes that our approach achieves improved perplexity relative to alternative output embeddings.We next decompose its performance in various ways to understand why.
Word frequency effects.We conjecture that GroC's main benefit comes from words that are rare in the training data, since the core contribution is to share representations across the vocabulary.To evaluate this hypothesis, we consider the difference in test loss (cross entropy) between GroC and a baseline model, following Baevski and Auli (2019) but computing the median instead of the average to reduce the effect of outliers.We decompose this score by data frequency bins (e.g., words occuring 1-50 times in the training dataset).Figure 3 is displayed for the penn and wikitext2 datasets.
The trend we observe is that GroC has the greatest relative benefit for words in lower frequency bins, compared to each model.The lowest-frequency bin on penn deviates from this pattern, which we take as an indication that generalizing to infrequent words with only 1M training tokens and a small 10K vocabulary is inherently challenging.
Ablations.To assess the contributions of GroC's components, we performed ablation tests on penn and wikitext2 (Table 3).These include removing relational and/or definitional forms, either with or without a deep residual output network.For fairness, we tune the hyperparameters of the ablated model variants as above.Overall, removing the relational and definitional forms from the main model with or without output network on top increases the perplexity.The largest drop in perplexity happens when we remove both forms, which highlights their notable contribution to the full model.Lastly, the results on wikitext2 highlight the importance of capturing the output similarity with an output network (out) for datasets with a larger vocabulary as opposed to merely reusing the grounded compositional embeddings as output embeddings.Lexicon coverage.To measure the effect of lexicon coverage on model performance in a controlled setting, we artificially remove words from Word-Net, making them unavailable for relational and definitional encodings.In this experiment, we consider the penn dataset, where WordNet's coverage over the (relatively small) vocabulary is highest to begin with.Table 4 shows the resulting test perplexity of a pretrained model (inference) and a model trained from scratch (train) when such controlled manipulation is applied to them from 0% up to the maximum of 82% coverage (Table 1) .Note that we treat relational independently of definitional forms since they are not always co-present.Overall, the results indicate that the model is sensitive to changes in the forms of words that have been seen during training but it is robust to changes if it is trained from scratch.In the next section, we investigate what happens when we add forms for words which the model has never seen before.Models.We compare GroC to the tied output embedding model described in Section 4.1 when combined with the following adaptation methods:

Cross-Domain Language Modeling
• Unigram: we interpolate the model's distribution with a unigram cache, which assigns probabilities based on the counts of words in the test data observed so far during evaluation.
tribution with a neural cache (Grave et al., 2017c), which assigns probabilities based on the similarity of the current hidden state to previous hidden states during evaluation.
• Finetuning: the model is finetuned on 2M tokens from the target domain.
(We also compare to the reported unbounded cache results from Grave et al., 2017a.)Cache models provide effective adaptation without training by using recent history to develop an auxiliary distribution during evaluation, informing predictions of unseen or rarely-seen words.However, as GroC already assigns non-negligible weight to new words not seen prior to evaluation, the cache has less effect by default, even if its predictions are more accurate, an effect we observed in validation.To address this, we down-weighted the model's predictions for new words prior to cache interpolation by 0.1.For finetuning, both tied and GroC models were trained for an additional 3 epochs on the target domain, allowing them to adapt to the new domain.See Appendix C.4 for hyperparameter details.
Vocabulary setting.For a fair comparison, all models are evaluated on the union of the training and test vocabularies.Tied models are interpolated with the uniform distribution at test time to prevent infinite perplexities on unseen words, prior to cache interpolation if applicable.Words present in the finetuning data but not in the original training data are given random embeddings prior to finetuning.

Results
The results for the cross-domain experiments are shown in

Conclusion
We proposed an adaptive language model based on grounded compositional outputs.We demonstrated that it reduces the number of parameters and increases sample efficiency, outperforming strong output embedding methods and adaptation baselines on both in-domain and open-vocabulary settings respectively.In principle, our results should be applicable to word-piece language models which are currently based on lookup tables to improve their sample efficiency and compactness.
In future work, it would be interesting to investigate to what extent pretrained language models benefit from GroC on such zero-resource or lowresource adaptation settings.This work indicates several other future directions for language modeling in low-resource domains: extension to other languages, scaling training to even larger vocabularies, and applying GroC in a large pretraining setting to expand its zero-shot generalization.

B.2 Hyperparameter Optimization
For all methods, the hyperparameter selection of output embedding dropout rate (r), output network depth (k) and activation (act), linear "correction", and adaptive frequency cutoffs was conducted by grid search over specific range of values given in Table 7 on development data.Note that not all the hyperparameters apply to all methods, as can be seen in Table 8 where we report the optimal hyperparameter values for each of the methods.For all the baselines we performed exhaustive grid search on both datasets, but for our method we performed grid search only on penn and searched manually on wikitext-2 by selecting values of hyperparameters that were ranked high based on the grid search on penn to avoid the increased cost that comes with training our method (see speed comparison in Appendix B.4).The total number of trials for all methods including our ablations were 204 and 67 respectively for penn and wikitext-2 respectively.Note that the reduced number of trials is due to not performing exhaustive search for our method and its ablations as explained above.
The number of trials per method can be derived by multiplying the non-zero columns per row with the number of trials required for each column.

B.3 Development Scores
Table 9 displays the development scores and number of parameters along with the test perplexities for our model and all the baseline output embedding methods for our main experiment.The development scores for the models of the ablation study and for the base models of the coverage experiment have already been given in

B.4 Training Speed
Table 10 displays the average training speed per epoch in seconds for each of the methods.This experiment was run on a single, dedicated11 GeForce RTX 2080 Ti.As we mentioned in Section 3.1, even though our model has vocabulary-size independent parameterization it is not independent of the computation that is required to encode the vocabulary.This has a negative impact on the training speed of GroC, making it a bit slower than the Convolutional method, namely 1.3× slower.
To mitigate this problem we recommend training GroC with sparse updates for the output embedding parameters as described in the main paper (Section 3.4).Concretely, at each training iteration with probability p we make a full update and keep the output embedding frozen otherwise.The rest of the network is trained with full updates as before.We can observe that this optimization strategy makes GroC nearly as efficient as the base-lines with p = 0.1 or p = 0.3.In particular, it becomes even faster than the convolutional baseline by 2.1×.Furthermore, our best model with p = 0.3 which is much faster reaches 75.3 perplexity on penn without additional hyperparameter optimization which is still −4 points lower than the second best, adaptive output embedding; tuning the model from scratch should likely lead to even better results.This is quite encouraging because it means that the benefits of our model need not come with a large computational cost.In future work, the training speed could be optimized even further by devising specialized efficient training methods for compositional outputs.

B.5 Comparison with State-of-the-Art Models
Table 11 displays several state-of-the-art models which have number of parameters ranging from 9M to 20M on Penn Treebank.We can observe that our model which has only 9.7M parameters achieves better performance than all the models that have lower than or equal to 21M parameters and even the model by Inan et al. (2017) which has 24M parameters.Note that our model has lower perplexity than the pointer sentinel mixture model by Merity et al. (2017) and the neural cache model by Grave et al. (2017a) while having 11M less parameters than them.Moreover, it is very close to the other models which have around 23-25M parameters without being highly regularized (weight dropout, input dropout) or having advanced optimization strategies (SGD + ASGD, finetuning) like AWD-LSTM (Merity et al., 2017).Training larger models and  investigating the potential of competing with even higher capacity models is an interesting direction which we hope will be explored in future studies.

C Cross-Domain Language Modeling
For the experiment in cross-domain language modeling, we used the following computing infrastructure: 2 GeForce RTX 2080 Ti and 2 TITAN RTX GPUs to train and finetune our GroC models, and 2 Tesla P100 GPUs to train and finetune the baselines and to perform hyperparameter search.The web dataset is a clear outlier, in which the tied model improves much more dramatically than in any other domain.The difference in validation performance here is reflected in the test perplexity (Table 5) but does not have a clear explanation.

C.2 Data
As described in Section 5, the choice of data and preprocessing used for the cross-domain experiments are based on Grave et al. (2017a).News Crawl and Common Crawl can be downloaded from the WMT 2014 website.12WikiText-103 was downloaded from Salesforce website13 .For the News Crawl datasets, the first 2M tokens of the English data for each year were used as the train

C.3 Finetuning Validation Results
Because no target-domain training is required for most of our cross-domain experiments, validation scores were not computed for most model-domain combinations; however, we report the validation perplexity for the finetuned models in Table 12, to aid in replication.

C.4 Hyperparameter Selection
Cache hyperparameters were selected via grid search, with θ, the flattening hyperparameter described in Grave et al. (2017c), ranging over 5 values from 0 to 1, and λ ranging over 5 values from 0.833 to 0.966 (bounds which were selected based on the optimal hyperparameter ranges in (Grave et al., 2017c)).Perplexity of a model trained on 2007 and evaluated on the 2008 validation set was the metric used to select the optimal hyperparameters: λ = 0.966 for unigram and neural cache and θ = 0.5 for neural cache.Because the cache is only used during evaluation, this hyperparameter search was quite efficient to carry out using the tied model, requiring no additional training, only 25 evaluation runs on the validation set.This hyperparameter search is illustrated in Figure 6.We then used the same hyperparameters for all cache models.This provides a slight advantage to the tied model, as the optimal hyperparameters for GroC might be different from those selected with the tied model.A cache size of 5,000 was used during hyperparameter tuning, but at test time we used 10,000 for all experiments based on its use in Grave17. Figure 7 shows a separate hyperparameter search performed over the penn validation set to confirm the accuracy of our neural cache reimplementation.Compare to Figure 2a in Grave et al. (2017c); note their λ is 1 minus ours.
For GroC, we also selected a downweighting hyperparameter dw, based on validation performance on the wiki dataset only.We searched over 5 values (0.1, 0.3, 0.5, 0.7, and 0.9) using GroC with the neural cache, and selected dw = 0.1 as the best value with a validation ppl of 154.01.
r H e r Y / F a M k q d o 7 h D 6 z P H 1 u b l C E = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " h L + F a L t O T 9 l u w f L W 3 U t 0 8 x l 3 P c w = " > A A A B 6 H i c b V D L T g J B E O z F F + I L 9 e h l I j H x R H b R R I 9 E L x 4 h k U c C G z I 7 9 M L I 7 O x m Z t Z I C F / g x Y P G e P W T v P k 3 D r A H B S v p p F L V n e 6 u I B F c G 9 f 9 d n J r 6 x u b W / n t w s 7 u 3 v 5 B 8 f C o q e N U M W y w W l e 4 Q 0 p 9 I L e 0 c e i t Y D y m W P 4 A / T 5 A 5 8 Y j y g = < / l a t e x i t > d < l a t e x i t s h a 1 _ b a s e 6 4 = " z 8 l e 4 Q 0 p 9 I L e 0 c e i t Y D y m W P 4 A / T 5 A 4 h c j x k = < / l a t e x i t > c < l a t e x i t s h a 1 _ b a s e 6 4 = " 0 j I M i Y 3 X g 6 F e H y d W T 6 U z r J g E y 0 o = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e i F 4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o r e N U M W y e n H f n Y 9 F a c P K Z Y / g D 5 / M H x 7 O M 6 w = = < / l a t e x i t > r < l a t e x i t s h a 1 _ b a s e 6 4 = " m c x b H M l p e x U Z O Q 8 0 I 6 5 L 7 N 8 g j U M = " > A A A B 6 H i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B b B U 0 m q o M e i F 4 8 t 2 F p o Q 9 l s J + 3 a z S b s b o Q S + g u 8 e F D E q z / J m / / G b Z u D t j 4 Y e L w 3 w 8 y 8 I B F c G 9 f 9 d g p r 6 x u b W 8 X t 0 s 7 u 3 v 5 B + f C o r e N U M W y

Figure 2 :
Figure 2: Grounded compositional output language modeling.(Left) The compositional input embedding is grounded in surface, relational, and definitional word forms from an external structured lexicon.(Right) The encoded prefix words are given as input to the prefix function and the words in an arbitrary vocabulary are given as input to the output embedding function and the bias function to predict the next word.

Figure 3 :
Figure 3: Median loss difference between each baseline and GroC over different word frequency intervals on penn (a) and wikitext2 (b).The biggest differences are mostly observed on words with low training frequencies.

Figure 4 :
Figure 4: Training and validation loss for GroC and the tied model during finetuning on near domains.

C. 1
Figures 4 and 5 show the loss on the training and validation data for the target domain during finetuning.GroC generalizes better from the training to the validation data than the tied model, consistently having lower validation loss.The training loss for GroC consistently starts out lower than that the of the tied model, showing that it has less difficulty adapting to the new data, and ends up higher, indicating greater regularization vs the tied model.The web dataset is a clear outlier, in which the tied model improves much more dramatically than in any other domain.The difference in validation performance here is reflected in the test perplexity (Table5) but does not have a clear explanation.

Figure 6 :
Figure 6: Validation accuracy for various hyperparameter settings on the 2008 validation set.

Figure 7 :
Figure 7: Validation accuracy for various hyperparameter settings on the penn validation set.

Table 1 :
Language modeling dataset statistics.

Table 2 :
Perplexity scores on conventional language modeling benchmarks with closed vocabulary.|Θ| denotes the total number of model parameters.

Table 3 :
Ablated model variants on penn and wikitext-2.out: the deep residual output network.

Table 4 :
External lexicon coverage effect on the perplexity of GroC on the penn test set.surf.: model with surface forms only from Table3, last row.

Table 5 :
Grave et al. (2017a)far cross-domain language modeling with an open vocabulary with a zero-resource or a low-resource setting.Top four rows display scores fromGrave et al. (2017a), while the next three are from our re-implementation with a stronger base model.Boldface marks the best perplexity on each test set.

Table 5 .
Standalone GroC improves perplexity relative to the tied model in every domain by up to −30 points, including the local neural cache and the unbounded neural cache model in the near-domain, even when the former is applied to our own stronger tied-embedding baseline model.

Table 7 :
Hyperparameters, range of values, and, number of trials required to search them.Adaptive cutoffs are read as follows: e.g. for 253 the cutoff array contains 0.2 * n, 0.5 * n, 0.3 * n , n = |V| words per bin.

Table 8 :
Best hyperparameter values per method.

Table 9 :
Table 3 in the main paper (Section 4.3).Overall, we can observe that in most cases the ranking based on the development scores is indicative of the ranking of the methods according to the test scores.Development and test scores on conventional language modeling benchmarks with closed vocabulary.|Θ| denotes the total number of model parameters.

Table 10 :
Training speed for each method.We report the average time in seconds to complete one epoch.