Beyond Weight Tying: Learning Joint Input-Output Embeddings for Neural Machine Translation

Tying the weights of the target word embeddings with the target word classifiers of neural machine translation models leads to faster training and often to better translation quality. Given the success of this parameter sharing, we investigate other forms of sharing in between no sharing and hard equality of parameters. In particular, we propose a structure-aware output layer which captures the semantic structure of the output space of words within a joint input-output embedding. The model is a generalized form of weight tying which shares parameters but allows learning a more flexible relationship with input word embeddings and allows the effective capacity of the output layer to be controlled. In addition, the model shares weights across output classifiers and translation contexts which allows it to better leverage prior knowledge about them. Our evaluation on English-to-Finnish and English-to-German datasets shows the effectiveness of the method against strong encoder-decoder baselines trained with or without weight tying.


Introduction
Neural machine translation (NMT) predicts the target sentence one word at a time, and thus models the task as a sequence classification problem where the classes correspond to words. Typically, words are treated as categorical variables which lack description and semantics. This makes training speed and parametrization dependent on the size of the target vocabulary (Mikolov et al., 2013). Previous studies overcome this problem by truncating the vocabulary to limit its size and mapping out-of-vocabulary words to a single "unknown" token. Other approaches attempt to use a limited number of frequent words plus sub-word units (Sennrich et al., 2016), the combination of which can cover the full vocabulary, or to perform character-level modeling (Chung et al., 2016;Lee et al., 2017;Costa-jussà and Fonollosa, 2016;Ling et al., 2015); with the former being the most effective between the two. The idea behind these alternatives is to overcome the vocabulary size issue by modeling the morphology of rare words. One limitation, however, is that semantic information of words or sub-word units learned by the input embedding are not considered when learning to predict output words. Hence, they rely on a large amount of examples per class to learn proper word or sub-word unit output classifiers.
One way to consider information learned by input embeddings, albeit restrictively, is with weight tying i.e. sharing the parameters of the input embeddings with those of the output classifiers (Press and Wolf, 2017;Inan et al., 2016) which is effective for language modeling and machine translation (Sennrich et al., 2017;Klein et al., 2017). Despite its usefulness, we find that weight tying has three limitations: (a) It biases all the words with similar input embeddings to have a similar chance to be generated, which may not always be the case (see Table 1 for examples). Ideally, it would be better to learn distinct relationships useful for encoding and decoding without forcing any general bias. (b) The relationship between outputs is only implicitly captured by weight tying because there is no parameter sharing across output classifiers. (c) It requires that the size of the translation context vector and the input embeddings are the same, which in practice makes it difficult to control the output layer capacity.
In this study, we propose a structure-aware output layer which overcomes the limitations of previous output layers of NMT models. To achieve this, we treat words and subwords as units with textual descriptions and semantics. The model consists of a joint input-output embedding which learns what to share between input embeddings  Table 1: Top-5 most similar input and output representations to two query words based on cosine similarity for an NMT trained without (NMT) or with weight tying (NMT-tied) and our structure-aware output layer (NMT-joint) on De-En (|V| ≈ 32K). Our model learns representations useful for encoding and generation which are more consistent to the dominant semantic and syntactic relations of the query such as verbs in past tense, adjectives and nouns (inconsistent words are marked in red). and output classifiers, but also shares parameters across output classifiers and translation contexts to better capture the similarity structure of the output space and leverage prior knowledge about this similarity. This flexible sharing allows it to distinguish between features of words which are useful for encoding, generating, or both. Figure 1 shows examples of the proposed model's input and output representations, compared to those of a softmax linear unit with or without weight tying. This proposal is inspired by joint input-output models for zero-shot text classification (Yazdani and Henderson, 2015;Nam et al., 2016a), but innovates in three important directions, namely in learning complex non-linear relationships, controlling the effective capacity of the output layer and handling structured prediction problems.
Our contributions are summarized as follows: • We identify key theoretical and practical limitations of existing output layer parametrizations such as softmax linear units with or without weight tying and relate the latter to joint input-output models.
• We propose a novel structure-aware output layer which has flexible parametrization for neural MT and demonstrate that its mathe-matical form is a generalization of existing output layer parametrizations.
• We provide empirical evidence of the superiority of the proposed structure-aware output layer on morphologically simple and complex languages as targets, including under challenging conditions, namely varying vocabulary sizes, architecture depth, and output frequency. The evaluation is performed on 4 translation pairs, namely English-German and English-Finnish in both directions using BPE (Sennrich et al., 2016) of varying operations to investigate the effect of the vocabulary size to each model. The main baseline is a strong LSTM encoder-decoder model with 2 layers on each side (4 layers) trained with or without weight tying on the target side, but we also experiment with deeper models with up to 4 layers on each side (8 layers). To improve efficiency on large vocabulary sizes we make use of negative sampling as in (Mikolov et al., 2013) and show that the proposed model is the most robust to such approximate training among the alternatives.

Background: Neural MT
The translation objective is to maximize the conditional probability of emitting a sentence in a target language Y = {y 1 , ..., y n } given a sentence in a source language X = {x 1 , ..., x m }, noted p Θ (Y |X), where Θ are the model parameters learned from a parallel corpus of length N : (1) By applying the chain rule, the output sequence can be generated one word at a time by calculating the following conditional distribution: where f Θ returns a column vector with an element for each y t . Different models have been proposed to approximate the function f Θ (Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014;Bahdanau et al., 2015;Cho et al., 2014;Gehring et al., 2017;Vaswani et al., 2017). Without loss of generality, we focus here on LSTM-based encoder-decoder model with attention Luong et al. (2015).

Softmax Linear Unit
The most common output layer (Figure 3a), consists of a linear unit with a weight matrix W ∈ IR d h ×|V| and a bias vector b ∈ IR |V| followed by a softmax activation function, where V is the vocabulary, noted as NMT. For brevity, we focus our analysis specifically on the nominator of the normalized exponential which characterizes softmax. Given the decoder's hidden representation h t with dimension size d h , the output probability distribution at a given time, y t , conditioned on the input sentence X and the previously predicted outputs y t−1 1 can be written as follows: where I is the identity function. From the second line of the above equation, we observe that there is no explicit output space structure learned by the model because there is no parameter sharing across outputs; the parameters for output class i, W T i , are independent from parameters for any other output class j, W T j .

Softmax Linear Unit with Weight Tying
The parameters of the output embedding W can be tied with the parameters of the input embedding E ∈ IR |V|×d by setting W = E T , noted as NMT-tied. This can happen only when the input dimension of W is restricted to be the same as that of the input embedding (d = d h ). This creates practical limitations because the optimal dimensions of the input embedding and translation context may actually be when d h = d.
With tied embeddings, the parametrization of the conditional output probability distribution from Eq. 3 can be re-written as: As above, this model does not capture any explicit output space structure. However, previous studies have shown that the input embedding learns linear relationships between words similar to distributional methods (Mikolov et al., 2013). The hard equality of parameters imposed by W = E T forces the model to re-use this implicit structure in the output layer and increases the modeling burden of the decoder itself by requiring it to match this structure through h t . Assuming that the latent linear structure which E learns is of the form E ≈ E l W where E l ∈ IR |V|×k and W ∈ IR k×d and d = d h , then Eq. 4 becomes: The above form, excluding bias b, shows that weight tying learns a similar linear structure, albeit implicitly, to joint input-output embedding models with a bilinear form for zero-shot classification (Yazdani and Henderson, 2015;Nam et al., 2016a). 1 This may explain why weight tying is more sample efficient than the baseline softmax linear unit, but also motivates the learning of explicit structure through joint input-output models.

Challenges
We identify two key challenges of the existing parametrizations of the output layer: (a) their difficulty in learning complex structure of the output space due to their bilinear form and (b) their rigidness in controlling the output layer capacity due to their strict equality of the dimensionality of the translation context and the input embedding.
The structure-aware output layer is a joint embedding between translation contexts and word classifiers. Figure 1: Schematic of existing output layers and the proposed output layer for the decoder of the NMT model with source context vector c t , previous word y t−1 ∈ IR d , and decoder hidden states, h t ∈ IR d h .

Learning Complex Structure
The existing joint input-output embedding models (Yazdani and Henderson, 2015;Nam et al., 2016a) have the following bilinear form: where W ∈ IR d×d h . We can observe that the above formula can only capture linear relationships between encoded text (h t ) and input embedding (E) through W. We argue that for structured prediction, the relationships between different outputs are more complex due to complex interactions of the semantic and syntactic relations across outputs but also between outputs and different contexts. A more appropriate form for this purpose would include a non-linear transformation σ(·), for instance with either:

Controlling Effective Capacity
Given the above definitions we now turn our focus to a more practical challenge, which is the capacity of the output layer. Let Θ base , Θ tied , Θ bilinear be the parameters associated with a softmax linear unit without and with weight tying and with a joint bilinear input-output embedding, respectively. The capacity of the output layer in terms of effective number of parameters can be expressed as: But since the parameters of Θ tied are tied to the parameters of the input embedding, the effective number of parameters dedicated to the output layer is only |Θ tied | = |V|.
The capacities above depend on external factors, that is |V|, d and d h , which affect not only the output layer parameters but also those of other parts of the network. In practice, for Θ base the capacity d h can be controlled with an additional linear projection on top of h t (e.g. as in the Open-NMT implementation), but even in this case the parametrization would still be heavily dependent on |V|. Thus, the following inequality for the effective capacity of these models holds true for fixed |V |, d, d h : This creates in practice difficulty in choosing the optimal capacity of the output layer which scales to large vocabularies and avoids underparametrization or overparametrization (left and right side of Eq. 11 respectively). Ideally, we would like to be able to choose the effective capacity of the output layer more flexibly moving freely in between C bilinear and C base in Eq. 11.

Structure-aware Output Layer for Neural Machine Translation
The proposed structure-aware output layer for neural machine translation, noted as NMTjoint, aims to learn the structure of the output space by learning a joint embedding between translation contexts and output classifiers, as well as, by learning what to share with input embeddings ( Figure 1b). In this section, we describe the model in detail, showing how it can be trained efficiently for arbitrarily high number of effective parameters and how it is related to weight tying.

Joint Input-Output Embedding
Let g inp (h t ) and g out (e j ) be two non-linear projections of d j dimensions of any translation context h t and any embedded output e j , where e j is the j th row vector from the input embedding matrix E, which have the following form: where the matrix U ∈ IR d j ×d and bias b u ∈ IR d j is the linear projection of the translation context and the matrix V ∈ IR d j ×d h and bias b v ∈ IR d j is the linear projection of the outputs, and σ is a nonlinear activation function (here we use Tanh).
Note that the projections could be high-rank or low-rank for h t and e j depending on their initial dimensions and the target joint space dimension.
With E ∈ IR |V|×d j being the matrix resulting from projecting all the outputs e j to the joint space, i.e. g out (E), and a vector b ∈ IR |V| which captures the bias for each output, the conditional output probability distribution of Eq 3 can be rewritten as follows:

What Kind of Structure is Captured?
From the above formula we can derive the general form of the joint space which is similar to Eq. 7 with the difference that it incorporates both components for learning output and context structure: where W o ∈ IR d×d j and W c ∈ IR d j ×d h are the dedicated projections for learning output and context structure respectively (which correspond to U and V projections in Eq. 14). We argue that both nonlinear components are essential and validate this hypothesis empirically in our evaluation by performing an ablation analysis (Section 4.4).

How to Control the Effective Capacity?
The capacity of the model in terms of effective number of parameters (Θ joint ) is: By increasing the joint space dimension d j above, we can now move freely between C bilinear and C base in Eq .11 without depending anymore on the external factors (d, d h , |V |) as follows: However, for very large number of d j the computational complexity increases prohibitively because the projection requires a large matrix multiplication between U and E which depends on |V|.
In such cases, we resort to sampling-based training, as explained in the next subsection.

Sampling-based Training
To scale up to large output sets we adopt the negative sampling approach from (Mikolov et al., 2013). The goal is to utilize only a sub-set V of the vocabulary instead of the whole vocabulary V for computing the softmax. The sub-set V includes all positive classes whereas the negative classes are randomly sampled. During back propagation only the weights corresponding to the subset V are updated. This can be trivially extended to mini-batch stochastic optimization methods by including all positive classes from the examples in the batch and sampling negative examples randomly from the rest of the vocabulary. Given that the joint space models generalize well on seen or unseen outputs (Yazdani and Henderson, 2015;Nam et al., 2016b), we hypothesize that the proposed joint space will be more sample efficient than the baseline NMT with or without weight tying, which we empirically validate with a sampling-based experiment in Section 4.5 (Table  2, last three rows with |V| ≈ 128K).

Relation to Weight Tying
The proposed joint input-output space can be seen as a generalization of weight tying (W = E T , Eq. 3), because its degenerate form is equivalent to weight tying. In particular, this can be simply derived if we set the non-linear projection functions in the second line of Eq. 14 to be the identity function, g inp (·) = g out (·) = I, as follows: Overall, this new parametrization of the output layer generalizes over previous ones and addresses their aforementioned challenges in Section 2.2. En

Evaluation
We compare the NMT-joint model to two strong NMT baselines trained with and without weight tying over four large parallel corpora which include morphologically rich languages as targets (Finnish and German), but also morphologically less rich languages as targets (English) from WMT 2017 (Bojar et al., 2017) 2 . We examine the behavior of the proposed model under challenging conditions, namely varying vocabulary sizes, architecture depth, and output frequency.

Datasets and Metrics
The English-Finnish corpus contains 2.5M sentence pairs for training, 1.3K for development (Newstest2015), and 3K for testing (New-stest2016), and the English-German corpus 5.8M for training, 3K for development (Newstest2014), and 3K for testing (Newstest2015). We preprocess the texts using the BPE algorithm (Sennrich et al., 2016) with 32K, 64K and 128K operations. Following the standard evaluation practices in the field (Bojar et al., 2017), the translation quality is measured using BLEU score (Papineni et al., 2002) (multi-blue) on tokenized text and the significance is measured with the paired bootstrap re-sampling method proposed by (Koehn et al., 2007). 3 The quality on infrequent words is measured with METEOR (Denkowski and Lavie, 2014) which has originally been proposed to measure performance on function words.
To adapt it for our purposes on English-German pairs (|V| ≈ 32K), we set as function words different sets of words grouped according to three frequency bins, each of them containing |V| 3 words of high, medium and low frequency respectively and set its parameters to {0.85, 0.2, 0.6, 0.} and {0.95, 1.0, 0.55, 0.} when evaluating on English and German respectively.

Model Configurations
The baseline is an encoder-decoder with 2 stacked LSTM layers on each side from OpenNMT (Klein et al., 2017), but we also experiment with varying depth in the range {1, 2, 4, 8} for German-English. The hyperparameters are set according to validation accuracy as follows: maximum sentence length of 50, 512-dimensional word embeddings and LSTM hidden states, dropout with a probability of 0.3 after each layer, and Adam (Kingma and Ba, 2014) optimizer with initial learning rate of 0.001. The size of the joint space is also selected on validation data in the range {512, 2048, 4096}. For efficiency, all models on corpora with V ≈ 128K (∼) and all structure-aware models with d j ≥ 2048 on corpora with V ≤ 64K are trained with 25% negative sampling. 4  NMT baseline in many cases, but the differences are not consistent and it even scores significantly lower than NMT baseline in two cases, namely on Fi → En and De → En with V ≈ 64K. This validates our claim that the parametrization of the output space of the original NMT is not fully redundant, otherwise the NMT-tied would be able to match its BLEU in all cases. In contrast, the NMTjoint model outperforms consistently both baselines with a difference up to +2.2 and +1.6 BLEU points respectively, 5 showing that the NMT-tied model has a more effective parametrization and retains the advantages of both baselines, namely sharing weights with the input embeddings, and dedicating enough parameters for generation. Overall, the highest scores correlate with a high number of BPE operations, namely 128K, 64K, 128K and 64k respectively. This suggests that the larger the vocabulary the better the performance, especially for the morphologically rich target languages, namely En → Fi and En → De. Lastly, the NMT baseline seems to be the least robust to sampling since its BLEU decreases in two cases. The other two models are more robust to sampling, however the difference of NMT-tied with the NMT is less significant than that of NMT-joint.

Ablation Analysis
To demonstrate whether all the components of the proposed joint input-output model are useful and to which extend they contribute to the performance, we performed an ablation analysis; the results are displayed in Table 3. Overall, all the variants of the NMT-joint outperform the baseline with varying degrees of significance. The NMTjoint with a bilinear form (Eq. 6) as in (Yaz- dani and Henderson, 2015; Nam et al., 2016b) is slightly behind the NMT-tied and outperforms the NMT baseline; this supports our theoretical analysis in Section 2.1.2 which demonstrated that weight tying is learning an implicit linear structure similar to bilinear joint input-output models.
The NMT-joint model without learning explicit translation context structure (Eq. 7 a) performs similar to the bilinear model and the NMTtied model, while the NMT-joint model without learning explicit output structure (Eq. 7 b) outperforms all the previous ones. When keeping same capacity (with d j =512), our full model, which learns both output and translation context structure, performs similarly to the latter model and outperforms all the other baselines, including joint input-output models with a bilinear form (Yazdani and Henderson, 2015;Nam et al., 2016b). But when the capacity is allowed to increase (with d j =2048), it outperforms all the other models. Since both nonlinearities are necessary to allow us to control the effective capacity of the joint space, these results show that both types of structure induction are important for reaching the top performance with NMT-joint.

Effect of Embedding Size
Performance Figure 2 displays the BLEU scores of the proposed model when varying the size of the joint embedding, namely d j ∈ {512, 2048, 4096}, against the two baselines. For English-Finish pairs, the increase in embedding size leads to a consistent increase in BLEU in favor of the NMTjoint model. For the English-German pairs, the difference with the baselines is much more evident    and the optimal size is observed around 2048 for De → En and around 512 on En → De. The results validate our hypothesis that there is parameter redundancy in the typical output layer. However the ideal parametrization is data dependent and is achievable systematically only with the joint output layer which is capacity-wise in between the typical output layer and the tied output layer.
Training speed Table 4 displays the target tokens processed per second by the models on En → DE with |V| ≈ 128K using different levels of negative sampling, namely 50%, 25%, and 5%. In terms of training speed, the 512-dimensional NMT-joint model is as fast as the baselines, as we can observe in all cases. For higher dimensions of the joint space, namely 2048 and 4096 there is a notable decrease in speed which is remidiated by reducing the percentage of the negative samples.

Effect of Output Frequency and Architecture Depth
Figure 3 displays the performance in terms of ME-TEOR on both directions of German-English language pair when evaluating on outputs of different frequency levels (high, medium, low) for all the competing models. The results on De → EN show that the improvements brought by the NMTjoint model against baselines are present consistently for all frequency levels including the lowfrequency ones. Nevertheless, the improvement is most prominent for high-frequency outputs, which is reasonable given that no sentence filtering was performed and hence frequent words have higher impact in the absolute value of METEOR. Similarly, for En → De we can observe that NMTjoint outperforms the others on high-frequency and low-frequency labels while it reaches parity with them on the medium-frequency ones.
We also evaluated our model in another challenging condition in which we examine the effect of the NMT architecture depth in the performance of the proposed model. The results are displayed in Table 5. The results show that the NMTjoint outperforms the other two models consistently when varying the architecture depth of the encoder-decoder architecture. The NMT-joint overall is much more robust than NMT-tied and it outperforms it consistently in all settings. Compared to the NMT which is overparametrized the improvement even though consistent it is smaller for layer depth 3 and 4. This happens because NMT has a much higher number of parameters than NMT-joint with d j =512.
Increasing the number of dimensions d j of the joint space should lead to further improvements, as shown in Fig. 2. In fact, our NMT-joint with d j = 2048 reaches 18.11 score with a 2-layer deep model, hence it outperforms all other NMT and NMT-tied models even with a deeper architecture (3-layer and 4-layer) regardless of the fact that it utilizes fewer parameters than them (48.8M vs 69.2-73.4M and 50.9-55.1M respectively).

Related Work
Several studies focus on learning joint inputoutput representations grounded to word semantics for zero-shot image classification (Weston et al., 2011;Socher et al., 2013;Zhang et al., 2016), but there are fewer such studies for NLP tasks. (Yazdani and Henderson, 2015) proposed a zero-shot spoken language understanding model based on a bilinear joint space trained with hinge loss, and (Nam et al., 2016b), proposed a similar joint space trained with a WARP loss for zero-shot biomedical semantic indexing. In addition, there exist studies which aim to learn output representations directly from data such as (Srikumar and Manning, 2014;Yeh et al., 2018;Augenstein et al., 2018); their lack of semantic grounding to the input embeddings and the vocabulary-dependent parametrization, however, makes them data hungry and less scalable on large label sets. All these models, exhibit similar theoretical limitations as the softmax linear unit with weight tying which were described in Sections 2.2.
To our knowledge, there is no existing study which has considered the use of such joint inputoutput labels for neural machine translation. Compared to previous joint input-label models our model is more flexible and not restricted to linear mappings, which have limited expressivity, but uses non-linear mappings modeled similar to energy-based learning networks (Belanger and McCallum, 2016). Perhaps, the most similar embedding model to ours is the one by (Pappas and Henderson, 2018), except for the linear scaling unit which is specific to sigmoidal linear units designed for multi-label classification problems and not for structured prediction, as here.

Conclusion and Perspectives
We proposed a re-parametrization of the output layer for the decoder of NMT models which is more general and robust than a softmax linear unit with or without weight tying with the input word embeddings. Our evaluation shows that the structure-aware output layer outperforms weight tying in all cases and maintains a significant difference with the typical output layer without compromising much the training speed. Furthermore, it can successfully benefit from training corpora with large BPE vocabularies using negative sampling. The ablation analysis demonstrated that both types of structure captured by our model are essential and complementary, as well as, that their combination outperforms all previous output layers including those of bilinear input-output embedding models. Our further investigation revealed the robustness of the model to samplingbased training, translating infrequent outputs and to varying architecture depth.
As future work, the structure-aware output layer could be further improved along the following directions. The computational complexity of the model becomes prohibitive for a large joint projection because it requires a large matrix multiplication which depends on |V|; hence, we have to resort to sampling based training relatively quickly when gradually increasing d j (e.g. for d j >= 2048). A more scalable way of increasing the output layer capacity could address this issue, for instance, by considering multiple consecutive additive transformations with small d j . Another useful direction would be to use more advanced output encoders and additional external knowledge (contextualized or generically defined) for both words and sub-words. Finally, to encourage progress in joint input-output embedding learning for NMT, our code is available on Github: http://github.com/idiap/ joint-embedding-nmt.