Discovering the Compositional Structure of Vector Representations with Role Learning Networks

How can neural networks perform so well on compositional tasks even though they lack explicit compositional representations? We use a novel analysis technique called ROLE to show that recurrent neural networks perform well on such tasks by converging to solutions which implicitly represent symbolic structure. This method uncovers a symbolic structure which, when properly embedded in vector space, closely approximates the encodings of a standard seq2seq network trained to perform the compositional SCAN task. We verify the causal importance of the discovered symbolic structure by showing that, when we systematically manipulate hidden embeddings based on this symbolic structure, the model’s output is changed in the way predicted by our analysis.


Introduction
Traditional models of cognition, and language in particular, have relied heavily on symbol structures and symbol manipulation. However, in the current era, deep learning research has shown that Neural Networks (NNs) can display remarkable degrees of generalization on tasks traditionally viewed as depending on symbolic structure (Wu et al., 2016;McCoy et al., 2019a), albeit with some important limits to their generalization (Lake and Baroni, 2018). Given that standard NNs have no obvious mechanisms for representing symbolic structures, parsing inputs into such structures, nor applying compositional symbol-manipulating rules to them, this success raises the question that we address in this paper: How do NNs achieve such strong performance on compositional tasks?
Could it be that NNs do learn symbolic representations-covertly embedded as vectors in their state spaces? McCoy et al. (2019a) showed that when trained on highly compositional tasks, standard NNs learned representations that are functionally equivalent to compositional vector embeddings of symbolic structures (Sec. 3). Processing in these NNs assigns structural representations to inputs and generates outputs that are governed by compositional rules stated over those representations. We refer to the networks we will analyze as target NNs, because we will propose a new type of NN (in Sec. 4)-the Role Learner (ROLE)which is used to analyze the target network. In contrast with the analysis model of McCoy et al. (2019a), which relies on a hand-specified hypothesis about the structure underlying the learned representations of the target NN, ROLE automatically learns a symbolic structure that best approximates the internal representation of the target network. This yields two advantages. First, ROLE achieves success at analyzing networks for which the underlying structure is unclear. We show this in Sec. 5, where ROLE successfully uncovers the symbolic structures learned by a seq2seq RNN trained on the SCAN synthetic semantic parsing task (Lake and Baroni, 2018). Second, removing the need for hand-specified structural hypotheses reduces the burden on the analyst, who only needs to provide input sequences and their target NN encodings. Discovering symbolic structure within a model enables us to perform precise alterations to the internal representations in order to produce desired alterations in the output (Sec. 5.3). Then, in Sec. 6, we turn briefly to partially-compositional tasks in NLP.
The novel contributions of this research are: • ROLE, a NN module that learns to assign symbolic structures to input sequences (Sec. 4).
• Demonstration that RNNs converge to compositional solutions on the synthetic SCAN task (Sec. 5).
• A precise closed-form expression for the distributed encoding learned by an RNN trained on SCAN, exhibiting its latent symbolic structure (Sec. 5.2).
• Demonstration of the causal relevance of this symbolic structure by using the equation for its vector encoding to control RNN output through precise alteration of the RNN's internal encoding (Sec. 5.3).
• Additional evidence showing that sentence embedding models do not capture compositional structure (Sec. 6).

Background Related work 2.1 Compositionality
Certain cognitive tasks consist in computing a function ϕ that is governed by strict rules: e.g., if ϕ is the function mapping a mathematical expression to its value (e.g., mapping '19 − 2 * 7' to 5), then ϕ obeys the rule that ϕ(x + y) = sum(ϕ(x), ϕ(y)) for any expressions x and y. This rule is compositional: the output of a structure (here, x + y) is a function of the outputs of the structure's constituents (here, x and y). The rule can be stated with full generality once the input is assigned a symbolic structure giving its decomposition into constituents. For a fully-compositional task, completely determined by compositional rules, a system that can assign appropriate symbolic structures to inputs and apply appropriate compositional rules to these structures will display full systematic generalization: it will correctly process arbitrary novel combinations of familiar constituents. This is a core capability of symbolic AI systems.
Other tasks, including most natural language tasks such as machine translation, are only partially characterizable by compositional rules: natural language is only partially compositional in nature. For example, if ϕ is the function that assigns meanings to English adjectives, it generally obeys the rule that ϕ(in-+ x) = not ϕ(x), (e.g., ϕ(inoffensive) = not ϕ(offensive)), yet there are exceptions: ϕ(inflammable) = ϕ(flammable). On these "partially-compositional" tasks, this strategy of compositional analysis has demonstrated considerable, but limited, generalization capabilities.

Analysis of NNs
Many past works in the rich body of literature about analyzing NNs focus on compositional structure (Hupkes et al., 2020(Hupkes et al., , 2018Hewitt and Manning, 2019;Li et al., 2019) and systematicity (Lake and Baroni, 2018;Goodwin et al., 2020). Two of the most popular analysis techniques are the behavioral and probing approaches. In the behavioral approach, a model is evaluated on a set of examples carefully chosen to require competence in particular linguistic phenomena (Marvin and Linzen, 2018;Wang et al., 2018;Dasgupta et al., 2019;Poliak et al., 2018;Linzen et al., 2016;McCoy et al., 2019b;Warstadt et al., 2020). This technique can illuminate behavioral shortcomings but says little about how the internal representations are struc-tured, treating the model as a black box.
In the probing approach, an auxiliary classifier is trained to classify the model's internal representations based on some linguistically-relevant distinction (Adi et al., 2017;Giulianelli et al., 2018;Conneau et al., 2018;Conneau and Kiela, 2018;Belinkov et al., 2017;Blevins et al., 2018;Peters et al., 2018;Tenney et al., 2019). In contrast with the behavioral approach, the probing approach tests whether some particular information is present in the model's encodings, but it says little about whether this information is actually used by the model. Indeed, in some cases models fail despite having the necessary information to succeed in their representations, showing that the ability of a classifier to extract that information does not mean that the model is using it (Voita and Titov, 2020;Ravichander et al., 2020;Vanmassenhove et al., 2017).
We build on McCoy et al. (2019a), which introduced the analysis task DISCOVER (DISsecting COmpositionality in VEctor Representations): take a NN and, to the extent possible, find an explicitly-compositional approximation to its internal distributed representations. DISCOVER allows us to bridge the gap between representation and behavior: It reveals not only what information is encoded in the representation, but also reveals this information in a way that we can manipulate to show that the information is causally implicated in the model's behavior (Section 5.3). Moreover, it provides a much more comprehensive window into the representation than the probing approach does; while probing extracts particular types of information from a representation (e.g., "does this representation distinguish between active and passive sentences?"), DISCOVER exhaustively decomposes the model's representational space. In this regard, DISCOVER is most closely related to the approaches of Andreas (2019), Chrupała and Alishahi (2019), and Abnar et al. (2019), who also propose methods for discovering a complete symbolic characterization of a set of vector representations, and Omlin and Giles (1996) and Weiss et al. (2018), which also seek to extract more interpretable symbolic models that approximate neural network behavior. Like Andreas (2019) and Chrupała and Alishahi (2019), we seek to find the structure encoded in neural networks, rather than seeking structure directly from the data as is the goal in grammar induction work such as Shen et al.  McCoy et al. showed that, in GRU (Cho et al., 2014) encoder-decoder networks performing simple, fully-compositional string manipulations, the medial encoding (between encoder and decoder) could be extremely well approximated, up to an affine transformation, by Tensor Product Representations (TPRs) (Smolensky, 1990), which are explicitly-compositional vector embeddings of symbolic structures. To represent a string of symbols as a TPR, the symbols in the string 337 might be parsed into three constituents {3 : pos1, 7 : pos3, 3 : pos2}, where posn is the role of n th position from the left edge of the string; other role schemes are also possible, such as roles denoting right-to-left position: {3 : third-to-last, 3 : second-to-last, 7 : last}. The embedding of a constituent 7 : pos3 is e(7 : pos3) = e F (7) ⊗ e R (pos3), where ⊗ is the tensor product (outer product), e R , e F are respectively a vector embedding of the roles and a vector embedding of the fillers of those roles: the digits. The embedding of the whole string is the sum of the embeddings of its constituents. In general, for a symbol structure S with roles {r k } that are respectively filled by the symbols {f k }, e TPR (S) = k e F (f k ) ⊗ e R (r k ). The DISCOVER task including the TPR equations is depicted in Figure 2.

NN embedding of symbol structures
At a high level, these role embeddings serve a similar purpose as positional embeddings in a Transformer (Vaswani et al., 2017), in that they are vector embeddings of a token's position in a sequence. The roles discussed above-and the positional embeddings used in Transformers-illustrate role schemes based on sequential position; nonsequential role schemes such as positions in a tree are also possible. McCoy et al. (2019a) showed that,for a given seq2seq architecture learning a given string-mapping task, there exists a highly accurate TPR approximation of the medial encoding, given an appropriate pre-defined role scheme. The main technical contribution of the present paper is the Role Learner (ROLE) model, an RNN that learns its own role scheme to optimize the fit of a TPR approximation to a given set of internal representations in a pre-trained target NN. This makes the DISCOVER framework more general by removing the need for human-generated hypothe- ses about the role schemes the network might be implementing. Learned role schemes, we will see in Sec. 5.1, can enable good TPR approximation of networks for which human-generated role schemes fail.

The Role Learner (ROLE) Model
ROLE 1 produces a vector-space embedding of an input string of T symbols S = s 1 s 2 . . . s T by producing a TPR T(S) and then passing it through an affine transformation. ROLE is trained to approximate a pre-trained target string-encoder E. Given a set of N training strings {S (1) , . . . , S (N ) }, ROLE minimizes the total mean-squared error (MSE) between its output W T(S (i) ) + b and E(S (i) ).
ROLE is an extension of the Tensor-Product Encoder (TPE) introduced in McCoy et al. (2019a) (as the "Tensor Product Decomposition Network") and depicted in Figure 3. Crucially, ROLE is not given role labels for the input symbols, but learns to compute them. More precisely, it learns a dictionary of n R d R -dimensional role-embedding vectors, R ∈ R d R ×n R , and, for each input symbol s t , computes a soft-attention vector a t over these role vectors: the role vector assigned to s t is then the attention-weighted linear combination 1 Code available at https://github.com/ psoulos/role-decomposition. The fillers (yellow circles) and roles (blue circles) are first vectorized with an embedding layer. These two vector embeddings are combined by an outer product to produce the green matrix representing the TPR of the constituent. All of the constituents are summed together to produce the TPR of the sequence, and then a linear transformation is applied to resize the TPR to the target encoder's dimensionality. ROLE replaces the role embedding layer and directly produces the blue role vector. of role vectors, r t = R a t . ROLE simultaneously learns a dictionary of n F d F -dimensional symbolembedding filler vectors F ∈ R d F ×n F , the φ th column of which is f φ , the embedding of symbol type φ; φ ∈ 1, . . . , n F where n F is the size of the vocabulary of symbol types. The TPR generated by ROLE is thus T(S) = T t=1 f τ (st) ⊗ r t , where τ (s t ) is symbol s t 's type. Finally, ROLE learns an affine transformation to map this TPR into R d , where d is the dimension of the representations of the encoder E.
ROLE uses an LSTM (Hochreiter and Schmidhuber, 1997) to compute the role-assigning attentionvectors a t from its learned embedding F of the input symbols s t : at each t, the hidden state of the LSTM passes through a linear layer and then a softmax to produce a t (depicted in Figure 4). Let the t th LSTM hidden state be q t ∈ R H ; let the output-layer weight-matrix have rows k ρ ∈ R H and let the columns of R be v ρ ∈ R d R , with ρ = 1, . . . , n R . Then r t = R a t = n R ρ=1 v ρ softmax(k ρ q t ): the result of query-key attention (e.g., Vaswani et al., 2017) with query q t to a fixed external memory containing key-value pairs {(k ρ , v ρ )} n R ρ=1 . Since a TPR for a discrete symbol structure deploys a discrete set of roles specifying discrete structural positions, ideally a single role would be Figure 4: The role learning module. The role attention vector a t is encouraged to be one-hot through regularization; if a t were one-hot, the produced role embedding r t would correspond directly to one of the roles defined in the role matrix R. The LSTM can be unidirectional or bidirectional. selected for each s t : a t would be one-hot. ROLE training therefore deploys regularization to bias learning towards one-hot a t vectors (based on the regularization proposed in Palangi et al. (2017), developed for the same purpose). See Appendix A.2 for the precise regularization terms that we used.
It is essential to note that, while we impose this regularization on ROLE, there is no explicit bias favoring discrete compositional representations in the target encoder E: any such structure that ROLE finds hidden in the representations learned by E must result from biases implicit in the vanilla RNNarchitecture of E when applied to its target task.

The SCAN task
Returning to our central question from Sec. 1, how can neural networks without explicit compositional structure perform well on fully-compositional tasks? Our hypothesis is that, though these models have no constraint forcing them to be compositional, they still have the ability to implicitly learn compositional structure. To test this hypothesis, we apply ROLE to a standard RNN-based seq2seq model (Sutskever et al., 2014) trained on a fully compositional task. Because the RNN has no constraint forcing it to use TPRs, we do not know a priori whether there exists any solution that ROLE could learn; thus, if ROLE does learn anything it will be a significant empirical finding about how these RNNs operate.
We consider the SCAN task (Lake and Baroni, 2018), which was designed to test compositional generalization and systematicity. SCAN is a synthetic semantic parsing task: an input sequence describing an action plan, e.g., jump opposite left, is mapped to a sequence of primitive actions, e.g., TL TL JUMP (see Sec. 5.3 for a complex example). We use TL to abbreviate TURN LEFT, sometimes written LTURN; similarly, we use TR for TURN RIGHT. The SCAN mapping is defined by a complete set of compositional rules (Lake and Baroni, 2018, Supplementary Fig. 7).

The compositional structure of SCAN encoder representations
For our target SCAN encoder E, we trained a standard GRU with one hidden layer of dimension 100 for 100,000 steps (batch-size 1) with a dropout of 0.1 on the simple train-test split (hyperparameters determined by a limited search; see Appendix A.3). E achieves 98.47% (full-string) accuracy on the test set.
Thus E provides what we want: a standard RNN achieving near-perfect accuracy on a non-trivial fully compositional task. After training, we extract the final hidden embedding from the encoder for each example in the training and test sets. These are the encodings we attempt to approximate as explicitly compositional TPRs. We provide ROLE with 50 roles to use as it wants (hyperparameters described in Appendix A.4). We evaluate the substitution accuracy that this learned role scheme provides in three ways. The continuous method tests ROLE in the same way as it was trained, with input symbol s t assigned role vector r t = R a t . The continuous method does not produce a discrete set of role vectors because the linear layer that generates a t allows for continuously-valued weights. The remaining two methods test the efficacy of a truly discrete set of role vectors. First, in the snapped method, a t is replaced at evaluation time by the one-hot vector m t singling out role m t = arg max(a t ): r t = R m t . This method serves the goal of enforcing the discreteness of roles, but it is expected to decrease performance because it tests ROLE in a different way than it was trained. Our final evaluation method, the discrete method, uses discrete roles without having such a train/test discrepancy by using a two-stage process. In the first stage, the snapped method is used to output one-hot vector roles m t for every symbol in the dataset. In the second stage, we train a TPE which does not learn roles but rather uses the one-hot vector m t as input during training. In this case, ROLE acts as an automatic data labeler, assigning a role to every input word.

Snapped
Discrete For comparison, we also train TPEs using a variety of discrete hand-crafted role schemes: left-toright (LTR), right-to-left (RTL), bidirectional (Bi), tree position, neighbor-based Wickelrole (Wickel), and bag-of-words (BOW) (descriptions of these role schemes are in Appendix A.1).
The mean substitution accuracy from these different methods is shown in Table 1. All of the predefined role schemes provide poor approximations, none surpassing 44.00% accuracy. The role scheme learned by ROLE does significantly better than any of the predefined role schemes: when tested with the basic, continuous role-attention method, the accuracy is 94.83%.
The success of ROLE tells us two things. First, it shows that the target model's compositional behavior relies on compositional internal representations: it was by no means guaranteed to be the case that ROLE would be successful here, so the fact that it is successful tells us that the encoder has learned compositional representations. Second, it adds further validation to the efficacy of ROLE, because it shows that it can be a useful analysis tool in cases of significantly greater complexity than the simple string manipulation tasks studied in McCoy et al. (2019a). In fact, it allows us to write in closed form the embedding e(S) of an input S = s 1 . . . s T that is learned by the SCAN encoder, to an excellent degree of approximation (as measured by substitution accuracy): is the role assigned to s t by the algorithm discussed next, and the matri- and bias vector b are learned by ROLE. Note that this expression is bilinear, even though the GRU encoder that generates it includes nonlinearities.

Interpreting the learned role scheme
By analyzing the roles assigned by ROLE to the sequences in the SCAN training set, we created a symbolic algorithm for predicting which role will be assigned to each filler. This section covers the primary factors of the algorithm, while the entire algorithm is described in Appendix A.5 and discussed at additional length in Appendix A.6. Though the algorithm was created based only on sequences in the SCAN training set, it is equally successful at predicting which roles will be assigned to test sequences, exactly matching ROLE's predicted roles for 98.7% of sequences.
The algorithm illuminates how the filler-role scheme encodes information relevant to the task. First, one of the initial facts that the decoder must determine is whether the sequence is a single command, a pair of subcommands connected by and, or a pair of subcommands connected by after; such a determination is crucial for knowing the basic structure of the output (how many actions to perform and in what order). We have found that role 30 is used for, and only for, the filler and, while role 17 is used in and only in sequences containing after (usually with after as the filler bound to role 17). Thus, the decoder can use these roles to tell which basic structure is in play: if role 30 is present, it is an and sequence; if role 17 is present, it is an after sequence; otherwise it is a single command.
Once the decoder has established the basic syntactic structure of the output, it must then fill in the particular actions. This can be accomplished using the remaining roles, which mainly encode absolute position within a subcommand. For example, the last word of a subcommand before after (e.g., jump left after walk twice) is always assigned role 8, while the last word of a subcommand after after (e.g., jump left after walk twice) is always assigned role 46. Therefore, once the decoder knows (based on the presence of role 17) that it is dealing with an after sequence, it can check for the fillers bound to roles 8 and 46 to begin to figure out what the two subcommands surrounding after look like. The identity of the last word in a subcommand is informative because that is where a cardinality (i.e., twice or thrice) appears if there is one. Thus, by checking what filler is at the end of a subcommand, the model can determine whether there is a cardinality present and, if so, which one. ROLE itself does not provide an interpretation for the symbolic structure it generates, but we have shown that this structure can be successfully interpreted by humans. By contrast, it is very difficult to interpret the continuous neuron values of RNN representations; even the rare successful cases of doing so, such as Lakretz et al. (2019) and Mu and Andreas (2020), only interpret a few isolated units, while we were able to exhaustively explain the entire symbolic structure discovered by ROLE.

Precision constituent-surgery on internal representations produces desired outputs
The substitution-accuracy results above show that if the entire learned representation is replaced by ROLE's approximation, the output remains correct. But do the individual word embeddings in this TPR have the appropriate causal consequences when processed by the decoder?
To address this causal question (Pearl, 2000), we actively intervene on the constituent structure of the internal representations by replacing one constituent with another syntactically equivalent one, 2 and see whether this produces the expected change in the output of the decoder. We take the encoding generated by the RNN encoder E for an input such as jump opposite left, subtract the vector embedding of the opposite constituent, add the embedding of the around constituent, and see whether this causes the output to change from the correct output for jump opposite left (TL TL JUMP) to the correct output for jump around left (TL JUMP TL JUMP TL JUMP TL JUMP). The roles in these constituents are determined by the algorithm of Appendix A.5. If changing a word leads other roles in the sequence to change (according to the algorithm), we update the encoding with those new roles as well. Such surgery can be viewed as a more general extension of the analogy approach used by Mikolov et al. (2013) for analysis of word embeddings. An example of applying a sequence of five such constituent surgeries to a sequence is shown in Figure 5 (left). Even long sequences of such replacements produce the expected change in the decoder's output with high accuracy ( Figure 5, 2 We extract syntactic categories from the SCAN grammar (Lake and Baroni, 2018, Supplementary Fig. 6) by saying that two words belong to the same category if every occurrence of one could be grammatically replaced by the other. We do not replace occurrences of and and after since the presence of either of these words causes substantial changes in the roles assigned within the sequence (Appendix A.5). right), indicating that the compositional structure discovered by ROLE does play a central causal role in the model's behavior.

Partially-compositional NLP tasks
The previous sections explored fully-compositional tasks where there is a strong signal for compositionality. In this section, we explore whether the representations of NNs trained on tasks that are only partially-compositional also capture compositional structure. Partially-compositional tasks are especially challenging to model because a fullycompositional model may enforce compositionality too strictly to handle the non-compositional aspects of the task, while a model without a compositional bias may not learn any sort of compositionality from the weak cues in the training set.
As a baseline, we also train TPEs that use predefined role schemes (hyperparameters described in Appendix A.7). For all of the sentence embedding models except Skip-thought, ROLE with continuous attention provides the lowest mean squared error at approximating the encoding ( Table 2). The BOW (bag-of-words) role scheme represents a TPE that uses a degenerate 'compositional' structure which assigns the same role to every filler; for each of the sentence embedding models tested except for SST, performance is within the same order of magnitude as structure-free BOW. Parikh et al. (2016) found that a bag-of-words model scores extremely well on Natural Language Inference despite having no knowledge of word order, showing that structure is not necessary to perform well on the sorts of tasks commonly used to train sentence encoders. Although not definitive, the ROLE results provide no evidence that these models' sentence embeddings possess compositional structure.

Conclusion
We have introduced ROLE, a neural network that learns to approximate the representations of an existing target neural network E using an explicit symbolic structure. ROLE successfully discovers symbolic structure in a standard RNN trained on the fully-compositional SCAN semantic parsing task, even though the RNN has no such structure explicitly present in its architecture. This yields a closed-form equation for the RNN's encoding of any input string. When applied to sentence embedding models trained on partially-compositional tasks, ROLE performs better than hand-specified hypothesized structures but still provides little evidence that the sentence encodings represent compositional structure.
While this work has shown that NNs can converge to TPRs to solve compositional tasks, it is still unknown how the weights in the NN actually convert the raw input into a TPR. To investigate this process, in future work we plan to apply our technique to representations of partial sequences. For instance, when the complete input is jump right twice, the target RNN must first represent jump right as a well-formed TPR at the point when only those two words have been encountered. The representation then needs to be updated when the next word, twice, is encountered. By studying the nature of that update, we can gain insight into how the target model builds up a TPR from the input elements.
Uncovering the latent symbolic structure of NN representations learned for fully-compositional tasks is a significant step towards explaining how NNs achieve the level of compositional generalization that they do. In addition, by illuminating shortcomings in the representations learned for standard tasks that are not fully-compositional, ROLE can help suggest types of inductive bias for improving models' generalization with standard, partiallycompositional datasets. 4. Tree: Each filler's role is given by its position in a tree. This depends on a tree parsing algorithm.

Wickelroles (Wickel)
: Each filler's role is a 2-tuple containing the filler before it and the filler after it. (Wickelgren, 1969) 6. Bag-of-words (BOW): Each filler is assigned the same role. The position and context of the filler is ignored.

A.2 ROLE regularization
Letting A = {a t } T t=1 , the regularization term applied during ROLE training is R = λ(R 1 + R 2 + R 3 ), where λ is a regularization hyperparameter and: Since each a t results from a softmax, its elements are positive and sum to 1. Thus the factors in R 1 (A) are all non-negative, so R 1 assumes its minimal value of 0 when each a t has binary elements; since these elements must sum to 1, such an a t must be one-hot. R 2 (A) is also minimized when each a t is one-hot because when a vector's L 1 norm is 1, its L 2 norm is maximized when it is one-hot. Although each of these terms individually favor one-hot vectors, empirically we find that using both terms helps the training process. In a discrete symbolic structure, each position can hold at most one symbol, and the final term R 3 in ROLE's regularizer R is designed to encourage this. In the vector s A = T t=1 a t , the ρ th element is the total attention weight, over all symbols in the string, assigned to the ρ th role: in the discrete case, this must be 0 (if no symbol is assigned this role) or 1 (if a single symbol is assigned this role). Thus R 3 is minimized when all elements of s are 0 or 1 (R 3 is similar to R 1 , but with squared terms since we are no longer assured each element is at most 1). It is important to normalize each role embedding in the role matrix R so that small attention weights have correspondingly small impacts on the weighted-sum role embedding.

A.3 RNN trained on SCAN
To train the standard RNN on SCAN, we ran a limited hyperparameter search similar to the procedure in Lake and Baroni (2018). Since our goal was to produce a single embedding that captured the entire input sequence, we fixed the architecture as a GRU with a single hidden layer. We did not train models with attention, to investigate whether a standard RNN could capture compositionality in its single bottleneck encoding. The remaining hyperparameters were hidden dimension and dropout. We ran a search over the hidden dimension sizes of 50, 100, 200, and 400 as well as dropout with a value of 0, .1, and .5 applied to the word embeddings and recurrent layer. Each network was trained with the Adam optimizer (Kingma and Ba, 2015) and a learning rate of .001 for 100,000 steps with a batch-size of 1. The best performing network had a hidden dimension or 100 and dropout of .1.

A.5 A role-assignment algorithm implicitly learned by the SCAN seq2seq encoder
The input sequences have three basic types that are relevant to determining the role assignment: sequences that contain and (e.g., jump around left and walk thrice), sequences that contain after (e.g., jump around left after walk thrice), and sequences without and or after (e.g., turn opposite right thrice). Within commands containing and or after, it is convenient to break the command down into the command before the connecting word and the command after it; for example, in the command jump around left after walk thrice, these two components would be jump around left and walk thrice.
• Sequence with and: -Elements of the command before and: -Action word directly before a cardinality: 4 -Action word before, but not directly before, a cardinality: 34 thrice directly after an action word: 2 twice directly after an action word: 2 opposite in a sequence ending with twice: 8 opposite in a sequence ending with thrice: 34 around in a sequence ending with a cardinality: 22 -Direction word directly before a cardinality: 2 -Action word in a sequence without a cardinality: 46 opposite in a sequence without a cardinality: 2 -Direction after opposite in a sequence without a cardinality: 26 around in a sequence without a cardinality: 3 -Direction after around in a sequence without a cardinality: 22 -Direction directly after an action in a sequence without a cardinality: 22 To show how this works with an example, consider the input jump around left after walk thrice. The command before after is jump around left. left, as the last word, is given role 8. around, as the secondto-last word, gets role 36. jump, as a first word that is not also the last or second-to-last word gets role 11. The command after after is walk thrice. thrice, as the last word, gets role 46. walk, as the secondto-last word, gets role 4. Finally, after gets role 17 because no other elements have been assigned role 17 yet. These predicted outputs match those given by the Role Learner.
A.6 Discussion of the algorithm We offer several observations about this algorithm.
1. This algorithm may seem convoluted, but a few observations can illuminate how the roles assigned by such an algorithm support success on the SCAN task. First, a sequence will contain role 30 if and only if it contains and, and it will contain role 17 if and only if it contains after. Thus, by implicitly checking for the presence of these two roles (regardless of the fillers bound to them), the decoder can tell whether the output involves one or two basic commands, where the presence of and or after leads to two basic commands and the absence of both leads to one basic command. Moreover, if there are two basic commands, whether it is role 17 or role 30 that is present can tell the decoder whether the input order of these commands also corresponds to their output order (when it is and in play, i.e., role 30), or if the input order is reversed (when it is after in play, i.e., role 17).
With these basic structural facts established, the decoder can begin to decode the specific commands. For example, if the input is a sequence with after, it can begin with the command after after, which it can decode by checking which fillers are bound to the relevant roles for that type of command.
It may seem odd that so many of the roles are based on position (e.g., "first word" and "second-to-last word"), rather than more functionally-relevant categories such as "direction word." However, this approach may actually be more efficient: Each command consists of a single mandatory element (namely, an action word such as walk or jump) followed by several optional modifiers (namely, rotation words, direction words, and cardinalities). Because most of the word categories are optional, it might be inefficient to check for the presence of, e.g., a cardinality, since many sequences will not have one. By contrast, every sequence will have a last word, and checking the identity of the last word provides much functionally-relevant information: if that word is not a cardinality, then the decoder knows that there is no cardinality present in the command (because if there were, it would be the last word); and if it is a cardinality, then that is important to know, because the presence of twice or thrice can dramatically affect the shape of the output sequence. In this light, it is unsurprising that the SCAN encoder has implicitly learned several different roles that essentially mean the last element of a particular subcommand.
2. The algorithm does not constitute a simple, transparent role scheme. But its job is to describe the representations that the original network produces, and we have no a priori expectation about how complex that process may be. The role-assignment algorithm implicitly learned by ROLE is interpretable locally (each line is readily expressible in simple English), but not intuitively transparent globally. We see this as a positive result, in two respects.
First, it shows why ROLE is crucial: no human-generated role scheme would provide a good approximation to this algorithm. Such an algorithm can only be identified because ROLE is able to use gradient descent to find role schemes far more complex than any we would hypothesize intuitively. This enables us to analyze networks far more complex than we could analyze previously, being necessarily limited to hand-designed role schemes based on human intuitions about how to perform the task.
Second, when future work illuminates the computation in the original SCAN GRU seq2seq decoder, the baroqueness of the roleassignment algorithm that ROLE has shown to be implicit in the seq2seq encoder can potentially explain certain limitations in the original model, which is known to suffer from severe failures of systematic generalization outside the training distribution (Lake and Baroni, 2018). It is reasonable to hypothesize that systematic generalization requires that the encoder learn an implicit role scheme that is relatively simple and highly compositional. Future proposals for improving the systematic generalization of models on SCAN can be examined using ROLE to test the hypothesis that greater systematicity requires greater compositional simplicity in the role scheme implicitly learned by the encoder.
3. While the role-assignment algorithm of A.8.1 may not be simple, from a certain perspective, it is quite surprising that it is not far more complex. Although ROLE is provided 50 roles to learn to deploy as it likes, it only chooses to use 16 of them (only 16 are ever selected as the arg max(a t ); see Sec. 6.1). Furthermore, the SCAN grammar generates 20,910 input sequences, containing a total of 151,688 words (an average of 7.25 words per input). This means that, if one were to generate a series of conditional statements to determine which role is assigned to each word in every context, this could in theory require up to 151,688 conditionals (e.g., "if the filler is 'jump' in the context 'walk thrice after opposite left', then assign role 17"). However, our algorithm involves just 47 conditionals. This reduction helps explain how the model performs so well on the test set: If it used many more of the 151,688 possible conditional rules, it would completely overfit the training examples in a way that would be unlikely to generalize. The 47-conditional algorithm we found is more likely to generalize by abstracting over many details of the context. 4. Were it not for ROLE's ability to characterize the representations generated by the original encoder in terms of implicit roles, providing an equally complete and accurate interpretation of those representations would necessarily require identifying the conditions determining the activation level of each of the 100 neurons hosting those representations. It seems to us grossly overly optimistic to estimate that each neuron's activation level in the representation of a given input could be characterized by a property of the input statable in, say, two lines of roughly 20 words/symbols; yet even then, the algorithm would require 200 lines, whereas the algorithm in A.8.1 requires 47 lines of that scale. Thus, by even such a crude estimate of the degree of complexity expected for an algorithm describing the representations in terms of neuron activities, the algorithm we find, stated over roles, is 4 times simpler.

A.7 TPEs trained on sentence embedding models
For each sentence embedding model, we trained three randomly initialized TPEs for each role scheme and selected the best performing one as measured by the lowest MSE. For each TPE, we used the original filler embedding from the sentence embedding model. This filler dimensionality is 25 for SST, 300 for SPINN and InferSent, and 620 for Skipthought. We applied a linear transformation to the pre-trained filler embedding where the input size is the dimensionality of the pre-trained embedding and the output size is also the dimensionality of the pre-trained embedding. This linearly transformed embedding is used as the filler vector in the filler-role binding in the TPE. For each TPE, we use a role dimension of 50. Training was done with a batch size of 32 using the Adam optimizer with a learning rate of .001. To generate tree roles from the English sentences, we used the constituency parser released in version 3.9.1 of Stanford CoreNLP (Klein and Manning, 2003).

A.8 ROLE trained on sentence embedding models
For each sentence embedding model, we trained three randomly initialized ROLE models and selected the best performing one as measured by the lowest MSE. We used the original filler embedding from the sentence embedding model (25 for SST, 300 for SPINN and InferSent, and 620 for Skipthought). We applied a linear transformation to the pre-trained filler embedding where the input size is the dimensionality of the pre-trained embedding and the output size is also the dimensionality of the pre-trained embedding. This linearly transformed embedding is used as the filler vector in the filler-role binding in the TPE. We also applied a similar linear transformation to the pretrained filler embedding before input to the role learner LSTM. For each ROLE model, we provide up to 50 roles with a role dimension of 50. Training was done with a batch size of 32 using the ADAM optimizer with a learning rate of .001. We performed a hyperparameter search over the regularization coefficient λ using the values in the set {1, 0.1, 0.01, 0.001, 0.0001}. For SST, SPINN, In-ferSent and SST, respectively, the best performing network used λ = 0.001, 0.01, 0.001, 0.1.