Tensor Product Generation Networks for Deep NLP Modeling

We present a new approach to the design of deep networks for natural language processing (NLP), based on the general technique of Tensor Product Representations (TPRs) for encoding and processing symbol structures in distributed neural networks. A network architecture — the Tensor Product Generation Network (TPGN) — is proposed which is capable in principle of carrying out TPR computation, but which uses unconstrained deep learning to design its internal representations. Instantiated in a model for image-caption generation, TPGN outperforms LSTM baselines when evaluated on the COCO dataset. The TPR-capable structure enables interpretation of internal representations and operations, which prove to contain considerable grammatical content. Our caption-generation model can be interpreted as generating sequences of grammatical categories and retrieving words by their categories from a plan encoded as a distributed representation.


Introduction
In this paper we introduce a new architecture for natural language processing (NLP). On what type of principles can a computational architecture be founded? It would seem a sound principle to require that the hypothesis space for learning which an architecture provides include network hypotheses that are independently known to be suitable for performing the target task. Our proposed architecture makes available to deep learning network configurations that perform natural language generation by use of Tensor Product Representations (TPRs) (Smolensky and Legendre, 2006). Whether learning will create TPRs is unknown in advance, but what we can say with certainty is that the hypothesis space being searched during learn- * LD is currently at Citadel.
ing includes TPRs as one appropriate solution to the problem.
TPRs are a general method for generating vector-space embeddings of complex symbol structures. Prior work has proved that TPRs enable powerful symbol processing to be carried out using neural network computation (Smolensky, 2012). This includes generating parse trees that conform to a grammar (Cho et al., 2017), although incorporating such capabilities into deep learning networks such as those developed here remains for future work. The architecture presented here relies on simpler use of TPRs to generate sentences; grammars are not explicitly encoded here.
We test the proposed architecture by applying it to image-caption generation (on the MS-COCO dataset, (COCO, 2017)). The results improve upon a baseline deploying a state-of-the-art LSTM architecture (Vinyals et al., 2015), and the TPR foundations of the architecture provide greater interpretability.
Section 2 of the paper reviews TPR. Section 3 presents the proposed architecture, the Tensor Product Generation Network (TPGN). Section 4 describes the particular model we study for image captioning, and Section 5 presents the experimental results. Importantly, what the model has learned is interpreted in Section 5.3. Section 6 discusses the relation of the new model to previous work and Section 7 concludes.

Review of tensor product representation
The central idea of TPRs (Smolensky, 1990) can be appreciated by contrasting the TPR for a word string with a bag-of-words (BoW) vector-space embedding. In a BoW embedding, the vector that encodes Jay saw Kay is the same as the one that encodes Kay saw Jay: J + K + s where J, K, s are respectively the vector embeddings of the words Jay, Kay, saw. A TPR embedding that avoids this confusion starts by analyzing Jay saw Kay as the set {Jay/SUBJ, Kay/OBJ, saw/VERB}. (Other analyses are possible: see Section 3.) Next we choose an embedding in a vector space V F for Jay, Kay, saw as in the BoW case: J, K, s. Then comes the step unique to TPRs: we choose an embedding in a vector space V R for the roles SUBJ, OBJ, VERB: r SUBJ , r OBJ , r VERB . Crucially, r SUBJ = r OBJ . Finally, the TPR for Jay saw Kay is the following vec- (1) Each word is tagged with the role it fills in the sentence; Jay and Kay fill different roles.
This TPR avoids the BoW confusion: v Jay saw Kay = v Kay saw Jay because J ⊗ r SUBJ + K ⊗ r OBJ = J ⊗ r OBJ + K ⊗ r SUBJ . In the terminology of TPRs, in Jay saw Kay, Jay is the filler of the role SUBJ, and J ⊗ r SUBJ is the vector embedding of the filler/role binding Jay/SUBJ. In the vector space embedding, the binding operation is the tensor -or generalized outer -product ⊗; i.e., J ⊗ r SUBJ is a tensor with 2 indices defined by: The tensor product can be used recursively, which is essential for the TPR embedding of recursive structures such as trees and for the computation of recursive functions over TPRs. However, in the present context, recursion will not be required, in which case the tensor product can be regarded as simply the matrix outer product (which cannot be used recursively); we can regard J⊗r SUBJ as the matrix product Jr SUBJ . Then Equation 1 becomes v Jay saw Kay = Jr SUBJ + Kr OBJ + sr VERB (2) Note that the set of matrices (or the set of tensors with any fixed number of indices) is a vector space; thus Jay saw Kay → v Jay saw Kay is a vector-space embedding of the symbol structures constituting sentences. Whether we regard v Jay saw Kay as a 2-index tensor or as a matrix, we can call it simply a 'vector' since it is an element of a vector space: in the context of TPRs, 'vector' is used in a general sense and should not be taken to imply a single-indexed array.
Crucial to the computational power of TPRs and to the architecture we propose here is the notion of unbinding. Just as an outer product -the tensor product -can be used to bind the vector embedding a filler Jay to the vector embedding a role SUBJ, J ⊗ r SUBJ or Jr SUBJ , so an inner product can be used to take the vector embedding a structure and unbind a role contained within that structure, yielding the symbol that fills the role.
In the simplest case of orthonormal role vectors r i , to unbind role SUBJ in Jay saw Kay we can compute the matrix-vector product: v Jay saw Kay r SUBJ = J (because r i r j = δ ij when the role vectors are orthonormal). A similar situation obtains when the role vectors are not orthonormal, provided they are not linearly dependent: for each role such as SUBJ there is an unbinding vector u SUBJ such that r i u j = δ ij so we get: v Jay saw Kay u SUBJ = J. A role vector such as r SUBJ and its unbinding vector u SUBJ are said to be duals of each other. (If R is the matrix in which each column is a role vector r j , then R is invertible when the role vectors are linearly independent; then the unbinding vectors u i are the rows of R −1 . When the r j are orthonormal, u i = r i . Replacing the matrix inverse with the pseudo-inverse allows approximate unbinding if the role vectors are linearly dependent.) We can now see how TPRs can be used to generate a sentence one word at a time. We start with the TPR for the sentence, e.g., v Jay saw Kay . From this vector we unbind the role of the first word, which is SUBJ: the embedding of the first word is thus v Jay saw Kay u SUBJ = J, the embedding of Jay. Next we take the TPR for the sentence and unbind the role of the second word, which is VERB: the embedding of the second word is then v Jay saw Kay u VERB = s, the embedding of saw. And so on.
To accomplish this, we need two representations to generate the t th word: (i) the TPR of the sentence, S (or of the string of not-yet-produced words, S t ) and (ii) the unbinding vector for the t th word, u t . The architecture we propose will therefore be a recurrent network containing two subnetworks: (i) a subnet S hosting the representation S t , and a (ii) a subnet U hosting the unbinding vector u t . This is shown in Fig. 1.

A TPR-capable generation architecture
As Fig. 1 shows, the proposed Tensor Product Generation Network architecture (the dashed box labeled N ) is designed to support the technique Figure 1: Architecture of TPGN, a TPR-capable generation network. "2 ×" denotes the matrix-vector product.
for generation just described: the architecture is TPR-capable. There is a sentence-encoding subnetwork S which could host a TPR of the sentence to be generated, and an unbinding subnetwork U which could output a sequence of unbinding vectors u t ; at time t, the embedding f t of the word produced, x t , could then be extracted from S t via the matrix-vector product (shown in the figure by "2 ×"): f t = S t u t . The lexical-decoding subnetwork L converts the embedding vector f t to the 1-hot vector x t corresponding to the word x t . Unlike some other work (Palangi et al., 2017), TPGN is not constrained to literally learn TPRs. The representations that will actually be housed in S and U are determined by end-to-end deep learning on a task: the bubbles in Fig. 1 show what would be the meanings of S t , u t and f t if an actual TPR scheme were instantiated in the architecture. The learned representations S t will not be proven to literally be TPRs, but by analyzing the unbinding vectors u t the network learns, we will gain insight into the process by which the learned matrices S t give rise to the generated sentence.
The task studied here is image captioning; Fig.  1 shows that the input to this TPGN model is an image, preprocessed by a CNN which produces the initial representation in S, S 0 . This vector S 0 drives the entire caption-generation process: it contains all the image-specific information for producing the caption. (We will call a caption a "sentence" even though it may in fact be just a noun phrase.) The two subnets S and U are mutuallyconnected LSTMs (Hochreiter and Schmidhuber, 1997): see Fig. 2. The internal hidden state of U, p t , is sent as input to S; U also produces output, the unbinding vector u t . The internal hidden state of S, S t , is sent as input to U, and also produced as output. As stated above, these two outputs are multiplied together to produce the embedding vector f t = S t u t of the output word x t . Furthermore, the 1-hot encoding x t of x t is fed back at the next time step to serve as input to both S and U.
What type of roles might the unbinding vectors be unbinding? A TPR for a caption could in principle be built upon positional roles, syntactic/semantic roles, or some combination of the two. In the caption a man standing in a room with a suitcase, the initial a and man might respectively occupy the positional roles of POS(ITION) 1 and POS 2 ; standing might occupy the syntactic role of VERB; in the role of SPATIAL-P(REPOSITION); while a room with a suitcase might fill a 5-role schema DET(ERMINER) 1 N(OUN) 1 P DET 2 N 2 . In fact we will provide evidence in Sec. 5.3.2 that our network learns just this kind of hybrid role decomposition; further evidence for these particular roles is presented elsewhere.
What form of information does the sentenceencoding subnetwork S need to encode in S? Continuing with the example of the previous paragraph, S needs to be some approximation to the TPR summing several filler/role binding matrices. In one of these bindings, a filler vector f a -which the lexical subnetwork L will map to the article a -is bound (via the outer product) to a role vector r POS 1 which is the dual of the first unbinding vector produced by the unbinding subnetwork U: u POS 1 . In the first iteration of generation the model computes S 1 u POS 1 = f a , which L then maps to a. Analogously, another binding approximately contained in S 2 is f man r POS 2 . There are corresponding approximate bindings for the remaining words of the caption; these employ syntactic/semantic roles. One example is f standing r V . At iteration 3, U decides the next word should be a verb, so it generates the unbinding vector u V which when multiplied by the current output of S, the matrix S 3 , yields a filler vector f standing which L maps to the output standing. S decided the caption should deploy standing as a verb and included in S an approximation to the binding f standing r V . It similarly decided the caption should deploy in as a spatial preposition, approximately including in S the binding f in r SPATIAL-P ; and so on for the other words in their respective roles in the caption.

System Description
As stated above, the unbinding subnetwork U and the sentence-encoding subnetwork S of Fig. 1 are each implemented as (1-layer, 1-directional) LSTMs (see Fig. 2); the lexical subnetwork L is implemented as a linear transformation followed by a softmax operation.
In the equations below, the LSTM variables internal to the S subnet are indexed by 1 (e.g., the forget-, input-, and output-gates are respectivelŷ f 1 ,î 1 ,ô 1 ) while those of the unbinding subnet U are indexed by 2.
Thus the state updating equations for S are, for t = 1, · · · , T = caption length: is the (element-wise) logistic sigmoid function; σ h (·) is the hyperbolic tangent function; the operator denotes the Hadamard (element-wise) product; . For clarity, biases -included throughout the model -are omitted from all equations in this paper. The initial statê S 0 is initialized by: where v ∈ R 2048 is the vector of visual features extracted from the current image by ResNet (Gan et al., 2017) andv is the mean of all such vectors; C s ∈ R (d×d)×2048 . On the output side, x t ∈ R V is a 1-hot vector with dimension equal to the size of the caption vocabulary, V , and W e ∈ R d×V is a word embedding matrix, the i-th column of which is the embedding vector of the i-th word in the vocabulary; it is obtained by the Stanford GLoVe algorithm with zero mean (Pennington et al., 2017).
x 0 is initialized as the one-hot vector corresponding to a "start-of-sentence" symbol. For U in Fig. 1, the state updating equations are: pt =ô2,t σ h (c2,t) Here The initial state p 0 is the zero vector. The dimensionality of the crucial vectors shown in Fig. 1, u t and f t , is increased from d×1 to d 2 ×1 as follows. A block-diagonal d 2 × d 2 matrix S t is created by placing d copies of the d × d matrixŜ t as blocks along the principal diagonal. This matrix is the output of the sentence-encoding subnetwork S. Now the 'filler vector' f t ∈ R d 2 -'unbound' from the sentence representation S t with the 'unbinding vector' u t -is obtained by Eq. (16).
Here u t ∈ R d 2 , the output of the unbinding subnetwork U, is computed as in Eq. (17), where W u ∈ R d 2 ×d is U's output weight matrix.
Finally, the lexical subnetwork L produces a decoded word x t ∈ R V by where σ s (·) is the softmax function and W x ∈ R V ×d 2 is the overall output weight matrix. Since W x plays the role of a word de-embedding matrix, we can set where W e is the word-embedding matrix. Since W e is pre-defined, we directly set W x by Eq. (19) without training L through Eq. (18). Note that S and U are learned jointly through end-to-end training as shown in Algorithm 1. Algorithm 1 End-to-end training of S and U Input: Image feature vector v (i) and corresponding caption

Dataset
To evaluate the performance of our proposed model, we use the COCO dataset (COCO, 2017). The COCO dataset contains 123,287 images, each of which is annotated with at least 5 captions. We use the same pre-defined splits as in (Karpathy and Fei-Fei, 2015;Gan et al., 2017): 113,287 images for training, 5,000 images for validation, and 5,000 images for testing. We use the same vocabu-lary as that employed in (Gan et al., 2017), which consists of 8,791 words.

Evaluation
For the CNN of Fig. 1, we used ResNet-152 (He et al., 2016), pretrained on the ImageNet dataset. The feature vector v has 2048 dimensions. Word embedding vectors in W e are downloaded from the web (Pennington et al., 2017). The model is implemented in TensorFlow (Abadi et al., 2015) with the default settings for random initialization and optimization by backpropagation.
In our experiments, we choose d = 25 (where d is the dimension of vector p t ). The dimension of S t is 625 × 625 (whileŜ t is 25 × 25); the vocabulary size V = 8, 791; the dimension of u t and f t is d 2 = 625.
The main evaluation results on the MS COCO dataset are reported in Table 5.2. The widelyused BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), and CIDEr (Vedantam et al., 2015) metrics are reported in our quantitative evaluation of the performance of the proposed model. In evaluation, our baseline is the widely used CNN-LSTM captioning method originally proposed in (Vinyals et al., 2015). For comparison, we include results in that paper in the first line of Table 5.2. We also re-implemented the model using the latest ResNet features and report the results in the second line of Table 5.2. Our re-implementation of the CNN-LSTM method matches the performance reported in (Gan et al., 2017), showing that the baseline is a state-of-theart implementation. For TPGN, we use parameter settings in a similar range to those in (Gan et al., 2017). TPGN has comparable, although slightly Methods METEOR BLEU-1 BLEU-2 BLEU-3 BLEU-4 CIDEr NIC (Vinyals et al., 2015)  more, parameters than the CNN-LSTM. The training time of TPGN is roughly 50% more than the CNN-LSTM model. The weights in TPGN are updated at every mini-batch; in the experiments, we use a batch size of 64 images. As shown in Table 5.2, compared to the CNN-LSTM baseline, the proposed TPGN appreciably outperforms the benchmark schemes in all metrics across the board. The improvement in BLEU-n is greater for greater n; TPGN particularly improves generation of longer subsequences. The results attest to the effectiveness of the TPGN architecture. It is worth mentioning that this paper is aimed at developing a Tensor Product Representation (TPR) inspired network to replace the core layers in an LSTM; therefore, it is directly comparable to an LSTM baseline. So in the experiments, we focus on comparison to a strong CNN-LSTM baseline. We acknowledge that more recent papers (Xu et al., 2017;Rennie et al., 2017;Yao et al., 2017;Lu et al., 2017;Gan et al., 2017) reported better performance on the task of image captioning. Performance improvements in these more recent models are mainly due to using better image features such as those obtained by Region-based Convolutional Neural Networks (R-CNN), or using reinforcement learning (RL) to directly optimize metrics such as CIDEr, or using more complex attention mechanisms (Gan et al., 2017) to provide a better context vector for caption generation, or using an ensemble of multiple LSTMs, among others. However, the LSTM is still playing a core role in these works and we believe improvement over the core LSTM, in both performance and interpretability, is still very valuable; that is why we compare the proposed TPGN with a state-of-the-art native LSTM (the second line of Table 5.2).

Interpretation of learned unbinding vectors
To get a sense of how the sentence encodings S t learned by TPGN approximate TPRs, we now investigate the meaning of the role-unbinding vec-tor u t the model uses to unbind from S t -via Eq. (16) -the filler vector f t that produces -via Eq. (18) -the one-hot vector x t of the t th generated caption word. The meaning of an unbinding vector is the meaning of the role it unbinds. Interpreting the unbinding vectors reveals the meaning of the roles in a TPR that S approximates.

Visualization of u t
We run the TPGN model with 5,000 test images as input, and obtain the unbinding vector u t used to generate each word x t in the caption of a test image. We plot 1,000 unbinding vectors u t , which correspond to the first 1,000 words in the resulting captions of these 5,000 test images. There are 17 parts of speech (POS) in these 1,000 words. The POS tags are obtained by the Stanford Parser (Manning, 2017). We use the Embedding Projector in Tensor-Board (Google, 2017) to plot 1,000 unbinding vectors u t with a custom linear projection in Tensor-Board to reduce 625 dimensions of u t to 2 dimensions shown in Fig. 3 through Fig. 7. Fig. 3 shows the unbinding vectors of 1000 words; different POS tags of words are represented by different colors. In fact, we can partition the 625-dim space of u t into 17 regions, each of which contains 76.3% words of the same type of POS on average; i.e., each region is dominated by words of one POS type. This clearly indicates that each unbinding vector contains important grammatical information about the word it generates. As examples, Fig. 4 to Fig. 7 show the distribution of the unbinding vectors of nouns, verbs, adjectives, and prepositions, respectively. Furthermore, we show that the subject and the object of a sentence can be distinguished based on u t in (Huang et al., 2018).

Clustering of u t
Since the previous section indicates that there is a clustering structure for u t , in this section we partition u t into N u clusters and examine the grammar roles played by u t .
First, we run the trained TPGN model on the 113,287 training images, obtaining the role-  unbinding vector u t used to generate each word x t in the caption sentence. There are approximately 1.2 million u t vectors over all the training images. We apply the K-means clustering algorithm to these vectors to obtain N u clusters and the centroid µ i of each cluster i (i = 0, · · · , N u − 1).
Then, we run the TPGN model with 5,000 test images as input, and obtain the role vector u t of each word x t in the caption sentence of a test image. Using the nearest neighbor rule, we obtain the index i of the cluster that each u t is assigned to.
The partitioning of the unbinding vectors u t into N u = 2 clusters exposes the most fundamental distinction made by the roles. We find that the vectors assigned to Cluster 1 generate words which are nouns, pronouns, indefinite and definite articles, and adjectives, while the vectors assigned to Cluster 0 generate verbs, prepositions, conjunctions, and adverbs. Thus Cluster 1 contains the noun-related words, Cluster 0 the verb-like words   (verbs, prepositions and conjunctions are all potentially followed by noun-phrase complements, for example). Cross-cutting this distinction is another dimension, however: the initial word in a caption (always a determiner) is sometimes generated with a Cluster 1 unbinding vector, sometimes with a Cluster 0 vector. Outside the captioninitial position, exceptions to the nominal/verbal ∼ Cluster 1/0 generalization are rare, as attested by the high rates of conformity to the generalization shown in Table 5.3.1. Table 5.3.1 shows the likelihood of correctness of this 'N/V' generalization for the words in 5,000 sentences captioned for the 5,000 test images; N w is the number of words in the category, N r is the number of words conforming to the generalization, and P c = N r /N w is the proportion conforming. We use the Natural Language Toolkit (NLTK, 2017) to identify the part of speech of each word in the captions.
A similar analysis with N u = 10 clusters reveals the results shown in Table 5.3.1; these results concern the first 100 captions, which were inspected manually to identify interpretable patterns. (More comprehensive results will be discussed elsewhere.) The clusters can be interpreted as falling into 3 groups (see Table 5.3.1). Clusters 2 and 3 are clearly positional roles: every initial word is generated by a role-unbinding vector from Cluster 2, and such vectors are not used elsewhere in the string. The same holds for Cluster 3 and the second caption word.
For caption words after the second word, position is replaced by syntactic/semantic properties for interpretation purposes. The vector clusters aside from 2 and 3 generate words with a dominant grammatical category: for example, unbinding vectors assigned to the cluster 4 generate words that are 91% likely to be prepositions, and 72% likely to be spatial prepositions. Cluster 7 generates 88% nouns and 9% adjectives, with the remaining 3% scattered across other categories. As Table 5.3.1 shows, clusters 1, 5, 7, 9 are primarily nominal, and 0, 4, 6, and 8 primarily verbal. (Only cluster 5 spans the N/V divide.)

Related work
This work follows a great deal of recent captiongeneration literature in exploiting end-to-end deep learning with a CNN image-analysis front end producing a distributed representation that is then used to drive a natural-language generation process, typically using RNNs (Mao et al., 2015;Vinyals et al., 2015;Devlin et al., 2015;Chen and Zitnick, 2015;Donahue et al., 2015;Karpathy and Fei-Fei, 2015;Kiros et al., 2014a,b;Xu et al., 2017;Rennie et al., 2017;Yao et al., 2017;Lu et al., 2017). Our grammatical interpretation of the structural roles of words in sentences makes contact with other work that incorporates deep learning into grammatically-structured networks (Tai et al., 2015;Kumar et al., 2016;Kong et al., 2017;Andreas et al., 2015;Yogatama et al., 2016;Maillard et al., 2017;Socher et al., 2010;Pollack, 1990). Here, the network is not itself structured to match the grammatical structure of sentences being processed; the structure is fixed, but is designed to support the learning of distributed representations that incorporate structure internal to the representations themselves -filler/role structure.
TPRs are also used in NLP in (Palangi et al., 2017) but there the representation of each individual input word is constrained to be a literal TPR filler/role binding. (The idea of using the outer product to construct internal representations was also explored in (Fukui et al., 2016).) Here, by contrast, the learned representations are not themselves constrained, but the global structure of the network is designed to display the somewhat abstract property of being TPR-capable: the archi-tecture uses the TPR unbinding operation of the matrix-vector product to extract individual words for sequential output.

Conclusion
Tensor Product Representation (TPR) (Smolensky, 1990) is a general technique for constructing vector embeddings of complex symbol structures in such a way that powerful symbolic functions can be computed using hand-designed neural network computation. Integrating TPR with deep learning is a largely open problem for which the work presented here proposes a general approach: design deep architectures that are TPRcapable -TPR computation is within the scope of the capabilities of the architecture in principle. For natural language generation, we proposed such an architecture, the Tensor Product Generation Network (TPGN): it embodies the TPR operation of unbinding which is used to extract particular symbols (e.g., words) from complex structures (e.g., sentences). The architecture can be interpreted as containing a part that encodes a sentence and a part that selects one structural role at a time to extract from the sentence. We applied the approach to image-caption generation, developing a TPGN model that was evaluated on the COCO dataset, on which it outperformed LSTM baselines on a range of standard metrics. Unlike standard LSTMs, however, the TPGN model admits a level of interpretability: we can see which roles are being unbound by the unbinding vectors generated internally within the model. We find such roles contain considerable grammatical information, enabling POS tag prediction for the words they generate and displaying clustering by POS.