Enriching Transformers with Structured Tensor-Product Representations for Abstractive Summarization

Abstractive summarization, the task of generating a concise summary of input documents, requires: (1) reasoning over the source document to determine the salient pieces of information scattered across the long document, and (2) composing a cohesive text by reconstructing these salient facts into a shorter summary that faithfully reflects the complex relations connecting these facts. In this paper, we adapt TP-Transformer (Schlag et al., 2019), an architecture that enriches the original Transformer (Vaswani et al., 2017) with the explicitly compositional Tensor Product Representation (TPR), for the task of abstractive summarization. The key feature of our model is a structural bias that we introduce by encoding two separate representations for each token to represent the syntactic structure (with role vectors) and semantic content (with filler vectors) separately. The model then binds the role and filler vectors into the TPR as the layer output. We argue that the structured intermediate representations enable the model to take better control of the contents (salient facts) and structures (the syntax that connects the facts) when generating the summary. Empirically, we show that our TP-Transformer outperforms the Transformer and the original TP-Transformer significantly on several abstractive summarization datasets based on both automatic and human evaluations. On several syntactic and semantic probing tasks, we demonstrate the emergent structural information in the role vectors and the performance gain by information specificity of the role vectors and improved syntactic interpretability in the TPR layer outputs.(Code and models are available at https://github.com/jiangycTarheel/TPT-Summ)


Introduction
Abstractive summarization is the task of generating a shorter version of a source text without necessarily reusing the sentences from the original source, while preserving the meaning of its salient contents. It is a complex task that requires: semantic understanding of the source text and reasoning over its lexical units, making inferences about their relation to extract salient facts which are scattered across the long document, as well as generating a concise and coherent sequence of new sentences that covers the salient facts. While humans are remarkably good at this type of reasoning and abstraction, developing models that are capable of extraction, comprehension, abstraction, and reformulation of salient contents has been an open research question.
One prominent aspect of abstractive summarization is that models struggle with combining multiple salient aspects in the source text into a coherent and grammatical set of sentences that preserve the original information in the source document. As shown in Fig. 1, these pieces of salient information ("death", "emergency landing", "beach") are often connected by complex syntactic, causal, and temporal relations and are loosely grouped under the main topic of the source document. The transformer models (Vaswani et al., 2017) encode syntactic and semantic information of the input text into a single representation space with the self-attention, and decode the salient aspects into a short summary with the cross-attention. However, despite the large number of training examples, current state-of-theart transformer based approaches still struggle with systematic generalization of the composition of multiple salient pieces of information.
In this paper, we investigate new types of computational primitives for transformers based on Tensor Product Representations (TPRs) (Smolensky, 1990) which are explicitly-compositional vector embeddings of symbolic structures. A Tensor Product Representation encodes a constituent in a symbolic structure as a composite of a role, which encodes the structural information (e.g., the dependency relation with another word), and a filler, which encodes the content of the constituent (e.g., the meaning of a word). Analogously, the TP-TRANSFORMER constructs a pair of representations for every token at every layer: a filler vector returned by attention and a novel role vector. As visualized in Fig. 2, the model then binds the role and filler vectors to produce the output of every token as a TPR. We adapt the TP-TRANSFORMER (Schlag et al., 2019), which was proposed for solving mathematics problems, for the task of abstractive summarization. Unlike the original TP-TRANSFORMER, which directly projects the input representation into a continuous role vector space, our model generates the role vectors by attending to a learned dictionary of role embeddings (Palangi et al., 2018). We observe that most learned role attention distributions are approximately one-hot, thus restricting the role vectors to a highly discrete space. This structural inductive bias encourages the TP-TRANSFORMER to encode the syntactic information in the discrete roles while isolating the semantics in the continuous fillers.
To test the ability of our TP-TRANSFORMER with discrete roles against the standard Transformer and the TP-TRANSFORMER with continuous roles, we build several models from scratch on a number of summarization datasets spanning different degrees of abstractiveness, output summary lengths, and domains. Our TP-TRANSFORMER significantly outperforms the standard Transformer and the TP-TRANSFORMER with continuous roles on the XSum (Narayan et al., 2018), Wikihow (Koupaee and Wang, 2018), and Arxiv (Cohan et al., 2018) datasets and achieves competitive performance on the CNN/Daily Mail (Hermann et al., 2015;Nallapati et al., 2016) dataset, measured by automatic metrics including ROUGE (Lin, 2004) and METEOR (Denkowski and Lavie, 2014). Our human evaluations on XSum and Wikihow datasets also correlate with the automatic metrics, demonstrating that summaries generated by our TP-TRANSFORMER are indeed better than the Trans-former's generations.
Furthermore, to investigate the structural representation that naturally emerges during training and the advantage of having compositional TPR hidden states, we design a suite of decoder probing tasks to explore the information encoded in the role, filler, and TPR space. We adopt the encoder probing task design presented in Tenney et al. (2019b) and create four decoder probing tasks: Part-of-speech tagging (POS), Dependency Labeling (DEP), Semantic Role Labeling (SRL), and Named Entity Labeling (NEL). Our findings collectively show that the decoder's role vectors encode a wealth of syntactic structures, aiding the decoder in deducing the syntactic features (e.g., being a proper noun, being the object of the root predicate) of the next token to be generated. The decoder's filler vectors on the other hand encode more semantic information (e.g., being a person's name). Furthermore, we observe that having the compositional TPR results in a more interpretable final representation than the original Transformer has at every layer, regarding the syntactic features of the next word to be generated. Our results support our hypothesis that by disentangling semantics and syntax, such structured intermediate representations enable the model to better control both the content to be conveyed and the syntactic structure needed to express it, ultimately improving the factuality and grammaticality of the generated summaries.
Our overall contributions are as follows: (1) we present a novel adaptation of the original Transformer architecture that incorporates a dictionary of role embeddings at every layer and generates Tensor Product Representation by binding the role vectors with attention outputs (filler vectors); (2) show that our TP-TRANSFORMER outperforms the Transformer as well as the original TP-TRANSFORMER (Schlag et al., 2019) on several abstractive summarization datasets; and (3) demonstrate the emergent structures in representations by revealing the disentangled syntactic and semantic information encoded in the role and filler spaces.

The TP-TRANSFORMER
We build our TP-TRANSFORMER based on the Transformer architecture used in Raffel et al. (2020). A TP-TRANSFORMER encoder applied to a sequence of tokens i = 1, ..., I can be seen as a 2-dimensional lattice of cells (i, l) where i is the position of the input token and l = 1, ..., L are the layer indices. All cells in the encoder have the same architecture and the cells at the same layer share the same weights. We introduce the basic components of a TP-TRANSFORMER cell in Sec. 2.2 and its encoder and decoder cells in Sec. 2.3.

Tensor-Product Representation Basics
Tensor-Product Representations (TPR; (Smolensky, 1990)) are explicitly-compositional vector embeddings of symbolic structures, where each constituent of the structure is represented as the product of a role vector, which encodes its structural information, and a filler vector, which contains the content. The TPR of a whole structure is the sum of the representation of its constituents. To represent any 3-digit number using TPRs, we need three role vectors: {r(p1): Ones place, r(p2): Tens place, r(p3): Hundreds place} and ten filler vectors f for ten digits. For example, the TPR of the number where ⊗ is the tensor product. When representing a number, the role vectors operate similarly as the positional embeddings in a Transformer (Vaswani et al., 2017). However, when representing natural languages, the role vectors need to encode a variety of structural information (e.g., predicate-argument, tense, etc) and thus it is infeasible to hand-design an entire suite of role vectors as we did for numbers. To overcome this challenge, for every token, we dynamically compute its role vector from a dictionary of a finite number of role embeddings learned with the entire model and treat the self-attention outputs as the fillers. We introduce the full computation procedure in Sec. 2.2.2.

The TP-TRANSFORMER Cell
Similar to the basic Transformer cell, at every layer, a TP-TRANSFORMER Encoder cell starts with a layer normalization and the multi-head selfattention followed by a residual layer. Then, the cell treats the output vectors as fillers and binds them to role vectors to construct a Tensor Product Representation, which is then passed through the feed-forward network to yield the final states.

Multi-Head Attention
The TP-TRANSFORMER cell adopts multi-head attention (Vaswani et al., 2017) to enable information passing between tokens. At any layer, denote the input vectors as X∈R kx×dm and the attention target vectors as Y ∈R ky×dm , where k x , k y are the length of the sequences and d m is the dimension of the input vectors. In the case of self attention, we have Y =X; while for the encoder-decoder cross attention, Y is the encoder's output vectors. We first apply layer normalization (Ba et al., 2016) to get X and then linearly project it to the query, key, and value vectors for each attention head h = 1, ..., H.
where W q , W k , W v ∈ R dm×d k . The attention output matrixV for each head h is computed as: where d k is the dimension of the key vectors K. The multi-head attention output O is the concatenation of the attention outputs from all heads followed by another linear projection W o ∈ R dm×dm . We end the Multi-head Attention with a residual connection with the layer input vectorsX: whereV h is the attention output for the h-th head.

Computing TPRs
Role Embeddings. Following Palangi et al. (2018), but departing from Schlag et al. (2019), every layer of our TP-TRANSFORMER is equipped with a dictionary r ∈ R Nr×dr of N r distinct role embeddings with a dimension of d r . Each role embedding r n , n=1,. . . ,N r , is randomly initialized in the entire network. The role embeddings are normalized before computing role vectors: r n = r n r n 2 for n = 1, ..., N r (4) At each layer, the model computes a weighted combination of these role embeddingsr to form a unique role vector for every token.
Multi-Head TPR Binding. Our filler vectors correspond to the multi-head attention output F = MHAttn(X) (Eqn. 3). The filler F of each token has a corresponding role vector R. We first compute the R h ∈ R dr at every head h = 1, ..., H as a weighted average of the normalized role embeddingsr. We then concatenate the R h ∈ R kx×dr of H heads to get the multi-head role vectors R ∈ R kx×(dr·H) for all k x tokens. We define this process formally as: where W r ∈ R dm×Nr is the linear projection that computes the attention scores over the role embeddings for every token. 2 We use a Hadamard product 3 to approximate the full Tensor product in binding the role vectors R with filler vectors F , as it was shown in Schlag et al. (2019) that using the Hadamard products allows learning an optimial lower-rank approximation of the full TPRs. The binding operation is followed by an addition with the unbound fillers (F ) to return the residual TPR vectors.

Residual Feed-forward Layer
The feed-forward layer of a cell consists of a linear projection followed by a ReLU activation and a second linear projection. The feed-forward output is then added to the input vectors: , and x is the function argument. 2 We set dr · H = dm so that the multi-head role vectors R have the same dimension as F . 3 The Hadamard (or elementwise) product is the diagonal of the full tensor product.

TP-TRANSFORMER Encoder & Decoder
Given the components of our basic TP-TRANSFORMER cell in the previous section, we now describe how we construct the TP-TRANSFORMER encoder and decoder.
First, the self-attention and the encoder-decoder cross-attention for every token can be computed as: where H is the output of the encoder's final layer. Y represent the previous layer's output vectors of either the partially (so-far) decoded sequence at test time or the masked reference summary at training time. The encoder and decoder's operations at every layer can be summarized as: After L layers of encoding and decoding, the final distribution of the i-th output token is given by: where Y L = Decode(H, Y L−1 ) are the decoder's output states at the last layer and E is the tied input/output word embeddings.

Abstractive Summarization Datasets
We train our models on four English abstractive summarization datasets varying the level of abstractiveness (explained below) and the length of summaries, as well as input domain.
XSum (Narayan et al., 2018) consists of 227k BBC articles from 2010 to 2017 concerning various subjects along with professionally written singlesentence summaries. Its summaries cover a wide variety of syntactic structures (relative clause, etc) and relations (causal, temporal, etc).
Wikihow (Koupaee and Wang, 2018) is a dataset consisting of instructions from the WikiHow.com website. Each of 200k examples has multiple instruction-step paragraphs, each paired with a summarizing sentence. The task is to generate the concatenated summaries of all paragraphs.

Datasets Summary
XSum Luxury fashion designer Burberry has returned to profit after opening new stores and spending more on online marketing.

Wikihow
Build a trustworthy bond with your piggy. Research different training methods. Choose the training method that works best for you and your guinea pig. Gather the materials that you will need for training.

Arxiv (Abbreviated)
We study the phase behavior of a nematic liquid crystal confined between a flat substrate with strong anchoring and a patterned substrate whose structure and local anchoring strength we vary. [. . . ] In addition the effective energy method allows one to determine the energy barriers between two states in a bistable nematic device .

CNN/DM
Mentally ill inmates in Miami are housed on the "forgotten floor". Judge Steven Leifman says most are there as a result of "avoidable felonies". While CNN tours facility, patient shouts: "I am the son of the president".

Experimental Setup
The Transformer and the two TP-TRANSFORMERS all have 6 layers, 8 heads per layer, dimension per head d k =64, model dimension d m =512, and feedforward dimension d f =2048 for the encoder and decoder. Our TP-TRANSFORMER with discrete roles has N r =50 role embeddings of dimension d r =64 at every layer. For each dataset above, we train the all three models from scratch using an Adafactor Optimizer (Shazeer and Stern, 2018) with square root learning rate decay and dropout rate of 0.1. We evaluate the models using automatic metrics including ROUGE F1 score and METEOR.

Results
We report automatic metric scores from our evaluated models in Table 2. We refer to the TP-TRANSFORMER, with freely-generated continuous role vectors (no role dictionary) ( Table 3: Human Evaluation results on 120 random samples from the XSum (Narayan et al., 2018) and Wikihow (Koupaee and Wang, 2018) test sets. The best numbers with an advantage of at least 5 points are underlined.
2019) as TPT-c, and our own TP-TRANSFORMER with a discrete set of role embeddings as TPT-d.
On the XSum, Arxiv, and Wikihow datasets, our TP-TRANSFORMER (TPT-d) outperforms the original Transformer on all metrics. On the CNN/Daily Mail dataset, both models obtain similar performance across all metrics. On every dataset, the TPT-c model which excels on the mathematics dataset, is the worst among the three models being compared. This suggests that continuous role vectors are not suited to the summarization tasks.
As we explain in Sec. 3.1, CNN/Daily Mail is the least abstractive one among the four datasets. In contrast, summaries from the XSum and Wikihow datasets contain very few n-grams (n>2) that can be copied from the source documents and thus push the model's ability to compose a coherent summary restating the salient aspects from the source. Furthermore, as illustrated in Table 1, the XSum summary contains a long sentence that combines multiple pieces of information scattered through the long source document. These facts are usually connected by syntactic, temporal 4 , or causal 5 relations and thus the model must be able to connect and reason across these salient facts and then convert them into a coherent sentence that faithfully reflects the original facts and their relations. We argue that the compositional TPR can better enable these abilities required for XSum, where we indeed find that our TP-TRANSFORMER achieves the largest advantage over the Transformer among its improvements on all datasets.

Human Evaluation
We conduct human evaluation to compare the summaries generated by the Transformer and our TP-TRANSFORMER. We randomly sample 120 examples from the test sets of XSum and Wikihow datasets with the beam-searched model summaries.
We refer to appendix for the complete setup. As shown in Table 3, on the XSum dataset, summaries generated by the TP-TRANSFORMER are significantly better in grammar. This corroborates our claim that having the TPR can improve the model's ability to follow the correct syntax in composing the summary. On the Wikihow dataset, the Transformer receives more votes in regarding the saliency. However, our TP-TRANSFORMER maintains an advantage in grammar and achieves significantly better overall preferences.
Unfaithful XSum Examples It is well-known that the XSum dataset contains a portion of unfaithful reference summaries that mention facts not included in the source article (Durmus et al., 2020;Maynez et al., 2020). Therefore, we are interested to find out whether our TP-TRANSFORMER is better than the baseline only at expressing the faithful content or it can also generate some external, "unfaithful" facts that the baseline can't cover. To answer this question, we randomly sample 100 examples from the XSum dev set and manually examine the source document, reference summary, and the two generated summaries. Among these 100 examples, we identify 71 examples whose reference summary includes "unfaithful" facts that are not mentioned in the source. In 21 out of 71 examples, the Transformer baseline manages to generate some "unfaithful" facts that match those in the reference while our TP-TRANSFORMER achieves this in 17 examples. Such "unfaithful" facts that were recovered by the models include the full name of a person when only the last name is mentioned in the source, the political party or the job title of a person, each of which can be attributed to at least one example seen by models during the training. Therefore, we believe that both models learn to draw external information from its memory of the seen examples, while our TP-TRANSFORMER doesn't do better than the baseline Transformer at referring to external facts to obtain higher ROUGE scores.
Probing is a method to test whether some particular information is present in the model's encodings. To achieve this, an auxiliary classifier is trained to predict specified linguistic features from the model's internal representations. We probe different components (roles, filler, TPRs) in our TP-TRANSFORMERs as well as the attention+residual outputs (equivalent to the filler) of the Transformer to assess the naturally emergent structures encoded in the role vectors and the effectiveness of the TPR in the decoding process. By conducting the probing experiments, we aim to (1) provide some insights and evidence of the different information encoded by the role and filler vectors; and (2) explain the ROUGE advantage of our TP-TRANSFORMER by showing that its output representation can better encode the linguistic structural information concerning multiple probing tasks.

Decoder Probing Tasks
When studying an encoder, previous works probe its i-th intermediate representation at a certain layer for information about the i-th input token For a decoder, however, we probe its i-th representation for clues about the i-th token it generates given the i − 1 previously generated tokens as the input. Intuitively, we are probing for the decoder's internal decision about the syntactic roles and semantic content of this token before it was ultimately selected. Based on encoder probing tasks used by Tenney et al. (2019b), we select and adapt four tasks to probe our decoders.
Part-of-speech tagging (POS) is the syntactic task of assigning tags such as noun (singular/mass noun: NN, proper noun: NNP, etc), verb (past tense: VBD, past participle: VBN, etc), adjective (comparative: JJR, etc), etc. to each token i. We let s 1 = [i, i + 1) be a single token, and seek to predict its POS tag.
Dependency labeling (DEP) seeks to predict the functional relationships of one token relative to another: e.g. is it a modifier-head relationship, a subject-verb relationship, etc. We take s 1 = [i, i + 1) to be a single token and s 2 = [j, j + 1) to be its syntactic head, and seek to predict the dependency relation between tokens i and j.
Semantic role labeling (SRL) is the task of imposing predicate-argument structure onto a sentence. We let s 1 = [i 1 , j 1 ) represent a known  predicate (e.g., "push") and s 2 = [i 2 , j 2 ) represent a known argument ("Peter") of that predicate, and seek to predict the role that the argument s 2 fills-e.g. ARG0 (agent, the pusher) vs. ARG1 (patient, the pushee).
Named entity labeling (NEL) is the task of predicting the category of an entity. The categories include PERSON, LOCATION, ORGANIZATION, etc. We let s 1 = [i, j) represent a known entity span and seek to predict its type.

Experimental Setup
As there is no existing dataset for probing decoders, we create our own training and evaluation data by running off-the-shelf models on the summarization datasets. Specifically, to probe a decoder trained on the XSum dataset on the POS task, we run an POS tagger on the reference summaries from the XSum training set and the model-generated summaries for the XSum dev set to create the ground-truth labels for the training set and model-specific dev set. We restore the model trained on a summarization dataset and freeze its parameters. Following Tenney et al. (2019b), we train a span convolution layer followed by a 2-layer MLP on top of the target representation that project it onto the output label space. Table 4 presents the results of probing the decoder of a TP-TRANSFORMER trained on the XSum (Narayan et al., 2018) dataset. Note that the Transformer doesn't have role vectors. It directly outputs the vector after the multi-head attention and the residual layer. Therefore, its fillers and final representations are equivalent.

Results
The decoder role vectors can encode grammatical information while the filler vectors represent the semantics. We first focus on the results of POS tagging probing task. Overall, we see a trend of increasing scores as the representations get closer to the final step of computing the distribution over the vocabulary. This implies that, as the computation progresses through the layers, the generated representations are gradually deciding the POS tag of the next word to generate. Next, we observe that the role vectors (the 1st number in the TPT-d column) of TP-TRANSFORMER encode a considerable amount of information about the POS tag of the next word generated. Additionally, because the job of deducing the POS tag of the next word is partially shared by the role vectors, the filler vectors' performance degrades compared to the Transformer. This pattern demonstrates that the TP-TRANSFORMER's decoder is representing the next word to be generated as a composite of structural information encoded in the role vectors and semantic contents encoded in the filler vectors. Comparing the fillers (the 2nd number in TPT-d column) with the TPR (the 3rd number in the TPTd column) of TP-TRANSFORMER, we see that the TPRs, which bind the roles and fillers, outperform the roles and fillers alone at every layer. This indicates that the TPR effectively aggregates the linguistic knowledge encoded in the roles and fillers into a shared space, where the POS tag of the next word can be decoded more easily than in the role space or filler space alone. Last, the final representations of TP-TRANSFORMER achieve higher F1 scores than their counterparts in the Transformer in the last three layers. This demonstrates the benefits of having the TPR in interpreting the POS tag of the word to be generated. When we consider the Dependency labeling (DEP) and Semantic role labeling (SRL) tasks, we observe that our TP-TRANSFORMER's final representations consistently beat the Transformer across all layers, with only one exception in the DEP task at the layer 2. We also observe that the TP-TRANSFORMER's advantage becomes larger in the last three layers except for the final layer in SRL task. However, unlike in the POS task, the TPR only achieve similar F1 scores to the fillers.
Finally, in the Named entity labeling (NEL) task which is considered to require more semantic information rather than syntax, the role vectors' performance is poorer than their performance in the three syntactic tasks. For example, the TP-TRANSFORMER's final representations at layer 6 obtain similar F1 scores in the POS and NEL tasks (74.5 VS 73.8), but its role vectors only achieve a 42.2 F1 score in the NEL tasks compared to the 56.0 in the POS. However, even though the role vectors encode little information about the named entity type of the next token to be generated, the TPR still strongly outperforms the Transformer's filler-only representation at every layer. We argue that although the syntactic information encoded in the role vectors is not enough to predict the correct named entity, it is still a beneficial complement to the knowledge encoded in the distributed filler vectors in certain situations. For example, whether the subject "Chanel" refers to a PERSON or an OR-GANIZATION could depend on its syntactic role and its relation to other words in the sentence (e.g., whether it is the subject or object of "wears") .
Compositional representations improves interpretability of the representations. Overall, by probing the different intermediate representations of the TP-TRANSFORMER and the Transformer, we show that having the compositional TPR results in more interpretable final representations at every layer regarding the syntactic features of the next word to be generated. Considering automatic evaluations generated summaries in Sec. 3.3, we argue that this compositionality in learned representation and its syntactic interpretability enable the decoder to take better control of the syntactic structure of the generation when assembling multiple distant facts, and thus lead to summaries of better quality.

Discrete Role Vectors
During the training of our TP-TRANSFORMER models on the summarization datasets, we observe that most learned role attention distributions are approximately one-hot, as more than 90% of the role attention distributions (as computed in Eqn. 5) have a maximum score larger than 0.98. Because each role vector is the concatenation of H vectors, each selected from N r role embeddings, the completely one-hot role attentions will yield (N r ) H possible role vectors. Therefore, the learned, approximately one-hot role vectors span (N r ) H discrete subspaces, each of which only covers the close proximity of a concatenation of H role embeddings. This finding indicates that as we represent the role vectors as multi-head attention over a learnable dictionary of role embeddings, the structural inductive bias: (1) pushes the role vector space to be even more discrete, and (2) induces the syntactic structures encoded in these discrete role vectors. We also believe there is a connection between the above two effects, as the structural, syntactic information favors a lower-dimensional or even discrete space while the distributed, semantic information favors a higher-dimensional space.

Related Work Explicit TPR Structures in Neural Networks
While earlier TPR work based on (Smolensky, 1990) focused on computability rather than learnability questions, recently TPRs have been incorporated into several recurrent deep learning models in order to solve various NLP tasks including Part-of-Speech tagging, constituency parsing, image captioning (Huang et al., 2018(Huang et al., , 2019, question answering (Palangi et al., 2018;Schlag and Schmidhuber, 2018), and natural-to-formal language generation (program synthesis) (Chen et al., 2020). Most recently, TPRs have been introduced into Transformer architectures, starting with Schlag et al. (2019) which introduced the TP-TRANSFORMER to improve the performance and interpretability of mathematical problem solving models. This model generated continuous role vectors by directly projecting from layer inputs, whereas our model indexes from a dictionary of role embeddings to form the role vectors which are shown to reside in a highly discrete space.

Structured Representations for Abstractive
Summarization Compared to the extractive methods, abstractive summarization models usually fail to show extractive properties, and have tendency to copy text from the source (See et al., 2017;Paulus et al., 2018;Pasunuru and Bansal, 2018;Celikyilmaz et al., 2018). More recent approaches that use standard transformers deal with this issue by introducing hierarchical structures to encode local and global information separately focusing on only the semantic content Lapata, 2018, 2019). To preserve salient source relations and generate abstractive summaries of the source document, previous work infused models with semantic parsers: while Song et al. (2018) introduces a new structure-infused copy mechanism that combines the source syntactic structure with the copy mechanism, Liao et al. (2018) uses abstract meaning representations (AMR). While these approaches require that the document sentence semantic parsers are provided beforehand, our models can implicitly learn to approximate the syntactic structure and semantic content in their representations.

Conclusion
In this work, we enrich the Transformer model with the structured Tensor Product Representation for abstractive summarization tasks. We represent every token as a pair of role and filler vectors. We show that our TP-TRANSFORMER with discrete roles outperforms Transformer and TP-TRANSFORMER with continuous roles on several abstractive summarization datasets, in both metrics scores and human evaluation. We further demonstrate the syntactic structures encoded in the role vectors and show the improved syntactic interpretability in our model's hidden states.

Ethics Statement
In this work we propose a new encoder-decoder modeling architecture and build several models to benchmark our new architecture with baseline architectures on several open source summarization datasets.
Intended use. Our architecture is designed to build models of abstractive summarization. Potentially our architecture could be used to train models for summarizing any type of company internal datasets (e.g., internal documents, reports, meetings, legal forms, etc.) to further improve the productivity and efficiency of the users in their daily activities without needing to read long documents.
Failure mode. Even though our models yield factually consistent summaries, as judged by human evaluation, they can still generate factually inconsistent summaries or sometimes hallucinate information that the source document does not include. This might be due to the bias or noise in the training data. Model builders wanting to use our archi-tecture to build models on their company internal datasets should build models with consideration of intellectual properties and privacy rights.
Misuse Potential. We note the models to be built with our architecture should be used with careful consideration. The generated summaries produced by our models are not controlled and use generative approaches, therefore, they could generate unreliable text. Researchers working on abstractive summarization should focus on generating factually correct, ethical and reliable text. If our models are trained on news datasets, a careful consideration should be made on factuality of the generated text and measures have been taken to prevent model hallucinations. overall preference. We then take the majority vote of every examples from its three human annotators.

C Probing Experimental Setup
As there is no existing dataset for probing decoders, we create our own training and evaluation data by running off-the-shelf models on the summarization datasets. Specifically, to probe a decoder trained on the XSum dataset on the POS task, we run an POS tagger on the reference summaries from the XSum training set and the model-generated summaries for the XSum dev set to create the ground-truth labels for the training set and model-specific dev set. We use Stanford CoreNLP (Manning et al., 2014) to get the labels for POS, dependency and named entity probing tasks. We use a BERT-base model (Devlin et al., 2019) from AllenNLP (Gardner et al., 2018) to get the ground-truth labels for SRL. We restore the model trained on a summarization dataset and freeze its parameters during the probing. We simply add a linear layer on top of the target representation to project it onto the output label space.

D Related Works Implicit TPR Encodings in Neural Networks
McCoy et al. (2019) showed that, in GRUbased (Cho et al., 2014) encoder-decoder networks performing fully-compositional string manipulations, trained on extensive data that fully exemplifies the range of possible compositions, the medial encoding between encoder and decoder could be extremely well approximated by TPRs. Soulos et al. (2019) presented the ROLE model that learns its own role scheme to optimize the fit of a TPR approximation to a given set of internal representations in a pre-trained target neural network, removing the need for human-generated hypotheses about the role schemes the network might be implementing. While this work successfully interprets the Tensor Product Representation in fully compositional tasks, abstractive summarization, as well as most other NLP tasks, are only partially compositional and the symbolic rules in language are much more complex. Although these two works showed that Tensor Product Representation can naturally emerge in a unstructured representations, we argue that standard models only learn TPRs without any special bias to do so when the compositional structure of the task is simple and blatant and when the training set makes that painfully clear by providing a good sample of the compositional possibilities. That is possible for the simple string tasks addressed in the two previous works, but not in the abstractive summarization as well as other real-world NLP tasks, where we show that having explicit TPR helps in modeling the structure information.
Sequence Models Encode Implicit Structure. Several recent works have shown that the pretrained Transformer-based BERT (Devlin et al., 2019) embeddings implicitly encode structural linguistic relations with various interpretation methods. The first, and also the most popular method (Tenney et al., 2019a) is to train an auxiliary classifier to probe the model's hidden representations for specific linguistic information. The second method (Lin et al., 2019) abstracts the Transformer model into a graph based on the attention weights, and explores syntactic structures based on the graph's structure. The third method (Hewitt and Manning, 2019) sees the hidden representations of BERT as in a metric space and directly connect the distance between representations to the distance between elements in a symbolic structure (e.g., a dependency-parse tree) to extract the implicit structures without extra training. The interpretation method deployed here falls under the probing family, but future work will also pursue other interpretation methods.

E Examples of Generated Summary
We provide examples generated by the Transformer baseline and our TP-TRANSFORMER in Table 5 and Table 6. Summary

Source
Nottinghamshire Police said it would expand its categories to include misogynistic incidents.It means abuse or harassment which might not be a crime can be reported to and investigated by the police, and support for the victim put in place.Nottingham Women's Centre said it hopes it will help give more victims the courage to report incidents.Chief Constable Sue Fish claimed it will make the county a safer place for women. </br>"What women face, often on a daily basis, is absolutely unacceptable and can be extremely distressing," she said. </br>"Nottinghamshire Police is committed to taking misogynistic hate crime seriously and encourages anyone who is affected by it to contact us without hesitation. </br>"Work on the idea first started with the Nottinghamshire Safer for Women Conference last year, co-hosted by the police with the Nottingham Women's Centre.BBC TV reporter Sarah Teale was harassed in the street while reporting on the conference.The force defines misogyny hate crime as: "Incidents against women that are motivated by an attitude of a man towards a woman and includes behaviour targeted towards a woman by men simply because they are a woman. </br>"The classification now means people can report incidents which might not be considered to be a crime and the police will investigate.Nottingham Women's Centre has been helping train call centre, force control staff and officers on the beat to recognise misogynistic hate crime and ways to tackle it.These officers will also examine if and how a victim can be supported or if anything can be done to help prevent them being targeted again.Domestic abuse will not be recorded as a misogyny hate crime because it has its own procedure, the force said.Melanie Jeffs, centre manager at Nottingham Women's Centre, said: "We're pleased to see Nottinghamshire Police recognise the breadth of violence and intimidation that women experience on a daily basis in our communities. </br>"She added: "Recording this as a hate crime will give us a detailed picture of how often, when and where it is happening. </br>It has been very difficult to build that picture before but we will now get detailed data to analyse. </br>"Showing that the police take it seriously will also give people the confidence to come forward and report offences. </br>"A crime that the victim or any other person perceives to be motivated by hostility or prejudice towards any aspect of a person's identity.Police forces in England, Wales and Northern Ireland annually monitor five strands of hate crime:Forces can include their own definition of a hate crime with several recently adding sub cultures.

Reference
Harassment of women is to be recorded as a hate crime in a bid to tackle sexist abuse.

Transformer
Women who commit misogyny and harassed a woman are to be asked to take part in an anti-Semitic conference.

TP-TRANSFORMER
A police force has launched a national drive to combat misogyny and hate crimes in Nottinghamshire. Table 5: An example from the XSum dev set and the summaries generated by the Transformer baseline and TP-TRANSFORMER.

Source
Sixty patrol boats will protect the UK's two new aircraft carriers which are due to arrive at Portsmouth Naval Base in 2017.The first carrier, HMS Queen Elizabeth, is expected to be operational in 2020. </br>"We are going to see a bigger Royal Navy and the flagship... will be here in Portsmouth," Michael Fallon said.The 60 Pacific 24 rigid-hulled inflatable boats will be built by BAE systems to "guard the carriers in the harbour and our new frigates and destroyers", Mr Fallon said.He said they will also enhance security by providing a rapid response in rescue, anti-piracy and counter-narcotics missions in the area.Mr Fallon said: "Through the defence review, defence spending is going to go up every April for the rest of this parliament.He said as part of the larger investment, the government will also be able to provide the new aircraft carriers with sufficient fighter jets. </br>"We have said we will maintain a minimum fleet of 19 destroyers and frigates, but as the older frigates are retired we also hope to add a lighter frigate between the offshore patrol vessel and Type 26 and to build more of those as well. </br>"Mr Fallon's visit to Portsmouth Naval Base comes as work has begun to rebuild the jetty for the arrival of HMS Queen Elizabeth in 2017.Floating cranes are also dredging Portsmouth harbour to prepare deeper channels for the aircraft carriers to sail from the base, which are the largest ships ever built for the Royal Navy. </br>"This is a huge financial investment in making sure the channel is wide enough, in enlarging the jetty here so they can take the carriers and in making sure the carriers are properly guarded," Mr Fallon said.Taller than Nelson's Column and longer than Portsmouth's Spinnaker Tower laid on its side, the new carriers will displace 65,000 tonnes of water.To make room for the carriers three million cubic metres of clay, sand and gravel will be removed from a two-mile stretch of Portsmouth Harbour covering an area the size of 200 football pitches.
Reference Increased spending will result in a "bigger" Royal Navy, the defence secretary has said, as he announced a new £13.5m shipbuilding contract.

Transformer
The Royal Navy's new aircraft carriers will be patrolling the Portsmouth harbour this year, the defence secretary has said.

TP-TRANSFORMER
Plans for a new Royal Navy aircraft carriers to be built in Portsmouth have been unveiled. Table 6: An example from the XSum dev set and the summaries generated by the Transformer baseline and TP-TRANSFORMER.