Semantically Driven Sentence Fusion: Modeling and Evaluation

Sentence fusion is the task of joining related sentences into coherent text. Current training and evaluation schemes for this task are based on single reference ground-truths and do not account for valid fusion variants. We show that this hinders models from robustly capturing the semantic relationship between input sentences. To alleviate this, we present an approach in which ground-truth solutions are automatically expanded into multiple references via curated equivalence classes of connective phrases. We apply this method to a large-scale dataset and use the augmented dataset for both model training and evaluation. To improve the learning of semantic representation using multiple references, we enrich the model with auxiliary discourse classification tasks under a multi-tasking framework. Our experiments highlight the improvements of our approach over state-of-the-art models.


Introduction
Generative NLP tasks, such as machine translation and summarization, often rely on human generated ground truth. Datasets for such tasks typically contain only a single reference per example. This may result from the costly effort of human annotations, or from collection methodologies that are restricted to single reference resources (e.g., utilizing existing corpora; Koehn, 2005;Nallapati et al., 2016). However, typically there are other possible generation results, such as ground-truth paraphrases, that are also valid. Failing to consider multiple references hurts the development of generative models, since such models are considered correct, at both training and evaluation, only if they follow one specific and often arbitrary generation path per example.
In this work we address Sentence Fusion, a challenging task where a model should combine related sentences, which may overlap in content, into a compact coherent text. The output should preserve the information in the input sentences as well as their semantic relationship. It is a crucial component in many NLP applications, including text summarization, question answering and retrieval-based dialogues (Jing and McKeown, 2000;Barzilay and McKeown, 2005;Marsi and Krahmer, 2005;Lebanoff et al., 2019;Szpektor et al., 2020).
Our analysis of state-of-the-art fusion models (Geva et al., 2019; indicates that they still struggle to correctly detect the semantic relationship between the input sentences, which is reflected in inappropriate discourse marker selection in the generated fusions ( §4). At the same time, DISCOFUSE (Geva et al., 2019), the dataset they use, is limited to a single reference per example, ignoring discourse marker synonyms such as 'but' and 'however'. Noticing that humans tend to judge these synonyms as equally suitable ( §3), we hypothesize that relying on single references may limit the performance of those models.
To overcome this limitation, we explore an approach in which ground-truth solutions are automatically expanded into multiple references. Concretely, connective terms in gold fusions are replaced with equivalent terms (e.g., {'however', 'but'} ), where the semantically equivalent sets are derived from the Penn Discourse TreeBank 2.0 (Prasad et al., 2008). Human evaluation of a sample of these generated references indicates the high quality of this process ( §3). We apply our method to automatically augment the DISCOFUSE dataset with multiple references, using the new dataset both for evaluation and training. We will make this dataset publicly available.
We then adapt a seq2seq fusion model in two ways so that it can exploit the multiple references in the new dataset ( §4). First, each training example with its multiple references is converted into multiple examples, each consisting of the input sentence pair with a different single reference fusion. Hence, the model is exposed to a more diverse and balanced set of fusion examples. Second, we direct the model to learn a common semantic representation for equivalent surface forms offered by the multiple references. To that end, we enhance the model with two auxiliary tasks: Predicting the type of the discourse relation and predicting the connective pertaining to the fused output, as derived from the reference augmentation process. We evaluate our model against state-of-the-art models in two experimental settings ( §5, 6): Indomain and cross-domain learning. The crossdomain setting is more challenging but may also be more realistic as labeled data is available only for the source domain but not for the target domain. To evaluate against multiple-reference examples, we measure the similarity of each generated fusion to each of the ground-truth fusions and report the highest score. This offers a more robust analysis, and reveals that the performance of fusion models is higher than previously estimated. In both settings, our model demonstrates substantial performance improvement over the baselines.
2 Related Work

Fusion Tasks
Traditionally, supervised sentence fusion models had access to only small labeled datasets. Therefore, they relied on hand-crafted features (Barzilay and McKeown, 2005;Filippova and Strube, 2008;Elsner and Santhanam, 2011;Filippova, 2010;Thadani and McKeown, 2013). Recently, DISCOFUSE, a large-scale dataset for sentence fusion, was introduced by Geva et al. (2019). This dataset was generated by automatically applying hand-crafted rules for 12 different discourse phenomena to break fused text examples from two domains, Wikipedia and Sports news, into two unfused sentences, while the content of the original text is preserved. We follow prior work (Malmi et al., 2019; and use the balanced version of DISCOFUSE, containing ∼16.5 million examples, where the most frequent discourse phenomena were down-sampled. With DISCOFUSE, it became possible to train data-hungry neural fusion models. Geva et al. (2019) showed that a Transformer model (Vaswani et al., 2017) outperforms an LSTM-based (Hochreiter and Schmidhuber, 1997) seq2seq model on this dataset. Malmi et al. (2019) further improved accuracy by introducing LaserTagger, modeling sentence fusion as a sequence tagging problem.  set the state-of-the-art with a BERT-based (Devlin et al., 2019) model.
Related to sentence fusion is the task of predicting the discourse marker that should connect two input sentences (Elhadad and McKeown, 1990;Grote and Stede, 1998;Malmi et al., 2018). It is typically utilized as an intermediate step to improve downstream tasks, mainly for discourse relation prediction (Pitler et al., 2008;Zhou et al., 2010;Braud and Denis, 2016;Qin et al., 2017). Connective prediction was included in multi-task frameworks for discourse relation prediction (Liu et al., 2016) and unsupervised sentence embedding (Jernite et al., 2017;Nie et al., 2019). We follow this approach of guiding a main task with the semantic information encompassed in discourse markers, studying it in the context of sentence fusion.

Generation Evaluation
Two main approaches are used to evaluate generation models against a single gold-truth reference. The first estimates the correctness of a generated text using a 'softer' similarity metric between the text and the reference instead of exact matching. Earlier metrics like BLEU and ROUGE (Papineni et al., 2002;Lin, 2004), considered n-gram agreement. Later metrics matched words in the two texts using their word embeddings (Lo, 2017;Clark et al., 2019). More recently, contextual similarity measures were devised for this purpose (Lo, 2019;Wieting et al., 2019;Zhao et al., 2019;Zhang et al., 2020;Sellam et al., 2020). In §7 we provide a qualitative analysis for the latter, presenting typical evaluation mistakes made by a recently-proposed contextual-similarity based metric (Zhang et al., 2020). This analysis reveals properties that characterize such methods, which make them less suitable for our task.
The second approach extends the (single) reference into multiple ones, by automatically generating paraphrases of the reference (a.k.a pseudoreferences) (Albrecht and Hwa, 2008;Yoshimura et al., 2019;Kauchak and Barzilay, 2006;Edunov et al., 2018;Gao et al., 2020). Our method ( §3.3) follows this paradigm. It applies curated paraphrase rules to generate highly accurate variations, putting an emphasis on precision. This is opposite to the recall-oriented similarity-based approaches, which may detect correct fusions beyond paraphrasing approaches, but may also promote erroneous solutions due to their soft matching nature.

Multiple References in Sentence Fusion
In this section we discuss the limitations of using single references for evaluation and training in sentence fusion. We then propose an automatic, precision-oriented method to create valid fusion variants. Human-based evaluation confirms the reliability of our method, which generates pseudoreferences that are considered as suitable as the original references. Finally, we demonstrate the effectiveness of the new references for fusion evaluation. Our observations, which are used here for reference generation and evaluation, will also guide our fusion model design and training ( §4).

Single-reference Based Evaluation
Recent fusion works (Geva et al., 2019;Malmi et al., 2019; rely on single references for training and evaluation. Two evaluation metrics are used: (1) EXACT, where the generated fusion should match the reference exactly, and (2) SARI (Xu et al., 2016), which measures the F1 score of kept and added n-grams, and the precision of deleted n-grams, compared to the gold fusion and the input sentences, weighting each equally.
A significant limitation of the above metrics, when measured using a single fusion reference, is that they do not properly handle semantically equivalent variants. For EXACT this is obvious, since even one word difference would account as an error. In SARI, the penalty for equivalent words, e.g., 'but' and 'however', and non-equivalent ones, e.g., 'but' and 'moreover', is identical. To validate this, we conducted a qualitative single-reference evaluation of a fusion model (AuxBERT, §4.3) under the EXACT metric. We randomly selected 50 examples, assessed as mistakes, from the dev sets of both DISCOFUSE's domains. Analyzing these mistakes, we identified six types of errors (Table 1).
The distribution of these error types is depicted in Table 2. We note that the most frequent error type refers to valid fusion variants that differ from the gold fusion: As much as 40% and 44% of the examples in the Wikipedia and the Sports datasets, respectively. While the sample size is too small for establishing accurate statistics, the high-level trend is clear, indicating that a significant portion of the generated fusions classified as mistakes by the EXACT metric are in fact correct.
A possible solution would be to rely on single references, but use 'softer' evaluation metrics (see §2.2). We experimented with the state-of-the-art BERTScore metric (Zhang et al., 2020) and noticed that it often fails to correctly account for the semantics of discourse markers (see §7), which is particularly important for sentence fusion. Furthermore, we notice that recent soft metrics depend on trainable models, mainly BERT (Devlin et al., 2019), which is also used in state-of-the-art fusion models (Malmi et al., 2019;. Thus, we expect these metrics to struggle in evaluation when fusion models struggle in prediction.

Multi-Reference Generation
Generation of valid variants that differ from the ground-truth reference is a challenge for various generation tasks. For open-ended tasks like text summarization, researchers often resort to manually annotating multiple valid reference summaries for a small sample of examples. Sentence fusion, however, is a more restricted task, enabling highquality automatic paraphrasing of gold fusions into multiple valid references. We introduce a precisionoriented approach for this aim.
Instead of generating arbitrary semantically equivalent paraphrases, we focus on generating variants that differ only by discourse markers, which are key phrases to be added when fusing sentences. The Penn Discourse TreeBank 2.0 (PDTB; Prasad et al., 2008) contains examples of argument pairs with an explicit discourse marker and a human-annotated sense tag. The same marker may be associated with multiple sense tags (for instance, since may indicate both temporal and causal relations), and for our precision-oriented approach we only considered unambiguous markers.
Concretely, we identified three PDTB sense tags most relevant to the DiscoFuse dataset and chose the markers whose tag is one of those three in at least 90% of all PDTB argument pairs with an explicit marker. The resulting clusters are presented in Table 3. 2 Finally, we add a fourth cluster containing relative clause paraphrases (such as who is

Mistake type Examples
Correct fusion variant (a+b) It is situated around Bad Segeberg but not part of it . Bad Segeberg is the seat of the AMT . (I) It is situated around Bad Segeberg , the seat of the AMT , but not part of it .
(G) It is situated around Bad Segeberg , which is the seat of the AMT , but not part of it .
Missing/added anaphora (a+b) Of the three , purple is preferred . Purple reinforces the red .
(I) Of the three , purple is preferred because purple reinforces the red . (G) Of the three , purple is preferred because it reinforces the red .    the and which is a). Paraphrases from this cluster are not equivalent and cannot be replaced one with each other. Instead, they are replaceable with apposition paraphrases (as demonstrated in Table 4, under Relative Clause).
Given a DISCOFUSE target fusion t i , if t i is annotated with a connective phrase c i that appears in one of our semantic clusters, we define the set V(t i ) that includes t i and its variants. These variants are automatically generated by replacing the occurrence of c i in t i by the other cluster members. Table 4 demonstrates this process. More details and examples are in the appendix ( §A).

The Quality of Multiple References
To validate the reliability of our automatically generated variants as ground-truth fusions, we evaluate their quality with human annotators. To this end, we randomly sampled 350 examples from the DIS-COFUSE dev sets (Wikipedia and Sports). Each example consists of two input sentences, and two fusion outcomes: the gold fusion and one automatically generated variant. We then conducted-using Amazon Mechanical Turk-a crowd-sourcing experiment where each example was rated by 5 native English speakers. Each rater indicated if one fusion outcome is better than the other, or if both outcomes are of similar quality (good or bad). We considered the majority of rater votes for each example. Table 5 summarizes this experiment. It shows that the raters did not favor a specific fusion outcome, which reinforces our confidence in our precision-based automatic generation method.
To demonstrate the benefit of our generated multiple references in fusion evaluation, we reevaluated the mistakes marked by single-reference EXACT in §3.1. Concretely, each gold fusion t i was automatically expanded into the multiple reference set V(t i ). We define a multi-reference accuracy: where f i is the generated fusion for example i, 1 is the indicator function, and N is the size of the test-set. MR-EXACT 3 considers a generated fusion for an example correct if it matches one of the G She is famed for her noble art Raikiri, which is a slash powered by lightning, that is believed to be inevitable. V She is famed for her noble art Raikiri, a slash powered by lightning, that is believed to be inevitable.  variants in V(t i ). We measured an absolute error reduction of 15% in both domains, where all these cases come from the correct fusion class of Table 2.

A Semantically Directed Fusion Model
In the previous section we have established the importance of multiple references for sentence fusion. We next show ( §4.1) that the state-of-the-art fusion model fails to detect the semantic relationship between the input sentences. We aim to solve this problem by expanding the training data to include multiple-references (MRs) per input example, where together these references provide a good coverage of the semantic relationship and are not limited to a single connective phrase. We then present our model ( §4.3), which utilizes auxiliary tasks in order to facilitate the learning of the semantic relationships from the MR training data ( §4.2).

The SotA Model: Error Analysis
Rothe et al. (2019) set the current state-of-the-art results on the DISCOFUSE dataset with a model that consists of a pre-trained BERT encoder paired with a randomly initialized Transfomer decoder, which are then fine-tuned for the fusion task. We re-implemented this model, denoted here by BERT, which serves as our baseline. We then evaluated BERT on DISCOFUSE using MR-EXACT ( §3.3) and report its performance on each of the discourse phenomena manifested in the dataset (Table 11).
We found that this model excels in fusion cases that are entity-centric in nature. In these cases, the fused elements are different information pieces related to a specific entity, such as pronominalization and apposition (bottom part of Table 11). These fusion cases do not revolve around the semantic relationship between the two sentences. This is in line with recent work that has shown that the pre-trained BERT captures the syntactic structure of its input text (Tenney et al., 2019).
On the other hand, the performance of the BERT model degrades when fusion requires the detection of the semantic relationship between the input sentences, which is usually reflected via an insertion of a discourse marker. Indeed, this model often fails to identify the correct discourse marker (top part of Table 11). Table 6 demonstrates some of the semantic errors made by BERT.

Automatic Dataset Augmentation
We aim to expose a fusion model to various manifestations of the semantic relation between the input sentences in each training example, rather than to a single one, as well as to reduce the skewness in connective occurrences. We hypothesize that this should help the model better capture the semantic relationship between input sentences.
To this end, we utilize our implementation of the variant set V ( §3.2). Specifically, for each train- to the augmented training set. We then train a fusion model on this augmented dataset. The augmented dataset balances between variants of the same fusion phenomenon because if in the original dataset one variant was prominent, its occurrences are now augmented with occurrences of all other variants that can be offered by V. We denote the baseline model trained on the augmented dataset by AugBERT.

Examples (I)
Grace is told she can not get pregnant and IVF is unlikely to help. (G) Grace is told she can not get pregnant because IVF is unlikely to help. (I) The mounds now appear smaller than they did in the past because extensive flooding in the centuries since their construction has deposited 3 feet. (G) The mounds now appear smaller than they did in the past , although extensive flooding in the centuries since their construction has deposited 3 feet. (I) A Grand Compounder was a degree candidate at the University of Oxford who was required to pay extra for his degree because he had a certain high level of income. (G) A Grand Compounder was a degree candidate at the University of Oxford who was required to pay extra for his degree unless he had a certain high level of income. (I) The Battalion lost 41 men killed or died of wounds received on 1 July 1916. (G) The Battalion lost 41 men killed and died of wounds received on 1 July 1916. Table 6: Examples of semantic errors made by the BERT model. The generated fusion is marked with I and the ground-truth fusion is marked with G. These examples are handled correctly by our AuxBERT model.

Semantically Directed Modeling
Multiple references introduce diversity to the training set that could guide a model towards a more robust semantic representation. Yet, we expect that more semantic directives would be needed to utilize this data appropriately. Specifically, we hypothesize that the lower performance of the state-of-theart BERT on semantic phenomena is partly due to its mean cross-entropy (MCE) loss function: where N is the size of the training-set, T i is the length of the target fusion t i , t i,j is the j-th token of t i , and p(w|s 1 i , s 2 i , pre) is the model's probability for the next token to be w, given the input sentences s 1 i , s 2 i and the previously generated prefix pre. As discussed earlier, the word-level overlap between the fusion and its input sentences is often high. Hence, many token-level predictions made by the model are mere copies of the input, and should be relatively easy to generate compared to new words that do not appear in the input. However, as the MCE loss does not distinguish copied words from newly generated ones, it would incur only a small penalty if only one or two words in a long fused sentence are incorrect, even if these words form an erroneous discourse marker. Moreover, the loss function does not directly account for the semantic (dis)similarities between connective terms. This may misguide the model to differentiate between similar connective terms, such as 'moreover' and 'additionally'.
To address these problems, we introduce a multitask framework, where the main fusion task is jointly learned with two auxiliary classification tasks, whose goal is to make the model explicitly consider the semantic choices required for correct fusion. The first auxiliary task predicts the type of discourse phenomenon that constitutes the fusion act out of 12 possible tags (e.g. apposition or discourse connective; see Table 11). The second auxiliary task predicts the correct connective phrase (e.g. 'however', 'plus' or 'hence') out of a list of 71 connective phrases. As gold labels for these tasks we utilize the structured information provided for each DISCOFUSE example, which includes the groundtruth discourse phenomenon and the connective phrase that was removed as part of the example construction. We denote this model AuxBERT and our full model with auxiliary tasks trained on the multiple-reference dataset AugAuxBERT.
The AuxBERT architecture is depicted in Figure 1. Both the fusion task and the two auxiliary classification tasks share the contextualized representation provided by the BERT encoder. Each classification task has its own output head, while the fusion task is performed via a Transformer decoder. The token-level outputs of the BERT encoder are processed by the attention mechanism of the Transformer decoder. BERT's CLS token, which provides a sentence-level representation, is post-processed by the pooler ( The three tasks we employ are combined in the following objective function: total = gen + α · type + β · conn where gen is the cross-entropy loss of the generation task, while type and conn are the crossentropy losses of the discourse type and connective phrase predictions, respectively, with scalar weights α and β. We utilize a pre-trained BERT encoder and fine-tune only its top two layers.

Experimental Setup
We follow prior work and use the balanced version of DISCOFUSE ( §2). The dataset consists of 4.5M examples for Wikipedia ('W') and 12M examples for Sports ('S'), split to 98% training, 1% test and 1% dev. We evaluate fusion models both in in-domain and cross-domain settings (training in one domain and testing on the other domain). We denote with W → S the setup where training is done on Wikipedia and testing on Sports, and similarly use S → W for the other way around.
We evaluate the following fusion models: 4 Transformer -the Transformer-based model by Geva et al. (2019).
BERT -the BERT-based state-of-the-art model by .
AugAuxBERT -Our multitask model trained on our augmented MR training set ( §4.3).
All baselines used the same parameter settings described in the cited works, and our models follow the parameter settings in      The same batch size and number of training steps were used in all models, thus training on the same number of examples when using either the original DISCOFUSE or our augmented version. The α and β hyper-parameters of the multi-task objective are tuned on the dev set (see the supp. material).

Results
We report results with MR-EXACT (Table 7) and MR-SARI (Table 8). To maintain compatibility with prior work, we also report results with single reference (SR) EXACT (Table 9) and SARI (   not surprising since SR evaluation uses one arbitrary reference, while the augmented dataset guides the model towards balanced fusion variants. Our premise in this paper is that multi-reference evaluation is more adequate in assessing outcomes that paraphrase the original DISCOFUSE fusions. Indeed, the results in Tables 7 and 8 show that with MR evaluation all our models outperform all baselines across setups, with AugAuxBERT, which combines multi-reference training and semantic guidance using auxiliary tasks, performing best. We further analyze in Table 11 the in-domain model performance of the strongest baseline BERT and our strongest model AugAuxBERT using MR-EXACT, sliced by the different discourse phenomena annotated in DISCOFUSE. As discussed in §4.1, we distinguish between two fusion phenomena types. Entity-centric fusion phenomena bridge between two mentions of the same en-tity, and for such, no connective discourse marker should be added by the model. Our analysis shows that both models perform well on 3 of the 5 entity-centric phenomena (bottom part of Table 11). For None and Anaphora, there is a drop in AugAuxBERT performance, which may be attributed to the change in example distribution introduced by our augmented dataset, and will be addressed in future work.
The semantic relationship phenomena, on the other hand, require deeper understanding of the relationship between the two input sentences. They tend to be more difficult as they involve the choice of the right connective according to this relation. On these phenomena (top part of Table 11), AugAuxBERT provides substantial improvements compared to BERT, indicating the effectiveness of guiding a model towards a robust semantic interpretation of the fusion task via multiple references and multitasking. Specifically, in the most difficult phenomenon for BERT, discourse connectives, performance increased relatively by 62% for Sports and 26% for Wikipedia. The gap is even larger for the composite cases of discourse connectives combined with anaphora ("Discourse connective+A"): 328.3% (Sports) and 65.4% (Wikipedia).
Finally, to explore the relative importance of the different components of our model, we looked at model performance sliced by the clusters we introduced in §3.2 (see Table 3). The results (Table 12), show that AuxBERT outperforms BERT in 7 of 8 cases, but the gap is ≤ 2% in all cases but one. On the other hand, AugBERT improves over BERT mostly for 'Comparison' and 'Cause', but the average improvements on these clusters are large: 15.4% (Sports) and 9.2% (Wikipedia). This shows that while our auxiliary tasks offer a consistent performance boost, the inclusion of multiple references contribute to significant changes in model's semantic perception.

Ablation Analysis
In this analysis, we focus on potential alternative evaluation measures. As mentioned in §2, a possible direction for solving issues in evaluation of sentence fusion-stemming from having a single reference-could be to use similarity-based evaluation metrics (Sellam et al., 2020;Kusner et al., 2015;Clark et al., 2019;Zhang et al., 2020). We notice two limitations in applying such metrics for sentence fusion. First, similarity-based met-

Fusion
BERTScore MR-EXACT (R) Ruby is the traditional birthstone for July and is usually more pink than garnet, -however some rhodolite garnets have a similar pinkish hue to most rubies . (G) Although ruby is the traditional birthstone for July and is usually more pink , 0.9670 1 than garnet some rhodolite garnets have a similar pinkish hue to most rubies. (B) Ruby is the traditional birthstone for July and is usually more pink than garnet , 0.9893 0 thus some rhodolite garnets have a similar pinkish hue to most rubies . (R) The water level in the wells has risen. As a result, work on agricultural lands -is going on. (G) The water level in the wells has risen, hence work on agricultural lands 0.9713 1 is going on. (B) The water level in the wells has risen. However, work on agricultural lands 0.9745 0 is going on. (R) August 28, which is the second day after school starts, is their first away game.
--(G) august 28, the second day after school starts, is their first away game. 0.9834 1 (B) August 28, who is the second day after school starts, is their first away game. 0.9879 0 rics depend on trainable models that are often in use within fusion models. Thus, we expect these metrics to struggle in evaluation when fusion models struggle in prediction. Second, these metrics fail to correctly account for the semantics of discourse markers, which is particularly important for sentence fusion.
In Table 13 we illustrate typical evaluation mistakes made by BERTScore (Zhang et al., 2020), a recent similarity-based evaluation measure. We calculate BERTScore (F1) for each reference fusion with two variants; (1) a fusion that holds the same meaning and (2) a fusion with a different meaning. A valid evaluation measure for the task is supposed to favor the first option (i.e. the fusion with the same meaning). However, that is not the case for the given examples. The measure often fails to consider the semantic differences between sentences, which is an important element of the task.
Consider the second example in Table 13: BERTScore favours the structural similarity between the gold reference (R) and the incorrect variant (B), which differ in meaning (yet based around the same fusion phenomenon: discourse connective). Meanwhile, the correct variant (G) holds the same meaning as the reference (while a different fusion phenomenon is being used: sentence coordination instead of discourse connective).

Conclusions
We studied the task of sentence fusion and argued that a major limitation of common training and evaluation methods is their reliance on a single reference ground-truth. To address this, we presented a method that automatically expands ground-truth fusions into multiple references via curated equivalence classes of connective terms. We applied our method to a leading resource for the task.
We then introduced a model that utilizes multiple references by training on each reference as a separate example while learning a common semantic representation for surface form variances using auxiliary tasks for semantic relationship prediction in a multitasking framework. Our model achieves state-of-the-art performance across a variety of evaluation scenarios.
Our approach for evaluating and training with generated multiple references is complementary to an approach that uses a similarity measure to match between similar texts. In future work, we plan to study the combination of the two approaches.

A Augmentation Rules
In this section we provide the technical details of the augmentation rules used to augment DISCO-FUSE (see §3.2). For the sake of clarity, we only provide a general explanation of most rules, avoiding fine-grained issues, minor implementation details and repeating similar rules with minor differences. We note that our augmented dataset will be made publicly available upon acceptance of the paper. Given a triplet (s 1 i , s 2 i , t i ), where s 1 i and s 2 i are the input sentences and t i is the ground truth fusion, and its corresponding discourse phenomenon and marker, p i and c i , respectively, we consider the semantic relationship in t i which is expressed by c i (see beginning of table 18). Our augmentation rules relate to three semantic classes: Conjunction, Comparison and Cause, and to one syntactic class: Relative clause (see class definition, §3.2). We design a set of rules for each of these classes, such that each rule first specifies how to detect a fusion that can be augmented according to the rule, and then describes which operations to perform on the ground-truth fusion and its inputs, t i , s 1 i and s 2 i , in order to generate a new valid fusion.

Delete(T , s0)
Delete occurrences of s0 from T .  Table 16). T , s 0 and s 1 are strings, where T is often an entire sentence while s 0 and s 1 are phrases.
of each rule, presenting its detection and augmentation schema, which is accompanied by the notations and definitions provided in Table 17.
In Table 18 we provide a detailed example of the augmentation process. We start with a description of the input structure, which is detected as a fit for an augmentation rule. We then demonstrate how the variant generation is performed, in a step by step manner.

Notation Definition ti
The ground-truth fusion of the i-th example The two input sentences of the i-th example ci The discourse marker used in ti pi The discourse phenomenon of ti  for training AugBERT and AugAuxBERT. These tables provide details about the imbalanced augmentation, where specific phenomena and markers are generated more often than others during the augmentation process.

C Probability Distribution across Valid Fusions
According to the results our models achieve in MR evaluation, we conclude that they are better capable of generating a fused text that is included in the ground-truth set. Here we show that, in addi-tion, they assign a more uniform probability to the members of the set, compared to the BERT model. Figure 2 graphically illustrates this pattern for three typical examples with 9, 9 and 5 ground-truth fusions, respectively (in each example, fusion 1 is the one in the original DISCOFUSE, and the others were created in our expansion). We first formally show that the probability mass tends to be uniformly allocated among the various references; for any t ∈ V(t i ) let be the probability of a variant t relative to the overall probabilities of the variants in V(t). Indeed, for more than 99% of the test-set examples the entropy − t∈V(t i )p i (t) logp i (t) induced by AugBERT and AugAuxBERT for the ground-truth solutions is higher than that of BERT, indicating that our augmented models are less inclined to prefer one of the solutions over the others. Moreover, we computed the sum of the multiplereference probabilities t∈V(t i ) p(t|s 1 i , s 2 i ) in testset examples. In about 71% of the test-set examples the sum of probabilities induced by AugBERT and AugAuxBERT is higher than that of BERT. That is, our model learns to direct more overall probability mass towards the correct variants, indicating a higher confidence in the correct solutions.

D Hyper-Parameters and Configurations
The BERT, AuxBERT, AugBERT and AugAuxBERT models share the same hyperparameters with respect to their shared architecture and to the optimization process. All models use    an initialized BERT-Base Uncased encoder with a randomly initialized Transformer (Vaswani et al., 2017) decoder. Configuration details and the hyper-parameters of the training process are provided in Table 21.
Recall that we define our multi-task loss as follows: total = gen + α · type + β · conn where gen is the cross-entropy loss of the genera-   tion task, and type and conn are the cross-entropy losses of the discourse type and connective phrase predictions, respectively, with scalar weights α and β. We tuned α and β on the DISCOFUSE development sets, considering the values {0.1, 0.5, 1} for both weights. We then chose the best performing set of hyperparameter according to the higher EXACT score on the appropriate development set. In all cases the resulting values were 0.1 for both weights.
The auxiliary heads of AuxBERT and AugAuxBERT also share the same architecture and hyper-parameters. For each auxiliary classifier we used one fully-connected layer, where the input dimension is 768, derived from BERT's pooler output, and the output dimension is determined by the auxiliary output dimension (71 discourse markers and 12 discourse phenomena).
We use the best-performing architecture and hyper-parameters specified by Malmi et al. (2019) for the LaserTagger model. Specifically, we use the auto-regressive model, AR-LaserTagger, with an initialized BERT-Base Cased encoder and a small randomly initialized Transformer decoder. This model has shown better results on the fusion task compared to FF-LaserTagger, the non auto-regressive model.

E Experimental Details
All experiments were performed on either one or two Nvidia GeForce GTX 1080 Ti GPUs, with two cores, 11 GB GPU memory per core, 6 CPU cores and 62.7 GB RAM.
We measured an average of 8.5 hours for 45,000 training steps for BERT, AuxBERT and AugAuxBERT, which is approximately a full 'Wikipedia' epoch and about one-third of a full 'Sports' epoch. To achieve full convergence, each model requires about 675K-900K training steps.
In Table 22 we provide the corresponding singlereference EXACT validation performance for each reported test result. Notice that domain adaptation setups are not included within this table, since in such setups we use development data from the source domain.