Bilingually-constrained Synthetic Data for Implicit Discourse Relation Recognition

To alleviate the shortage of labeled data, we propose to use bilingually-constrained synthetic implicit data for implicit discourse relation recognition. These data are extracted from a bilingual sentence-aligned corpus according to the implicit/explicit mismatch be-tween different languages. Incorporating these data via a multi-task neural network model achieves signiﬁcant improvements over baselines, on both the English PDTB and Chinese CDTB data sets.


Introduction
Discovering the discourse relation between two sentences is crucial to understanding the meaning of a coherent text, and also beneficial to many downstream NLP applications, such as question answering and machine translation. Implicit discourse relation recognition (DRR imp ) remains a challenging task due to the absence of strong surface clues like discourse connectives (e.g. but). Most work resorts to large amounts of manually designed features (Soricut and Marcu, 2003;Pitler et al., 2009;Lin et al., 2009;Louis et al., 2010;Rutherford and Xue, 2014), or distributed features learned via neural network models (Braud and Denis, 2015;. The above methods usually suffer from limited labeled data. Marcu and Echihabi (2002) attempt to create labeled implicit data automatically by removing connectives from explicit instances, as additional training data. These data are usually called as syn- * Corresponding author. thetic implicit data (hereafter SynData). However, Sporleder and Lascarides (2008) argue that SynData has two drawbacks: 1) meaning shifts in some cases when removing connectives, and 2) a different word distribution with the real implicit data. They also show that using SynData directly degrades the performance. Recent work seeks to derive valuable information from SynData while filtering noise, via domain adaptation (Braud and Denis, 2014;, classifying connectives (Rutherford and Xue, 2015) or multi-task learning (Lan et al., 2013;Liu et al., 2016), and shows promising results. society reckon existence youth problems, but many young people think themselves no problems. Figure 1: An example illustrating the implicit/explicit mismatch between Chinese (ch) and English (en). A Chinese implicit instance is translated into an English explicit one. In the PDTB, a discourse instance is defined as a connective (e.g. but) taking two arguments (Arg1 and Arg2).
Different from previous work, we propose to construct bilingually-constrained synthetic implicit data (called BiSynData) for DRR imp , which can alleviate the drawbacks of SynData. Our method is inspired by the findings that a discourse instance expressed implicitly in one language may be expressed explicitly in another. For example, Zhou and Xue (2012) show that the connectives in Chinese omit much more frequently than those in English with about 82.0% vs. 54.5%. Li et al. (2014a) further argue that there are about 23.3% implicit/explicit mismatchs between Chinese/English instances. As illustrated in Figure 1, a Chinese implicit instance where the connective ´is absent, is translated into an English explicit one with the connective but. Intuitively, the Chinese instance is a real implicit one which can be signaled by but. Hence, it could potentially serve as additional training data for the Chinese DRR imp , avoiding the different word distribution problem of SynData. Meanwhile, for the English explicit instance, it is very likely that removing but would not lose any information since its Chinese counterpart ´can be omitted. Therefore it could be used for the English DRR imp , alleviating the meaning shift problem of SynData.
We extract our BiSynData from a Chinese-English sentence-aligned corpus (Section 2). Then we design a multi-task neural network model to incorporate the BiSynData (Section 3). Experimental results, on both the English PDTB (Prasad et al., 2008) and Chinese CDTB (Li et al., 2014b), show that BiSynData is more effective than SynData used in previous work (Section 4). Finally, we review the related work (Section 5) and draw conclusions (Section 6).

BiSynData
Formally, given a Chinese-English sentence pair (S ch , S en ), we try to find an English explicit instance (Arg1 en , Arg2 en , Conn en ) in S en 1 , and a Chinese implicit instance (Arg1 ch , Arg2 ch ) in S ch , where (Arg1 en , Arg2 en , Conn en ) is the translation of (Arg1 ch , Arg2 ch ). In most cases, discourse relations should be preserved during translating, so the connective Conn en is potentially a strong indicator of the discourse relation between not only Arg1 en and Arg2 en , but also Arg1 ch and Arg2 ch . Therefore, we can construct two synthetic implicit instances labeled by Conn en , denoted as (Arg1 en , Arg2 en ), Conn en and (Arg1 ch , Arg2 ch ), Conn en , respectively. We refer to these synthetic instances as BiSynData be- In our experiments, we extract our BiSynData from a combined corpus (FBIS and HongKong Law), with about 2.38 million Chinese-English sentence pairs. We generate 30,032 synthetic English instances and the same number of Chinese instances, with 80 connectives, as our BiSynData. Table 1 lists the top 10 most frequent connectives in our BiSynData, which are roughly consistent with the statistics of Chinese/English implicit/explicit mismatches in (Li et al., 2014a). According to connectives and their related relations in the PDTB, in most cases, and and also indicate the Expansion relation, if and because the Contigency relation, bef ore the T emporal relation, and but the Comparison relation. Connectives as, when, while and since are ambiguous. For example, while can indicate the Comparison or T emporal relation. Overall, our constructed BiSynData covers all four main discourse relations defined in the PDTB.
With our BiSynData, we define two connective classification tasks: 1) given (Arg1 en , Arg2 en ) to predict the connective Conn en , and 2) given (Arg1 ch , Arg2 ch ) to predict Conn en . We incorporate the first task to help the English DRR imp , and the second for the Chinese DRR imp . It is worthy to note that we use English connectives themselves as classification labels rather than mapping them to relations in both tasks. two tasks are essentially the same, just with different output labels. Therefore, as illustrated in Figure 2, M T N shares parameters in all feature layers (L 1 -L 3 ) and uses two separate classifiers in the classifier layer (L 4 ). For each task, given an instance (Arg 1 , Arg 2 ), M T N simply averages embeddings of words to represent arguments, as v Arg 1 and v Arg 2 . These two vectors are then concatenated and transformed through two non-linear hidden layers. Finally, the corresponding sof tmax layer is used to perform classification. M T N ignores the word order in arguments and uses two hidden layers to capture the interactions between two arguments. The idea behind M T N is borrowed from (Iyyer et al., 2015), where a deep averaging network achieves close to the state-ofthe-art performance on text classification. Though M T N is simple, it is easy to train and efficient on both memory and computational cost. In addition, the simplicity of M T N allows us to focus on measuring the quality of BiSynData.
We use the cross-entropy loss function and minibatch AdaGrad (Duchi et al., 2011) to optimize parameters. Pre-trained word embeddings are fixed. We find that fine-tuning word embeddings during training leads to severe overfitting in our experiments. Following Liu et al. (2016), we alternately use two tasks to train the model, one task per epoch. For tasks on both the PDTB and CDTB, we use the same hyper-parameters. The dimension of word embedding is 100. We set the size of L 2 to 200, and L 3 to 100. ReLU is used as the non-linear function. Different learning rates 0.005 and 0.001 are used in the main and auxiliary tasks, respectively. To avoid overfitting, we randomly drop out 20% words in each argument following Iyyer et al. (2015). All hyper-parameters are tuned on the development set.

Experiments
We evaluate our method on both the English PDTB and Chinese CDTB data sets. We tokenize English data and segment Chinese data using the Stanford CoreNLP toolkit (Manning et al., 2014). The English/Chinese Gigaword corpus (3rd edition) is used to train the English/Chinese word embeddings via word2vec (Mikolov et al., 2013), respectively. Due to the skewed class distribution of test data (see Section 4.1), we use the macro-averaged F 1 for performance evaluation. (2015) Table 2 shows the results of M T N combining our BiSynData (denoted as M T N bi ) on the PDTB.

Following Rutherford and Xue
ST N means we train M T N with only the main task. On the macro F 1 , M T N bi gains an improvement of 4.17% over ST N . The improvement is significant under one-tailed t-test (p<0.05). A closer look into the results shows that M T N bi performs better across all relations, on the precision, recall and F 1 score, except a little drop on the recall of Cont. The reason for the recall drop of Cont is not clear. The greatest improvement is observed on Comp, up to 6.36% F 1 score. The possible reason is that only while is ambiguous about Comp and T emp, while as, when and since are all ambiguous about T emp and Cont, among top 10 connectives in our BiSynData. Meanwhile the amount of labeled data for Comp is relatively small. Overall, using BiSynData under our multi-task model achieves significant improvements on the English DRR imp . We believe the reasons for the improvements are twofold: 1) the added synthetic English instances from our BiSynData can alleviate the meaning shift problem, and 2) a multi-task learning method is helpful for addressing the different word distribution problem between implicit and explicit data.
Considering some of the English connectives (e.g., while) are highly ambiguous, we compare our method with ones that uses only unambiguous connectives. Specifically, we first discard as, when, while and since in top 20 connectives, and get 22,999 synthetic instances. Then, we leverage these instances in two different ways: 1) using them in our multi-task model as above, and 2) using them as additional training data directly after mapping unambiguous connectives into relations. Both methods using only unambiguous connectives do not achieve better performance. One possible reason is that these synthetic instances become more unbalanced after discarding ones with ambiguous connectives.
We also compare M T N bi with recent systems using additional training data. Rutherford and Xue (2015) select explicit instances that are similar to the implicit ones via connective classification, to enrich the training data. Liu et al. (2016) use a multi-task model with three auxiliary tasks: 1) conn: connective classification on explicit instances, 2) exp: relation classification on the labeled explicit instances in the PDTB, and 3) rst: relation classification on the labeled RST corpus (William and Thompson,  1988), which defines different discourse relations with that in the PDTB. The results are shown in Table 3. Although Liu et al. (2016) achieve the stateof-the-art performance (Line 5), they use two additional labeled corpora. We can find that M T N bi (Line 6) yields better results than those systems incorporating SynData (Line 1, 2 and 3), or even the labeled RST (Line 4). These results confirm that BiSynData can indeed alleviate the disadvantages of SynData effectively.  The results in Table 5 show that M T N incorporating BiSynData (Line 3) performs better than using SynData (Line 1 and 2), for the task on the CDTB.

Related Work
One line of research related to DRR imp tries to take advantage of explicit discourse data. Zhou et al. (2010) predict the absent connectives based on a language model. Using these predicted connectives as features is proven to be helpful. Biran and McKeown (2013) aggregate word-pair features that are collected around the same connectives, which can effectively alleviate the feature sparsity problem. More recently, Braud and Denis (2014) and  consider explicit data from a different domain, and use domain adaptation methods to explore the effect of them. Rutherford and Xue (2015) propose to gather weakly labeled data from explicit instances via connective classification, which are used as additional training data directly. Lan et al. (2013) and Liu et al. (2016) combine explicit and implicit data using multi-task learning models and gain improvements. Different from all the above work, we construct additional training data from a bilingual corpus. Multi-task neural networks have been successfully used for many NLP tasks. For example, Collobert et al. (2011) jointly train models for the Partof-Speech tagging, chunking, named entity recognition and semantic role labeling using convolutional network. Liu et al. (2015) successfully combine the tasks of query classification and ranking for web search using a deep multi-task neural network. Luong et al. (2016) explore multi-task sequence to sequence learning for constituency parsing, image caption generation and machine translation.

Conclusion
In this paper, we introduce bilingually-constrained synthetic implicit data (BiSynData), which are generated based on the bilingual implicit/explicit mismatch, into implicit discourse relation recognition for the first time. On both the PDTB and CDTB, using BiSynData as the auxiliary task significantly improves the performance of the main task. We also show that BiSynData is more beneficial than the synthetic implicit data typically used in previous work. Since the lack of labeled data is a major challenge for implicit discourse relation classification, our proposed BiSynData can enrich the training data and then benefit future work.