Improving Implicit Discourse Relation Recognition with Discourse-specific Word Embeddings

We introduce a simple and effective method to learn discourse-specific word embeddings (DSWE) for implicit discourse relation recognition. Specifically, DSWE is learned by performing connective classification on massive explicit discourse data, and capable of capturing discourse relationships between words. On the PDTB data set, using DSWE as features achieves significant improvements over baselines.


Introduction
Recognizing discourse relations (e.g., Contrast, Conjunction) between two sentences is a crucial subtask of discourse structure analysis. These relations can benefit many downstream NLP tasks, including question answering, machine translation and so on. A discourse relation instance is usually defined as a discourse connective (e.g., but, and) taking two arguments (e.g., clause, sentence). For explicit discourse relation recognition, using only connectives as features achieves more than 93% in accuracy . Without obvious clues like connectives, implicit discourse relation recognition is still challenging.
The earlier researches usually develop linguistically informed features and use supervised learning method to perform the task Lin et al., 2009;Louis et al., 2010;Rutherford and Xue, 2014;Braud and Denis, 2015). Among these features, word pairs occurring in argument pairs are considered as important features, since they can partially catch discourse relationships between two arguments. For example, synonym word pairs like (good, great) may indicate a Conjunction relation, while antonym word pairs like (good, bad) * Corresponding author. may mean a Contrast relation. However, classifiers based on word pairs in previous work do not work well because of the data sparsity problem. To address this problem, recent researches use word embeddings (aka distributed representations) instead of words as input features, and design various neural networks to capture discourse relationships between arguments (Zhang et al., 2015;Ji and Eisenstein, 2015;Qin et al., 2016;. While these researches achieve promising results, they are all based on pre-trained word embeddings ignoring discourse information (e.g., good, great, and bad are often mapped into close vectors). Intuitively, using word embeddings sensitive to discourse relations would further boost the performance.
In this paper, we propose to learn discoursespecific word embeddings (DSWE) from explicit data for implicit discourse relation recognition. Our method is inspired by the observation that synonym (antonym) word pairs tend to appear around the discourse connective and (but). Other connectives can also provide some discourse clues. We expect to encode these discourse clues into the distributed representations of words, to capture discourse relationships between them. To this end, we use a simple neural network to perform connective classification on massive explicit data. Explicit data can be considered to be automatically labeled by connectives. While they cannot be directly used as training data for implicit discourse relation recognition and contain some noise, they are effective enough to provide weakly supervised signals for training the discourse-specific word embeddings.
We apply DSWE as features in a supervised neural network for implicit discourse relations recognition. On the PDTB (Prasad et al., 2008), using DSWE yields significantly better performance than using off-the-shelf word embeddings, or recent systems incorporating explicit data. We detail our method in Section 2 and evaluate it in Section 3. Conclusions are given in Section 4. Our learned DSWE is publicly available at here.

Discourse-specific Word Embeddings
In this section, we first introduce the neural network model for learning discourse-specific word embeddings (DSWE), and then the way of collecting explicit discourse data for training. Finally, we highlight the differences between our work and the related researches. An explicit instance is denoted as (arg 1 , arg 2 , conn). w 1 arg 1 , ..., w m arg 1 mean the words in arg 1 . Two arguments are concatenated as input and the number of hidden layers is not limited to two.
We induce DSWE based on explicit data by performing connective classification. The connective classification task predicts which discourse connective is suitable for combining two given arguments. It is essentially similar to implicit relation recognition, just with different output labels. Therefore, any existing neural network model for implicit relation recognition can be easily used for connective classification. We adapt the model in (Wu et al., 2016) for connective classification because it is simple enough to enable us to train on massive data. As illustrated in Figure 1, an argument is first represented as the average of distributed representations of words in it. On the concatenation of two arguments, multiple non-linear hidden layers are then used to capture the interactions between them. Finally, a softmax layer is stacked for classification. We combine the crossentropy error and regularization error multiplied by the coefficient λ as the objective function. During training, we initialize distributed representations of all words randomly and tune them to minimize the objective function. The finally obtained distributed representations of all words are our discourse-specific word embeddings.
Collecting explicit discourse data includes two steps: 1) distinguish whether a connective occurring reflects a discourse relation. For example, the connective and can either function as a discourse connective to join two Conjunction arguments, or be just used to link two nouns in a phrase. 2) identify the positions of two arguments. According to (Prasad et al., 2008), arg 2 is defined as the argument following a connective, however, arg 1 can be located within the same sentence as the connective, in some previous or following sentence . Lin et al. (2014) show that the accuracy of distinguishing connectives is more than 97%, while identifying arguments is below than 80%. Therefore, we use the existing toolkit 1 to find discourse connectives, and just collect explicit instances using patterns like [arg1 because arg2], where two arguments are in the same sentence, to decrease noise. We believe these simple patterns are enough when using a very large corpus. Note that there are 100 discourse connectives in the PDTB, we ignore four parallel connectives (e.g., if...then) for simplicity. The way of collecting explicit data can be easily generalized to other languages, one just need to train a classifier to find discourse connectives following (Lin et al., 2014).
Some aspects of this work are similar to (Biran and McKeown, 2013;Braud and Denis, 2016). Based on massive explicit instances, they first build a word-connective co-occurrence frequency matrix 2 , and then weight these raw frequencies. In this way, they represent words in the space of connectives to directly encode their discourse function. The major limitation of their approach is that the dimension of the word representations must be less than or equal to the number of connectives. By comparison, we learn DSWE by predicting connectives conditioning on arguments, which yields better performance and has no such dimension limitation. Some researchers use explicit data as additional training data via multi-task learning (Lan et al., 2013; or data selection (Rutherford and Xue, 2015;Wu et al., 2016).
In both cases, explicit data are directly used to estimate the parameters of implicit relation classifiers. As a result, it is hard for them to incorporate massive explicit data because of the noise problem. By contrast, we leverage massive explicit data by learning word embeddings from them.

Data and Settings
We collect explicit data from the Xin and Ltw parts of the English Gigaword Corpus (3rd edition), and get about 4.92M explicit instances. We randomly sample 20,000 instances as the development set and the others as the training set for DSWE. After discarding words occurring less than 5 times, the size of the vocabulary is 185,048. For the connective classification task, we obtain an accuracy of about 53% on the development set.
We adapt the neural network model described in Figure 1 as the classifier for implicit discourse relation recognition (CDRR). Specifically, we concatenate some surface features with the last hidden layer as the input of the softmax layer to predict discourse relations. We choose 500 Production rule (Lin et al., 2009) and 500 Brown Cluster Pair (Rutherford and Xue, 2014) features based on mutual information using the toolkit provided by Peng et al. (2005). Our learned DSWE is used as the pre-trained word embeddings for CDRR, and fixed during training.
Hyper-parameters for training DSWE and CDRR are selected based on their corresponding development set, and listed in Table 1 Following , we perform a 4way classification on the four top-level relations in the PDTB: T emporal (T emp), Comparison (Comp), Contingency (Cont) and Expansion (Expa). The PDTB is split into the training set (Sections 2-20), development set (Sections 0-1) and test set (Sections 21-22). Table 2 lists the statistics of these data sets. Due to the small and uneven test data set, we run our method 10 times with different random seeds (therefore different initial parameters), and report the results (of a run) which are closest to the average results. Finally, we use both Accuracy and M acro F 1 (macroaveraged F 1 ) to evaluate our method.

Results
We compare our learned discourse-specific word embeddings (DSWE) with two publicly available embeddings 3 : 1) GloVe 4 : trained on 6B words from Wikipedia 2014 and Gigaword 5 using the count based model in (Pennington et al., 2014), with a vocabulary of 400K and a dimensionality of 300.
2) word2vec 5 : trained on 100B words from Google News using the CBOW model in (Mikolov et al., 2013), with a vocabulary of 3M and a dimensionality of 300.
Results in Table 3 show that using DSWE gains significant improvements (one-tailed t-test with p<0.05) over using GloVe or word2vec, on both Accuracy and M acro F 1 . Furthermore, using DSWE achieves better performance across all relations on the F 1 score, especially for minority relations (T emp, Comp and Cont). Overall, our DSWE can effectively incorporate discourse infor-  Table 3: Results of using different word embeddings. We also list the Precision, Recall and F 1 score for each relation.
mation in explicit data, and thus benefits implicit discourse relation recognition. We also compare our method with three recent systems which also use explicit data to boost the performance: 1) R&X2015: Rutherford and Xue (2015) construct weakly labeled data from explicit data based on the chosen connectives, to enlarge the training data directly.
2) B&D2016: Braud and Denis (2016) learn connective-based word representations and build a logistic regression model based on them 6 .
3) Liu2016: Liu et al. (2016) use a multi-task neural network to incorporate several discourserelated data, including explicit data and the RST-DT corpus (William and Thompson, 1988  Results in Table 4 show the superiority of our method. Although Liu2016 performs slightly better on M acro F 1 , it uses the additional labeled RST-DT corpus. For R&X2015 and Liu2016, they both incorporate relatively small explicit data because of the noise problem, for example, 20,000 and 40,000 instances respectively. By contrast, our method benefits from about 4.9M explicit instances. While B&D2016 uses massive explicit data, it is limited by the fact that the maximum dimension of word representations is restricted to the number of connectives, for example 96 in their work. Overall, our method can effectively utilize massive explicit data, and thus is more powerful than baselines.  To give an intuition of what information is encoded into the learned DSWE, we list in Table 5 the top 15 closest words of not and good, according to the cosine similarity. We can find that, in DSWE, words similar to not to some extent have negative meanings. And since declined is similar to not, a classifier may easily identify the implicit instance [A network spokesman would not comment. ABC Sports officials declined to be interviewed.] as the Conjunction relation. For good in DSWE, the similar words no longer include words like bad. Furthermore, the similar score between good and great is 0.54 while the score between good and bad is just 0.33, which may make a classifier easier to distinguish word pairs (good, great) and (good, bad), and thus is helpful for predicting the Conjunction relation. This qualitative analysis demonstrates the ability of our DSWE to capture the discourse relationships between words. Finally, we conduct experiments to investigate the impact of connectives used in training DSWE on our results. Specifically, we use the explicit discourse instances with the top 10, 20, 30, 60 most frequent or all connectives to learn DSWE, accounting for 78.9%, 91.9%, 95.8%, 99.4% or 100% of total instances, respectively. The top 10 most frequent connectives are: and, but, also, while, as, when, after, if, however and because, which cover all four top-level relations defined in the PDTB. As illustrated in Figure 2, with only the top 10 connectives, the learned DSWE achieves better performance than the common word embeddings. We observe a significant improvement when using top 20 connectives, almost the best performance with top 30 connectives, and no further substantial improvement with more connectives. These results indicate that we can use only top n most frequent connectives to collect explicit discourse data for DSWE, which is very convenient for most languages.

Conclusion
In this paper, we learn discourse-specific word embeddings from massive explicit data for implicit discourse relation recognition. Experiments on the PDTB show that using the learned word embeddings as features can significantly boost the performance. We also show that our method can use explicit data more effectively than previous work. Since most of neural network models for implicit discourse relation recognition use pretrained word embeddings as input, we hope that our learned word embeddings would benefit them.