TED-CDB: A Large-Scale Chinese Discourse Relation Dataset on TED Talks

As different genres are known to differ in their communicative properties and as previously, for Chinese, discourse relations have only been annotated over news text, we have created the TED-CDB dataset. TED-CDB comprises a large set of TED talks in Chinese that have been manually annotated according to the goals and principles of Penn Discourse Treebank, but adapted to features that are not present in English. It serves as a unique Chinese corpus of spoken discourse. Benchmark experiments show that TED-CDB poses a challenge for state-of-the-art discourse relation classiﬁers, whose F1 performance on 4-way classiﬁcation is < 60%. This is a dramatic drop of 35% from performance on the news text in the Chinese Discourse Treebank. Transfer learning experiments have been carried out with the TED-CDB for both same-language cross-domain transfer and same-domain cross-language transfer. Both demonstrate that the TED-CDB can improve the performance of systems being developed for languages other than Chinese and would be helpful for insufﬁ-cient or unbalanced data in other corpora. The dataset and our Chinese annotation guidelines has been made freely available. 1


Introduction
Recent years have witnessed increasing attention to the properties of discourse for a wide variety of natural language processing (NLP) tasks, e.g., machine translation (Ohtani et al., 2019;Voita et al., 2019), summarization (Isonuma et al., 2019;Xu et al., 2020), machine reading comprehension (Mihaylov and Frank, 2019). One of those interesting properties is the coherence between clauses and sentences arising from shallow discourse relations. As empirical approaches for modeling discourse relations usually require corpora annotated with 1 https://github.com/wanqiulong0923/TED-CDB such relations, Penn Discourse Treebank (PDTB) (Prasad et al., 2008b), based on the idea that the discourse relations are grounded in an identifiable set of discourse connectives or Altlex expressions, has been widely applied in the field of natural language processing. Largely because PDTB is effective to extract discourse semantic features, it serves as a useful substrate for the development and evaluation of neural models in many downstream NLP applications (Qin et al., 2017;Nie et al., 2019;Narasimhan and Barzilay, 2015).
Because for Chinese, discourse relations have only been annotated over news text and few of the resulting corpora are freely available, we have created the TED-CDB dataset. TED-CDB currently comprises 72 TED talks in Chinese (∼268.1K words), annotated with 15,540 discourse relations -almost 3 times as many as the CDTB (Zhou and Xue, 2015). Because Tonelli et al. (2010) have shown that discourse relations in spoken discourse are expressed differently than in written text, for scenarios involving Chinese spoken discourse (e.g., dialogue, spoken language translation), TED-CDB boasts unprecedented potential for exploitation and application.
Our contributions comprise: • the largest PDTB-style Chinese discourse corpus over spoken monologues (Section 3.1). Table 1 compares the TED-CDB with other discourse-annotated Chinese corpora. • new annotation elements to accommodate Chinese-specific discourse phenomena (Section 3.2). • benchmark results on Level-2 discourse relation classification for future comparison with other models (Section 5). • experiments with cross-domain and crosslingual transfer learning that show that the TED-CDB can improve the performance of Corpus Domain Total Relations Availability CDTB (Zhou and Xue, 2015) News report 5,534 Through LDC CUHK (Zhou et al., 2014) News report -From owner HIT-CTDB (Zhang et al., 2013) Internet news 21,505 From owner NTU (Huang and Chen, 2011a) Sino and travel set 3,081 From owner TED-CDB (ours) TED Talks 15,540 Freely public available Table 1: Comparison of our corpus to related data sets. "-" means the work do not mention the number.
systems being developed for languages other than Chinese and would be helpful for insufficient or unbalanced data in other corpora (Section 6).

Related Work
Most annotations in PDTB style are conducted on written texts originating from news reports. Before 2015, there has been just one corpus for spoken discourse (Tonelli et al., 2010), where PDTB annotations are constructed on Italian dialogues. Recently, researchers have realized that the PDTB annotation guidelines should be used more widely instead of just being confined to construct corpora of written texts. Zeyrek et al. (2018) annotate 6 TED talks for 7 languages. Scheffler et al. (2019) build a discourse corpus on Twitter Conversations. Regarding Chinese discourse corpora for discourse relations, as illustrated in Table 1, there are mainly 4 Chinese discourse corpora based on the PDTB framework (Prasad et al., 2008a). Zhou and Xue (2012) present a PDTB-style discourse corpus for Chinese, which is further expanded to contain 164 documents, namely the Treebank (CDTB) (CDTB) (Zhou and Xue, 2015). Huang and Chen (2011b) construct a Chinese discourse corpus with 81 articles. They adopt the top-level senses from PDTB sense hierarchy and focus on the annotation of inter-sentential relations. Zhou et al. (2014) present the CUHK Discourse Treebank. They adapt the annotation scheme of Penn Discourse Treebank 2 (PDTB-2) to Chinese and re-annotate the documents of the Chinese Treebank and with only intersentence explicit discourse relations. The largest Chinese discourse relation corpus for written texts is HIT-CDTB (Zhang et al., 2013), which presents a new Chinese discourse relation hierarchy adapted from the PDTB system. Nevertheless, these four corpora can only be acquired by either purchasing or applying from the owners.
Therefore, the scarcity of Chinese datasets, especially the lack of corpora for spoken monologues have significantly inspired to build TED-CDB.

The TED-CDB Corpus
This section describes the annotation procedure for TED-CDB, including details on the data, annotation scheme, annotation process and agreement study among the annotators.

Data Description
TED talks (TED is short for technology, entertainment and design), as examples of planned spoken monologues delivered to a live audience (Greenbaum et al., 1996), are given by experts from different fields and different countries, most of which are translated to various languages. Hai and Sandra (2020) indicate that Chinese translations as a whole can be reliably distinguished from texts originally written in Chinese, for texts translated into a target language possess linguistic properties that are very different from comparable texts originally written in this language. Hence, we collect two types of TED talks: (1) 26 TED talks originally presented in English and translated into Chinese, and (2) 56 TED talks originally presented in Chinese (in Taipei, Shanghai or Chengdu). Together, these 72 TED talks contain 268,099 words after preprocessing.

Annotation Scheme
Our annotation scheme has been adapted from the PDTB 3.0 relation hierarchy. In the PDTB 3.0 relation hierarchy, there are 4 top-level senses (Expansion, Temporal, Contingency, Contrast) and their second-or in some cases third-level senses, as shown in table 2. To this hierarchy, an additional second-level sense -Progression -has been added under Expansion, specifically for Chinese.
Discourse relations are taken to hold between two abstract object arguments, named Argument 1 and Argument 2. Generally, the arguments are clauses or sentences. Using the PDTB annotator tool, we annotated an explicit connective, identified its two arguments in which the connective occurs, and then labeled the sense for explicit relation. For implicit relations, when we inferred the type of   (Webber et al., 2019). The Level-2 senses are used in assessing system performance (Section 5.1).
relation between two arguments, we tried to insert a connective for this relation. If a connective conveys more than one sense or more than one relation can be inferred, multiple senses would be assigned to the token. And we use a set of consistency rules due to specific linguistic properties in Chinese such as ellipsis of subject, pair connectives.
As some syntactic and textual contexts could not been annotated in our previous work (Long et al., 2020), we loosen the constraints on arguments, connectives, and distance of arguments. In this way, more relations are acquired effectively on the same texts, thus revealing the discourse coherence and structure more fully and clearly. The following are the main additions to our annotation scheme, which future efforts at Chinese discourse annotation might consider adopting as well. In the examples throughout the paper, explicit connectives are underlined, while connectives inserted for implicit relations are both underlined and parenthesized. Sense labels are indicated after the connectives.

Relations have been annotated across nonadjacent sentences
While relations between non-adjacent sentences have only been annotated in the PDTB if Arg1 of an explicit connective is not adjacent to Arg2, implicit relations between non-adjacent sentences were not annotated, except in a small-scale study by Prasad et al. (2017) of relations between paragraph-initial sentences and material in the previous text. In contrast, we annotate relations across non-adjacent sentences not only for explicit relations but also im-plicit relations. We believe that this would be useful for annotating spoken monologues in general, not just for Chinese. That is, in communicating with an audience, speakers often insert material meant to explain the details of the first argument to audience. Relations can be found across non-adjacent sentences in our annotations. The following are two examples -the first, of an explicit relation, and the second, of an implicit relation.
"[When we started to do Qzone, we designed a game about collecting treasures]1. [This design may seem a bit weird now, but it worked at the time. Because it helps us retain some users who can't wait, and also avoids the loss of users. So from the early beginning, our space has been closely related to games]2.
[ThenASYNCHRONOUS, we joined to make the game of QQ farm]3. " "[My research tells me that consciousness is not just a manifestation of intelligence, but more about our nature as a living, breathable organism]1. [The difference between consciousness and intelligence is very large. You will feel pain even if you are not smart, but only if you have to live]2. [(Therefore)RESULT+SPEECHACT stories I want to tell you, our conscious experience of the world around us and our own existence are all controlled illusions, all of which originate from our living bodies]3." In the above examples, there is an explicit Temporal discourse relation between sentences 1 and 3 in the first example and an implicit relation between sentences 1 and 3 in the second example, and there are several sentences being added between the two non-adjacent sentences, which give details for sentence 1. The intervening materials are annotated as "Arg2-as-detail" with respect to the given Arg1(sentence 1).

Verbs can serve as explicit connectives
We follow the practice in the PDTB-3 of using PropBank annotation of modifier relations (ARG-M) to seed intra-sentential discourse relations (Webber et al., 2019). For Chinese, we adopt conventions from Chinese PropBank Annotation (Xue and Palmer, 2009). This allows additional discourse relations to be included. It is the first work to explore Chinese verbs which can signal discourse relations. In this example, the verb phrase "使得" can be identified, while the discourse relations can be expressed through a combination of the adverbial of Arg2 and the anaphoric reference to Arg1 as the implicit subject. In terms of Chinese PropBank Annotation, "使得我们比赛输了(made us lose the game)" is the ARGM-ADV, and there is a relation expressing Cause.result between between the two clauses. In the Example (3),"参加一个16天的德语强化 (attend a 16 days' German intensive course)" is the purpose and has been labelled as an ARGM-PRP adjunct in the Chinese PropBank (Xue and Palmer, 2003). There is a verb "去" ，which is translated to "to" in this English translation. While the verb "去" is a poly semantic word, and it often refers to "go" in English, it tends to act as a structural auxiliary word in this example. There are several Chinese verbs that have the same function like "来" ，"让"， "用". They always signal senses of relation like Condition, Purpose, Result and Manner.

Noun phrases can serve as arguments
Noun phrases have been annotated as arguments previously in Chinese discourse corpora like the CDTB (Zhou and Xue, 2015). While the Chinese NomBank (Xue, 2006) annotates the nominalized predicate, and also the Chinese Proposition Bank (Xue and Palmer, 2009) performs similar annotation of nominalized verbs. Accordingly, we do not annotate all noun phrases as arguments but those nouns which are nominalizations of their verbal form. Chinese verbs and their nominalizations share the same form, but we identify this kind of arguments, depending on whether the structure NP + 的 (of) + nominalizations of predicate appears. Moreover, in this structure, the NP can always be regarded as the object or subject of the nominalized predicate for the argument.
The noun phrase in this example is a typical NP + 的 (of) + nominalizations of predicate structure, the nominalized verb "限制" (limit) can be seen as the predicate of the object NP "自由" (freedom). And the rnoun phase can be paraphrased into "我 们限制他的自由" (we limit his freedom).

Punctuation can serve as an AltLex
AltLex (Alternative Lexicalizations) are expressions that convey the SENSE of a discourse relation, without being explicit connectives. If the Altlex Expressions like "这 导 致 了" (this cause), "一个例子是" (one example is... ), "原因是" (the reason is) appear, the insertion of connectives become redundant. Although this kind of expressions are usually referred to words before, we have actually found that punctuation like colon play the role of Altlex expressions. With it, the relation of details can usually be expressed without adding additional connectives. Colon as AltLex Expression is the first attempt for PDTB-style corpora among all languages. In Example (5), we can see that the colon is sufficient to display the relation between the clauses. Hence, we have reasons to regard it as a special kind of Altlex expressions.

Annotation Process
The annotator team comprised a professor as the supervisor, an experienced annotator and a researcher of PDTB as counselors, 6 master degree candidates as annotators. All annotators are engaged in research on Natural Language Processing and have a certain theoretical foundation of linguistics. With the professional guidance and rich annotation experience, the quality and the efficiency can be initially guaranteed. To ensure annotation quality, the entire annotation process has the following phrases: • Training and discussion. The experienced annotator trained the six annotators through training meeting, based on the Chinese tutorial 2 we made on PDTB guidelines and our adaptation scheme; • Self-pre-annotation. The annotators tried to independently annotate the same texts, finding samples for different senses of relations, exploring problems respectively and discussing issues together, and the experienced annotator checked their work and provided advice for each of them. This step repeated three times until the annotators were all well trained; • Group pre-annotation. To ensure consistency between the annotators, they were divided into two groups to annotate the same texts and compare their annotation; • Formal annotation. We annotated 10 TED talks per cycle. During each cycle, the annotated texts from the annotators would be handed in to the experienced annotator who gathered problems existing in their annotation and gave suggestions. Uncertain or new issues would be discussed in the weekly meeting.
After each cycle, we exchanged the partner between different groups; 2 https://github.com/wanqiulong0923/TED-CDB • Check and improve. This phase is very critical for minimizing errors.

Agreement Study
To ensure annotation consistency, we measured annotators' consistency in annotating specific types of relations which are explicit, implicit, Altlex, NoRel, EntRel, Hypophora, senses from the top level to the third level. Kappa is a quantitative measure of reliability for two raters that are rating the same thing, corrected for how often that the raters may agree by chance. The formulas are: Pe = P (correct) + P (incorrect); (2) (3) A quantifies instances where both the annotators' annotations are correct; D does so where both annotators' annotations are incorrect. B quantifies instances where annotator 1 is incorrect while annotator 2 is correct, while C does the reverse. P o refers to the agreement rate for 500 instances. P(consistent) quantifies instances where the annotators are consistent. We compute the Kappa value and agreement rate between two annotators and then get the average Kappa value and agreement rate among the six annotators. Our results of agreement study can be seen from Table 3.
As is indicated in the Table 3, we achieve relatively high agreement results and Kappa value for the discourse relation type and top-level senses (≥ 0.9 ). Moreover, strong results on the second-level and third-level senses were also achieved, with an agreement rate of 0.85 and Kappa value of 0.83 for the second level senses and agreement rate of 0.83 and Kappa value of 0.81 for the third level. Table 4 shows statistics on TED-CDB. The corpus contains 15,540 discourse relations, which is almost three times as large as the number of discourse relations in Chinese Discourse Treebank (Zhou and Xue, 2015). Of these, 5,531 are explicit relations, while 7,015 are implicit. This means that implicit relations are more frequent in Chinese spoken discourse, while approximately the same number of explicit and implicit relation are     found in the PDTB-3. There is also a large number of Altlex relations (1034). This type of relations is crucial for automatically identifying discourse relations under the circumstance of no explicit connectives. In our work, we try to detect all possible Altlex expressions that are capable of conveying the discourse relations. The number of the intra-sentential relations and inter-sentential relations in PDTB-3 are almost the same, but clearly, we can see that the discourse relations in our corpus are more commonly annotated within the sentence, consisting of 9,847 intra-sentential relations and 5,693 inter-sentential relations. The reason perhaps lies in the use of punctuation, which is quite different in Chinese than in English. For example, a comma sometimes serves the same function as a full stop in English (Xue and Yang, 2011). Therefore, a long Chinese sentence may require the use of multiple English sentences to express the same content and preserve grammatically . This may be why there are more intra-sentential relations in Chinese than in English.

Statistics on TED-CDB
We also compared the CDTB and our TED-CDB with respect to the sense distribution. This is displayed in Figure 1(a) and 1(b). CDTB uses an annotation style similar to the PDTB for the texts from the Chinese Treebank corpus. For a discourse relation, one of eight discourse relation senses is assigned. Although all senses in the CDTB are at the same level of the hierarchy, we can map them to the four top-level relation senses in the PDTB hierarchy according to their definitions: Alternative → Expansion; Causation → Contingency; Conditional → Contingency; Conjunction → Expansion; Contrast → Comparison; Expansion → Expansion; Purpose → Contingency; Temporal → Temporal, progression → Expansion; From Figure 1(b), most relations in CDTB are Expansion, constituting the largest percentage of 82%, while the percentage of other 3 types of relation is less than a quarter. On the contrary, Figure 1(a) clearly shows that TED-CDB sees a balanced and rich distribution over the senses. The percentage of Expansion is higher than other types of relations, but it just represents 38%, while contingency, temporal, and comparison can validate their existence, accounting for 29%, 18% and 15% respectively. Moreover, there are several different second-level senses under each of the four top-level senses, among which Cause is the most.
To explore the discourse differences between implicit and the explicit relations, we compare the distribution of top-level senses between the two corpora. Figure 2(a) and 2(b) show that there are more Contingency relations among the explicits, whereas there are more Expansion relations among the implicits. The statistics also tell us that in explicit relations, "因为" (because) and "所以" (so) are the top two most frequent connectives.

Experiments
This section describes benchmark experiments for discourse relation recognition on our dataset.

Methods
We used the state-of-the-art pretrained language models and fine-tuned them on our corpus to con-   Table 6: Results on level-2 discourse relation classification; F1 score(%) for each level-2 relation in the PDTB-3 hierarchy plus the "Progression" sense relation for both explicit and implicit relation on TED-CDB; Total macro F1 and Total Accuracy are for all level-2 senses in the hierarchy; "-" means there is no the type of sense in the test set.
duct the benchmark test. Particularly, we used the following three baselines: • BERT, a bidirectional encoder from transformers (Devlin et al., 2019) which is tuned towards two objectives: masked language modeling and next sentence prediction. We adopted two BERT systems: BERT-wwm and BERT-wwm-est (Cui et al., 2019). "-wwm" denotes whole word masking, which means that if a part of a complete word (i.e., wordpiece) is replaced by [mask], the other parts of the same word will also be replaced by the mask. "-est" denotes the model trained on a larger data (5.4B).
• ERNIE (115M) 3 , a.k.a Enhanced Representation through Knowledge Integration (Sun et al., 2019), which is trained with not only Wikipedia data but also community QA, Baike (similar to Wikipedia), etc.
• ROBERTa, a robust BERT . We used  For all models, we used the default hyperparameters (batch=8, learning rate=2e-5, 3 https://github.com/PaddlePaddle/ERNIE 4 https://github.com/ymcui/Chinese-BERT-wwm epoch=10). BERT-wwm (110M) and BERTwwm-ext have the same hidden size H=768 trained in different size of tokens (0.4B and 5.4B respectively). And ROBERTa-wwm-est (325M) has hidden size H=1024, which is trained in the same way as ROBERTa but without next sentence prediction, with more training steps. We adopted the F1 and accuracy rate to evaluate both explicit and implicit relation recognition. Moreover, we evaluated the tasks on both the top level (4-way classification) and second level (18way classification). We used 80% of the dataset as the training set, 10% as dev set and 10% as test set.

Results
As can be seen from Table 5 and 6, these pre-trained models perform differently on our dataset, but most of the differences are not large. With respect to the 4-way relation classification, all four models achieve high results for the explicit relations, with average accuracy and average F1 all above 92%. This may indicate good annotation consistency for the explicit relations in the corpus. On the other hand, implicit relation classification is much more difficult for the models, with an average accuracy of 60% and average F1 score of 57%. As for the second level (18-way classification), Table 6 shows  Table 7: F1 score (%) and total accuracy (%) for level-1 implicit relation classification on CDTB that the models still obtain quite good results for explicit relations. However, it becomes more challenging for them to classify the implicit relations for the second level. Even the best model among them, ROBERTa-wwm-ext just achieves an accuracy of 49.79% and F1 of 36.45%. In short, we can see how challenging it is for the state-of-the-art models to improve the performance of implicit relation classification on our TED-CDB corpus, which can be used as a testbed for future efforts devoted to spoken discourse relation recognition.

Transfer Learning via TED-CDB
We also conducted transfer learning experiments across discourse corpora in different domains and languages. In particular, we considered the following two tasks for transfer learning: (1) training on TED-CDB snf testing on CDTB and (2) training on TED-CDB and testing on TED-MDB. The former is for transfer learning across domains of the same language, while the latter for transfer learning across seven languages within the same domain. The goal of these transfer learning experiments is to investigate if TED-CDB would be helpful for improving the performance of systems being developed for other languages and for insufficient or unbalanced data in other corpora.

Same-Language Cross-Domain Learning
While the best pre-trained models just can achieve an accuracy of around 60% for 4-way classification and less than 50% for 18-way classification on implicit relations on our dataset, we have noticed that models in previous work (Rutherford et al., 2017) can achieve a significantly higher accuracy of more than 85% on CDTB. Therefore, we used the BERT-wmm model with the same parameters as in our baseline experiments to perform 4-way implicit relation classification on CDTB. Table 7 shows that, although the accuracy of the model is 93.45%, its F1 score is just 38.31%. A closer look  Table 8: F1 score (%) and total accuracy (%) for comparison of 3-way implicit relation classification. The left is the result for zero-shot learning from TED-CDB to CDTB, while the right is for TED-CDB.
at the model performance at each type of sense shows that this high accuracy can be attributed to the most common sense relation, Expansion, on which the accuracy is 98.92%. However, accuracy for the other, less-frequent senses is much lower.
In particular, the relation of Temporal gets 0 for both Accuracy and Macro F1. The reason behind this is that the sense distribution of CDTB is quite unbalanced, and most of the annotated relations are Expansion as shown in 1(b), while the number of implicit relation of Temporal can be counted. In other words, the training data for other sense types are not sufficient. Therefore, we wonder whether it is useful that our dataset serves as training set to test all the three types of relations in CDTB, while the relations of Expansion category are removed from both datasets. The model we used here is the BERT-wwm, whose parameters are the same as before. Table 8 shows the 3-way implicit relation classification results on TED-CDB and those of the zero-shot transfer learning from TED-CDB on CDTB. Compared with the model performance for 3-way implicit relation classification on TED-CDB, Contingency and Comparison get better scores when these three kinds of relations in CDTB are used as the test set for models fine-tuned on TED-CDB. However, for the type of Temporal, the model trained on TED-CDB does not perform well for the CDTB test set. We looked into the test set and discovered that there are only 7 implicit relations of Temporal and that the annotation for several is not consistent with what we tend to annotate, for example:  Table 9: F1 Score (%) for cross validation within TED-MDB and zero-shot transfer learning from TED-CDB to TED-MDB; The task is 4-way (level-1) implicit relation classification; Total Macro F1 are for all level-1 senses in each language.
For this example, we might annotate it as contingency. Condition, whereas in CDTB the sense of Temporal is assigned to the two arguments.

Same-Domain Cross-Language Learning
TED-MDB (Zeyrek et al., 2018) corpus annotation follows the PDTB 3.0 framework. It contains manual annotation of 6 TED talks in seven languages (English, Turkish, European Potuguese, Polish, German, Russian, and Lithuanian). The sub-corpus for each language is quite small, with about 200 implicit discourse relations each, compared with the ∼7.0 K implicit relations in the TED-CDB. Therefore, we can see whether the TED-CDB can help them. For this experiment, the multilingual BERT was used, which is as large as BERT-wwm but the training data is expanded to cover 104 languages. We used the multilingual BERT implementation from Huggingface. 5 The design for these experiments is making a comparison between a cross validation within the TED-MDB and a zero-shot transfer learning from TED-CDB to TED-MDB. Due to the unbalanced distribution of senses in TED-MDB, using the method of Easy Ensemble (Liu, 2009), we divided the Expansion data of every language in the TED-MDB into 4 parts and then each part was added into the data of other types to become the training set. Finally, we integrated these training sets from 6 language into one training set, and the left data for one language would be the test set. Therefore, what we used here is 4-fold cross validation where each fold is used as the test set exactly once. The average test set accuracy is then reported. Table 9 shows the results for transfer learning from TED-CDB to TED-MDB and cross validation within TED-MDB for the task of 4-way implicit relation classification. Comparing the performances with and without our TED-CDB as training set sug-5 https://github.com/huggingface/transformers gests that using the model trained on TED-CDB leads to noticeably better performance for all 7 languages in TED-MDB. In addition, when TED-CDB is used for training, the performance for the 7 languages is close to that for TED-CDB data itself as test set. In particular, from the table, it is noteworthy that the performance on Comparison dramatically increases with the model trained on TED-CDB.

Conclusion
We have presented TED-CDB, a large-scale dataset for discourse relations on spoken monologues in Chinese. It is equipped with high-quality annotations and linguistic elements tailored for both Chinese and the genre of spoken monologue. The benchmark results of pretrained language models suggest that TED-CDB is a challenging dataset, which can be used to promote further development on discourse relation recognition and discourselevel NLP tasks. Moreover, we display the ability of TED-CDB to help address the issue of insufficient or unbalanced data on other corpora and improve the performance of models for other languages.