MATINF: A Jointly Labeled Large-Scale Dataset for Classification, Question Answering and Summarization

Recently, large-scale datasets have vastly facilitated the development in nearly all domains of Natural Language Processing. However, there is currently no cross-task dataset in NLP, which hinders the development of multi-task learning. We propose MATINF, the first jointly labeled large-scale dataset for classification, question answering and summarization. MATINF contains 1.07 million question-answer pairs with human-labeled categories and user-generated question descriptions. Based on such rich information, MATINF is applicable for three major NLP tasks, including classification, question answering, and summarization. We benchmark existing methods and a novel multi-task baseline over MATINF to inspire further research. Our comprehensive comparison and experiments over MATINF and other datasets demonstrate the merits held by MATINF.


Introduction
In recent years, large-scale datasets (e.g., Ima-geNet (Deng et al., 2009) and SQuAD (Rajpurkar et al., 2016)) have inspired remarkable progress in many areas like Computer Vision (CV) and Natural Language Processing (NLP). On the one hand, well-annotated data provide essential information for training supervised machine learning models. On the other hand, benchmarked datasets make it possible to evaluate and compare the capability of different methods on the same stage.
Due to the high cost of data annotation, existing NLP datasets are usually labeled for only one particular task (e.g., SQuAD (Rajpurkar et al., 2016) for question answering, CNN/DM (Hermann et al., * The first two authors contribute equally to this paper. † Chenliang Li is the corresponding author. 1 The implementation of MTF-S2S and information about obtaining access to the dataset can be found at https:// github.com/WHUIR/MATINF. (Zhang et al., 2015) for text classification). These singletask datasets hinder the development of learning common and task-invariant knowledge (Liu et al., 2017). Although multi-task learning and transfer learning have delivered encouraging results, we still cannot determine whether the improvement is from the extension of input or supervision. Furthermore, task-specific data make the models tend to learn task-specific leakage features  rather than meaningful knowledge that could generalize to other tasks. However, as a key step to Artificial General Intelligence (AGI), knowledge acquisition requires the model to learn more general knowledge instead of overfitting on a specific task. Therefore, a large-scale and cross-task dataset is in huge demand for future NLP research. Nevertheless, to the best of our knowledge, none of the existing datasets could meet such demand.

2015) for summarization and AGNews
In this paper, we propose Maternal and Infant Dataset (MATINF), the first large-scale dataset covering three major NLP tasks: text classification, question answering and summarization. MATINF consists of question answering data crawled from a large Chinese maternity and baby caring QA site. On this site, users can ask questions related to maternity and baby caring. When submitting a question, a detailed description is required to provide essential information and the asker also needs to assign a category for this question from a pre-defined topic list. Each user could submit an answer to a question post, and the asker will select the best answer out of all the candidates. To attract more attention, the askers are encouraged to set rewards using virtual coins when submitting the question and these coins will be given to the user who submitted the best answer selected by the asker. This rewarding mechanism could constantly ensure high-quality answers.
MATINF supports three NLP tasks as follows.
Text Classification. Given a question and its detailed description, the task is to select an appropriate category from the fine-grained category list. Different from previous news classification tasks whose category set is general topics like entertainment and sports, MATINF-C is a fine-grained classification under a single domain. That is, the distance between different categories is smaller, which provides a more challenging stage to test the continuously evolving state-of-the-art neural models. Question Answering. Given a question, the task is to produce an answer in natural language. This task is slightly different from previous Machine Reading Comprehension (MRC) since the document which contains the correct answer is not directly provided. Therefore, how to collect the domain knowledge from massive QA data becomes extremely important. Summarization. Given a question description, the task is to produce the corresponding question. Previous summarization datasets are all constructed with news or academic articles. The limited text genres covered in these datasets hinder the thorough evaluation of summarization models. Also, the noisy nature of MATINF encourages more robust models. MATINF can be considered as the first social media summarization dataset. MATINF holds the following merits: (1) Large. MATINF includes 1.07M unique QA pairs, making it an ideal playground for the new advancements of deeper and larger models (e.g., Pretrained Language Models). (2) Multi-task applicable. MAT-INF is the first dataset that simultaneously contains ground truths for three major NLP tasks, which could facilitate new multi-task learning methods for these tasks. Here, to set a baseline and inspire future research, we present Multi-task Field-shared Sequence to Sequence (MTF-S2S), a straightforward yet effective model, which achieves better performance on all three tasks compared to its singletask counterparts.

Topic Classification
Topic classification is one of the most fundamental tasks in NLP. As a deeply explored task, many datasets have been used in previous research both in English (AGNews, DBPedia, Yahoo Answer (Zhang et al., 2015), TREC (Voorhees and Tice, 1999)) and Chinese (THUCNews (Sun et al., 2016), SogouCS (Wang et al., 2008a), Fudan Cor-pus, iFeng and ChinaNews (Zhang and LeCun, 2017)). These datasets were useful and indispensable in the past decades to test the performance of different kinds of classifiers.
However, as most of them are formal text and the target categories are general topics, even simply leveraging n-gram features could achieve acceptable results. Plus, some of them are small in scale. Nowadays, with the prevalence of neural models and pretraining techniques, recent algorithms  are approaching the ceiling of these datasets with accuracy scores up to 98%. Different from any of the existing datasets, MATINF is more challenging, providing a new stage to test the performance of future algorithms.

Question Answering
Following the definition in (Jurafsky and Martin, 2009), Question Answering (QA) can be generally divided into Information Retrieval (IR) based Question Answering and Knowledge-based Question Answering. For IR-based Question Answering, the answer is often a span in the retrieved document. As for Knowledge-based Question Answering, a human-constructed knowledge base is provided for querying and the answer is in the form of a query result. Recently, Open Domain QA (Chen et al., 2017) has been recognized as a new genre where a natural language response instead of text spans is returned as an answer.
Currently, several datasets are available for Chinese Question Answering. NLPCC Shared Task (Duan and Tang, 2017) provided two datasets for IR-based and Knowledge-based QA, respectively. DuReader (He et al., 2018) is an Open Domain dataset derived from user search logs and provided with human-picked documents as evidence. Zhang and Zhao (2018) provided a QA dataset in the domain of Chinese College Entrance Test history exam questions, with documents from standard history textbooks. Different from these datasets, instead of providing pre-defined documents as evidence, MATINF-QA only contains sufficient QA pairs in the training set. In this way, there are various approaches to exploit these questions as evidence. Thus, MATINF-QA encourages innovations in retrieval, generation and hybrid question answering methods.

Summarization
Summarization datasets can be roughly categorized into extractive and abstractive datasets, which respectively favor abstractive and extractive methods. Extractive datasets are composed of long documents and summaries. Since the summary is long, extracted sentences and spans from the document could compose a good summary. Newsroom (Grusky et al., 2018), ArXiv and PubMed (Cohan et al., 2018) and CNN / Daily Mail dataset (Hermann et al., 2015) are commonly used extractive datasets.
Abstractive datasets often contain short documents and summaries, which encourages a thorough understanding of the document and style transfer between a document and its corresponding summary. Gigaword (Napoles et al., 2012) and Xsum (Narayan et al., 2018) fall into this category. Also, the abstractive dataset LCSTS (Hu et al., 2015), crawled from verified short news feeds of major newspapers and televisions, is the only public dataset for Chinese text summarization to date.
However, all of these existing datasets are composed of either news or academic articles. The narrow sources of these datasets bring two main drawbacks. First, due to the nature of news reporting and academic writing, the summary-eligible contents do not distribute uniformly (Sharma et al., 2019). Second, models evaluated on these noiseless formal-text datasets are not robust enough for real-world applications. To address these problems, we propose MATINF-SUMM, a new abstractive Chinese summarization dataset.

MATINF Dataset
We present Maternal and Infant (MATINF) Dataset, a large-scale dataset jointly labeled for classification, question answering and summarization in the domain of maternity and baby caring in Chinese. An entry in the dataset includes four fields: question (Q), description (D), class (C) and answer (A). An example is shown in Figure 1, and the average character and word numbers of each field are reported in Table 1. We collect nearly two million question-answer pairs with fine-grained human-labeled classes from a large Chinese maternity and baby caring QA site. We conduct both automatic and manual data cleansing and remove: (1) classes with insufficient samples; (2) entries in which the length of the description filed is less than the length of the question field; (3) data with any field longer than 256 characters; (4) human-spotted ill-formed data. After the data cleansing, we construct MATINF with the remaining 1.07 million entries.
We first randomly split the whole data into training, validation and test sets with a proportion of 7:1:2. Then, we use the splits for summarization and QA. For classification, we further divide the data into two sub-tasks according to different classification standards within each split.

MATINF-C: Fine-grained Text Classification
In MATINF, the class labels are first selected by the users when submitting a question. Then, if the question is not in the right class, the forum administrators would manually re-categorize the question to the correct class. In our data, there are two parallel standards for classifying a question: topic class and age of the baby. We use these two standards to construct our two subsets. Thus, we define two tasks: (1) classifying a question to different age groups; (2) classifying a question into a fine-grained topic. We list the classes of the two tasks in   between the two subsets. Formally, we define the task as predicting the class of a QA pair with its question and description fields (i.e., Q, D → C). Different from previous datasets, our task is a finegrained classification (i.e., to classify documents in a domain) rather than classifying general topics (e.g., politics, sports, entertainments), which means the semantic difference between classes is prominently smaller. It requires meticulous exploitation of semantics instead of recognizing unique n-gram features for each class. We provide statistical comparison of MATINF-C with other datasets in Table  3.

MATINF-QA: Health-Domain Question Answering
Typically, to return an answer for a specific question, the model needs to retrieve from a pre-defined document set or query a manually-constructed knowledge base. MS-MARCO (Nguyen et al., 2016) utilizes a search engine to pre-filter 10 documents from the Internet and uses them as the document set. However, searching itself is a challenging task that significantly affects the final performance.
On the other hand, in a real-world scenario, it is impossible to define a document set covering all knowledge needed to answer a user question. Thus, we provide the training set of MATINF-QA as the possible document source and encourage all kinds of methods including retrieval, generation and hybrid models. Formally, the task is defined as replying a question with natural text (i.e., Q → A). The large scale of our dataset ensures that a model is able to generalize and learn enough knowledge to answer a user question. Note that we do not use description when defining this task since we observe a negative effect on the generalization in our experiment. Shown in Table 4, we list statistics of MATINF-QA and other commonly-used datasets.

MATINF-SUMM: Summarization in Professional Domain
All current datasets for summarization to date are in the domain of news and academic articles. However, as a custom of the report and academic writing, in extractive datasets, the summary-eligible contents often appear at the beginning or the end of an article, preventing the summarization model from a full understanding and resulting in impractically high performance in evaluation. On the other hand, current abstractive datasets are all formal news datasets, which are in lack of diversity. Models trained on such a single-source dataset is not robust enough to handle real-world complexity.
In MATINF-SUMM, question description can be seen as an extended and specific version of the question itself, containing more detailed background information with respect to the question. Besides, the question itself is often a well-formed interrogative sentence rather than extracted phrases. Our task is to generate the question from the corresponding description (i.e., D → Q). Note that our task itself can support many meaningful real-world applications, e.g., generating an informative title for user-generated content (UGC). Also, there is only one public dataset for summarization in Chinese to date. Our dataset can be used to verify the effectiveness of existing models and eliminate the   overfitting bias caused by evaluation on merely one dataset. We compare MATINF-SUMM with other datasets in Table 5.

Multi-task Learning
Recently, many attempts have been made on multitask learning in NLP ( ) and several benchmarks are available for multi-task evaluation (Wang et al., 2019a,b). Though recent studies show that multi-task learning is effective, there is still one more question to answer. That is, when training models on multiple tasks, multiple datasets are used by default. As illustrated in Figure 2( Raffel et al., 2019) has proved that corpora (X) from different sources can make the model more robust and significantly improve the performance. To this end, it is not easy to determine whether the success of a multi-task model should be mainly attributed to the addition of X or Y . However, as depicted in Figure 2(b), in MAT-INF, our jointly labeled fashion can guarantee that X remains the same as in a single task and only Y is added. Thus, MATINF provides a fair and ideal stage for exploring multi-task learning, especially auxiliary and multi-task supervision under a single dataset.
To set a baseline and also inspire future research, we design a multi-task learning network, named  Figure 3: The architecture of MTF-S2S. Note that a common attention mechanism (Luong et al., 2015) is applied when decoding question and answer (in the blue and green boxes), but we do not illustrate it in this figure for clarity.
Multi-task Field-shared Sequence to Sequence (MTF-S2S). We illustrate the architecture of MTF-S2S in Figure 3. For generation tasks, we combine the summarization (D → Q) and QA (Q → A) to be the form of D → Q → A, with a shared Long Short-Term Memory (LSTM) for decoding questions in summarization task and encoding questions for both QA and classification tasks. Previous studies often share layers among tasks to regularize the representation learning, as illustrated in Figure  2(c). Different from that, MTF-S2S shares on both module level (i.e., field encoder/decoder, as shown in Figure 2(d)) and layer level. An attention mechanism is applied when decoding for summarization and QA. Also, we concatenate the encoded representations of description and question, and feed it to a shared fully connected layer and then specialized fully connected layers for age classification and topic classification, respectively.
When training, since the sizes of datasets for different tasks are not equal, we first determine the batch size for different tasks to make sure that the training progress for each task is approximately synchronized by: where T includes four tasks: summarization, QA, and two classification tasks. bs * is the batch size of each task, and n * is the sample numbers in each dataset for the task. If one task is iterated to the last data batch, it will start over from the first batch.
For each iteration, we successively calculate the losses by Cross Entropy for each task in one batch. Then, we train the model to minimize the total loss: where λ * is the manually set weight for each task. We stop the co-training after one epoch, then finetune the model to obtain the peak performance for each task, separately.

Experiments
In this section, we benchmark a few baselines and MTF-S2S on the three tasks of MATINF. We run each experiment with three different random seeds and report the average result of the three runs.

Experimental Settings
MTF-S2S. For MTF-S2S, we set all λ i = 0.25 and use an Adam (Kingma and Ba, 2015) optimizer to co-train the model for one epoch with batch sizes of 64, 64, 12 and 52 for bs Summ , bs QA , bs CT opic , and bs CAge respectively with a learning rate of 0.001. Then we fine-tune the model for each task with a learning rate of 5 × 10 −5 . We report both the performance after co-training and after fine-tuning. The hidden size of all LSTM encoders/decoders and attentions is 200. For all tasks, we separately train MTF-S2S on each task only to provide a single-task baseline. Both MTF-S2S and Seq2Seq baselines are character-based and their embeddings are initialized with Tencent AI Lab Embedding (Song et al., 2018). For both MTF-S2S and Seq2Seq baselines, we use Beam Search (Wiseman and Rush, 2016) when decoding.
Classification. For classification, we conduct experiments with a statistical learning baseline, several deep neural networks and pretrained large-scale language models. For the statistical baselines, we extract character-based unigram and bigram features and use a logistic classifier to predict the classes. For neural networks, we choose fastText (Grave et al., 2017), Text CNN (Kim, 2014), DCNN (Kalchbrenner et al., 2014), RCNN (Lai et al., 2015) and DPCNN (Johnson and Zhang, 2017). As a classical step in Chinese text classification, we segment the sentences into words with Jieba 2 , a commonly used out-of-the-box word segmentation toolkit. We then initialize the word embedding with pretrained Tencent AI Lab Embedding (Song et al., 2018) except for fastText, which has its own algorithm to construct word embeddings. We minimize the Cross-Entropy with Adam (Kingma and Ba, 2015) optimizer with a learning rate of 0.001 and apply early stopping. For language models, we fine-tune BERT (Devlin et al., 2019) and ERNIE  that both have released official pretrained Chinese models. We set the learning rate for fine-tuning to 5 × 10 −5 and apply early stopping. We also compress the fine-tuned 12-layer BERT model with BERT-of-Theseus (Xu et al., 2020) and obtain the performance of a 6-layer model. Question Answering. For retrieval-based QA, following MS-MARCO (Nguyen et al., 2016), we calculate the average best scores between each answer in the test set and all answers in the training set within the same class, to determine the oracle retrieval performance. Then, we construct our retrieval-based baseline by fine-tuning BERT-Base (Devlin et al., 2019) for question matching on an external dataset, LCQMC . Then we use the trained model to score the match between each question in the test set and all questions in the training set with the same class and return the answer of the top 1 matched question. For generation-based baselines, we use character-based Seq2Seq (Sutskever et al., 2014) and Seq2Seq with Attention (Luong et al., 2015), since character-based method has a prominently better performance for Chinese text generation (Hu et al., 2015;Li et al., 2019). The metric for evaluation are ROUGE scores (Lin and Hovy, 2003) calculated on the character level. Summarization. We categorize the baselines into two fashions: extractive methods (i.e., extracting sentences or phrases from the text) and abstractive methods (i.e., generating summaries according to the text). For extractive methods, we choose two widely used classical methods, TextRank (Mihalcea and Tarau, 2004)  Text CNN (Kim, 2014) 90.95 64.41 DCNN (Kalchbrenner et al., 2014) 90.96 64.60 RCNN (Lai et al., 2015) 90.81 63.56 fastText (Grave et al., 2017) 87.76 61.81 DPCNN (Johnson and Zhang, 2017) Radev, 2004). For abstractive methods, we use WEAN  and Global Encoding  along with Seq2Seq (Sutskever et al., 2014;Luong et al., 2015) as the baselines. We also add BertAbs (Liu and Lapata, 2019), a BERT-based summarization model, to reflect the recent progress on this task. We use the officially released Chinese BERT-Base as the backbone. We use ROUGE scores (Lin and Hovy, 2003) to evaluate the quality of generated summaries.

Results and Analysis
Classification. We show the experimental results of two classification sub-tasks in Table 6. On the tougher MATINF-C-TOPIC, language models prominently outperform other baselines. Among non-LM neural networks, DPCNN (Johnson and Zhang, 2017), which has the deepest architecture and the most parameters, outperforms other baselines with a considerable margin. On MATINF-C-AGE, which is a smaller dataset with fewer classes, DPCNN outperforms all other baselines including TextRank (Mihalcea and Tarau, 2004) (Hu et al., 2015), and MATINF-SUMM.
language models with an accuracy of 91.02. To analyze, this task has fewer training samples, which is in favor of a model with moderate parameter numbers instead of huge parameter numbers as in language models. Also, the task is relatively easier due to the class number, which makes the advantage of language models more trivial. For the multi-task baseline, MTF-S2S shows a satisfying performance on both MATINF-C-AGE and MATINF-C-TOPIC, outperforming the same model which is only trained on the single task by 0.14 and 0.19 in terms of accuracy. Notably, BERT-of-Theseus (Xu et al., 2020) has a satisfying performance compressing the fine-tuned BERT to smaller models.
Question Answering. The experimental results are shown in Table 7. The high scores of Best Passage (maximum possible performance) indicate that using training data as a document set is completely feasible. Seq2Seq with Attention outperforms the retrieval-based baseline by a margin of 2.56 in terms of ROUGE-L. It suggests that a generation-based neural network can effectively learn from multiple relevant samples and generalize. Besides, since we do the matching between each question and every entry within the same class in the training set, the inference of BERT Matching takes quite a long time. Similar to MS-MARCO (Nguyen et al., 2016), it is possible to use a search engine (e.g., Elastic Search) to pre-filter the documents and reduce the computational cost. Meanwhile, MTF-S2S is effective on QA task and outperforms its single-task version by 0.74 on ROUGE-L.
Summarization. We further conduct performance comparison for summarization across three datasets, CNN/DM (Hermann et al., 2015), LC-STS (Hu et al., 2015), and our MATINF-SUMM in Table 8. By comparing the performance of two basic baselines, TextRank (Mihalcea and Tarau, 2004) and Seq2Seq+Att (Luong et al., 2015), we can see an obvious difference in performance between extractive and abstractive methods on datasets of different genres. BertAbs (Liu and Lapata, 2019), the powerful BERT-based model, significantly outperforms all other baselines on MATINF-SUMM thanks to its exploitation of pretraining and the capacity of a BERT model. For MTF-S2S, it outperforms the single-task counterpart by 4.73 on ROUGE-L.

Discussion
Since MATINF is a web-crawled dataset, it would be inevitable to be noisier than a dataset annotated by hired annotators though we have made every effort to clean the data. On the bright side, it can encourage more robust models and facilitate realworld applications. For future work, we would like to see more interesting work exploring new multi-task learning approaches.

Conclusion
To conclude, in this paper, we present MATINF, a jointly labeled large-scale dataset for classification, question answering and summarization. We benchmark existing methods and a straightforward baseline with a novel multi-task paradigm on MAT-INF and analyze their performance on these three tasks. Our extensive experiments reveal the potential of the proposed dataset for accelerating the innovations in the three tasks and multi-task learning.