DUT-NLP at MEDIQA 2019: An Adversarial Multi-Task Network to Jointly Model Recognizing Question Entailment and Question Answering

In this paper, we propose a novel model called Adversarial Multi-Task Network (AMTN) for jointly modeling Recognizing Question Entailment (RQE) and medical Question Answering (QA) tasks. AMTN utilizes a pre-trained BioBERT model and an Interactive Transformer to learn the shared semantic representations across different task through parameter sharing mechanism. Meanwhile, an adversarial training strategy is introduced to separate the private features of each task from the shared representations. Experiments on BioNLP 2019 RQE and QA Shared Task datasets show that our model benefits from the shared representations of both tasks provided by multi-task learning and adversarial training, and obtains significant improvements upon the single-task models.


Introduction
With the rapid development of Internet and medical care, online health queries are increasing at a high rate. In 2012, 59% of U.S. adults looked for health information online 1 . However, it is always difficult for search engines to return relevant and trustworthy health information every time if the symptoms are not accurately described (Pletneva et al., 2012;Scantlebury et al., 2017). Therefore, many websites provide online doctor consultation services, which can answer questions or give advice from doctors or experts to the customers. Unfortunately, manually answering some simple queries or answering similar questions multiple times is quite time-consuming and wasteful. A Question Answering (QA) system that can automatically understand and answer the health care questions asked by customers is urgently needed (Wren, 2012).
RQE task aims at identifying entailment relation between two questions in the context of QA (Abacha and Fushman, 2016), which can be represented as "a question Q1 entails a question Q2 if every answer to Q2 is also a complete or partial answer to Q1".
QA task aims at automatically filtering and improving the ranking of automatically retrieved answers (Abacha and Fushman, 2019). There are two targets for QA: (1) determining whether the given sentence could answer the given question; (2) ranking all the right answers according to their relevance to the question.
Neural networks and deep learning (DL) currently provide the best solutions for RQE and QA tasks. Among various neural networks, such as traditional Convolutional Neural Networks (CNN) (LeCun et al., 1998) and Long Short-Term Memory (LSTM) networks (Hochreiter and Schmidhuber, 1997), Transformer (Vaswani et al., 2017) has demonstrated superiority in multiple

DUT-NLP at MEDIQA 2019:An Adversarial Multi-Task Network to Jointly Model Recognizing Question Entailment and Question Answering
natural language processing tasks (Verga et al., 2017;Shen et al., 2017;Yu et al., 2018). Transformer (Vaswani et al., 2017) is based solely on attention mechanisms and it can effectively capture the long-range dependencies between words.
More recently, the pre-trained language models, such as ELMo (Matthew et al., 2018), OpenAI GPT 2 , and BERT (Devlin et al., 2018), have shown their effectiveness to capture the deep semantic and syntactic information of words. BioBERT (Lee et al., 2019) is one of the BERTbased pre-trained language model for biomedical domain, and it achieves great improvement in many biomedical tasks. For this reason, we believe that the pre-trained language models, especially the BioBERT, should be valid for RQE and QA under reasonable use.
Most previous researches train the model of RQE task or QA task separately based on a single training set. However, such single-task method cannot provide essential mutual supports between the two tasks. The inherent interactions between the two tasks might help us do even better on the RQE and QA tasks. RQE task can find Frequently Asked Questions (FAQs) similar to a consumer health question, providing consumers with appropriate FAQs and enabling QA systems to identify the right answers with greater precision and higher speed (Harabagiu and Hickl, 2006).
Multi-Task Learning (MTL) is a learning paradigm in machine learning and its aim is to leverage shared representations contained in multiple related tasks to help improve the generalization performance of all the tasks. MTL is usually done with parameter sharing of hidden layers. Hard parameter sharing is the most commonly used approach to MTL in neural networks. It is generally applied by sharing the hidden layers between all tasks, while keeping several task-specific output layers. However, it is difficult for MTL to distinguish the commonalities and differences between different tasks.
A common way to improve the robustness of the system is to train the system using different datasets through adversarial training (Goodfellow et al., 2014). Chen et al. (2017) propose a sharedprivate model, which extracts shared features and private features from multiple corpus, and introduces adversarial training for shared representation learning. Drawing on the practices of previous studies, we plan to use an adversarial multi-task framework to extract the noise-robust shared representation directly.
Considering the similarity between RQE and QA tasks, this paper proposes a novel Adversarial Multi-Task Network (AMTN) to jointly model these two tasks. Specifically, AMTN first utilizes BioBERT as an embedding layer to generate context-dependent word representations. Then, a common Interactive Transformer layer is introduced for sentence representation learning and inter-sentence relationship modeling, which allows knowledge transfer from other tasks. Finally, two specific classifiers are used for RQE and QA tasks respectively. Here, we only consider the target (1) of QA task for the multi-task learning to ensure the consistency between RQE and QA tasks. Furthermore, to prevent the shared and private feature spaces from interfering with each other, an adversarial training strategy is introduced to make the shared feature representations to be more compatible and taskinvariant among different tasks. Experimental results show that our AMTN model is effective to improve the performance for both RQE and QA tasks upon the single-task models, which demonstrates the superiority of the adversarial multi-task strategy.
Our contributions can be summarized into two folds.
 A well-designed Interactive Transformer layer is introduced for sentence representation learning and inter-sentence relationship modeling.
 A novel adversarial multi-task strategy is introduced to jointly model RQE and QA tasks, in which multi-task learning is proposed for shared representation learning and adversarial training is used to force the shared representation purer and taskinvariant.

Method
This section gives a detailed description of the proposed AMTN, which is shown in Figure 1. AMTN mainly consists of three parts: a shared encoder, a task discriminator and two classifiers for the RQE task and the QA task, respectively. Shared encoder is used to learn the shared semantic representations across different tasks through parameter sharing mechanism. Task discriminator is used to form an adversarial training with the shared encoder to separate the private features of each task from the shared representations. Two specific classifiers are applied to judge whether a sentence pair is an entailment relationship (RQE task) or a questionand-answer relationship (QA task).
Next, we will use four subsections to introduce our AMTN model in detail: Data Preprocessing, Shared Representation Learning, Task Specific Classifier and Adversarial Training.

Data Preprocessing
Define a data set is the corresponding labels of ( ) X k i , k N is the number of training data in the th k task. In this paper, 1 k  refers to RQE task, and 2 k  refers to QA task.
where n, m are the sequence lengths. Specially, since the answers of the QA task are too long, we intercept the first sentence of them as x B .

Shared Representation Learning
We use the shared encoder to learn the shared representations as the input for the classifiers and the task discriminator. Figure 2 illustrates the architecture of shared encoder, which contains BioBERT Embedding Layer, Interactive Transformer Layer and Combination Layer.

BioBERT Embedding Layer:
BioBERT is a domain specific language representation model pre-trained on large-scale biomedical corpora (Lee et al., 2019). It could effectively enhance the learning ability of encoding biomedical information.
We use BioBERT as an embedding layer, whose final hidden representation of each word is treated as word embedding. Given the sequence input X , the corresponding hidden representation To effectively capture the long dependency information and establish an interaction between the two sentences, the hidden representation sequences h A and h B are fed to an Interactive Transformer Layer. The Interactive Transformer consists of N blocks, each of which contains a multi-head attention with interactive process. Multi-head attention performs the scaled dot product attention multiple times on linearly projected query ( Q ), Key ( K ) and Value (V ), which is shown in the following formula:  4 Attention( , , ) softmax( ) where K d is the dimension of K . Vaswani et al. (2017) point out that the input of softmax grows large in magnitude, pushing the softmax function into regions where it has extremely small gradients. Therefore, the dot productions are scaled by 1 K d to counteract this effect.
For the first sentence x A , we take its hidden representation h A as Q and h B as K , V . In this way, the information flow inside features of the sentence x B are dynamically conditioned on the features of the sentence x A . The inputs for the first sentence x A can be represented as: For the second sentence x B , we take h B as Q and take h A as K , V , which can be represented as: Therefore, the multi-head attentions with interactive process for the given pair of sentences can be formulated as: is a concatenation of outputs of L heads. Different from the original Transformer, in which the input of Q , K and V are all the same, Interactive Transformer takes different sentences as the inputs of Q and K , V . In this way, we expect to effectively compute dependencies between any two words of the sentence pairs and encode the abundant semantic information for each sequence word.
Combination Layer: After modeling the association between the two sentences, we utilize the max pooling operation to obtain the final shared semantic representations of x A and x B respectively: Then, we perform vector combination on A S , B S , and flag representation h C through a dense layer to generate the sentence pair representation S for classification, which is calculated as follows: are trainable parameters.

Task Specific Classifier
For each task, a specific classifier is employed to judge whether a sentence pair is an entailment relationship (RQE) or a question-and-answer relationship (QA). Each classifier is composed of a two-layer fully-connected neural network, which uses a ReLU nonlinearity after the first fully connected layer and a softmax nonlinearity after the second fully connected layer. It can be written as follows: The classifier takes the sentence pair representation S as input and outputs a probability distribution to predict whether the current sentence pair is entailment relation or question-and-answer relation.
Both classifiers are trained by optimizing the cross-entropy loss as follows: where C is the number of categories of classification label, , i j y is the predicted probability of the th j category of the th i sentence pair.

Adversarial Training
In order to make shared representations contain more common information and reduce the mixing of task-specific information, adversarial training is introduced into the above multi-task framework.
The goal of the proposed adversarial training strategy is to form an adversary with shared representation learning by introducing a task discriminator. In this paper, we take the shared encoder as generative network G and the task discriminator as discriminative model D , in which G needs to learn as much semantic information as possible from the shared data distribution between the two tasks and D aims to determine which task (RQE or QA) the input sentence belongs to by using the shared representations.
Specifically, we first use the shared encoder G( , ) X s  , which is mentioned in section 2.2, to get the sentence pair representation S . s  is the shared parameter need to be trained. Then, the shared representations will be fed to the task discriminator D to determine the task to which the current input belongs. D can be expressed by the following formula: Besides the task loss for RQE and QA, we additionally introduce an adversarial loss adv J to prevent task-specific feature from creeping into shared space and thus get a purer shared representation. The adversarial loss adv J is trained in alternating fashion as shown below: where ( ) k i t is the correct task label (RQE task or QA task) of the given sentence pair X . Here the basic idea is that, the shared representations learned by the shared encoder need to mislead the task discriminator. At the same time, the task discriminator needs to predict the task (RQE or QA) to which the data belongs as accurately as possible. The two are adversarial to each other and alternately optimized to separate the private features from the shared representations.
Finally, the shared encoder and the task discriminator reach a balance point and achieve mutual promotion.

Dataset
Our experiments are conducted on the BioNLP RQE and QA shared tasks. The QA dataset contains a total of 3042 question-answer pairs: 1701 for training, 234 for validation, and 1107 for test. The RQE dataset contains a total of 9120 question pairs: 8588 for training, 302 for validation, and 230 for test. The statistic of the two datasets are shown in Table 1.

Experimental settings and metric
In the shared encoder module, we use the pretrained uncased BioBERTbase 3 for computational complexity considerations.

Effects of the Adversarial Multi-Task Learning Strategy
This section first proposes two baseline strategies for comparison as described below:  Multi-Task: Under this strategy, the architecture is constructed by removing the discriminator D from AMTN. We call it AMTN-Discriminator.
 Single-Task: Under this strategy, the architecture is constructed by removing the discriminator D from AMTN, and using the same classifier for the two tasks. We call it Single-Task Network (STN).
The results are shown in Table 3. From the table, we can see that single-task learning achieves the worst results, which is probably due to the simple model architecture. For the three methods using different dataset in Single-task learning, STN (QA+RQE) performs better than STN (QA) and STN (RQE). It demonstrates that the two datasets have quite similar information distributions that could adequately complement each other and contribute to both RQE and QA tasks.
From the second block in Table 4, we can see that Multi-Task strategy performs clearly better than Single-Task. Note that, AMTN-Discriminator has an accuracy rate of 63.6% and 74.5% for RQE and QA tasks, which is the result of our submission in the task website. Multi-task learning jointly trains multiple sub-task models through a shared encoder. It can effectively capture the common features of the two task data, thereby promoting the generalization ability of RQE and QA tasks synchronously.
To explore the effects of the proposed adversarial multi-task strategy. Furthermore, we arm the above Multi-Task strategy with adversarial training, i.e. adding a discriminator to form the adversary with shared representation learning:  Adversarial Multi-Task: Under this strategy, two architectures are constructed. One is our proposed AMTN. The other is a variant of AMTN, which adds a Private Encoder for each task to parallelly learn task-specific representations and shared representations.    7 of adversarial loss. We believe that the discriminator could strip private features from shared representations and make shared representations more general. Finally, when we add a private encoder for each task, i.e. AMTN+Private Encoder, we can see that the performance is significantly reduced by 6.0% and 3.3% accuracy in RQE and QA tasks, respectively. Although private representation could provide task-specific information, it will introduce too many redundant parameters that could make the model prone to over-fitting, resulting in performance degradation.

Effects of the Shared Encoder
Our AMTN model uses Interactive Transformer as shared encoder to perform shared representation learning. To verify the effects of the shared encoder, we compare the Interactive Transformer with the following three baseline methods: For sentence x B , the three input ( Q , K and V ) of Transformer are all h B . That is to say, there is no interaction between the two sentences in this encoder.
Note that, the final sentence representation is generated by max pooling on the output of the above shared encoder.
In addition, previous works in biomedical RQE and QA often use word embeddings trained on PubMed or PMC corpus. To verify the superiority of pre-trained language representation model, the above four shared encoders (including Interactive Transformer) are respectively equipped with the following three word representation methods:  Word2Vec: Each word in a sentence is represented by word embeddings trained on PubMed abstracts and PubMed Central fulltext articles (Wei et al., 2013) with Word2vec toolkit (Mikolov et al., 2013). The dimension of the pre-trained word embedding is 100. We use a transition matrix to convert its dimension to 768.
 BERT: The pre-trained BERT model is used to generate a hidden representation h of each word in the sentence as its word embedding. The purpose of this method is to increase the generalization ability of the Word2Vec and fully describe the character level, word level and sentence level information and even the relationship between sentences.
 BioBERT: Same as above, the pre-trained BioBERT model is used to generate a hidden representation h of each word in the sentence as its word embedding.  shows that pre-trained models could improve model robustness and uncertainty estimates. On the other hand, among the four different encoders, Interactive Transformer shows the best results overall. Interactive Transformer could not only capture the long-range dependency information, but also establish an interaction between the given two sentences by the interactive process. The benefit of introducing the interactive process is that it can efficiently compute dependencies between any two words in a sentence pair and encode the rich semantic information for each sequence word.

Error Analysis
Although the proposed AMTN achieves great performance over strong baselines, some failure cases are also observed. We have carried out detailed statistics and analysis of these errors, and classified the possible causes into the following three categories.
The first error type is acronyms. Since most biomedical concepts have acronyms, e.g. "Gastrointestinal Stromal Tumor" vs. "GIST" in first sentence pair in Table 6, it is quite difficult for model to determine whether the two sentences focus on the same topic without any external knowledge, thus resulting in misclassification. This problem is also our concern for future work, e.g. how to integrate prior knowledge into the model.
The second error type is ambiguous samples, which means that the relationship between the sentences is fuzzy and difficult to judge, such as the QA sentence pair shown in the second block of Table 5. Its golden label is True, however, the answer sentence seems to be irrelevant to the question, thus leads to the wrong classification of our model.
The third error type is semantic confusion, which refers to the semantic misunderstanding caused by complex syntax or collocation of phrases. Take the sentence pair in third block of Table 6 as an example: Q1 contains almost all the words in Q2 ("possible", "atypical pneumonia", "treatments" and etc.), while the two sentences are of Contradiction relation. We believe that the sentence pair is quite confusing that AMTN does not really "understand" it.

Conclusion
In this paper, we propose an Adversarial Multi-Task Network to jointly model RQE and QA shared tasks. ATMN employs BioBERT and Interactive Transformer as the shared encoder to learn the shared representations across the two tasks. A discriminator is further introduced to form an adversarial training with the shared encoder for purer shared semantic representations. Experiments on BioNLP 2019 RQE and QA shared tasks show that our proposed AMTN model benefits from the shared representations of both tasks provided by multi-task learning and adversarial training, and gains a significant improvement upon the single-task models.