Label-Aware Double Transfer Learning for Cross-Specialty Medical Named Entity Recognition

We study the problem of named entity recognition (NER) from electronic medical records, which is one of the most fundamental and critical problems for medical text mining. Medical records which are written by clinicians from different specialties usually contain quite different terminologies and writing styles. The difference of specialties and the cost of human annotation makes it particularly difficult to train a universal medical NER system. In this paper, we propose a label-aware double transfer learning framework (La-DTL) for cross-specialty NER, so that a medical NER system designed for one specialty could be conveniently applied to another one with minimal annotation efforts. The transferability is guaranteed by two components: (i) we propose label-aware MMD for feature representation transfer, and (ii) we perform parameter transfer with a theoretical upper bound which is also label aware. We conduct extensive experiments on 12 cross-specialty NER tasks. The experimental results demonstrate that La-DTL provides consistent accuracy improvement over strong baselines. Besides, the promising experimental results on non-medical NER scenarios indicate that La-DTL is potential to be seamlessly adapted to a wide range of NER tasks.


Introduction
The development of hospital information system and medical informatics drives the leverage of various medical data for a more efficient and intelligent medical care service. Among many kinds of medical data, electronic health records (EHRs) are one of the most valuable and informative data as they contain detailed information about the patients and the clinical practices. EHRs are essential to many intelligent clinical applications, such * Weinan Zhang is the corresponding author. as hospital quality control and clinical decision support systems (Wu et al., 2015). Most of EHRs are recorded in an unstructured form, i.e., natural language. Hence, extracting structured information from EHRs using natural language processing (NLP), e.g., named entity recognition (NER) and entity linking, plays a fundamental role in medical informatics (Zhang and Elhadad, 2013). In this paper, we focus on medical NER from EHRs, which is a fundamental task and is widely studied in the research community (Nadeau and Sekine, 2007;Uzuner et al., 2011).
In practice, the difficulty of building a universally robust and high-performance medical NER system lies in the variety of medical terminologies and expressions among different departments of specialties and hospitals. However, building separate NER systems for so many specialties comes with a prohibitively high cost. The data privacy issue further discourages the sharing of the data across departments or hospitals, making it more difficult to train a canonical NER system to be applied everywhere. This raises a natural question: if we have sufficient annotated EHRs data in one source specialty, can we distill the knowledge and transfer it to help training models in a related target specialty with few annotations? By transferring the knowledge we can achieve higher performance in target specialties with lower annotation cost and bypass the data sharing concerns. This is commonly referred to as transfer learning (Pan and Yang, 2010).
Current state-of-the-art transfer learning methods for NER are mainly based on deep neural networks, which perform an end-to-end training to distill sequential dependency patterns in the natural language (Ma and Hovy, 2016;Lample et al., 2016). These transfer learning methods include (i) feature representation transfer (Peng and Dredze, 2017;Kulkarni et al., 2016), which normally lever-1 ages deep neural networks to learn a close feature mapping between the source and target domains, and (ii) parameter transfer (Murthy et al., 2016;Yang et al., 2017), which performs parameter sharing or joint training to get the target-domain model parameters close to those of the source-domain model. To the best of our knowledge, there is no previous literature working on transfer learning for NER in the medical domain, or even in a larger scope, i.e., medical natural language processing.
In this paper, we propose a novel NER transfer learning framework, namely label-aware double transfer learning (La-DTL): (i) We leverage bidirectional long-short term memory (Bi-LSTM) network (Graves and Schmidhuber, 2005) to automatically learn the text representations, based on which we perform a label-aware feature representation transfer. We propose a variant of maximum mean discrepancy (MMD) (Gretton et al., 2012), namely label-aware MMD (La-MMD), to explicitly reduce the domain discrepancy of feature representations of tokens with the same label between two domains. (ii) Based on the learned feature representations from Bi-LSTM, two conditional random field (CRF) models are performed for sequence labeling for source and target domain separately, where parameter transfer learning is performed. Specifically, an upper bound of KL divergence between the source and target domain's CRF label distributions is added over the emission and transition matrices across the source and target CRF models to explore the shareable parts of the parameters. Both (i) and (ii) have a labelaware characteristic, which will be discussed later. We further argue that label-aware characteristic is crucial for transfer learning in sequence labeling problems, e.g., NER, because only when the corresponding labels are matched, can the "similar" contexts (i.e. feature representation) and model parameters be efficiently borrowed to improve the label prediction.
Extensive experiments are conducted on 12 cross-specialty medical NER tasks with real-world EHRs. The experimental results demonstrate that La-DTL provides consistent accuracy improvement over strong baselines, with overall 2.62% to 6.70% absolute F1-score improvement over the state-of-the-art methods. Besides, the promising experimental results on other two non-medical NER scenarios indicate that La-DTL has the potential to be seamlessly adapted to a wide range of NER tasks.

Related Works
Named Entity Recognition (NER) is fundamental in information extraction area which aims at automatic detection of named entities (e.g., person, organization, location and geo-political) in free text (Marrero et al., 2013). Many high-level applications such as entity linking (Moro et al., 2014) and knowledge graph construction (Hachey et al., 2011) could be built on top of an NER system. Traditional high-performance approaches include conditional random fields models (CRFs) (Lafferty et al., 2001), maximum entropy Markov models (MEMMs) (McCallum et al., 2000) and hidden Markov models (HMMs). Recently, many neural network-based models have been proposed (Collobert et al., 2011;Chiu and Nichols, 2016;Ma and Hovy, 2016;Lample et al., 2016), in which few feature engineering works are needed to train a high-performance NER system. The architecture of those neural network-based models are similar, where different neural networks (LSTMs, CNNs) at different levels (char-and word-level) are applied to learn feature representations, and on top of neural networks, a CRF model is employed to make label predictions. Transfer Learning distills knowledge from a source domain to help create a high-performance learner for a target domain. Transfer learning algorithms are mainly categorized into three types, namely instance transfer, feature representation transfer and parameter transfer (Pan and Yang, 2010). Instance transfer normally samples or reweights source-domain samples to match the distribution of the target domain (Chen et al., 2011;Chu et al., 2013). Feature representation transfer typically learns a feature mapping which projects source and target domain data simultaneously onto a common feature space following similar distributions (Zhuang et al., 2015;Long et al., 2015;Shen et al., 2017). Parameter transfer normally involves a joint or constrained training for the models on source and target domains, usually introduce connections between source target parameters via sharing (Srivastava and Salakhutdinov, 2013), initialization (Perlich et al., 2014), or intermodel parameter penalty schemes (Zhang et al., 2016). Transfer Learning for NER Training a highperformance NER system requires expensive and time-consuming manually annotated data. But sufficient labeled data is critical for the generalization of an NER system, especially for neural networkbased models. Thus, transfer learning for NER is a practically important problem. The first group of methods focuses on sharing model parameters but they differ in the training schemes. He and Sun (2017) proposed to train the parameter-shared model with source and target data jointly, while the learning rates for sentences from source domain are re-weighted by the similarity with target domain corpus. Yang et al. (2017) proposed a family of frameworks which share model parameters in hierarchical recurrent networks to handle crossapplication, cross-lingual, and cross-domain transfer in sequence labeling tasks. Differently, Lee et al. (2017) first trained the model with source domain data and then fine-tuned the model with little annotated target domain data.
Domain adaptation method has been well studied in NER scenarios such as using distributed word representations (Kulkarni et al., 2016) and leveraging rule-based annotators (Chiticariu et al., 2010). Multi-task learning has also been studied to improve performance in multiple NER tasks by transferring meaningful knowledge from other tasks (Collobert et al., 2011;Peng and Dredze, 2016). To take the advantages of both domain adaptation and multi-task learning, Peng and Dredze (2017) proposed a multi-task domain adaptation model.

Preliminaries
This section briefly introduces bidirectional LSTM, conditional random field and maximum mean discrepancy, which are the building blocks of our transfer learning framework. Bidirectional LSTM Recurrent neural networks (RNNs) are widely used in NLP tasks for their great capability to capture contextual information in sequence data. A widely used variant of RNNs is long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997), which incorporates input and forget gates to capture both long and short term dependencies. Furthermore, it will be beneficial if we process the sequence in not only a forward but also a backward way. Thus, bidirectional LSTM (Bi-LSTM) was employed in many previous works (Chiu and Nichols, 2016;Ma and Hovy, 2016;Lample et al., 2016) to capture bidirectional information in a sequence. More specifi-cally, for token x t (embedding vector) at timestep produced by a forward LSTM and a backward one, respectively. Then we concatenate h → t and h ← t to h t as the final hidden vector produced by Bi-LSTM: The representations learned from Bi-LSTM for sequence X is thus denoted as H = (h 1 , h 2 , ..., h n ).

Conditional Random Field
The goal of NER is to detect named entities in a sequence X by predicting a sequence of labels y = (y 1 , y 2 , ..., y n ).
Conditional random field (CRF) is widely used to make joint labeling of the tokens in a sequence (Lafferty et al., 2001).
Recently, Lample et al. (2016) proposed to build a CRF layer on top of a Bi-LSTM so that the automatically learned feature representation H = (h 1 , h 2 , ..., h n ) of the sequence can be directly fed into the CRF for sequence labeling. For a sequence of labels y, given the hidden vector sequence H, we define its θ c -parametrized score function s θc (H, y) as: where E is the emission score matrix of size n×m (m is the number of unique labels), and is computed by E = HW where W is the label emission parameter matrix; A is the label transition parameter matrix; thus θ c = {W, A}. We then define the conditional probability of label sequence y given H by a softmax over all possible label sequences in set Y(H) as: where θ c is omitted for simplification in the following part. The training objective in the CRF layer is to maximize the log-likelihood max θc log p(y|H). In the label prediction phase, we give the output label sequence y * with the highest conditional probability y * = arg max y ∈Y(H) p(y |H) by dynamic programming (Sutton et al., 2012). parametric test statistic to measure the distribution discrepancy in terms of the distance between the kernel mean embeddings of two distributions p and q. The MMD is defined in particular function spaces that witness the difference in distributions

Maximum Mean Discrepancy
By defining the function class F as the unit ball in a universal Reproducing Kernel Hilbert Space (RKHS), denoted by H, it holds that MMD[F, p, q] = 0 if and only if p = q. And then given two sets of samples X = {x 1 , ..., x m } and Y = {y 1 , ..., y n } independently and identically distributed (i.i.d.) from p and q on the data space X , the empirical estimate of MMD can be written as the distance between the empirical mean embeddings after mapping to RKHS where φ(·) : X → H is the nonlinear feature mapping that induces H.

Methodology
In this section, we present a label-aware double transfer learning (La-DTL) framework and discuss its rationale. Figure 1 gives an overview of La-DTL for NER. From bottom up, each input sentence is converted into a sequence of embedding vectors, which are then fed into a Bi-LSTM to sequentially encode contextual information into fixed-length hidden vectors. The embedding and Bi-LSTM layers are shared among source/target domains. With labelaware maximum mean discrepancy (La-MMD) to reduce the feature representation discrepancy between two domains, the hidden vectors are directly fed into source/target domain specific CRF layers to predict the label sequence. We use domain constrained CRF layers to enhance the target domain performance.

Framework Overview
We occasionally use H(X) to denote the corresponding hidden vectors when feeding X into the Bi-LSTM. CRF decodes hidden vectors H to a label sequenceŷ = (ŷ 1 ,ŷ 2 , ...,ŷ n ). Our goal is to improve label prediction accuracy on the target domain D t by utilizing the knowledge from the source domain D s : Thus training a transferable model p(y|X) requires both H(X) and p(y|H) to be transferable.
We use share word embedding and Bi-LSTM by approaching the feature representation distributions p(h|D s ) and p(h|D t ), i.e., the distributions of Bi-LSTM hidden vectors at each timestep of the sentences from the source and target domains respectively. The rationale behind it lies on the insufficiency of labeled target data. Even though LSTM has high capacity, its generalization ability highly relies on viewing "sufficient" data. Otherwise, LSTM is very likely to overfit the data. Training on both source and target data, the Bi-LSTM is expected to learn feature representations with high quality. Yosinski et al. (2014) provided a justification of this solution that sharing bottom layers is promising for transfer learning in practice.
With the sentences projected onto the same hidden space, the conditional distribution p(h s |D s ) and p(h t |D t ), however, may be distant because LSTM hidden vectors contain contextual information which is different across domains. In order to reduce source/target discrepancy, we refine MMD (Gretton et al., 2012) with label constraints, i.e., label-aware MMD (La-MMD). Using La-MMD, the source/target hidden states are pushed to similar distributions to make the feature representation H(X) transfer feasible.
Based on the hidden vectors from Bi-LSTM, we adopt independent CRF layers for each domain. The rationale lies in the hypotheses that (i) the target domain predictor can better capture target data distribution which could be very unique; (ii) a good predictor trained on the source domain directly could be leveraged to assist the target domain predictor without directly borrowing the source domain training data to bypass the data privacy issue. With respect to the emission and transition score matrices E i,y i and A y i ,y i+1 , we adopt an upper bound between source/target domains, which helps the target domain predictor to be guided by the source domain predictor. Thus p(y|H) is also transferable.
There are also other transfer methods, including fine-tuning, sharing parameter directly (without constraints) (He and Sun, 2017;Lee et al., 2017;Yang et al., 2017), etc. However, simply sharing models may dismiss target specific instances.

Learning Objective
The learning objective is to minimize the following loss L with respect to parameters Θ = {θ b , θ c }: where L c is the CRF loss, L La-MMD is the La-MMD loss, L p is the parameter similarity loss on CRF layers, and L r is the regularization term, with α, β, γ as hyperparameters to balance loss terms.
The CRF loss is our ultimate objective predicting the label sequence given the input sentence, i.e., we minimize the negative log-likelihood of training samples from both source/target domains: where H are hidden vectors obtained from Bi-LSTM, ε is the balance coefficient. The La-MMD loss L La-MMD and parameter similarity loss L p are discussed in Section 4.3 and 4.4, respectively. The regularization term is to generally control overfitting: We will provide the model convergence and hyperparameter study in Section 5.1.

Bi-LSTM Feature Representation Transfer
To learn transferable feature representations, the maximum mean discrepancy (MMD) which measures the distance between two distributions, has been widely used in domain adaptation scenarios (Long et al., 2015;Rozantsev et al., 2016). Almost all these works focus on reducing the marginal distribution distance between different domain features in an unsupervised manner to make them indistinguishable. However, considering a word is not evenly distributed conditioning on different labels, it may result in that the discriminative property of features from different domains may not be similar, which means that close source and target samples may not have the same label. Different from previous works, we propose label-aware MMD (La-MMD) in Eq. (5) to explicitly reduce the discrepancy between hidden representations with the same label, i.e., the linear combination of the MMD for each label. For each label class y ∈ Y v , where Y v is the set of matched labels in two domains, we compute the squared population MMD between the hidden representations of source/target samples with the same label y: where R s y and R t y are sets of hidden representation h s and h t with corresponding number N s y and N t y . Eq. (4) can be easily derived by casting Eq. (2) into inner product form and applying φ(x), φ(y) H = k(x, y) where k is the reproducing kernel function (Gretton et al., 2012). For each label class, we compute the MMD loss in a normal manner. After that, we define the La-MMD loss as: where µ y is the corresponding coefficient. The illustration of La-MMD is shown in Figure 2.
Once we have applied this La-MMD to our representations learned from Bi-LSTM, the representation distribution of instances with the same label from different domains should be close. Then the standard CRF layer which has a simple linear structure takes these similar representations as input and is likely to give a more transferable label decision for instances with the same label.

CRF Parameter Transfer
Simply sharing the CRF layer is non-promising when source/target data are diversely distributed. According to probability decomposition in Eq. (3), in order to transfer on source/target CRF layers, more specifically, p(y|H), we reduce the KL divergence from p t (y|H) to p s (y|H). But directly reducing D KL (p s (y|H)||p t (y|H)) is intractable, we tend to reduce its upper bound: where H(·) is the entropy of distribution (·) and c is a constant. The detailed proof is provided in Appendix A.1. Since c( W s − W t 2 2 + A s − A t 2 2 ) is the upper bound of D KL (p s (y|H) p t (y|H)), we conduct CRF parameter transfer by minimizing It turns out that a similar regularization term is applied in our CRF parameter transfer method and the regularization framework (RF) for domain adaptation (Lu et al., 2016). However, RF is proposed to generalize the feature augmentation method in (Daume III, 2007), and these two methods are only discussed from a perspective of the parameter. There is no guarantee that two models having similar parameters yields similar output distributions. In this work, we discuss the model behavior in CRF conditions, and we successfully prove that two CRF models having similar parameters (in Euclidean space) yields similar output distributions. In another word, our method guarantees transferability in the model behavior level, while previous works are limited in parameter level. The CRF parameter transfer is illustrated in Figure 3, which is also label-aware since the L2 constraint is added over parameters corresponding to the same label in two domains, e.g., W s O and W t O .

Training
We train La-DTL in an end-to-end manner with mini-batch AdaGrad (Duchi et al., 2011). One mini-batch contains training samples from both domains, otherwise the computation of L La-MMD can not be performed. During training, word (and character) embeddings are fine-tuned to adjust real data distribution. During both training and decoding (testing) of CRF layers, we use dynamic programming to compute the normalizer in Eq. (1) and infer the label sequence.

Experiments
In this section, we evaluate La-DTL 1 and other baseline methods on 12 cross-specialty NER problems based on real-world datasets. The experimental results show that La-DTL steadily outperforms other baseline models in all tasks significantly. We also conduct further ablation study and robustness study. We evaluate La-DTL on two more nonmedical NER transfer tasks to validate its general efficacy over a wide range of applications.

Cross-Specialty NER
Datasets We collected a Chinese medical NER (CM-NER) corpus for our experiments. This corpus contains 1600 de-identified EHRs of our affiliated hospital from four different specialties in four departments: Cardiology (500), Respiratory (500), Neurology (300) and Gastroenterology (300), and the research had been reviewed and approved by the ethics committee. Named entities are annotated in the BIOES format (Begin, Inside, Outside, End and Single), with 30 types in total. The statistics of CM-NER is shown in Table 1. Baselines The following methods are compared. For a fair comparison, we implement La-DTL and baselines with the same base model introduced in (Lample et al., 2016) but with different transfer techniques.
• Non-transfer uses the target domain labeled data only.
• Domain mask and Linear projection belong to the same framework proposed by Peng and Dredze (2017) but have different implementations at the projection layer, which aims to produce shared feature representations among different domains through a linear transformation.
• Re-training is proposed by Lee et al. (2017), where an artificial neural networks (ANNs) 1 https://github.com/felixwzh/La-DTL is first trained on the source domain and then re-trained on the target domain.
• Joint-training is a transfer learning method proposed by Yang et al. (2017) where different tasks are trained jointly.
• CD-learning is a cross-domain learning method proposed by He and Sun (2017), where each source domain training example's learning rate is re-weighted.
Experimental Settings We use 23,217 unlabeled clinical records to train the word embeddings (word2vec) at 128 dimensions using skipgram model (Mikolov et al., 2013). The hidden state size is set to be 200 for word-level Bi-LSTM. We evaluate La-DTL for cross-specialty NER with CM-NER in 12 transfer tasks, results shown in Table 2. For each task, we take the whole source domain training set D s and 10% sentences of the target domain training set D t as training data. We use the development set in target domain to search hyper-parameters including training epochs. We then take the models to make the prediction in target domain test set and use F1-score as the evaluation metric. Statistical significance has been determined using a randomization version of the paired sample t-test (Cohen, 1995).

Results and Discussion
From the results of 12 cross-specialty NER tasks shown in Table 2, we find that La-DTL outperforms all the strong baselines in all the 12 cross-specialty transfer learning tasks, with 2.62% to 6.70% F1-score lift over state-of-the-art baseline methods. Meanwhile, Linear projection and Domain mask (Peng and Dredze, 2017) do not perform as good as other three baselines, which may be because such linear transformation methods are likely to weaken the representations. While other three baseline methods all share the whole model between source/target domains but differ in the training schemes and performance.
To better understand the transferability of La-DTL, we also evaluate three variants of La-DTL: La-MMD, CRF-L2, and MMD-CRF-L2. La-MMD and CRF-L2 have the same networks and loss function as La-DTL but with different building blocks: La-MMD has β = 0, while CRF-L2 has α = 0. In MMD-CRF-L2, we replace La-MMD loss L La-MMD in La-DTL with a vanilla MMD loss:   where R s and R t are sets of hidden representation from source and target domain. Results in Table 2 show that: (i) Using La-MMD alone does achieve satisfactory performance since it outperforms the best baseline Joint-training (Yang et al., 2017) in 7 of 12 tasks. And it has a significant improvement over Domain mask and Linear projection methods (Peng and Dredze, 2017), which indicates that using La-MMD to reduce the domain discrepancy of feature representations in sequence tagging tasks is promising. (ii) CRF-L2 is also a promising method when transferring between NER tasks, and it improves the La-MMD method significantly when these two methods are combined to form La-DTL. (iii) Label-aware characteristic is important in sequence labeling problems because there is an obvious performance drop when La-MMD is replaced with a vanilla MMD in La-DTL. But MMD-CRF-L2 still has very competitive performance compared to all the baseline methods. This shows positive empirical evidence that transferring knowledge at both Bi-LSTM feature representation level and CRF parameter level for NER tasks is better than transferring knowledge at only one of these two levels, as discussed in Section 4.1.

Robustness to Target Domain Data Sparsity
We further study the sparsity problem (target domain) of La-DTL in C→R task comparing to Joint-training (Yang et al., 2017) and Non-transfer method. We evaluate La-DTL with different data volume (sampling rate: 10%, 25%, 50%, 100%) on the target domain training set. Results are shown in Figure 4(a). We observe that La-DTL outperforms Joint-training and Non-transfer results under all circumstances, and the improvement of La-DTL is more significant when the sampling rate is lower.
To show La-DTL's convergence and significant improvement over Joint-training, we repeat the 10% sampling rate experiment for 10 times with 10 random seeds. The F1-score on the target domain development set for two methods with a 95% confidence interval is shown in Figure 4(b) where La-DTL outperforms Joint-training method significantly. Hyperparameter Study We study the influence of three key hyperparameters in La-DTL: α, β, and ε in C→R task with 10% target domain sampling rate. We first apply a rough grid search for the three hyperparameters, and the result is (α = 0.02, β = 0.03, ε = 0.3). We then fix two hyperparameters and test the third one in a finer granularity. The results in Figure 5 (Peng and Dredze, 2017) 56.99 Domain mask (Peng and Dredze, 2017) * 56.80 Domain mask (Peng and Dredze, 2017) 56.32 CD-learning (He and Sun, 2017) * 52.05 CD-learning (He and Sun, 2017) 56.46 Re-training (Lee et al., 2017) 55.36 Joint-training (Yang et al., 2017) 56.80 La-DTL 57.74 mance. This shows that we need to balance the learning objective of the source and target domains for better transferability.

NER Transfer Experiment on Non-medical Corpus
To show La-DTL could be applied in a wide range of NER transfer learning scenarios, we make experiments on two non-medical NER tasks. Corpora's details are shown in Table 3. WeiboNER Transfer Following He and Sun (2017); Peng and Dredze (2017), we transfer knowledge from SighanNER (MSR corpus of the sixth SIGHAN Workshop on Chinese language processing) to WeiboNER (a social media NER corpus) (Peng and Dredze, 2015). Results in Table  4 show that La-DTL outperforms all the baseline methods in Chinese social media domain.  (Tjong Kim Sang and De Meulder, 2003) to TwitterNER (Ritter et al., 2011). Since the entity types in these two corpora cannot be exactly matched, La-DTL and Joint-training (Yang et al., 2017) can be applied directly in this case while other baselines can not. Because the CRF parameter transfer of La-DTL is label-aware, and Jointtraining simply leverages two independent CRF layers. The results are shown in  fer learning scenarios with mismatched label sets and languages like English.

Conclusions
In this paper, we propose La-DTL, a label-aware double transfer learning framework, to conduct both Bi-LSTM feature representation transfer and CRF parameter transfer with label-aware constraints for cross-specialty medical NER tasks. To our best knowledge, this is the first work on transfer learning for medical NER in cross-specialty scenario. Experiments on 12 cross-specialty NER tasks show that La-DTL provides consistent performance improvement over strong baselines. We further perform a set of experiments on different target domain data size, hyperparameter study and other non-medical NER tasks, where La-DTL shows great robustness and wide efficacy. For future work, we plan to jointly perform NER and entity linking for better cross-specialty media structural information extraction.

A.1 Detailed Proof
Recall the bound as in Eq. (6): is the upper bound of (s s (H, y) − s t (H, y)) 2 .
Proof of Lemma A.1. ⊗ refers to convolutional product, H W , H A are mask matrices corresponding to the given hidden vectors H, and c 1 is a constant. We have: =2( W s − W t 2 2 · H W 2 2 ) + 2( A s − A t 2 2 · H A 2 2 ) ≤c 1 ( W s − W t 2 2 + A s − A t 2 2 ).
Lemma A.2. c( W s − W t 2 2 + A s − A t 2 2 ) 1 2 is the upper bound of D KL (p s (y|H)||p t (y|H)).