HGCN4MeSH: Hybrid Graph Convolution Network for MeSH Indexing

Recently deep learning has been used in Medical subject headings (MeSH) indexing to reduce the time and monetary cost by manual annotation, including DeepMeSH, TextCNN, etc. However, these models still suffer from failing to capture the complex correlations between MeSH terms. To this end, we introduce Graph Convolution Network (GCN) to learn the relationship between these terms, and present a novel Hybrid Graph Convolution Net for MeSH index (HGCN4MeSH). Basically, we utilize two BiGRUs to learn the embedding representation of the abstract and the title of the MeSH index text respectively. At the same time, we establish the adjacency matrix of MeSH terms based on the co-occurrence relationships in Corpus, which is easy to apply for GCN representation learning. On the basis of learning the mixed representation, the prediction problem of the MeSH index keywords is transformed into an extreme multi-label classification problem after the attention layer operation. Experimental results on two datasets show that HGCN4MeSH is competitive compared with the state-of-the-art methods.


Introduction
MEDLINE 1 is an important database for publications of biomedical and life science containing more than 24 million journal citations. To facilitate information storage and retrieval, the National Library of Medicine (NLM) created Medical Subject Headings (MeSH) 2 to index articles in MEDLINE. MeSH is an annually-updated hierarchical glossary. There are 29368 concepts 3 of MeSH in 2019, covering various area from biomedicine to information technology. Currently, the articles in MEDLINE are indexed primarily by NLM human experts. It is estimated that it costs millions of dollars each year * The corresponding author. 1 https://www.nlm.nih.gov/bsd/medline.html 2 https://www.nlm.nih.gov/mesh/meshhome.html 3 https://www.nlm.nih.gov/databases/download/mesh.html  to index new articles (Mork et al., 2013). Therefore, it is necessary to build an efficient and accurate model for indexing documents -MeSH index. Xun et al. (2019) demonstrated that the MeSH indexing problem can be cast as an extreme multilabel classification task. Each MeSH term can be regarded as a tag, with a total of 29368 tags, and each article has an average of 13 tags. Recently, there are some deep learning models applied to MeSH terms indexes successfully, such as Atten-tionMeSH (Jin et al., 2018), MeSHProbeNet (Xun et al., 2019), etc. However, these models do not considered the correlation and the co-occurrence relationship between MeSH terms. By ignoring the complexity between objects, these methods are inherently limited. Table 1 is a real example of article tags from the data.
In this paper, we propose a novel GCN (Kipf and Welling, 2016)-based MeSH term index model, HGCN4MeSH, which learns the co-occurrence representation of tags via a GCN-based mapping function. Specifically, we design a novel data-driven adjacency matrix to guide the information propagation between nodes. To solve the problem of too many tags in extreme multi-label classification Figure 1: The proposed model framework. Balls of various sizes and colors represent different representations of MeSH terms, BiGRU is the bidirectional gated recurrent unit. First, A hybrid graph is constructed for MeSH terms, where each node represents a MeSH term. The abstract and title are input into GRU for feature extraction respectively and GCN updates the representation of MeSH terms by learning co-occurrences of MeSH terms during training. The final representation of MeSH terms consists of two parts, one is the representation generated by GCN, the other is the semantic representation of MeSH terms. Then we can calculate the attention weight between MeSH terms and title; abstract, output the final score via a linear layer and a sigmoid activation function. cases, we propose a hybrid adjacency matrix, that is, constructing a bidirectional GCN between highfrequency tags and a unidirectional GCN between high-frequency and low-frequency tags to reduce the computation. The major contribution are: • We propose a novel end-to-end extreme multilabel classification framework (Figure 1), which employs a GCN to learn tags representation.
• We utilize a partial block adjacency matrix to reduce calculation and noise for extreme multi-label classification. The experimental results show that our method is competitive with the state-of-the-art method.
2 Related Work Aronson et al. (2004) introduced the Medical Text Index (MTI) to help experts find suitable MeSH terms for articles quickly and accurately. Peng et al. (2016) proposed DeepMeSH, which achieved the best results in the 2017 BioASQ challenge task A. BioASQ is a challenge funded by the European Union; the task A of BioASQ requires participants to use only the abstracts and titles to predict corresponding MeSH terms. DeepMeSH utilized TF-IDF (Jones, 1972) and document to vector (D2V) (Le and Mikolov, 2014) to represent each abstract and They used k-nearest-neighbor (KNN) (Altman, 1992) classifiers to generate candidate MeSH terms. AttentionMeSH (Jin et al., 2018) was also divided into two parts. The first part used KNN to generate candidate MeSH terms, and the second used bidirectional Recurrent Gated Unit (BiGRU)  architecture to capture context features. Xun et al. (2019) used the representation learned from the name of journal combine with the information from the abstract and a multi-view neural classifier to get results. Wang and Mercer (2019) provided a useable data set, including the title, abstract, paragraphs associated with the figures, and tables of each text, and used multi-channel TextCNN (Kim, 2014) to solve the problem.
MeSH terms were modelled independently in those methods, which ignored the relationships between MeSH terms. In this paper, we use a GCN to capture the more complex topological relationships.

Graph Convolutional Network and Correlation Matrix
We use Graph Convolutional Network (GCN) to model the relationship between MeSH terms. Kipf and Welling (2016) proposed GCN which induces embedding vectors of the nodes according to the properties of their neighbor nodes. Given a graph G = (V,E) where V and E denote the set of nodes and edges respectively. The GCN is a multi-layer neural network. With convolutional operations, the propagation of every layer can be written as Here, H l ∈ R n×d and H l+1 ∈ R n×d indicate the nodes representation of the l th and (l + 1) th hidden layer respectively (where n is the number of nodes and d, d are the dimensions of the node representations),Ã ∈ R n×n represents the normalized version of the correlation matrix A ∈ R n×n , h(·) means a non-linear operation such as ReLU, · means the matrix product operation, W l ∈ R d×d is a layer-specific trainable transformation matrix. GCN updates the node features by propagating the information between neighbor nodes, based on the corresponding correlation matrix. Hence, the crucial thing is how to build the adjacency matrix. In most applications, the adjacency matrix is predefined. However, there is no corresponding adjacency matrix already defined in the area of extreme multi-label text classification. Facing this problem, we propose the hybrid adjacency matrix construction method. We construct the adjacency matrix between tag frequencies and the co-occurrence relationships between tags.
In extreme multi-label text scenarios, the number of tags is often in the tens of thousands. If we consider the relationship between all the tags, the adjacency matrix would be huge and consume considerable memory and time during the computation. Considering that in the extreme multi-label classification task, the distribution of tags is long-tailed, which means that there are some tags appear rarely, henceÃ is a sparse matrix. Hence, we set a threshold frequency to divide tags into low-frequency and high-frequency groups. We find that the number of low-frequency tags cooccurring with high-frequency tags is larger than the number of low-frequency tags co-occurring with low-frequency tags through empirically. Thus, we build an adjacency matrixÃ ∈ R m×n , where m is the number of the high-frequency tags and n denotes the total number of tags. It means that we utilize the information between high-frequency tags and low-frequency tags, so it is called hybrid adjacency matrix. Figure 2 shows the example of adjacency matrix. We use the empirical conditional probability to model the directed relationship between tags: which means the occurrence probability of tag L j when tag L i appears, where N i denotes the occurrences times of the tag L i , and M ij denotes the concurring times of tag L i and tag L j .
However, due to a large number of tags, these cooccurrencesmay be noisy estimate for some tags with low co-occurrence frequency, so we set a threshold τ as follows:

Document Representation
The core challenging in MeSH idnexing is to learn representations for the title and abstract. After tokenizing the titles and abstracts, we derive the context-aware title representation via a bidirectional Gated Recurrent Unit (BiGRU) : where H title , H abstract mean the hidden state of title, abstract respectively. X title ∈ R L×de , X abstract ∈ R L ×de denote the feature of title, abstract respectively (d e means the embedding dimension of word), L is the length of title, L is the length of title, d h is the hidden layer dimension. In this work, the title and the abstract share the same process.

MeSH Representation
First, we use the corresponding word embedding of all MeSH terms as the initial input (H 0 ) to GCN. In section 3.1, we introduced a novel adjacency matrix A, we can get the new representation of MeSH terms with co-occurrence information after multi-layers of GCN.
where H l ∈ R m×d l is the high-frequency MeSH terms representation of l th layer,Ã is the normalized version of adjacency matrix and W l is a layerspecific trainable transformation matrix. In other words, only the representations of high-frequency MeSH terms are propagated at each layer in GCN. After getting the representation of MeSH terms interrelation by GCN, we also use the embedding of MeSH terms to retain the semantic information.
where the symbol : means the concatenated operation; e M eSH is the word embedding of MeSH terms. Now we can utilize MeSH representations to select the most relevant text representation features for classification by attention mechanism . We calculate the similarity between MeSH terms and text by dot products and use Softmax to normalize the word axis: Ultimately, we can get the representation of MeSH terms by words representation: where A attn is the attention score between abstract and MeSH terms, and H abstract is the hidden state of abstract. Then we can gain the score of MeSH here, σ(·) is the sigmoid function, W is the trainable weight matrix and b is the bias. The binary cross-entropy loss function is applied in the model: where y j is the ground truth,ŷ j ∈ [0, 1]. The total loss is: Here, K is the total number of training data. Finally, the MeSH multi-label classifier outputs the MeSH index that we want.

Implementation Details
In the processing, non-English characters are removed. The embedding dimensions of title and abstract are both 200, GRU layer number is set to 2, and the hidden dimension is 200. In the part of GCN, we use a layer of GCN with both input and output dimensions of 200. LeakyReLU (Maas et al., 2013) with a negative slope of 0.2 is used as the non-linear activation function. For the division of word frequency, we choose the high-frequency MeSH terms with more than 1000 occurrences, the low-frequency MeSH terms with less than 1000 of the PMC Collection dataset. For SETC2015 dataset, the threshold is 500. We set τ in Eq.(4)   (Yao et al., 2007). The model is implemented with PyTorch (Paszke et al., 2017).

Evaluation Metrics
Due to the large space of the tags, only a few tags can match the text. Hence, the major metrics for performance evaluation are ranking-based methods. Precision at k (p@k) and normalized discounted cumulative gain (nDCG) are ranking-based evaluation methods. In this paper, we also utilize these two authoritative metrics. Table 2 shows the rank-based matric result. Although there are some strong baselines of bioASQ challenge, the code is available to test on the two dataset. We compare with the state-of-art method, multichannel TextCNN (Wang and Mercer, 2019). For the proposed model, we report the results of the model with GCN or not. It is obvious that our model without GCN outperforms baseline, and the performance of the model with GCN is the best result, which may due to the fact that the model with GCN pays more attention to the co-occurrence relationships between the tags.

Experiments Results
In addition, the score of the PMC Collection dataset increases by about 2-4 points after introducing GCN. However, the score of SETC2015 only increases by 1-2 points. The reason is that there are only 14000 samples of SETC2015. Thus the datadriven adjacency matrix is biased. Nevertheless, since the PMC Collection dataset contains about 250000 data, the adjacency matrix based on the dataset should be closer to the true co-occurrence relationship between the MeSH terms, and results to better performance.

Ablation Studies
In the Table 3, we can observe effects of thresholds that define low-frequency MeSH terms and high-frequency MeSH terms. If the threshold is too high, it may cause fewer high-frequency MeSH terms, which causes the representation between different MeSH terms to be too smooth. However, when the frequency threshold is too low, there are many high-frequency words, and some co-occurrence of many words may become noise. Table 4 shows that with the number of GCN layers increasing, the results decrease. As the number of GCN layers increasesthe information transmission between nodes may accumulate, resulting in excessive smoothness of the final representation.  The results of the ablation experiment are shown in Table 5. Title contains a lot of useful information, the effect of extracting information from title and abstract separately is slightly better than directly concatenating both.

Conclusion
Modelling the relationship between MeSH terms is a key issue in MeSH indexing. This paper proposes a model for constructing specifying the relationship between MeSH terms based on GCN and a new end-to-end model for MeSH indexing.
In the field of biomedicine, the co-occurrence relationship of tags is very common and useful. We use the co-occurrence relationship between tags to design the adjacency matrix by the GCN using the data-driven method, which can also be extended to other extreme multi-label classification fields.