Automatic Term Name Generation for Gene Ontology: Task and Dataset

Terms contained in Gene Ontology (GO) have been widely used in biology and bio-medicine. Most previous research focuses on inferring new GO terms, while the term names that reflect the gene function are still named by the experts. To fill this gap, we propose a novel task, namely term name generation for GO, and build a large-scale benchmark dataset. Furthermore, we present a graph-based generative model that incorporates the relations between genes, words and terms for term name generation, which exhibits great advantages over the strong baselines.


Introduction and Related Work
Gene Ontology (GO) is a widely-used biological ontology, which contains a large number of terms to describe the gene function in three aspects, namely molecular function, biological process and cellular component (Consortium, 2015(Consortium, , 2016. The terms are organized hierarchically like a tree, and can be used to annotate the genes as demonstrated in Figure 1. GO has been extensively studied in the research community of bio-medicine and biology for its great value in many applications, such as protein function analysis (Cho et al., 2016) and disease association prediction (Menche et al., 2015).
A major concern in GO is the GO construction, including term discovery, naming and organization (Mazandu et al., 2017;Koopmans et al., 2019). In early studies, the terms are manually defined and organized by the experts in particular areas of biology, which is very labor-consuming and inefficient given the large volume of biological literature published every year (Tomczak et al., 2018). Moreover, different experts may use different expressions to describe the same biological concept, causing an inconsistency problem in term naming. Recently, many researchers turn to develop automatic methods for GO construction. Dutkowski et al. (Dutkowski et al., 2013) proposed a Network-eXtracted Ontology (NeXO), which clustered genes hierarchically based on their connections in the molecular networks, and recovered around 40% of the terms according to the alignment between NeXO and GO. In order to further improve the performance, Kramer et al. (Kramer et al., 2014) identified the gene cliques which were treated as a term in an integrated biological network. Though these methods infer new GO terms and their relationships based on the structured networks automatically (Gligorijević et al., 2014;Li and Yip, 2016;Peng et al., 2015), the new terms are still named manually by the experts, which is prone to the problems of inefficiency and inconsistency. Furthermore, only the structure information in existing networks is utilized, while the genes' rich textual information that potentially describes the corresponding term has not well been studied.
In order to obtain term names automatically to boost GO construction, we propose a novel task that aims to generate term names based on the textual information of the related genes. An illustrative example of the task is shown in Figure 1. The genes IGFBP3, OGFR and BAP1 are annotated by the term with the ID as GO:0001558 and name as "Regulation of cell growth". Since there are some word overlaps between the term name and gene text (alias and description) by our observations, we aim to generate the term name based on the gene text. To facilitate the research, we first present a dataset for term name generation in GO. Then, we propose a graph-based generative model that incorporates the potential relations between genes, words and terms for term name generation. The experimental results indicate the effectiveness of our proposed model. The contributions of our work are threefold: (1) To the best of our knowledge, it is the first attempt to explore to generate term names for GO automatically.
(2) We present a large-scale dataset for term name generation based on various biological resources, which will help boost the research in bio-medicine and biology. (3) We conduct extensive experiments with in-depth analyses, which verify the effectiveness of our proposed model.

Dataset
We build a large-scale dataset 1 for term name generation, which contains the GO terms about Homo sapiens (humankind). We collect the term ID, term name and the corresponding genes' ID from Gene Ontology Consortium 2 . In addition, the gene alias and descriptions are crawled from GeneCards 3 , which contains the information from Universal Protein Resource (UniProt) 4 .
Our dataset contains 18,092 samples in total. Each sample contains a term ID, term name and the related genes with alias and descriptions as demonstrated in Figure 1. The statistics and distributions about the dataset are shown in Table 1 and Figure 2. We observe that about 51.3% of the words are shared between term names and related genes, indicating the potential to utilize the textual information of genes for term name generation. It is also interesting to find that some patterns like "regulation of " appear in the term name frequently, which provide valuable clues for enhancing the performance of generation.

Graph-based Generative Model
The classical generative models such as Seq2Seq (Sutskever et al., 2014), HRNNLM (Lin et al., 2015) and Transformer (Vaswani et al., 2017) only incorporate the sequential information of the source text for sentence generation, while the potential structure within the text is neglected. To alleviate this problem, we build a heterogeneous graph with the words, genes and terms as nodes, and adopt a graph-based generative model for term name generation. The overall architecture of our graph-based generative model is shown in Figure 3, which consists of two components: the GCN based encoder and the graph attention based decoder.

GCN based Encoder
The GCN-based encoder aims to encode the relations between genes, words and terms for boosting term name generation. We first construct a heterogeneous graph based on the dataset, and then apply the Graph Convolutional Network (GCN) (Vashishth et al., 2019) for representation learning. Graph Construction. We build a heterogeneous graph where the nodes are the words, genes and terms, and the edges reflect the relations between them. The words come from the gene text. Regarding to the edges, there are two types: wordgene and gene-term. The value for the word-gene edge is the normalized count of the word in the Figure 3: The overall architecture of our Graph-based Generative Model. Prob("beta", g) and Prob("beta", c) denote the probabilities based on the generation-mode and copy-mode respectively. gene text, while the value for the gene-term edge is 1 if the gene can be annotated by the term.
Representation Learning. The initial representation for the word node is the word embeddings. For the gene node, the gene alias and description encoded by a GRU model is used as the initial representation. Regarding to the term node, the pooling over all the representations of the related gene nodes is used as the initial representation. Then, we update the node representation via a GCN model due to its effectiveness in modeling the structure information (Kipf and Welling, 2016), which is formulated as follows: whereÂ = A + I, A is the adjacency matrix of the graph, and I is the identity matrix. X is the initial representation for the nodes, denoted as X = (t, g 1 ...g m , w 1 , ..., w n ), where g i , w i , t denote the initial representation for the ith gene, word and term respectively. W (0) and W (1) represent the weight matrix in the first and second layer of GCN.

Graph Attention based Decoder
Motivated by the effectiveness of the attention mechanism for generation (Bahdanau et al., 2014), we adopt a graph attention based decoder to generate the term name. The attentive word node representation by GCN is utilized and formulated as: where h t−1 is the previous hidden state, w j is the word node representation by GCN, v is a parameter vector, and W a is a parameter matrix. Given the word overlaps between the gene text and term name, we utilize the copy mechanism in CopyNet (Gu et al., 2016) for decoding, making it possible to generate the word from either the vocabulary of the training set or the current gene text. The initial hidden state h 0 is the term node representation (i.e., t ) obtained by GCN, and the hidden state is updated as: ( 3) where f is the RNN function, w t−1 is the word embedding of the previous generated word, w SR is a selective read (SR) vector in CopyNet. When the previous generated word appears in the gene text, the next word will also probably come from it, and thus w SR is the previous word node representation; otherwise it is a zero vector. The probability of generating a target word y t is calculated as a mixture of the probabilities by the generation-mode and copy-mode as follows: where ψ g (y t ) and ψ c (x j ) are score functions for the generate-mode and copy-mode respectively, which can be defined as demonstrated in (Gu et al., 2016). Z = v∈V e ψg(v) + x∈S e ψc(x) , where V denotes the word vocabulary in the training set, and S denotes the source word set in the gene text. It is notable that there are a lot of fixed patterns in the term names as mentioned in section 2. Therefore, we extract top ranked bigrams and trigrams, and treat them as new words for ease of generation.

Experimental Setup
Implementation Details. The dataset is divided into the training, validation and test sets with a proportion of 8:1:1. We adopt the widely used evaluation metrics like BLEU1-3 (Papineni et al.,  Table 2: Overall performance of different models. The best result is marked in bold. Only the Rouge-1 and BLEU-1 scores for the extractive models are shown since they usually extract the unigrams independently. 2002) and Rouge 1,2,L (Lin, 2004) for the generation task. The word embeddings are initialized from N (0, 1) with a dimension of 300 and updated during training. The dimension of the hidden units for GRU (Chung et al., 2014) and GCN is 300. We initialize the parameters according to a uniform distribution with the Xavier scheme (Kumar, 2017), and the dropout rate is set to 0.5. The Adam (Kingma and Ba, 2014) method with a learning rate of 1e-3 is used for training. Baseline Methods. To evaluate the effectiveness of our proposed model, we apply the advanced baselines in two categories for comparison: (1) TF-IDF; (2) LexRank (Erkan and Radev, 2004); (3) Seq2Seq (Sutskever et al., 2014); (4) HRNNLM (Lin et al., 2015); (5) Transformer (Vaswani et al., 2017). The former two are extractive models which extract words from the gene text as the term name, and the latter three are generative models which generate words from the vocabulary space as the term name.

Experimental Results
The experimental results are shown in Table 2. It is observed that the generative models perform better than the extractive models by incorporating the language probability into generation, which makes the generated term name more coherent. Whereas, the extractive models usually extract keywords independently, which are hard to form a complete and brief term name. It is also notable that our graphbased generative model achieves the best performance in all cases by incorporating the relations between the genes, words and terms into generation. While other generative models bring unnecessary sequential information of multiple genes, which may have a side effect on term name generation. From the ablation study, we find that when we treat the frequent patterns as new words during generation and then restore them, the performance can be further boosted. In addition, the copy mechanism can help improve the generation performance especially in the metric of BLEU scores, which proves the effectiveness of using the shared words between genes and terms for term name generation.

Visualization of Attention
To have an insight of why our proposed graphbased generative model is more effective, we randomly sample a generated term name that is the same as the ground truth, and draw an attention heatmap for the words in the term name and the corresponding gene aliases in Figure 4. The attention result for the gene descriptions is not presented here due to the limited space. We observe that the word Tweety that represents a gene group 5 in gene aliases is highly related to the words as Transporter and Activity in the term name, which indicates the potential of modeling the relations between words, genes and terms for enhancing the performance of term name generation.

Conclusions and Future Work
In this paper, we propose a novel task of automatic term name generation based on the gene text for GO. We construct a large-scale dataset and provide the insights of this task. Experimental results show that our proposed graph-based generative model is superior to other strong baselines by modeling the relations between genes, words and terms. In the future, we will explore how to utilize more knowledge to guide term name generation.