Bacteria Biotope Relation Extraction via Lexical Chains and Dependency Graphs

abstract In this article, we describe our approach for the Bacteria Biotopes relation extraction (BB-rel) subtask in the BioNLP Shared Task 2019. This task aims to promote the development of text mining systems that extract relationships between Microorganism, Habitat and Phenotype entities. In this paper, we propose a novel approach for dependency graph construction based on lexical chains, so one dependency graph can represent one or multiple sentences. After that, we propose a neural network model which consists of the bidirectional long short-term memories and an attention graph convolution neural network to learn relation extraction features from the graph. Our approach is able to extract both intra- and inter-sentence relations, and meanwhile utilize syntax information. The results show that our approach achieved the best F1 (66.3%) in the official evaluation participated by 7 teams.


Introduction
The BioNLP Shared Task 2019 (Bossy et al., 2019) is a continuation of the previous efforts organized around the BioNLP Shared Task workshop series (Kim et al., 2009(Kim et al., , 2011Deléger et al., 2017). It aims to facilitate development and sharing of computational tasks of biomedical text mining and solutions to them. The Bacteria Biotope (BB) task is one of the six main tasks of the BioNLP Open Shared Tasks 2019. Three teams participated in the BB task when it was first organized in 2011. INRA Bibliome (Ratkovic et al., 2011) achieved the best Fscore of 45% with the Alvis system which used dictionary mapping, ontology inference and semantic analysis for NER, and co-occurrence-based rules for detecting relations between the entities. The 2013 BB task  contained three subtasks, the first one concerning recognition and normalization of bacteria and habitat entities, and the other two subtasks involving relation extraction. Four teams participated in these tasks, with the UTurku TEES system (Björne and Salakoski, 2013) achieving the first places with F-scores of 42% and 14%. Compared to the 2013 BB task, the 2016 BB task contains more subtasks and its subtask2 only concerned relation extraction. The team VERSE (Lever and Jones, 2016) achieved the best F-scores of 55.8% in the subtask2.
The Bacteria Biotopes relation extraction (BBrel) in the BioNLP Shared Task 2019 aims to automatically extract Microorganism-Habitat or Microorganism-Phenotype relationships from biomedical literature. The BB-rel task follows the previous Bacteria Biotopes shared tasks, annotating directed binary relationships between Microorganism, Habitat and Phenotype entities. In the BB-rel task, not all the relations occur between two entities with the same sentence. In the preprocessing step, we found that there exist about one fourth of all relations whose argument entities are located in different sentences. Therefore, we need to build a model that does not only consider the entity relationship within one sentence, but also beyond the sentence boundary.
A lexical chain (Morris and Hirst, 1991) is a sequence of words which are semantically-similar or related. These words are related sequentially in the text, defining the topic of the text segment that they cover and establishing associations between sentences. Following this observation, some researchers have obtained success in many NLP tasks such as word sense induction (Tao et al., 2014) , machine translation (Mascarell, 2017) and text (Stokes et al., 2004) segmentation. In the BB-rel dataset, the sentences where inter-sentence relations occur usually express the same topic or have semantic associations each other. These features usually appear as some related words which can form lexical chains. Following this observation, we propose a novel approach to build an inter-sentence dependency graph based on lexical chains.
In this paper, we propose a novel relation extraction method for the BB-rel task by incorporating dependency graphs and lexical chains into the neural network. As shown in Figure 1, intersentence relations are usually expressed in interrelated sentences, and these sentences may contain semantically-related words which can form lexical chains. We utilize these lexical chains and dependency graphs to build an inter-sentence dependency graph for inter-sentence relation extraction. Specifically, we utilize word embedding to find the semantic relationships of words that occur in different sentences for building reliable lexical chains. Then, we use the Stanford CoreNLP toolkit (Manning et al., 2014) to obtain sentence-level dependency and part-of-speech (POS) information, and build an inter-sentence dependency graph based on these information and lexical chains.
After that, we employ a neural network model which consists of the bidirectional long shortterm memories and attention-guided graph convolutional neural networks to extract features from the inter-sentence dependency graph. The fea- tures are fed into a multi-layer perceptron (MLP) to classify the relation between an entity pair. Our approach has two advantages. First, it is capable of extracting both intra-sentence and intersentence relations by connecting the dependency graphs of different sentences via lexical chains. Second, it is able to leverage syntax information. The results in the BB-rel task demonstrate the superiority of our method. It achieves the highest F1-score, the second highest precision and recall in the official evaluation.

Method
In this section, we first introduce our strategy of relation candidate generation. Then, the approach for constructing lexical chains is described. After that, we will introduce how to build inter-sentence dependency graphs. Lastly, the architecture of our neural network model is described.

Relation Candidate Generation
In the BB-rel dataset, if all candidate pairs (bacteria and habitat or phenotype) that occur in the document are enlisted as candidate training examples, the positive and negative examples will become very unbalanced because most entity pairs located beyond one sentence do not have any relation. Based on our observations, most entity pairs spanning more than two sentences have no relations between them. Therefore, we consider all entity pairs that span within two sentences as the candidates to generate training examples. The statistics of our dataset are summarized in Table 1.

Lexical Chain Construction
In previous work, there are mainly three approaches for constructing lexical chains. The first one utilized WordNet (Hirst and St-Onge, 1997) to capture the semantic relationship between words. The second approach (Remus and Biemann, 2013) Figure 2: Process of lexical chain construction. Orange words denote nouns. C is the set of lexical chains. The similarity here refers to the cosine similarity between word vectors. We set the threshold to 0.5. automatically extracted lexical chains using statistical methods . Another approach (Li et al., 2017) is based on semantic word vectors. In this paper, we assume that lexical relationships can be captured by calculating the similarity of their semantic vectors. To compute similarities, we use 200dimensional pre-trained word vectors released by . Moreover, we only consider nouns for constructing the lexical chains since they usually contain relevant information.
Given a sentence, we first use the Stanford CoreNLP toolkit (Manning et al., 2014) to obtain POS tags for each word. Then we pick those words whose POS tags belonging to N= (NN,NNP,NNS) as candidates for chain construction. We take one candidate at a time and check where it should be placed. Assuming that C is the set of lexical chains, we add each candidate w to C according to the following steps ( Figure 2): • Step 1: each noun is treated as a candidate w. If C is empty, we will create a new lexical chain in C and add the current candidate w into it. • Step 2: for the current candidate w, we traverse all the lexical chains in C and compute the similarity between the last word of each lexical chain and the current candidate w. If the similarity surpasses a predefined threshold, the current candidate w will be attached to the corresponding lexical chain. • Step 3: if the current candidate w cannot be attached to any existing lexical chain, we will create a new lexical chain for it. Figure 3: An example of the dependency graph and its corresponding adjacent matrix. If there is a dependency relation between the node i and j in the dependency graph, the value of the element M ij in the adjacent matrix is 1.

Dependency Graph Construction
In this section, we propose an approach to build an inter-sentence dependency graph by lexical chains. For an entity pair that occurs within the same sentence, we directly use their sentence dependency graph. If two entities occur in different sentences, we construct their dependency graph by lexical chains. We design two rules to build an inter-sentence graph. Here we define the following notations: C is the set of lexical chains, A and B are nouns belonging to sentence s 1 and sentence s 2 , respectively.
• Rule 1: if A and B exist in the same chain of C, we will add an edge between A and B to build an inter-sentence dependency graph.
• Rule 2: if A and B do not appear in the same lexical chain, we will use the root nodes of two sentences to build the dependency intersentence graph.
Then we convert the dependency graph into an adjacency matrix. An example of such process is shown in Figure 3. Give a sequence S = {s 1 , s 2 , ..., s n }, we considered its dependency graph as an undirected graph, which can be converted into an adjacent matrix. If there is a dependency relation between nodes i and j in the dependency graph, the element M ij in the adjacent matrix is assigned with 1. Figure 4: The architecture of our model. The input sentence is "MRSA were isolated by oxacillin screening agar" with a Microorganism entity "MRSA" and a Habitat entity "oxacillin screening agar". M denotes the adjacency matrix. Figure 4 shows the neural network architecture of our model. It uses the words and POS tags as input. We adopt the 200-dimensional word embeddings and 20-dimensional POS tag embeddings. The final representation for the token is the concatenation x i of the word embedding s i and the POS tag embedding p i . We initialize our word embeddings with the pre-trained biomedical embeddings  and randomly initialize the POS tag embeddings.

BiLSTM Layer
After obtaining the word representation sequence x = {x 1 , x 2 , ..., x n }, we leverage bidirectional LSTMs (Hochreiter, 1998) to encode the context information into each word. The forward and backward hidden states (

Attention-Guided GCNN Layer
We employ the attention-guided graph convolutional neural network (AGCNN) (Guo et al., 2019a) to incorporate the dependency information into word representations, which is composed of M identical blocks. Each block has three types of layers: attention-guided layer, densely connected layer, linear combination layer.
In the attention guided layer, we first update the representation of the node using a graph convolution network (GCNN) (Zhang et al., 2018). For an L-layer GCNN, we denotes the inputs in the first layer as g i denotes the output vectors of the node i in the l-th layer. The convolution operation in the l-th layer can be written as: where W l is a linear transformation, b l is a bias term, and σ is a nonlinear function (e.g., ReLU ). TheM can be computed by M + I, where I ∈ R n×n is an identity matrix and d i = n j=1M ij is the degree of node i in the dependency graph. Intuitively, during the graph convolution of each layer, each node gathers all the information of its neighboring nodes in the graph. After the L-layer graph convolution operation, we transform the original dependency graph into a fully connected edge-weighted graph by constructing N (N is a hyper-parameter) attentionguided adjacency matrix. Each attention-guided adjacency matrixÃ corresponds to a completely connected graph. In this paper, we use the multi-head attention (Vaswani et al., 2017) to calculatẽ A, which allows the model to focus on information from different representation sub-spaces. The output is computed as a weighted sum of values, where the weight is calculated by the function of the query and the corresponding key.
where Q and K are both equal to the collective representation h l−1 at layer l−1 of the model. The projections are parameter matrices W Q i ∈ R d×d and W K i ∈ R d×d .Ã (t) is the t-th attention guided adjacency matrix corresponding to the t-th head.
Following (Guo et al., 2019b), we employ the dense connection (Huang et al., 2017) into the our model to capture more structural information on the large graph. We concatenate the initial node representation h (l) j and the node representations g (1) j , ..., g (l−1) j produced in layer 1, ..., l − 1: Each densely connected layer has L sub-layers. The dimensions of these sub-layers d hidden are decided by L and the input feature dimension d. In our model, we use d hidden = d/L. Then we use N separate dense connection layers to modify the computation of each layer as follows (for the t-th matrixÃ (t) ): where t = 1, ..., N and t selects the weight matrix and bias term associated with the attention guided adjacency matrixÃ (t) . The column dimension of the weight matrix increases by d hidden per sub-layer, i.e., W l t ∈ R d hidden ×d (l) where d (l) = d + d hidden (l − 1).
Finally, we use linear combination layer to integrate representations from N different densely connected layers. Formally, the output of the linear combination layer is defined as: where g out is the output by concatenating outputs from N separate densely connected layers, i.e., g out = [g (1) ; ...; g (N ) ] ∈ R d×d . W comb ∈ R d×d is a weight matrix and b comb is a bias vector for the linear transformation.

Output Layer
We treat the BB-rel task as a classification task. The goal of the BB-rel task is to predict whether there is a "Live in" or "Exhibits" relationship between the entities H e and M e . After applying the attention-guided GCNN layer to the input word vectors, we obtain the representation for each word. The sequence representation can be obtained using the following equation: where g 1 , ..., g n denotes the outputs of the the attention-guided GCNN layer and f : R d×n → R d is a max-pooling function. Since we also observed that the entity information is often critical for BB-rel extraction, the entity representations M e and H e are also used, given by: Inspired by (Santoro et al., 2017;Lee et al., 2017), we obtained the final feature for BB-rel extraction by feeding the sequence and entity representations into a multi-layer perceptron (MLP): where "[]" denotes the concatenation operation.
Finally, g f inal is fed into a softmax layer to compute the probability distribution over all classes. During training, our model uses the cross-entropy loss: where J denotes the size of the training set S = {(S 1 , y 1 ), ..., (S J , y J )} and y j denotes the gold answer of the j-th training instance. P (y j |S j ) denotes the probability that S j belongs to y j , which is calculated as P (y j |S j ) = sof tmax(g f inal ).

Evaluation Metrics
We send the prediction results of our model on the test set to the task organizer for evaluation. The   performances of our model were evaluated by the standard evaluation measures: precision (P), recall (R) and F1-score (F1).

Hyper-parameter
The hyper-parameter setting is listed in Table 2. We tuned hyper-parameters based on the development set.

Official Results
The official results on the test set are shown in Table 3. There are totally 7 teams participating in the BB-rel task. Each team could submit up to 2 predictions. We report the top results for all teams.

Ensemble Training and Inference
In relation extraction tasks, the ensemble training and inference have proven to be an effective way to improve performance of the neural network model (Mehryary et al., 2016;Lim and Kang, 2018). Following previous work (Lim and Kang, 2018), we improve performance of our model using the ensemble training and inference. We sum the output probabilities (logits) of ensemble members, which are generated using the same neural network model but different weight initialization.
As shown in Figure 5, M1 to M10 are the models using the same structure and hyper-parameters. In the training phase, we independently trained each ensemble member with different initialized parameters. When inferring a relation for an easy sample, the trained ensemble members make relatively consistent predictions. When inferring for a difficult sample, the trained ensemble members may make different predictions. We incorporate the voting results of 10 ensemble members to produce final results.
To investigate the effectiveness of ensemble training and inference, we conducted the following experiment on the development set. First, we run five times of our model and average the results as the final result of the single model as shown in Table 4. Second, we run one time for the ensemble training and inference. The results show that the approach using ensemble training and inference achieved relatively balanced precision and recall, thus yielding a better F1.

Results of Recognizing Inter-and Intra-Sentence Relations
In this section, we discuss the performance of our model in Intra-and inter-sentence relation. As shown in Table 5, we obtained an F1-score of 65.0 when we only evaluated the intra-sentence relationships. When we evaluated both intra-and inter-sentence relationship, F1-score, Recall increase by 0.7% and 3.2% respectively. But Precision drops by 1.7%. We can also see from the table that the performance of "Exhibits" relation is better than the performance of the "Live in" relation. Because most of the "Exhibits" relation happen within a sentence and have a certain pattern.

Effects of Lexical Chains
In order to verify the effectiveness of constructing inter-sentence dependency graphs by lexical chains, we also conducted related experiments on development set. The experimental results are shown in Table 6. "lexical chains" denotes the model employing the proposed method that constructs inter-sentence dependency graphs by lexical chains. "root nodes" denotes the model where the inter-sentence dependency graphs are built using root nodes. Table 6 shows the performance comparison of the "lexical chains" method and the "root nodes" method on the development set. The "lexical chains" method obtained better perfor-   mance than the "root nodes" model. This demonstrates our idea is effective. The relevant sentences are usually expressed using relevant words. These relevant words found by lexical chains can be used as the associations to connect the dependency graphs of different sentences. Therefore, we can build an effective representation for an intersentence entity pair.

Error Analysis
In this section, we manually analyzed what cases lead to false positives, since those are more critical than false negatives. Figure 6 shows some examples of false positives. The most of false positives are caused by overlapping target entities. For example, there is a "Live in" relation between "Listeria sp." and "chicken nugget processing plant", but there is no "Live in" relation between "Listeria sp." and "chicken" or "chicken nugget". The reason for these errors is that the model is confused by overlapping entities with similar context.

Related Work
In the natural language processing community, there are a number of related competitions and tasks Deléger et al., 2016). Most prior work focused on extracting the relations within one sentence, and ignored the relations beyond one sentence.
In the NLP community, it has proven to be effective to combine linguistic features with neural networks for relation extraction Miwa and Bansal, 2016). Bunescu et al. (2005) demonstrated that the relationship of an entity pair can be captured along their shortest dependency path in the dependency graph because the words on the shortest dependency path concentrate the most relevant information and diminish redundant information. Following this observation, several studies (Xu et al., 2015;Liu et al., 2015) achieved outstanding performance by combining shortest dependency paths with various neural networks. As deep learning develops, some attention-based neural architectures (Zhou et al., 2016;Lin et al., 2016) have been proposed for relation classification and show the state-of-the-art performance. But with a few exceptions, almost all related work only focused on intra-sentence relation extraction, without considering the inter-sentence relations.
Recent work has explored some approaches to consider inter-sentence relations, such as Graph LSTMs (Peng et al., 2017), self-attention (Verga et al., 2018), Graph CNNs (Sahu et al., 2019). However, none of these work investigated lexical chains for inter-sentence relation extraction. In the future, we will evaluate our approach on some large-scale datasets for intra-and inter-sentence relation extraction (Yao et al., 2019).

Conclusion
In this paper, we describe our approach used for participating the Bacteria Biotope task at BioNLP-OST 2019. Our approach achieved very competitive performance in the official evaluation. We found that the idea using lexical chains to build inter-sentence dependency graphs is effective. Moreover, ensemble training and inference can improve the performance of our model. The attention-guided graph convolution neural network performs well in extracting Bacteria Biotope relations. However, our approach is not specific to Bacteria Biotope relation extraction, and it can be applied to other relation extraction tasks.