YNU-junyi in BioNLP-OST 2019: Using CNN-LSTM Model with Embeddings for SeeDev Binary Event Extraction

We participated in the BioNLP 2019 Open Shared Tasks: binary relation extraction of SeeDev task. The model was constructed us- ing convolutional neural networks (CNN) and long short term memory networks (LSTM). The full text information and context information were collected using the advantages of CNN and LSTM. The model consisted of two main modules: distributed semantic representation construction, such as word embedding, distance embedding and entity type embed- ding; and CNN-LSTM model. The F1 value of our participated task on the test data set of all types was 0.342. We achieved the second highest in the task. The results showed that our proposed method performed effectively in the binary relation extraction.


Introduction
The goal of Information Extraction (IE) (Finkel et al., 2005) is to transform textual information into structured information, and to focus on quickly locating and finding useful information in large amounts of data. Information Extraction (IE) (Fader et al., 2011) is also capable of mining useful data and hiding knowledge from a large number of corpus texts, which has led to some new research methods in many disciplines. For example, with the growing demand for key issues related to life and biology, many biological problems have fallen into the bottleneck due to inadequate methods. Biological information extraction (Bio-IE) emerges in time and attracts more and more researchers to solve problems. For instance, in the identification of named entities, the classification of relationships between proteins and the extraction of links between drugs. In addition, information extraction in the field of biology, especially event extraction, has entered people's views. This will be a far-reaching task and a major biological challenge for information extraction tasks.
The BioNLP Shared Task Series is a representative of biomolecular event extraction and has been held four times. This year is the fifth time that BioNLP has shared tasks. The topics in this series include fine-grained extraction, generalization to knowledge base construction. In addition, the scope of this task has become more extensive in each time. For instance, the BioNLP 2016 Shared Task (Nédellec et al., 2016) contained three separate parts, the Bacteria Biotope subtask (B-B3), the Seed Development subtask (SeeDev) and the Genia Event subtask (GE4). However, the BioNLP 2019 Open Shared Task contains seven separate parts, the Integrated structure, semantics and coreference subtask (CRAFT), the Pharma-CoNER task, the Active Gene Annotation Corpus subtask (AGAC), the BB3, the SeeDev and the Research Domain Criteria subtask(RDoc).
We mainly participated in the binary relation extraction task, which is part of the SeeDev task. The SeeDev task (Nédellec et al., 2013) (Chaix et al., 2016) aims to promote complex event extraction on regulations in plants from scientific articles. It focuses on events describing genetic and molecular mechanisms involved in seed development of the model plant, Arabidopsis thaliana. It involves n-ary and binary relation extraction. Meanwhile, the SeeDev task was proposed for the first time at BioNLP Shared Task 2016 (Nédellec et al., 2016) (Mehryary et al., 2016). This 2019 edition is a rerun of the task, with an evaluation methodology more focused on the biological contribution.
Many teams participated in the BioNLP 2016 Shared Task (He et al., 2016). For example, VERSE uses a support vector machine (SVM) and k-fold cross-validation to identify the best parameters. (Lever and Jones, 2016) DUTIR uses a deep learning method that utilizes a convolutional neu-ral network . Motivated by the previous study, based on CNN, we have integrated L-STM (Hochreiter and Schmidhuber, 1997) to solve the defect that convolutional neural networks can not obtain context information. After improving the method, we got good results.
The rest of our paper is structured as follows. Section 2 introduces models. Section 3 describes results and discussion. Conclusions are described in Section 4.

Model
The SeeDev-binary task can be thought of as a binary relationship extraction, which specifies whether there is interaction between the two entities. In relation extraction, the semantic and syntactic information of a sentence plays an important role. Traditional methods often require the design and extraction of complex features based on domain-specific knowledge (such as tree kernels and graphics kernels) to construct the model. As a result, this results in a much lower corpusdependent generation capability. Therefore, we use CNN to replace complex manual design feature engineering, and learn the advanced function automation by modeling the word embedding and fully connected neural networks from the original input through convolution and pooling operations. Besides, we capture relative distance information and entity types as complementary features of the sentence. After that, we input the data processed by the CNN into the LSTM. Because CNN do not get good context information, and sometimes the connection between text contexts can help us do relation extraction more accurately. So, LSTM can get text context information, which allows us to get a better result in the end.
As shown in Figure 1, the model consists of two modules: distributed semantic representation construction, such as embedded characters, distance embedding and entity type embedding, and CNN-LSTM module. In the next section, we will introduce more details.

Data preprocessing
When doing data preprocessing, first we use the Stanford CoreNLP (Manning et al., 2014) tool to process the task's data. The text is divided into sentences and tokenized. Parts-of-speech and lemmas are identified and a dependency parse is generated for each sentence. Then, we further process the preprocessed data.

Embedding
We use the context of two entities to predict the type of relationship. In our task, the context is represented by words between two entities in a sentence. Then, by analyzing the data, we observe that different entities with different types have different mutual interaction probabilities if the entity types satisfy the relationship constraints. Therefore, the entity type of the two entities is the important factor of the predicted relationship type. In our model, entity types are seen as a complement to word embedding. In addition, we find that distance information usually plays an important role. The distance can capture the relative position between two entities. So, we concatenate the word embedding (Levy and Goldberg, 2014), type embedding (Su and Wang, 2011), and distance embedding (Cormode, 2003). We use the pre-trained word embedding. 1 Then, we would introduce some formulas about word embedding, entity type embedding and distance embedding.
where S stands for the sentences. E 1 and E 2 are the type 1 and type 2 respectively. W 1 stands for the first word. W is the word embedding table. W T is type embedding table and W d stands for the distance embedding table. LT W (S) is the representation of word embedding. LT W , W T (S) is the representation type embedding. LT W d (S) is the distance embedding. In the distance embedding, zero vector(0) is used to pad the sentence.

Model training
We run our model 5 times and use the maximum as the final result of the model. In all model runs, the dropout (Srivastava et al., 2014) is set to 0.5. We found that our loss function tends to stabilize when the epoch reaches around 120. So, we think that our model can converge at this time, so set epoch = 120. The batch size is set to 64. And, we use a pooling approach that combines average pooling and max pooling.
In this task, we choose the CNN-LSTM model to compare with a single CNN model. We find that the CNN-LSTM model works better than a single CNN model on development data set. So, we choose the CNN-LSTM model in the final submission.

Results and discussion
The SeeDev-binary task data sets consist of three parts which are the training set, the development set, and the test set. There are a total of 87 sections from 20 complete articles on Arabidopsis seed de-     Our method obtained F1 scores of 0.342 for all types and 0.394 for ignoring relation types and direction on the test set. In this task, the organizer gives the results of the evaluation obtained from three different evaluation conditions. Compared with 2016 BioNLP Shared Task, the organizer has added two more evaluations in order to have better biological contributions. These evaluation conditions are global results, relations by type cluster, and ignoring relation types and direction, respectively. We obtained a good score compared to the official results from different systems, and we ranked the second among all teams. It proves that our proposed method has good performance in binary relation extraction. Table 2 shows the F1, recall and precision of cluster on the test data sets, and Table 3 shows the result of all types on the test data sets. Table 4 shows the result of ignoring types on the test data sets and Table 5 shows detailed results of our method on the test data set.

Conclusions
We use distributed semantic representation and CNN-LSTM model to extract the binary relationship between entities, then build a word embedding with rich semantic knowledge, distance embedding and entity type embedding to feed it into the CNN and learn the intrinsic relationship between the candidate entities. In the task, our F1score of all types is 0.342, which indicates that our proposed method works efficiently in extraction of binary relations.
However, using only the original words embedded in CNN-LSTM may not be sufficient to understand the hidden information between words. Using our model to get this score does not mean that the model works well in other tasks.
In the future, we will continue to focus more on building rich distributed semantic embedding and we will improve our model by changing our model structure and adjusting paraments. In addition, we will explore various neural networks with multilayer architectures, such as the attention mechanism and capsule networks, to solve binary relationships or event extraction problems.