Chinese Relation Extraction with Multi-Grained Information and External Linguistic Knowledge

Chinese relation extraction is conducted using neural networks with either character-based or word-based inputs, and most existing methods typically suffer from segmentation errors and ambiguity of polysemy. To address the issues, we propose a multi-grained lattice framework (MG lattice) for Chinese relation extraction to take advantage of multi-grained language information and external linguistic knowledge. In this framework, (1) we incorporate word-level information into character sequence inputs so that segmentation errors can be avoided. (2) We also model multiple senses of polysemous words with the help of external linguistic knowledge, so as to alleviate polysemy ambiguity. Experiments on three real-world datasets in distinct domains show consistent and significant superiority and robustness of our model, as compared with other baselines. We will release the source code of this paper in the future.


Introduction
Relation extraction (RE) has a pivotal role in information extraction (IE), aiming to extract semantic relations between entity pairs in natural language sentences. In downstream applications, this technology is a key module for constructing largescale knowledge graphs. Recent developments in deep learning have heightened the interest for neural relation extractions (NRE), which attempt to use neural networks to automatically learn semantic features (Liu et al., 2013;Zeng et al., 2014Zeng et al., , 2015Lin et al., 2016;Zhou et al., 2016;Jiang et al., 2016). Although it is not necessary for NRE to perform feature engineering, they ignore the fact that different language granularity of input will have a significant impact on the model, especially for Chinese RE. Conventionally, according to the difference in granularity, most existing methods for Chinese RE can be divided into two types: character-based RE and word-based RE.
For the character-based RE, it regards each input sentence as a character sequence. The shortcoming of this kind of method is that it cannot fully exploit word-level information, capturing fewer features than the word-based methods. For the word-based RE, word segmentation should be first performed. Then, a word sequence is derived and fed into the neural network model. However, the performance of the word-based models could be significantly impacted by the quality of segmentation.
For example, as shown in Fig 1, the Chinese sentence "达尔文研究所有杜鹃 (Darwin studies all the cuckoos)" has two entities, which are "达 尔 文 (Darwin)" and "杜 鹃 (cuckoos)", and the relation between them is Study. In this case, the correct segmentation is "达尔文 (Darwin) / 研究 (studies) / 所有 (all the) / 杜鹃 (cuckoos)" . Nevertheless, semantics of the sentence could become entirely different as the segmentation changes. If the segmentation is "达尔文 (In Darwin) / 研究所 (institute) / 有 (there are) / 杜鹃 (cuckoos)", the meaning of the sentence becomes 'there are cuckoos in Darwin institute' and the relation between "达尔文 (Darwin)" and "杜鹃 (cuckoos)" turns into Ownership, which is wrong. Hence, neither character-based methods nor word-based methods can sufficiently exploit the semantic information in data. Worse still, this problem becomes severer when datasets is finely annotated, which are scarce in number. Obviously, to discover highlevel entity relationships from plain texts, we need the assistance of comprehensive information with various granularity.
Furthermore, the fact that there are many polysemous words in datasets is another point neglected by existing RE models, which limits the ability of the model to explore deep semantic features. For instance, the word "杜 鹃" has two different senses, which are 'cuckoos' and 'azaleas'. But it's difficult to learn both senses information from plain texts without the help of external knowledge. Therefore, the introduction of external linguistic knowledge will be of great help to NRE models.
In this paper, we proposed the multi-granularity lattice framework (MG lattice), a unified model comprehensively utilizes both internal information and external knowledge, to conduct the Chinese RE task. (1) The model uses a lattice-based structure to dynamically integrate word-level features into the character-based method. Thus, it can leverage multi-granularity information of inputs without suffering from segmentation errors.
(2) Moreover, to alleviate the issue of polysemy ambiguity, the model utilizes HowNet (Dong and Dong, 2003), which is an external knowledge base manually annotates polysemous Chinese words. Then, the senses of words are automatically selected during the training stage and consequently, the model can fully exploit the semantic information in data for better RE performance.
Sets of experiments has been conducted on three manually labeled RE datasets. The results indicate that our model significantly outperforms multiple existing methods, achieving state-of-theart results on various datasets across different do-mains.

Related Work
Recent years RE, especially NRE, has been widely studied in the NLP field. As a pioneer, (Liu et al., 2013) proposed a simple CNN RE model and it is regarded as one seminal work that uses a neural network to automatically learn features. On this basis, (Zeng et al., 2014) developed a CNN model with max-pooling, where positional embeddings were first used to represent the position information. Then the PCNNs model (Zeng et al., 2015) designed the multi-instance learning paradigm for RE. However, the PCNNs model suffers the issue of the selection of sentences. To address the problem, Lin et al. (2016) applied the attention mechanism over all the instances in the bag. Further, Jiang et al. (2016) proposed a model with multiinstance and multi-label paradigms. Although PC-NNs models are more efficient, they cannot exploit contextual information like RNNs. Hence, LSTM with attention mechanism was also applied to the RE task (Zhang and Wang, 2015;Zhou et al., 2016;Lee et al., 2019).
Existing methods for Chinese RE are mostly character-based or word-based implementations of mainstream NRE models (Chen and Hsu, 2016;Rönnqvist et al., 2017;Xu et al., 2017). In most cases, these methods only focus on the improvement of the model itself, ignoring the fact that different granularity of input will have a significant impact on the RE models. The character-based model can not utilize the information of words, capturing fewer features than the word-based model. On the other side, the performance of the word-based model is significantly impacted by the quality of segmentation (Zhang and Yang, 2018). Although some methods are used to combine character-level and word-level information in other NLP tasks like character-bigrams  and soft words (Zhao and Kit, 2008;Chen et al., 2014;Peng and Dredze, 2016), the information utilization is still very limited.
Then, tree-structured RNNs was proposed to address the shortcomings. Tai et al. (2015) proposed a tree-like LSTM model to improve the semantic representation. This type of structure has been applied into various tasks, including human action recognition , NMT encoders , speech tokenization (Sperber et al., 2017) and NRE (Zhang and Yang, 2018). Although the lattice LSTM model can exploit word and word sequence information, it still could be severely affected by the ambiguity of polysemy. In other words, these models cannot handle the polysemy of words with the change of language situation. Therefore, the introduction of external linguistic knowledge is very necessary. We utilize sense-level information with the help of HowNet proposed by Dong and Dong (2003), which is a concept knowledge base that annotates Chinese with correlative word senses. In addition, the open-sourced HowNet API (Qi et al., 2019) is also used in our work.

Methodology
Given a Chinese sentence and two marked entities in it, the task of Chinese relation extraction is to extract semantic relations between the two entities. In this section, we present our MG lattice model for Chinese relation extraction in detail. As shown in Fig 2, the model could be introduced from three aspects: Input Representation. Given a Chinese sentence with two target entities as input, this part represents each word and character in the sentence. Then the model can utilize both word-level and character-level information.
MG Lattice Encoder. Incorporating external knowledge into word sense disambiguation, this part uses a lattice-structure LSTM network to construct a distributed representation for each input instance.
Relation Classifier. After the hidden states are learned, a character-level mechanism is adapted to merge features. Then the final sentence representations are fed into a softmax classifier to predict relations.
We will introduce all the three parts in the following subsections in detail.

Input Representation
The input of our model is a Chinese sentence s with two marked entities. In order to utilize multigranularity information, we represent both characters and words in the sentence.

Character-level Representation
Our model takes character-based sentences as direct inputs, that is, regarding each input sentence as a character sequence. Given a sentence s consisting of M characters s = {c 1 , ..., c M }, we first map each character c i to a vector of d c dimensions, denoted as x ce i ∈ R d c , via the Skip-gram model (Mikolov et al., 2013).
In addition, we leverage position embeddings to specify entity pairs, which are defined as the relative distances from the current character to head and tail entities (Zeng et al., 2014). Specifically, the relative distances from the i-th character c i to the two marked entities are denoted as p 1 i and p 2 i respectively. We calculate p 1 i as below: where b 1 and e 1 are the start and end indices of the head entity. The computation of p 2 i is similar to Eq. 1. Then, p 1 i and p 2 i are transformed into two corresponding vectors, denoted as x p 1 i ∈ R d p and x p 2 i ∈ R d p , by looking up a position embedding table.
Finally, the input representation for character c i , denoted as , is concatenated by character embedding x ce i , position embeddings x p 1 i and x p 2 i : Then, the representation of characters x c = {x c 1 , ..., x c M } will be directly fed into our model.

Word-level Representation
Although our model takes character sequences as direct inputs, in order to fully capture word-level features, it also needs the information of all potential words in the input sentences. Here, a potential word is any character subsequence that matches a word in a lexicon D built over segmented large raw text. Let w b,e be such a subsequence starting from the b-th character to the e-th character.
To represent w b,e , we use the word2vec (Mikolov et al., 2013) to convert it into a real-valued vector However, the word2vec method maps each word to only one single embedding, ignoring the fact that many words have multiple senses. To tackle this problem, we incorporate HowNet as an external knowledge base into our model to represent word senses rather than words.
Hence, given a word w b,e , we first obtain all K senses of it by retrieving the HowNet. Using Sense(w b,e ) to denote the senses set of w b,e , we then convert each sense sen into a real-valued vector x sen b,e,k ∈ R d sen through the SAT model (Niu et al., 2017). The SAT model is on the basis of the Skip-gram, which can jointly learn word and sense representations. Finally, the representation of w b,e is a vector set denoted as x sen b,e = {x sen b,e,1 , ..., x sen b,e,K }. In the next section, we will introduce how our model utilizes sense embeddings.

Encoder
The direct input of the encoder is a character sequence, together with all potential words in lexicon D. After training, the output of the encoder is the hidden state vectors h of an input sentence. We introduce the encoder with two strategies, including the basic lattice LSTM and the multi-graind lattice (MG lattice) LSTM.

Basic Lattice LSTM Encoder
Generally, a classical LSTM (Hochreiter and Schmidhuber, 1997) unit is composed of four basic gates structure: one input gate i j controls which information enters into the unit; one output gate o j controls which information would be outputted from the unit; one forget gate f j controls which information would be removed in the unit. All three gates are accompanied by weight matrix W . Current cell state c j records all historical information flow up to the current time. Therefore, the character-based LSTM functions are: where σ() means the sigmoid function. Hence, the current cell state c j will be generated by calculating the weighted sum using both previous cell state and current information generated by the cell (Graves, 2013). Given a word w b,e in the input sentence which matches the external lexicon D, the representation can be obtained as follows: where b and e denotes the start and the end of the word, and e w is the lookup table . Under this circumstance, the computation of c c j incorporates word-level representation x w b,e to construct the basic lattice LSTM encoder. Further, a word cell c w b,e is used to represent the memory cell state of x w b,e . The computation of c w b,e is: where i w b,e and f w b,e serve as a set of word-level input and forget gates.
The cell state of the e-th character will be calculated by incorporating the information of all the words that end in index e, which is w b,e with b ∈ {b |w b ,e ∈ D}. To control the contribution of each word, an extra gate i c b,e is used: Then the cell value of the e-th character is computed by: where α c b,e and α c e are normalization factors, setting the sum to 1: . (12) Finally, we use Eq. 5 to compute the final hidden state vectors h c j for each character of the sequence. This structure is also used in Zhang and Yang (2018).

MG Lattice LSTM Encoder
Although the basic lattice encoder can explicitly leverages character and word information, it could not fully consider the ambiguity of Chinese. For instance, as shown in Figure 2, the word w 2,3 (杜 鹃) has two senses: sen (w 2,3 ) 1 represents 'azalea' and sen (w 2,3 ) 2 represents 'cuckoo', but there is only one representation for w 2,3 in the basic lattice encoder, which is x w 2,3 . To address this shortcoming, we improve the model by adding sense-level paths as external knowledge to the model. Hence, a more comprehensive lexicon would be constructed. As mentioned in 3.1, the representation of the k-th sense of the word w b,e is x sen b,e,k . For each word w b,e which matches the lexicon D, we will take all its sense representations into the calculation. The computation of the k-th sense of word w b,e is: where c sen b,e,k represents the memory cell of the kth sense of the word w b,e . Then all the senses are merged into a comprehensive representation to compute the memory cell of w b,e , which is denoted as c sen b,e : where i sen b,e,k is an extra gate to control the contribution of the k-th sense, and is computed similar as Eq. 9.
In this situation, all the sense-level cell states will be incorporated into the word representation c sen b,e , which could better represent the polysemous word. Then, similar to Eq. 9 -12, all the recurrent paths of words ending in index e will flow into the current cell c c e : Finally, the hidden state h are still computed by Eq. 5 and then sent to the relation classifier.

Relation Classifier
After the hidden state of an instance h ∈ R d h ×M is learnt, we first adopt a character-level attention mechanism to merge h into a sentence-level feature vector, denoted as h * ∈ R d h . Here, d h indicates the dimension of the hidden state and M is the sequence length. Then, the final sentence representation h * is fed into a softmax classifier to compute the confidence of each relation.
The representation h * of the sentence is computed as a weighted sum of all character feature vectors in h: where w ∈ R d h is a trained parameter and α ∈ R M is the weight vector of h.
To compute the conditional probability of each relation, the feature vector h * of sentence S is fed into a softmax classifier: where W ∈ R Y ×d h is the transformation matrix and b ∈ R Y is a bias vector. Y indicates the total number of relation types, and y is the estimated probability for each type. This mechanism is also applied to (Zhou et al., 2016). Finally, given all (T ) training examples (S (i) , y (i) ), we define the objective function using cross-entropy as follows: where θ indicates all parameters of our model. To avoid co-adaptation of hidden units, we apply dropout (Hinton et al., 2012) on the LSTM layer by randomly removing feature detectors from the network during forward propagation.

Experiments
In this section, we conduct a series of experiments on three manually labeled datasets. Our models show superiority and effectiveness compared with other models. Furthermore, generalization is another advantage of our models, because there are five corpora used to construct the three datasets, which are entirely different in topics and manners of writing. The experiments will be organized as follows: (1) First, we study the ability of our model to combine character-level and word-level information by comparing it with char-based and wordbased models; (2) Then we focus on the impact of sense representation, carrying out experiments among three different kinds of lattice-based models; (3) Finally, we make comparisons with other proposed models in relation extraction task.

Datasets and Experimental Settings
Datasets. We carry out our experiments on three different datasets, including Chinese San-Wen (Xu et al., 2017), ACE 2005 Chinese corpus (LDC2006T06) and FinRE.
The Chinese SanWen dataset contains 9 types of relations among 837 Chinese literature articles, in which 695 articles for training, 84 for testing and the rest 58 for validating. The ACE 2005 dataset is collected from newswires, broadcasts, and weblogs, containing 8023 relation facts with 18 relation subtypes. We randomly select 75% of it to train the models and the remaining is used for evaluation.
For more diversity in test domains, we manually annotate the FinRE dataset from 2647 financial news in Sina Finance 2 , with 13486, 3727 and 1489 relation instances for training, testing and validation respectively. The FinRE contains 44 distinguished relationships including a special relation NA, which indicates that there is no relation between the marked entity pair.
Evaluation cluding the precision-recall curve, F1-score, Precision at top N predictions (P@N) and area under the curve (AUC). With comprehensive evaluations, models can be estimated from multiple angles.
Parameter Settings. We tune the parameters of our models by grid searching on the validation dataset. Grid search is utilized to select optimal learning rate λ for Adam optimizer (Kingma and Ba, 2014) among {0.0001, 0.0005, 0.001, 0.005, } and position embedding d p in {5, 10, 15, 20}. Table 1 shows the values of the best hyperparameters in our experiments. The best models were selected by early stopping using the evaluation results on the validation dataset. For other parameters, we follow empirical settings because they make little influence on the whole performance of our models.

Effect of Lattice Encoder.
In this part, we mainly focus on the effect of the encoder layer. As shown in Table 2, we conducted experiments on char-based, word-based and lattice-based models on all datasets. The word-based and character-based baselines are implemented by replacing the lattice encoder with a bidirectional LSTM. In addition, character and word features are added to these two baselines respectively, so that they can use both character and word information. For word baseline, we utilize  an extra CNN/LSTM to learn hidden states for characters of each word (char CNN/LSTM). For char baseline, bichar and softword (word in which the current character is located) are used as wordlevel features to improve character representation. The lattice-based approaches include two lattice-based models, and both of them can explicitly leverage both character and word information. The basic lattice uses the encoder mentioned in 3.2.1, which can dynamically incorporate wordlevel information into character sequences. For MG lattice, each sense embedding will be used to construct an independent sense path. Hence, there is not only word information, but also sense information flowing into cell states. Results of word-based model. With automatic word segmentation, the baseline of the word-based model yields 41.23%, 54.26% and 64.43% F1score on three datasets. The F1-scores are increased to 41.6%, 56.62 and 68.86% by adding character CNN to the baseline model. Compared with the character CNN, character LSTM representation gives slightly higher F1-scores, which are 42.2%, 57.92%, and 69.81% respectively.
The results indicate that character information will promote the performance of the word-based model, but the increase in F1-score is not significant.
Results of character-based model. For the character baseline, it gives higher F1-scores compared with the word-based methods. By adding soft word feature, the F1-scores slightly increase on FinRE and SanWen dataset. Similar results are achieved by adding character-bigram. Additionally, a combination of both word features yields best F1-scores among character-based models, which are 42.03%, 61.75%, and 72.63%.
Results of lattice-based model. Although we take multiple strategies to combine character and word information in baselines, the lattice-based models still significantly outperform them. The basic lattice model improves the F1-scores of three datasets from 42.2% to 47.35%, 61.75% to 63.88% and 72.63% to 77.12% respectively. The results demonstrate the ability to exploit character and word sequence information of the latticebased model. Comparisons and analysis of the lattice-based models will be introduced in the next subsection.

Effect of Word Sense Representations
In this section, we will study the effect of word sense representations by utilizing sense-level information with different strategies. Hence, three types of lattice-based models are used in our experiments. First, the basic lattice model uses word2vec (Mikolov et al., 2013) to train the word embeddings, which considers no word sense information. Then, we introduce the basic lattice (SAT) model as a comparison, for which the pre-trained word embeddings are improved by sense information (Niu et al., 2017). Moreover, the MG lattice model uses sense embeddings to build independent paths and dynamically selects the appropriate sense.
The results of P@N shown in Table 3 demonstrate the effectiveness of word sense representations. The basic lattice (SAT) gives better performance than the original basic lattice model thanks to considering sense information into word embeddings. Although the basic lattice (SAT) model reaches better overall results, the precision of the top 100 instances is still lower than the latticebasic model. Compared with the other two models, MG lattice shows superiority in all indexes of P@N, achieving the best results in the mean scores.
To compare and analyze the effectiveness of all lattice-based models more intuitively, we report the precision-recall curve of the ACE-2005 dataset in Figure 3 as an example. Although the basic lattice (SAT) model obtains better overall performance than the original basic lattice model, the precision is still lower when the recall is low, which corresponds to the results in Table 3. This situation indicates that considering multiple senses only in the pre-trained stage would add noise to the word representations. In other words, the word representation tends to favor the commonly used senses in the corpora, which will disturb the model when the correct sense of the current word is not the common one. Nevertheless, the MG lattice model successfully avoids this problem, giving the best performance in all parts of the curve. This result indicates that the MG lattice model is not significantly impacted by the noisy information because it can dynamically select the sense paths in different contexts. Although MG lattice model shows effectiveness and robustness on the overall results, it is worth noting that the improvement is limited. The situation indicates that the utilization of multi-grained information could still be improved. A more detailed discussion is in Section 5.

Final Results
In this section, we compare the performance of the lattice-based model with various proposed methods. The proposed models we selected are as fol-  PCNN (Zeng et al., 2015) puts forward a piecewise CNN model with multi-instance learning.
We conduct experiments on both characterbased and word-based versions of the five models mentioned above. The results show that the character-based versions perform better than the word-based versions for all models on all datasets. Consequently, we only use the character-based version of the five selected models in the following experiments.
For comprehensive comparison and analysis, we report precision-recall curves in Figure 4 and F1-scores and AUC in Table 4. From the results, we can observe that: (1) Lattice-based models significantly outperform other proposed models on datasets from different domains. Thanks to the polysemy information, the MG lattice model performs best among all models, showing superiority and effectiveness on the Chinese RE task.
The results indicate that sense-level information could enhance the ability to capturing deep semantic information from text. (2) The gap between the basic lattice model and the MG lattice model becomes narrow on the dataset FinRE. The reason for this phenomenon is that FinRE is constructed from financial report corpus, and the words of financial reports are often rigorous and unambiguous.
(3) In comparison, the PCNN and PCNN+ATT models perform worse in the SanWen and ACE datasets. The reason is that there are positional overlaps between entity pairs in these two datasets, making PCNN unable to take full advantage of the piece-wise mechanism. The results indicate that the PCNN-based methods have a high dependence on the form of the dataset. In comparison, our models show robustness on all three datasets.

Conclusion and Future Work
In this paper, we propose the MG lattice model for Chinese relation extraction. The model incorporates word-level information into character sequences to explore deep semantic features and avoids the issue of polysemy ambiguity by introducing external linguistic knowledge, which is regarded as sense-level information. We comprehensively evaluate our model on various datasets. The results show that our model significantly outperforms other proposed methods, reaching the state-of-the-art results on all datasets.
In the future, we will attempt to improve the ability of the MG Lattice to utilize multi-grained information. Although we have used word, sense and character information in our work, more level of information can be incorporated into the MG Lattice. From coarse to fine, sememe-level information can be intuitively valuable. Here, sememe is the minimum semantic unit of word sense, whose information may potentially assist the model to explore deeper semantic features. From fine to coarse, sentences and paragraphs should be taken into account so that a border range of contextual information can be captured.