Noise Mitigation for Neural Entity Typing and Relation Extraction

In this paper, we address two different types of noise in information extraction models: noise from distant supervision and noise from pipeline input features. Our target tasks are entity typing and relation extraction. For the first noise type, we introduce multi-instance multi-label learning algorithms using neural network models, and apply them to fine-grained entity typing for the first time. Our model outperforms the state-of-the-art supervised approach which uses global embeddings of entities. For the second noise type, we propose ways to improve the integration of noisy entity type predictions into relation extraction. Our experiments show that probabilistic predictions are more robust than discrete predictions and that joint training of the two tasks performs best.


Introduction
Knowledge bases (KBs) are important resources for natural language processing tasks like question answering and entity linking. However, KBs are far from complete (e.g., Socher et al. (2013)). Therefore, methods for automatic knowledge base completion (KBC) are beneficial. Two subtasks of KBC are entity typing (ET) and relation extraction (RE). We address both tasks in this paper.
As in other information extraction tasks, obtaining labeled training data for ET and RE is challenging. The challenge grows as labels become more fine-grained. Therefore, distant supervision (Mintz et al., 2009) is widely used. It reduces the need for manually created resources. Distant supervision assumes that if an entity has a type (resp. two entities have a relationship) in a KB, then all sentences mentioning that entity (resp. those two entities) express that type (resp. that relationship). However, that assumption is too strong and gives rise to many noisy labels. Different techniques to deal with that problem have been investigated. The main technique is multi-instance (MI) learning . It relaxes the distant supervision assumption to the assumption that at least one instance of a bag (collection of all sentences containing the given entity/entity pair) expresses the type/relationship given in the KB. Multi-instance multi-label (MIML) learning is a generalization of MI in which one bag can have several labels (Surdeanu et al., 2012).
Most MI and MIML methods are based on hand crafted features. Recently, Zeng et al. (2015) introduced an end-to-end approach to MI learning based on neural networks. Their MI method takes the most confident instance as the prediction of the bag. Lin et al. (2016) further improved that method by taking other instances into account as well; they proposed MI learning based on selective attention as an alternative way of relaxing the impact of noisy labels on RE. In selective attention, a weighted average of instance representations is calculated first and then used to compute the prediction of a bag.
In this paper, we introduce two multi-label versions of MI. (i) MIML-MAX takes the maximum instance for each label. (ii) MIML-ATT applies, for each label, selective attention to the instances. We apply MIML-MAX and MIML-ATT to finegrained ET. In contrast to RE, the ET task we consider contains a larger set of labels, with a variety of different granularities and hierarchical relationships. We show that MIML-ATT deals well with noise in corpus-level ET and improves or matches the results of a supervised model based on global embeddings of entities.
The second type of noise we address in this paper influences the integration of ET into RE. It has been shown that adding entity types as features improves RE models (cf. Ling and Weld (2012), Liu et al. (2014)). However, noisy training data and difficulties of classification often cause wrong predictions of ET and, as a result, noisy inputs to RE. To address this, we propose a joint model of ET and RE and compare it with methods that integrate ET results in a strict pipeline. The joint model performs best. Among the pipeline models, we show that using probabilities instead of binary decisions better deals with noise (i.e., possible ET errors).
To sum up, our contributions are as follows. (i) We introduce new algorithms for MIML using neural networks. (ii) We apply MIML to finegrained entity typing for the first time and show that it outperforms the state-of-the-art supervised method based on entity embeddings. (iii) We show that a novel way of integrating noisy entity type predictions into a relation extraction model and joint training of the two tasks lead to large improvements of RE performance.
We release code and data for future research. 1

Related Work
Noise mitigation for distant supervision. Distant supervision can be used to train information extraction systems, e.g., in relation extraction (e.g., Mintz et al. (2009), Hoffmann et al. (2011), Zeng et al. (2015) and entity typing (e.g., Ling and Weld (2012), Yogatama et al. (2015), Dong et al. (2015)). To mitigate the noisy label problem, multi-instance (MI) learning has been introduced and applied in relation extraction Ritter et al., 2013). Surdeanu et al. (2012) introduced multi-instance multi-label (MIML) learning to extend MI learning for multilabel relation extraction. Those models are based on manually designed features. Zeng et al. (2015) and Lin et al. (2016) introduced MI learning methods for neural networks. We introduce MIML algorithms for neural networks. In contrast to most MI/MIML methods, which are applied in relation extraction, we apply MIML to the task of finegrained entity typing. Ritter et al. (2013) applied MI on a Twitter dataset with ten types. Our dataset has a larger number of classes or types (namely 102) and input examples, compared to that Twitter dataset and also to the most widely used datasets for evaluating MI (cf. ). This makes our setup more challenging because of dif-1 cistern.cis.lmu.de ferent dependencies and the multi-label nature of the problem. Also, there seems to be a difference between how entity relations and entity types are expressed in text. Our experiments support that hypothesis.
Knowledge base completion (KBC). Most KBC systems focus on identifying triples R(e 1 , r, e 2 ) missing from a KB Socher et al., 2013;Jiang et al., 2012;Wang et al., 2014). Work on entity typing or unary relations for KBC is more recent Neelakantan and Chang, 2015;Yaghoobzadeh and Schütze, 2015;Yaghoobzadeh et al., 2017). In this paper, we build a KBC system for unary and binary relations using contextual information of words and entities.
Named entity recognition (NER) and typing. NER systems (e.g., Finkel et al. (2005), Collobert et al. (2011)) used to consider only a small set of entity types. Recent work also addresses finegrained NER (Yosef et al., 2012;Ling and Weld, 2012;Yogatama et al., 2015;Dong et al., 2015;Del Corro et al., 2015;Ren et al., 2016a;Ren et al., 2016b;Shimaoka et al., 2016). Some of this work (cf. Yogatama et al. (2015), Dong et al. (2015)) treats entity segment boundaries as given and classifies mentions into fine-grained types. We make a similar assumption, but in contrast to NER, we evaluate on the corpus-level entity typing task of Yaghoobzadeh and Schütze (2015); thus, we do not need test sentences annotated with context dependent entity types. This task was also used to evaluate embedding learning methods (Yaghoobzadeh and Schütze, 2016).
Entity types for relation extraction. Several studies have integrated entity type information into relation extraction -either coarse-grained (Hoffmann et al., 2011;Zhou et al., 2005) or finegrained (Liu et al., 2014;Du et al., 2015;Augenstein et al., 2015;Vlachos and Clark, 2014;Ling and Weld, 2012) entity types. In contrast to most of this work, but similar to , we do not incorporate binary entity type values, but probabilistic outputs. Thus, we allow the relation extraction system to compensate for errors of entity typing. Additionally, we compare this approach to various other possibilities, to investigate which approach performs best.  found that joint training of entity typing and relation extraction is better than a pipeline model; we show that this result also holds for neural network models and when the number of entity types is large.

MIML Learning for Entity Typing
Entity typing (ET) is the task of finding, for each named entity, a set of types or classes that it belongs to, e.g., "author" and "politician" for "Obama". Our goal is corpus-level prediction of entity types. We use the entity-type information from a KB and annotated contexts of entities in a corpus to estimate P (t|e), the probability that entity e has type t.
More specifically, consider an entity e and B = {c 1 , c 2 , ..., c q }, the set of q contexts of e in the corpus. Each c i is an instance of e and since e can have several labels, it is a multi-instance multi-label (MIML) learning problem. We address MIML using neural networks by representing each context as a vector c i ∈ R h , and learn P (t|e) from the set of contexts of entity e. In the following, we first describe our MIML algorithms and then explain how c i is computed.

Algorithms
Distant supervision. The basic way to estimate P (t|e) is based on distant supervision with learning the type probability of each c i individually, by making the assumption that each c i expresses all labels of e. Therefore, we define the context-level probability function as: where w t ∈ R h is the output weight vector and b t is the bias scalar for type t. The cost function is defined based on binary cross entropy: (2) where y t is 1 if entity e has type t otherwise 0. To compute P (t|e) at prediction time, i.e., P pred (t|e), the context-level probabilities must be aggregated. Average is the usual way of doing that: Multi-instance multi-label. The distant supervision assumption is that all contexts of an entity with type t are contexts of t; e.g., we label all contexts mentioning "Barack Obama" with all of his types. Obviously, the labels are incorrect or noisy for some contexts. Multi-instance multi-label (MIML) learning addresses this problem. We apply MIML to fine-grained ET for the first time. Our assumption is: if entity e has type t, then there is at least one context of e in the corpus in which e occurs as type t. So, we apply this assumption during training with the following estimation of the type probability of an entity: which means we take the maximum probability of type t over all contexts of entity e as P (t|e). We call this approach MIML-MAX. MIML-MAX picks the most confident context for t, ignoring the probabilities of all the other contexts. Apart from missing information, this can be especially harmful if the entity annotations in the corpus are the result of an entity linking system. In that case, the most confident context might be wrongly linked to the entity. So, it can be beneficial to leverage all contexts into the final prediction, e.g., by averaging the type probabilities of all contexts of entity e: We call this approach MIML-AVG. We also propose a combination of the maximum and average, which uses MIML-MAX (Eq. 4) in training and MIML-AVG (Eq. 5) in prediction. We call this approach MIML-MAX-AVG. MIML-AVG treats every context equally which might be problematic since many contexts are irrelevant for a particular type. A better way is to weight the contexts according to their similarity to the types. Therefore, we propose using selective attention over contexts as follows and call this approach MIML-ATT. MIML-ATT is the multi-  Table 1: Different MIML algorithms for entity typing, and the aggregation function they use to get corpus-level probabilities.
probability for e, we define: where w t ∈ R h is the output weight vector and b t the bias scalar for type t, and a t is the aggregated representation of all contexts c i of e for type t, computed as follows: where α i,t is the attention score of context c i for type t and a t ∈ R h can be interpreted as the representation of entity e for type t. α i,t is defined as: where M ∈ R h×dt is a weight matrix that measures the similarity of c and t. t ∈ R dt is the representation of type t. Table 1 summarizes the differences of our MIML methods with respect to the aggregation function they use to get corpus-level probabilities. For optimization of all MIML methods, we use the binary cross entropy loss function, In contrast to the loss function of distant supervision in Eq. 2, which iterates over all contexts, we iterate over all entities here.

Context Representation
To produce a high-quality context representation c, we use convolutional neural networks (CNNs). The first layer of the CNN is a lookup table that maps each word in c to an embedding of size d. The output of the lookup layer is a matrix E ∈ R d×s (the embedding layer), where s is the context size (a fixed number of words).
The CNN uses n filters of different window widths w to narrowly convolve E. For each of the n filters H ∈ R d×w , the result of applying H to matrix E is a feature map m ∈ R s−w+1 : where g is the relu function, is the Frobenius product, E :,i:i+w−1 are the columns i to i + w − 1 of E and 1 ≤ w ≤ k are the window widths we consider. Max pooling then gives us one feature for each filter and the concatenation of those features is the CNN representation of c.
As it is shown in the entity typing part of Figure 1, we apply the CNN to the left and right context of the entity mention and the concatenation φ(c) ∈ R 2n is fed into a multi-layer perceptron (MLP) to get the final context representation c ∈ R h : 4 Type-aware Relation Extraction Relation extraction (RE) is mostly defined as finding relations between pairs of entities, for instance, finding the relation "president-of" between "Obama" and "USA". Given a set of q contexts for an entity pair z, B = {c 1 , c 2 , ..., c q } in the corpus, we learn P (r|z), which is the probability of relation r for z. We assume that each z has one relation r(z). Each c i is represented by a vector c i ∈ R h , which is our type-aware representation of context described in Section 4.1.
To learn P (r|z), we use the multi-instance (MI) learning method of Zeng et al. (2015): where P (r|c i ) is the probability of relation r for context c i . The cost function we optimize is:

Context Representation
Similar to our entity typing system, we apply CNNs to compute the context representation φ(c).
In particular, we use Adel et al. (2016) Figure 1: Our architecture for joint entity typing and relation extraction and right of the relation arguments. The parts "overlap", i.e., the left (resp. right) argument is included in both left (resp. right) and middle parts. For each of the three parts, convolution and 3-max pooling (Kalchbrenner et al., 2014) is performed. The context representation φ(c) ∈ R 3·3·n is the concatenation of the pooling results.

Integration of Entity Types
We concatenate the entity type representations t 1 ∈ R τ and t 2 ∈ R τ of the relation arguments to the CNN representation of the context, φ(c): Our context representation c is then: where W h ∈ R h×(3·3·n+2τ ) is the weight matrix. This is also depicted in Figure 1, right column, third layer from the top: t 1 , t 2 , Φ(c). We calculate t 1 and t 2 from the predictions of the entity typing model with the following transformation: where c e k is the context of e k , W t ∈ R τ ×T is a weight matrix (learned from corpus or during training) and f is a function (identity or tanh).
With the transformation W t , the model can combine predictions for different types to learn better internal representations t 1 and t 2 . The choices of W t and f depend on the different representations we investigate and describe in the following.
(1) Pipeline: We integrate entity types into the RE model, using the output of ET in a pipeline model (see Eq. 15). We test the following representations of t k , k ∈ {1, 2}. PREDICTED-HIDDEN: W t from Eq. 15 is learned during training and f is tanh. That means that a hidden layer learns representations based on the predictions P (t 1 |c e k ) . . . P (t T |c e k ). BINARY-HIDDEN: This is the binarization of the input of PREDICTED-HIDDEN, i.e., each probability estimate is converted to 0 or 1 (with a threshold of 0.5). BINARY: t k is the binary vector itself (used by Ling and Weld (2012)). WEIGHTED: The columns of matrix W t from Eq. 15 are the distributional embeddings of types trained on the corpus (see Section 5.1). f is the identity function.
(2) Joint model: As an alternative to the pipeline model, we investigate integrating entity typing into RE by jointly training both models. We use the architecture depicted in Figure 1. The key difference to the pipeline model PREDICTED-HIDDEN is that we learn P (t|c) and P (r|c) jointly, called JOINT-TRAIN. We compare JOINT-TRAIN to other models, including the pipeline models.
During training of JOINT-TRAIN, we compute the cost of the ET model for typing the first entity L 1 (θ T ), the cost for typing the second entity L 2 (θ T ) and the cost of the RE model for assigning a relation to the two entities L(θ R ). Then, we combine those costs with a weight γ which is tuned on the development set:  Note that based on this equation, the ET parameters are optimized on the contexts of the RE examples, which are a subset of all training examples of ET. However in the pipeline models, ET is trained on the whole training set used for typing. Also note that in JOINT-TRAIN we do not use MIML for the ET part but a distant supervised cost function.
For relation extraction, we first select the ten most frequent relations (plus NA for no relation according to Freebase) of entity pairs in CF-FIGMENT. We ensure that the entity pairs have at least one context in CF-FIGMENT. This results in 5815, 3054 and 6889 unique entity pairs for train, dev and test. 2 Dev and test set sizes are 124,462 and 556,847 instances. For the train set, we take a subsample of 135,171 sentences. The entity and sentence sets of CF-FIGMENT were constructed to ensure that entities in the entity test set do not occur in the sentence train and dev sets; that is, a sentence was assigned to the train set only if all entities it contains are train entities. 1

Word, Entity and Type Embeddings
We use 100-dimensional word embeddings to initialize the input layer of ET and RE. Embeddings are kept fixed during training. Since we need embeddings for words, entities and types in the same space, we process ClueWeb+FACC1 (corpus with entity information) as follows. For each sentence s, we add two copies: s itself, and a copy in which each entity is replaced with its notable type, the most important type according to Freebase. We process train, dev and test this way, but do not replace test entities with their notable type because the types of test entities are unknown in our application scenario. We run word2vec (Mikolov et al., 2013) on the resulting corpus to learn embeddings for words, entities and types. Note that our application scenario is that we are given an unannotated input corpus and our system then extracts entity types and relations from this input corpus to enhance the KB.

Entity Typing Experiments
Entity context setup. We use a window size of 5 on each side of the entity mentions. Following Yaghoobzadeh and Schütze (2015), we replace other entities occurring in the context with their Freebase notable type mapped to FIGER.
Models. Yaghoobzadeh and Schütze (2015) applied a multi-layer perceptron (MLP) architecture to create context representations. Therefore, we use an MLP baseline to compute the context representation φ(c). The input to the MLP model is a concatenation of context word embeddings. As an alternative to MLP, we also train a CNN (see Section 3.2) to compute context representations. We run experiments with MLP and CNN, each trained with standard distant supervision and with MIML.
EntEmb and FIGMENT baselines. Following Yaghoobzadeh and Schütze (2015), we also learn entity embeddings and classify those embeddings to types, i.e., instead of distant supervision, we classify entities based on aggregated information represented in entity embeddings. An MLP with one hidden layer is used as classifier. We call that model EntEmb. We join the results of EntEmb with our best model (line 13 in Table 3), similar to the joint model (FIGMENT) in Yaghoobzadeh and Schütze (2015).
We use the same evaluation measures as Ling and Weld (2012), Yaghoobzadeh and Schütze (2015) and Neelakantan and Chang (2015) for entity typing: precision at 1 (P @1), which is the accuracy of picking the most confident type for each entity, micro average F 1 of all entity-type  Table 3: P @1, Micro F 1 for all, head and tail entities and MAP results for entity typing.
assignments and mean average precision (MAP) over types. We could make assignment decisions based on the standard criterion p > θ, θ = 0.5, but we found that tuning θ improves results. For each probabilistic classifier and each type, we set θ to the value that maximizes performance on dev.
Results. Table 3 shows results for P @1, micro F 1 and MAP. For F 1 , we report separate results for all, head (frequency higher than 100) and tail (frequency less than 5) entities.
Discussion. The improvement of CNN (6) compared to MLP (1) is not surprising considering the effectiveness of CNNs in finding position independent local features, compared to the flat representation of MLP. Lines 2-5 and 7-10 show the results of different MIML algorithms for MLP and CNN, respectively. Considering micro F1 for all entities as the most importance measure, the trend is similar in both MLP and CNN for MIML methods: ATT > MAX-AVG > AVG > MAX.
MAX is worse than even basic distant supervised models, especially for micro F1. MAX predictions are based on only one context of each entity (for each type), and the results suggest that this is harmful for entity typing. This is in contradiction with the previous results in RE (cf. Zeng et al. (2015)) and suggests that there might be a significant difference between expressing types of entities and relations between them in text. Related to this, MAX-AVG which averages the type probabilities at prediction time improves MAX by a large margin. Averaging the context probabilities seems to be a way to smooth the entity type probabilities. MAX-AVG models are also better than the corresponding models with AVG that train and predict with averaging. This is due to the fact that AVG gives equal weights to all context probabilities both in training and prediction. ATT uses … /m/024g5w , and DOCTOR into disease will be ... … whooping cough , and kidney disease ( /m/024g5w 's disease ...
In 7 , DOCTOR and /m/024g5w write Elements of the ... book but his catarrhal bronchitis turned to /m/024g5w 's disease and ...
It has cured /m/024g5w 's disease that could be traced to ... two clinical wards so /m/024g5w can carry on intensive study ... weighted contexts in both training and prediction and that is probably the reason for its effectiveness over all other MIML algorithms. Overall, using attention (ATT) significantly improves the results of both MLP and CNN models. CNN+MIML-ATT (10) performs comparable to EntEmb (11), with better micro F1 on all and tail entities and worse MAP and micro F1 on head entities. These two models have different properties, e.g., MIML is also able to type each mention of entities while EntEmb works only for corpuslevel typing of entities. (See Yaghoobzadeh and Schütze (2015) for more differences) It is important to note that MIML can also be applied to any entity typing architecture or model that is trained by distant supervision. Due to the lack of large annotated corpora, distant supervision is currently the only viable approach to fine-grained entity typing; thus, our demonstration of the effectiveness of MIML is an important finding for entity typing.
Joining the results of CNN+MIML-ATT with EntEmb (line 13) gives large improvements over each of the single models. It is also consistently better (by more than 3% in all measures) than our baseline FIGMENT (12), which is basically MLP+EntEmb. This improvement is achieved by using CNN instead of MLP for context representation and integrating MIML-ATT. EntEmb is improved by Yaghoobzadeh et al. (2017) by using entity names. We leave the integration of that model to future work.
Example. To show the behavior of MIML-MAX and MIML-ATT, we extract the scores that each method assigns to the labels for each context. A comparison for the example entity "Richard Bright" (MID: /m/024g5w) who is a PERSON, DOCTOR and AUTHOR is shown in Figure 2. Note that the weights from MIML-ATT (Eq. 8) sum to 1 for each label because of the applied softmax function while the scores from MIML-MAX (Eq. 1) do not. For both methods, the scores for the type PERSON are more equally distributed than for the other types which makes sense since the entity has the PERSON characteristics in every sentence. For the other types, both models seem to be influenced by other entities occurring in the context (e.g., an occurrence with another DOCTOR could indicate that the entity is also a DOCTOR) but also by trigger words such as "write" or "book" for the type AUTHOR or "disease" for the type DOCTOR.

Relation Extraction Experiments
Models. In our experiments, we compare two state-of-the-art RE architectures: piecewise CNN (Zeng et al., 2015) and contextwise CNN (Adel et al., 2016). We use the publicly available implementation for the piecewise CNN (URL, 2016a) and our own implementation for the contextwise CNN. Both CNNs represent the input words with embeddings and split the contexts based on the positions of the relation arguments. The contextwise CNN splits the input before convolution, the piecewise CNN after convolution. Also, while the piecewise CNN applies a softmax layer directly after pooling, the contextwise CNN feeds the pooling results into a fully-connected hidden layer first. For both models, we use MI learning to mitigate the noise from distant supervision.
Results. The precision recall (PR) curves in Figure 3 show that the contextwise CNN outperforms the piecewise CNN on our RE dataset. We also compare them to a baseline model that does not learn context features but uses only the embeddings of the relation arguments as an input and feeds them into an MLP (similar to the EntEmb baseline for ET). The results confirm that the context features which the CNNs extract are very important, not only for ET but also for RE. Note that the PR curves are calculated on the corpus level and not on the sentence-level, i.e., after aggregating the predictions for each entity pair. Following Ritter et al. (2013), we compute the area A under the PR curves which supports this trend (EntEmb: A = 0.34, piecewise CNN: A = 0.48, contextwise CNN: A = 0.50).
Pipeline vs. joint training. Since the contextwise CNN outperforms the piecewise CNN, we use the contextwise CNN for integrating en-  Figure 4 shows that the performance on the RE dataset increases when we integrate entity type information into the CNN. The main trend of the PR curves and the areas under them shows the following order of model performances: JOINT-TRAIN > WEIGHTED > PREDICTED-HIDDEN > BINARY-HIDDEN > BINARY. Discussion. The better performance of our approaches of integrating type predictions into the contextwise CNN (PREDICTED-HIDDEN, WEIGHTED) compared to baseline type integrations (BINARY, BINARY-HIDDEN) shows that probabilistic predictions of an entity typing system can be a valuable resource for RE. With binary types, it is not possible to tell whether one of the selected types had a higher probability than another or whether a type whose binary value is 0 just barely missed the threshold. Probabilistic representations preserve this information. Thus, using probabilistic representations, the RE system can compensate for noise in ET predictions.
WEIGHTED with access to the distributional type embeddings learned from the corpus works better than all other pipeline models. This shows that our type embeddings can be valuable for RE. JOINT-TRAIN performs better than all pipeline models, even though the ET part in the pipelines is trained on more data. The area of JOINT-TRAIN under the PR curve is A = 0.66. A plausible reason is the mutual dependencies of those two tasks which a joint model can better learn than a pipeline model. We can also relate it to better noise mitigation of jointed ET, compared to isolated models. 3 Analysis of joint training. In this paragraph, we investigate the joint training in more detail. In particular, we evaluate different variants of it by combining relation extraction with other entity typing approaches: EntEmb and FIGMENT. For joint training with ET-EntEmb, we do not use the context for predicting the types of the relation arguments but only their embeddings. Then, we feed those embeddings into an MLP which computes a representation that we use for the type prediction. This corresponds to the EntEmb model presented in Table 3 (line 11). For joint training with ET-FIGMENT, we compute two different cost functions for entity typing: one for typing based on entity embeddings (see ET-EntEmb above) and one for typing based on an MLP context model. This does not correspond exactly to the FIGMENT model from Table 3 (line 12) which combines an entity embedding and MLP context model as a postprocessing step but comes close. In addition to those two baseline ET models, we also train a version in which both entity typing and relation extraction use EntEmb as their only input features. Figure 5 shows the PR curves for those models. The curve for the model that uses only entity embedding features for both entity typing and relation extraction is much worse than the other curves. This emphasizes the importance of our context model for RE (see also Figure 3), also in combination with joint training. Similarly, the curve for the model with EntEmb as entity typing component has more precision variations than the curves for the other models which use context features for ET. Thus, joint training does not help per se but it is important which models are trained together. The areas under the PR curves show the following model trends: joint with ET-FIGMENT ≈ joint as in Figure 1 > joint with ET-EntEmb > joint with ET-EntEmb and RE-EntEmb.
Most improved relations. To identify which 3 On the joint dataset, joint training improves MAP for entity typing by about 20% compared to the best isolated model.

Conclusion
In this paper, we addressed different types of noise in two information extraction tasks: entity typing and relation extraction. We presented the first multi-instance multi-label methods for entity typing and showed that it helped to alleviate the noise from distant supervised labels. This is an important contribution because most of the current finegrained entity typing systems are trained by distant supervision. Our best model sets a new state of the art in corpus-level entity typing. For relation extraction, we mitigated noise from using predicted entity types as features. We compared different pipeline approaches with each other and with our proposed joint type-relation extraction model. We observed that using type probabilities is more robust than binary predictions of types, and joint training gives the best results.