Optimizing Multivariate Performance Measures for Learning Relation Extraction Models

We describe a novel max-margin learning approach to optimize non-linear performance measures for distantly-supervised relation extraction models. Our approach can be generally used to learn latent variable models under multivariate non-linear performance measures, such as F β -score. Our approach interleaves Concave-Convex Pro-cedure (CCCP) for populating latent variables with dual decomposition to factorize the original hard problem into smaller independent sub-problems. The experimental re-sults demonstrate that our learning algorithm is more effective than the ones commonly used in the literature for distant supervision of information extraction models. On several data conditions, we show that our method outperforms the baseline and results in up to 8.5% improvement in the F 1 -score.


Introduction
Rich models with latent variables are popular for many problems in natural language processing. In information extraction, for example, one needs to predict the relation labels y that an entity-pair x can have based on the hidden relation mentions h, i.e., the relation labels for occurrences of the entity-pair in a given corpus. However, these models are often trained by optimizing performance measures (such as conditional log-likelihood or error rate) that are not directly related to the task-specific non-linear performance measure, e.g., the F 1 -score. However, better models may be trained by optimizing the taskspecific performance measure while allowing latent variables to adapt their values accordingly.
We present a large-margin method to learn parameters of latent variable models for a wide range of non-linear multivariate performance measures such as F β . Our method can be applied to learning graphical models that incorporate interdependencies among the output variables either directly, or indirectly through hidden variables.
Large-margin methods have been shown to be a compelling approach to learn rich models detailing the inter-dependencies among the output variables, via optimizing loss functions decomposable over the training instances (Taskar et al., 2003;Tsochantaridis et al., 2004) or non-decompasable loss functions (Ranjbar et al., 2013;Tarlow and Zemel, 2012;Rosenfeld et al., 2014;Keshet, 2014). They have also been shown to be powerful when applied to latent variable models when optimizing for decomposable loss functions (Wang and Mori, 2011;Felzenszwalb et al., 2010;Yu and Joachims, 2009).
Our large-margin method learns latent variable models via optimizing non-decomposable loss functions. It interleaves the Concave-Convex Procedure (CCCP) (Yuille and Rangarajan, 2001) for populating latent variables with dual decomposition (Komodakis et al., 2011;Rush and Collins, 2012). The latter factorizes the hard optimization problem (encountered in learning) into smaller independent sub-problems over the training instances. We then present linear programming and local search methods for effective optimization of the sub-problems encountered in the dual decomposition. Our local search algorithm leads to a speed up of 7,000 times compared to the exhaustive search used in the literature (Joachims, 2005;Ranjbar et al., 2012).
Our work is the first to make use of max-margin training in distant supervision of relation extraction models. We demonstrate the effectiveness of our proposed method compared to two strong baseline systems which optimize for the error rate and conditional likelihood, including a state-of-the-art system by Hoffmann et al. (2011). On several data conditions, we show that our method outperforms the baseline and results in up to 8.5% improvement in the F 1 -score.

Distant Supervision for Relation Extraction
Our framework is motivated by distant supervision for learning relation extraction models (Mintz et al., 2009). The goal is to learn relation extraction models by aligning facts in a database to sentences in a large unlabeled corpus. Since the individual sentences are not hand labeled, the facts in the database act as "weak" or "distant" labels, hence the learning scenario is termed as distantly supervised.
Prior work casts this problem as a multi-instance multi-label learning problem (Hoffmann et al., 2011;Surdeanu et al., 2012). It is multi-instance since for a given entity-pair, only the label of the bag of sentences containing both entities (aka mentions) is given. It is multi-label since a bag of mentions can have multiple labels. The inter-dependencies between relation labels and (hidden) mention labels are modeled by a Markov Random Field ( Figure 1) (Hoffmann et al., 2011). The learning algorithms used in the literature for this problem optimize the (conditional) likelihood, but the evaluation measure is commonly the F -score.
Formally, the training data is D : where x i ∈ X is the entity-pair, y i ∈ Y denotes the relation labels, and h i ∈ H denotes the hidden mention labels. The possible relation labels for the entity pair are observed from a given knowledgebase. If there are L candidate relation labels in the knowledge-base, then y i ∈ {0, 1} L , (e.g. y i, is 1 if the relation is licensed by the knowledge-base for the entity-pair) and h i ∈ {1, .., L, nil} |x i | (i.e. each mention realizes one of the relation labels or nil).
Notation. In the rest of the paper, we denote the collection of all entity-pairs {x i } N i=1 by X ∈ X := X × .. × X , the collection of mention relations {h i } N i=1 by H ∈ H := H×..×H, and the collection of relation labels The aim is to learn a parameter vector w ∈ R d by which the relation labels for a new entity-pair x can be predicted where Φ ∈ R d is a feature vector defined according to the Markov Random Field, modeling the interdependencies between x and y through h ( Figure 1). In training, we would like to minimize the loss function ∆ by which the model will be assessed at test time. For the relation extraction task, the loss can be considered to be the negative of the F β score: where β = 0.5 results in optimizing against F 1score. Our proposed learning method optimizes those loss functions ∆ which cannot be decomposed over individual training instances. For example, F β depends non-linearly on Precision and Recall which in turn require the predictions for all the entity pairs in the training set, hence it cannot be decomposed over individual training instances.

Structured Prediction Learning
The goal of our learning problem is to find w ∈ R d which minimizes the expected loss, aka risk, over a 893 new sample D of size N : Generally, the loss function ∆ cannot be decomposed into a linear combination of a loss function δ over individual training samples. However, most discriminative large-margin learning algorithms assume for simplicity that the loss function is decomposable and the samples are i.i.d. (independent and identically distributed), which simplifies the sample risk R ∆ fw as: Often learning algorithms make use of the empirical risk as an approximation of the sample risk: For non-decomposable loss functions, such as F β , ∆ cannot be expressed in terms of instance-specific loss function δ to construct the empirical risk of the kind in Eq. (5). Instead, we need to optimize the empirical risk constructed based on the sample loss: Having defined the empirical risk in Eq (7), we formulate the learning problem as a structured prediction problem. Instead of learning a mapping function f w : X → Y from an individual instance x ∈ X to its label y ∈ Y, let us learn a mapping function f : X → Y from all instances X ∈ X to their labels Y ∈ Y. We then define the best labeling using a linear discriminant function: Based on the margin re-scaling formulation of structured prediction problems (Tsochantaridis et al., 2004), the training objective can be written as the following unconstrained optimization problem: which is similar to the training objective for the latent SVMs (Yu and Joachims, 2009), with the difference that instance-dependent loss function δ is replaced by the sample loss function ∆. Learning w by optimizing the above objective function is challenging, and is the subject of the next section.

Optimizing the Training Objective
In this section we present our method to learn latent SVMs with non-decomposable loss functions. Our training objective is Eq (9), which can be equivalently expressed as: The training objective is non-convex, since it is the difference of two convex functions. In this section we make use of the CCCP to populate the hidden variables (Yu and Joachims, 2009;Yuille and Rangarajan, 2001), and interleave it with dual decomposition (DD) to solve the resulting intermediate lossaugmented inference problems (Ranjbar et al., 2012;Rush and Collins, 2012;Komodakis et al., 2011).

Concave-Convex Procedure (CCCP)
The CCCP (Yuille and Rangarajan, 2001) gives a general iterative method to optimize those nonconvex objective functions which can be written as the difference of two convex functions g 1 (w) − g 2 (w). The idea is to iteratively lowerbound g 2 with a linear function g 2 (w (t) ) + v · (w − w (t) ), and take the following step to update w: In our case, the training objective Eq (10) is the difference of two convex functions, where the second function Algorithm 1 The Training Algorithm (Optimizing Eq 10) Initialize w (0) and set t = 0 3: repeat 4: until some stopping condition is met 9: return w (t) lowerbound of g 2 (w) involves populating the hidden variables by: Therefore, in each iteration of our CCCP-based algorithm we need to optimize Eq (12), which is reminiscent of the standard structural SVM without latent variables: The above objective function can be optimized using the standard cutting-plane algorithms for structural SVM (Tsochantaridis et al., 2004;Joachims, 2005). The cutting-plane algorithm in turn needs to solve the loss-augmented inference, which is the subject of the next sub-section. The CCCP-based training algorithm is summarized in Algorithm 1.

Loss-Augmented Inference
To be able to optimize the training objective Eq (12) encountered in each iteration of Algorithm 1, we need to solve (the so-called) loss-augmented inference: max y 1 ,..,y N ∆ (y 1 , .., y N ), (y 1 , .., y N ) We make use of the dual decomposition (DD) technique to decouple the two terms of the above objective function, and efficiently find an approximate solution. DD is shown to be an effective technique for loss-augmented inference in structured prediction models without hidden variables (Ranjbar et al., 2012).
To apply DD to the loss-augmented inference (13), let us rewrite it as a constrained optimization problem: max y 1 ,...,y N ,y 1 ,...,y N ∆ (y 1 , . . . , y N ), (y 1 , . . . , y N ) Introduction of the new variables (y 1 , .., y N ) decouples the two terms in the objective function, and as we will see, leads to an effective optimization algorithm. After forming the Lagrangian, the dual objective function is derived as: where Λ := (λ λ λ 1 , .., λ λ λ N ), and λ λ λ i is a vector of Lagrange multipliers for L binary variables each of which represent a relation label. The two optimization problems involved in the dual L(Λ) are independent and can be solved separately. The dual is an upperbound on the loss-augmented objective function for any value of Λ; therefore, we can find the tightest upperbound as an approximate solution: The dual is non-differentiable at those points Λ where either of the two optimisation problems has multiple optima. Therefore, it is optimized using the subgradient descent method: where η (t) = 1 √ t is the step size 1 , and 1 Other (non-increasing) functions of the iteration number t are also plausible, as far as they satisfy the following conditions (Komodakis et al., 2011) needed to guarantee the convergence to the optimal solution in the subgradient descent method: if Y * = Y * then 7: return Y * 8: for i := 1 to N do 9: for := 1 to L do 10:

11:
until some stopping condition is met 12: return Y * The DD algorithm to compute the loss-augmented inference is outlined in Algorithm 2. Now the challenge is how to solve the above two optimization problems, which is the subject of the following section.

Effective Optimization of the Dual
The two optimization problems involved in the dual are hard in general. More specifically, the optimization of the affine-augmented model score (in Eq. 15) is as difficult as the MAP inference in the underlying graphical model, which can be challenging for loopy graphs. For the graphical model underlying distant supervision of relation extraction (Fig 1), we formulate the inference as an ILP (integer linear program). Furthermore, we relax the ILP to LP to speed up the inference, in the expense of trading exact solutions with approximate solutions 2 . Likewise, the optimization of the affineaugmented multivariate loss (in Eq. 14) is difficult. This is because we have to search over the entire space of Y ∈ Y, which is exponentially large O(2 N * L ). However, if the loss term ∆ can be expressed in terms of some aggregate statistics over Y , such as false positives (FPs) and false negatives (FNs), the optimization can be performed efficiently. This is due to the fact that the number of FPs can range from zero to the size of negative labels, and the number of FNs can range from zero to the number of positive labels. Therefore, the loss term can take O(N 2 L 2 ) different values which can for ((f p , f n ) ∈ Neigbours(f p, f n) do 7: : if loss (f p,f n) > loss (f p ,f n ) then 10: break 11: else 12: (f p, f n) = (f p , f n )

13:
until be represented on a two-dimensional grid. Fixing FPs and FNs to a grid point, Λ · Y is maximized with respect to Y . The grid point which has the best value for ∆(Y, Y ) + Λ · Y will then give the optimal solution for Eq (14). Exhaustive search in the space of all possible grid points is not efficient as soon as the grid becomes large. Therefore, we have to adapt the techniques proposed in previous work (Ranjbar et al., 2012;Joachims, 2005). We propose a simple but effective local search strategy for this purpose. The procedure is outlined in Algorithm 3. We start with a random grid point, and move to the best neighbour. We keep hill climbing until there is no neighbour better than the current point. We define the neighbourhood by a set of exponentially-spaced points in all directions around the current point, to improve the exploration of the search space. We present some analysis on the benefits of using this search strategy vis-a-vis the exhaustive search in the Experiments section.

Experiments
Dataset: We use the challenging benchmark dataset created by Riedel et al. (2010) for distant supervision of relation extraction models. It is created by aligning relations from Freebase 4 with the sentences in New York Times corpus (Sandhaus, 2008). The labels for the datapoints come from the Freebase database but Freebase is incomplete (Ritter et al., 2013). So a data point is labeled nil when either no relation exists or the relation is absent in Freebase. To avoid this ambiguity we train and evaluate the baseline and our algorithms on a subset of this dataset which consists of only non-nil relation labeled datapoints (termed as positive dataset). For the sake of completeness, we do report the accuracies of the various approaches on the entire evaluation dataset.
Systems and Baseline: Hoffmann et al. (2011) describe a state-of-the-art approach for this task. They use a perceptron-style parameter update scheme adapted to handle latent variables; their training objective is the conditional likelihood. Out of the two implementations of this algorithm, we use the better 5 of these two 6 , as our baseline (denoted by Hoffmann). For a fair comparison, the training dataset and the set of features defined over it are common to all the experiments.
We discuss the results of two of our approaches. One, is the LatentSVM max-margin formulation with the simple decomposable Hamming loss function which minimizes the error rate (denoted by MM-hamming). The other is the LatentSVM maxmargin formulation with the non-decomposable loss function which minimizes the negative of F β score (denoted by MM-F-loss) 7 .
Evaluation Measure: The performance measure is F β which can be expressed in terms of false positives (FP) and false negatives (FN) as: where β is the weight assigned to precision (and 1 − β to recall). F P , F N and N p are defined as :   We use 1−F β as the expression for the multivariateloss.

Training on Sub-samples of Data
We performed a number of experiments using different randomized subsets of the Riedel dataset (10% of the positive dataset) for training the max-margin approaches. This was done in order to empirically determine a good set of parameters for training. We also compare the results of the approaches with Hoffmann trained on the same sub-samples.
Comparison with the Baseline: We report the average over 15 subsets of the dataset with a 90% confidence interval (using student-t distribution). The results of these experiments are shown in Figure 2 and Table 1. We observe that both MM-hamming and MM-F-loss have higher F 1 -score compared to the baseline. There is a significant improvement in F 1score to the tune of 8.52% for the multivariate performance measure over Hoffmann. There is also an improvement of F 1 -score of 7.12% compared to MM-Hamming. This highlights the importance of using non-linear loss functions compared to simple loss functions like error rate during training. However, Hoffmann has a marginally higher precision of about 1.13%. We noticed that this was due to over-fitting the data, as the performance on the training datasets were very high. One more interesting observation of MM-F-loss is that it is fairly balanced w.r.t both precision and recall which the other approaches do not exhibit.

Tuning towards Precision/Recall:
Often we come across situations where either precision or recall is important for a given application. This is modeled by the notion of F β (van Rijsbergen, 1979). One of the main advantages of using a non-decomposable loss function like F β is the ability to vary the learning algorithm to factor such situations. For instance we can tune the objective to favor precision more than recall by "up-weighting" precision in the F β -score.
For instance, in the previous case we observed that MM-F-loss has a marginally poorer precision compared to Hoffmann. Suppose we increase the weight of precision, β = 0.833, we observe a dramatic increase in precision from 65.83% to 86.59%. As expected, due to the precision-recall trade-off, we observe a decrease in recall. The results are shown in Figure 3.
Local vs. Exhaustive Grid Search: As we described in Section 3.3, we devise a simple yet efficient local search strategy to search the space of (F P, F N ) grid-points. This enables a speed up of three orders of magnitude in solving the dual-optimization problem. In Table 2, we compare the average time per iteration and the F 1 -score when each of these techniques is used for training on a sub-sample dataset. We observe that there  is a significant decrease in training time when we use local search (almost 7000 times faster), with a negligible decrease in F 1 -score (0.073%). Table 3 present the overall results of our approaches compared to the baseline on the positive dataset. We observe that MM-F-loss has an increase in F 1 -score to the tune of ∼8% compared to the baseline. This confirms our observation on the sub-sample datasets we saw earlier.

Figure 4 and
By assigning more weight to precision, we are able to improve over the precision of Hoffmann by ∼1.6% (Table 4). When precision is tuned with a higher weight during training of MM-F-loss, we see an improvement in precision without much dip in recall.

Discussion
So far we have discussed the performance of various approaches on the positive evaluation dataset. Our approach is shown to improve overall F β -score having better recall than the baseline. By suitably tweaking the F β we show an improvement in precision as well.
The performance of the approaches when evaluated on the entire test dataset (consisting of both nil and non-nil datapoints) is shown in Table 5. Maxmargin based approaches generally perform well when trained only on the positive dataset compared to Hoffmann. However, our F 1 -scores are ∼8% less when we train on the entire dataset consisting of both nil and non-nil datapoints.  In a recent work, Xu et al. (2013) provide some statistics about the incompleteness of the Riedel dataset. Out of the sampled 1854 sentences from NYTimes corpus most of the entity pairs expressing a relation in Freebase correspond to false negatives. This is one of the reasons why we do not consider nil labeled datapoints during training and evaluation.
MIMLRE (Surdeanu et al., 2012) is another stateof-the-art system which is based on the EM algorithm. Since that system uses an additional set of features for the relation variables y, it is not our primary baseline. On the positive dataset, our model MM-F-loss achieves a F 1 -score of 65.598% compared to 65.341% of MIMLRE. As part of the future work, we would like to incorporate the additional features present in MIMLRE into our approach.

Conclusion
In this paper, we described a novel max-margin approach to optimize non-linear performance measures, such as F β , in distant supervision of information extraction models. Our approach is general and can be applied to other latent variable models in NLP. Our approach involves solving the hardoptimization problem in learning by interleaving Concave-Convex Procedure with dual decomposition. Dual decomposition allowed us to solve the hard sub-problems independently. A key aspect of our approach involves a local-search algorithm which has led to a speed up of 7,000 times in our experiments. We have demonstrated the efficacy of our approach in distant supervision of relation extraction. Under several conditions, we have shown our technique outperforms very strong baselines, and results in up to 8.5% improvement in F 1 -score.
For future work, we would like to maximize other performance measures, such as area under the curve, for information extraction models. Furthermore, we would like to explore our approach for other latent variable models in NLP, such as those in machine translation.