Don’t understand a measure? Learn it: Structured Prediction for Coreference Resolution optimizing its measures

An interesting aspect of structured prediction is the evaluation of an output structure against the gold standard. Especially in the loss-augmented setting, the need of finding the max-violating constraint has severely limited the expressivity of effective loss functions. In this paper, we trade off exact computation for enabling the use and study of more complex loss functions for coreference resolution. Most interestingly, we show that such functions can be (i) automatically learned also from controversial but commonly accepted coreference measures, e.g., MELA, and (ii) successfully used in learning algorithms. The accurate model comparison on the standard CoNLL-2012 setting shows the benefit of more expressive loss functions.


Introduction
In recent years, interesting structured prediction methods have been developed for coreference resolution (CR), e.g., (Fernandes et al., 2014;Björkelund and Kuhn, 2014;Martschat and Strube, 2015).These models are supposed to output clusters but, to better control the exponential nature of the problem, the clusters are converted into tree structures.Although this simplifies the problem, optimal solutions are associated with an exponential set of trees, requiring to maximize over such a set.This originated latent models (Yu and Joachims, 2009) optimizing the so-called lossaugmented objective functions.
In this setting, loss functions need to be factorizable together with the feature representations for finding the max-violating constraints.The consequence is that only simple loss functions, basically just counting incorrect edges, were applied in previous work, giving up expressivity for simplicity.This is a critical limitation as domain experts consider more information than just counting edges.
In this paper, we study the use of more expressive loss functions in the structured prediction framework for CR, although some findings are clearly applicable to more general settings.We attempted to optimize the complicated official MELA measure 1 (Pradhan et al., 2012) of CR within the learning algorithm.Unfortunately, MELA is the average of measures, among which CEAF e has an excessive computational complexity preventing its direct use.To solve this problem, we defined a model for learning MELA from data using a fast linear regressor, which can be then effectively used in structured prediction algorithms.We defined features to learn such a loss function, e.g., different link counts or aggregations such as Precision and Recall.Moreover, we designed methods for generating training data from which our regression loss algorithm (RL) can generalize well and accurately predict MELA values on unseen data.
Since RL is not factorizable2 over a mention graph, we designed a latent structured perceptron (LSP) that can optimize non-factorizable loss functions on CR graphs.We tested LSP using RL and other traditional loss functions using the same setting of the CoNLL-2012 Shared Task, thus enabling an exact comparison with previous work.The results confirmed that RL can be effectively learned and used in LSP, although the improvement was smaller than expected, considering that our RL provides the algorithm with a more accurate feedback.
Thus, we analyzed the theory behind this pro-cess by also contributing to the definition of the properties of loss optimality.These show that the available loss functions, e.g., by Fernandes et al.; Yu and Joachims, are enough for optimizing MELA on the training set, at least when the data is separable.Thus, in such conditions, we cannot expect a very large improvement from RL.
To confirm such a conjecture, we tested the models in a more difficult setting, in terms of separability.We used different feature sets of a smaller size and found out that in such conditions, RL requires less epochs for converging and produces better results than the other simpler loss functions.The accuracy of RL-based model, using 16 times less features, decreases by just 0.3 points, still improving the state of the art in structured prediction.Accordingly, in the Arabic setting, where the available features are less discriminative, our approach highly improves the standard LSP.

Related Work
There is a number of works attempting to directly optimize coreference metrics.The solution proposed by Zhao and Ng (2010) consists in finding an optimal weighting (by beam search) of training instances, which would maximize the target coreference metric.Their models, optimizing MUC and B 3 , deliver a significant improvement on the MUC and ACE corpora.Uryupina et al. (2011) benefited from applying genetic algorithms for the selection of features and architecture configuration by multi-objective optimization of MUC and the two CEAF variants.Our approach is different in that the evaluation measure (its approximation) is injected directly into the learning algorithm.Clark and Manning (2016) optimize B 3 directly as well within a mention-ranking model.For the efficiency reasons, they omit optimization of CEAF, which we enable in this work.
SVM cluster -a structured output approach by Finley and Joachims (2005) -enables optimization to any clustering loss function (including nondecomposable ones).The authors experimentally show that optimizing particular loss functions results into a better classification accuracy in terms of the same functions.However, these are in general fast to compute, which is not the MELA case.
While Finley and Joachims are compelled to perform approximate inference to overcome the intractability of finding an optimal clustering, the latent variable structural approaches -SVM of Yu and Joachims (2009) and perceptron of Fernan- (also referred to as the antecedent tree approach) is exploited in the works of Björkelund and Kuhn (2014), Martschat and Strube (2015), and Lassalle and Denis (2015).Like us, the first couples such approach with approximate inference but for enabling the use of non-local features.The current state-of-the-art model of Wiseman et al. (2016) also employs a greedy inference procedure as it has global features from an RNN as a non-decomposable term in the inference objective.

Structure Output Learning for CR
We consider online learning algorithms for linking structured input and output patterns.More formally, such algorithms find a linear mapping f (x, y) = w, Φ(x, y) , where f : X × Y → R, w is a linear model, Φ(x, y) is a combined feature vector of input variables X and output variables Y .The predicted structure is derived with the argmax y∈Y f (x, y).In the next sections, we show how to learn w for CR using structured perceptron.Additionally, we provide a characterization of effective loss functions for separable cases.

Modeling CR
In this framework, CR is essentially modeled as a clustering problem, where an input-output example is described by a tuple (x, y, h), x is a set of entity mentions contained in a text document, y is set of the corresponding mention clusters, and h is a latent variable, i.e., an auxiliary structure that can represent the clusters of y.For example, given the following text: Although (she) m 1 was supported by (President Obama) m 2 , (Mrs.Clinton) m 3 missed (her) m 4 (chance) m 5 , (which) m 6 looked very good before counting votes.the clusters of the entity mentions are represented by the latent tree in Figure 1, where its nodes are mentions and the subtrees connected to the additional root node form distinct clusters.The tree h is called a latent variable as it is consistent with y, i.e., it contains only links between mention nodes that corefer or fall into the same cluster according to y.Clearly, an exponential set of trees, H, can be associated with one and the same clustering y.Using only one tree to represent a clustering makes the search for optimal mention clusters tractable.In particular, structured prediction algorithms select h that maximizes the model learned at time t as shown in the next section.

Latent Structured Perceptron (LSP)
The LSP model proposed by Sun et al. (2009) and specialized for solving CR tasks by Fernandes et al. ( 2012) is described by Alg. 1.
Given a training set {(x i , y i )} n i=1 , initial w 0 3 , a trade off parameter C, and the maximum number of epochs T , LSP iterates the following operations: Line 5 finds a latent tree h * i that maximizes w t , Φ(x i , h) for the current example (x i , y i ).It basically finds the max ground truth tree with respect to the current w t .Finding such max requires an exploration over the tree set H(x i , y i ), which only contains arcs between mentions that corefer according to the gold standard clustering y i .Line 6 seeks for the max-violating tree ĥi in H(x i ), which is the set of all candidate trees using any possible combination of arcs.Line 7 tests if the produced tree ĥi has some mistakes with respect to the gold clustering y i , using loss function ∆(y i , h * i , ĥi ).Note that some models define a loss exploiting also the current best latent tree h * i .If the test is verified, the model is updated with the vector Φ(x i , h * i ) − Φ(x i , ĥi ).
3 Either 0 or a random vector.Fernandes et al. (2012) used exactly the directed trees we showed as latent structures and applied Edmonds' spanning tree algorithm (Edmonds, 1967) for finding the max.Their model achieved the best results in the CoNLL-2012 Shared Task, a challenge for CR systems (Pradhan et al., 2012).Their selected loss function also plays an important role as shown in the following.

Loss functions
When defining a loss, it is very important to preserve the factorization of the model components along the latent tree edges since this leads to efficient maximization algorithms (see Section 5).
Fernandes et al. uses a loss function that (i) compares a predicted tree ĥ against the gold tree h * and (ii) factorizes over the edges in the way the model does.Its equation is: where h * (i) and ĥ(i) output the parent of the mention node i in the gold and predicted tree, respectively, whereas 1 h * (i) = ĥ(i) just checks if the parents are different, and if yes, penalty of 1 (or 1 + r if the gold parent is the root) is added.
Yu and Joachims's loss is based on undirected tree without a root and on the gold clustering y.It is computed as: ∆ Y J (y, ĥ) = n(y) − k(y) + e∈ ĥ l(y, e), (2) where n(y) is the number of graph nodes, k(y) is the number of clusters in y, and l(y, e) assigns −1 to any edge e that connects nodes from the same cluster in y, and r otherwise.
In our experiments, we adopt both loss functions, however, in contrast to Fernandes et al., we always measure ∆ F against the gold label y and not against the current h * , i.e., in the way it is done by Martschat and Strube (2015), who employ an equivalent LSP model in their work.

On optimality of simple loss functions
The above loss functions are rather simple and mainly based on counting the number of mistaken edges.Below, we show that such simple loss functions achieve training data separation (if it exists) of a general task measure reaching its max on their 0 mistakes.The latter is a desirable characteristic of many measures used in CR and NLP research.
Proposition 1 (Sufficient condition for optimality of loss functions for learning graphs).Let ∆(y, h * , ĥ) ≥ 0 be a simple, edge-factorizable loss function, which is also monotone in the number of edge errors, and let µ(y, ĥ) be any graphbased measure maximized by no edge errors.Then, if the training set is linearly separable LSP optimizing ∆ converges to the µ optimum.

∆(y
where l(•) is an edge loss function.Thus, e∈ ĥi l(y i , h * i , e) = 0.The latter equation and monotonicity imply l(y i , h * i , e) = 0, ∀e ∈ ĥi , i.e., there are no edge mistakes, otherwise by fixing such edges, we would have a smaller ∆, i.e., negative, contradicting the initial positiveness hypothesis.Thus, no edge mistake in any x i implies that µ(y, ĥ) is maximized on the training set.
Proof.Equations 1 and 2 show that both are 0 when applied to a clustering with no mistake on the edges.Additionally, for each edge mistake more, both loss functions increase, implying monotonicity.Thus, they satisfy all the assumptions of Proposition 1.
The above characteristic suggests that ∆ F and ∆ Y J can optimize any measure that reasonably targets no mistakes as its best outcome.Clearly, this property does not guarantee loss functions to be suitable for a given task measure, e.g., the latter may have different max points and behave rather discontinuously.However, a common practice in NLP is to optimize the maximum of a measure, e.g., in case of Precision and Recall, or Accuracy, therefore, loss functions able to at least achieve such an optimum are preferable.

Automatically learning a loss function
How to measure a complex task such as CR has generated a long and controversial discussion in the research community.While such a debate is progressing, the most accepted and used measure is the so-called Mention, Entity, and Link Average (MELA) score.As it will be clear from the description below, MELA is not easily interpretable and not robust to the mention identification effect (Moosavi and Strube, 2016).Thus, loss functions showing the optimality property may not be enough to optimize it.Our proposal is to use a version of MELA transformed in a loss function optimized by an LSP algorithm with inexact inference.However, the computational complexity of the measure prevents to carry out an effective learning.Our solution is thus to learn MELA with a fast linear regressor, which also produces a continuos version of the measure.
MUC is based on the number of correctly predicted links between mentions.The number of links required for obtaining the key entity set K where k i are key entities in K (cardinality of each entity minus one).MUC recall computes what fraction of these were predicted, and the predicted were as many as , where p(k i ) is a partition of the key entity k i formed by intersecting it with the corresponding response entities r j ∈ R, s.t., k i ∩ r j = ∅.This number equals to the number of the key links minus the number of missing links, required to unite the parts of the partition p(k i ) to obtain k i .
B 3 computes Precision and Recall individually for each mention.For mention m: Recall m = where k m i and r m j , subscripted with m, denote, correspondingly, the key and response entities into which m falls.The over-document Recall is then an average of these taken with respect to the number of the key mentions.The MUC and B 3 Precision is computed by interchanging the roles of the key and response entities.
CEAF e computes similarity between key and system entities after finding an optimal alignment between them.Using ψ(k i , r j ) = 2|k i ∩r j | |k i |+|r j | as the entity similarity measure, it finds an optimal oneto-one map g * : K → R, which maps every key entity to a response entity, maximazing an overall similarity Ψ(g) = k i ∈K ψ(k i , g(k i )) of the example.This is solved as a bipartite matching problem by the Kuhn-Munkres algorithm.Then Preci-Algorithm 2 Finding a Max-violating Spanning Tree 1: Input: training example (x, y); graph G(x) with vertices V denoting mentions; set of the incoming candidate edges, E(v), v ∈ V ; weight vector w 2: w, e + C × l(y, e)

5:
h * = h * ∪ e * 6: end for 7: return max-violating tree h * 8: (clustering y * is induced by the tree h * ) sion and Recall are MELA computation is rather expensive mostly because of CEAF e .Its complexity is bounded by O(M l 2 log l) (Luo, 2005), where M and l are, correspondingly, the maximum and minimum number of entities in y and ŷ.Computing CEAF e is especially slow for the candidate outputs ŷ with a low quality of prediction, i.e, when l is big, and the coherence with the gold y is scarse.
Finally, B 3 and CEAF e are strongly influenced by the mention identification effect (Moosavi and Strube, 2016).Thus, ∆ F and ∆ Y J may output identical values for different clusterings that can have a big gap in terms of MELA.

Features for learning measures
As computational reasons prevent to use MELA in LSP (see our inexact search algorithm in Section 5), we study methods for approximating it with a linear regressor.For this purpose, we define nine features, which count either exact or simplified versions of Precision, Recall and F1 of each of the three metric-components of MELA.Clearly, neither ∆ F nor ∆ Y J provide the same values.
Apart from the computational complexity, the difficulty of evaluating the quality of the predicted clustering ŷ during training is also due to the fact that CR is carried out on automatically detected mentions, while it needs to be compared against a gold standard clustering of a gold mention set.However, we can use simple information about automatic mentions and how they relate to gold mentions and gold clusters.In particular, we use four numbers: (i) correctly detected automatic mentions, (ii) links they have in the gold standard, (iii) gold mentions, and (iv) gold links.The last one enables the precise computation of Precision, Recall and F1-measure values of MUC; the required partitions p(k i ) of key entities are also available at training time as they contain only automatic mentions.These are the first three features that we design.Likewise for B 3 , the feature values can be derived using (ii) and (iii).
For computing CEAF e heuristics, we do not perform cluster alignment to find an optimal Ψ(g * ).Instead of Ψ(g * ), which can be rewritten as m∈K∩R , pretending that for each m its key k m i and response r m j entities are aligned.r j ∈R ψ(r j , r j ) and k i ∈K ψ(k i , k i ) in the denominators of the Precision and Recall are the number of predicted and gold clusters, correspondingly.The imprecision of the CEAF e related features is expected to be leveraged when put together with the exact B 3 and MUC values into the regression learning using the exact MELA values (implicitly exact CEAF e values as well).

Generating training and test data
The features described above can be used to characterize the clustering variables ŷ.For generating training data, we collected all the maxviolating ŷ produced during LSP F (using ∆ F ) learning and associate them with their correct MELA scores from the scorer.This way, we can have both training and test data for our regressor.In our experiments, for the generation purpose, we decided to run LSP F on each document separately to obtain more variability in ŷ's.We use a simple linear SVM to learn a model w ρ .Considering that MELA(y, ŷ) score lies in the interval [100, 0], a simple approximation of the loss could be: Below, we show its improved version and an LSP for learning with it based on inexact search.

Learning with learned loss functions
Our experiments will demonstrate that ∆ ρ can be accurately learned from data.However, the features we used for this are not factorizable over the edges of the latent trees.Thus, we design a new LSP algorithm that can use our learned loss in an approximated max search.

A general inexact algorithm for CR
If the loss function can be factorized over tree edges (see Equation 3) the max-violating constraint in Line 6 of Alg. 1 can be efficiently found by exact decoding, e.g., using Edmonds' algorithm as in Fernandes et al. ( 2014) or Kruskal's as Algorithm 3 Inexact Inference of a Max-violating Spanning Tree with a Global Loss 1: Input: training example (x, y); graph G(x) with vertices V denoting mentions; set of the incoming candidate edges, E(v), v ∈ V ; w, ground truth tree h * 2: ĥ ← ∅ 3: score ← 0 4: repeat 5: prev score = score 6: score = 0 7: w, e + C × ∆(y, h * , h ∪ e) 10: ĥ = h ∪ ê 11: score = score + w, ê 12: end for 13: score = score + ∆(y, h * , ĥ) 14: until score = prev score 15: return max-violating tree ĥ in Yu and Joachims (2009).The candidate graph, by construction, does not contain cycles, and the inference by Edmonds' algorithm does technically the same as the "best-left-link" inference algorithm by Chang et al. (2012).This can be schematically represented in Alg. 2. When we deal with ∆ ρ , Alg. 2 cannot be longer applied as our new loss function is nonfactorizable.Thus, we designed a greedy solution, Alg. 3, which still uses the spanning tree algorithm, though, it is not guaranteed to deliver the max-violating constraint.However, finding even a suboptimal solution optimizing a more accurate loss function may achieve better performance both in terms of speed and accuracy.
We reformulate Step 4 of Alg. 2, where a maxviolating incoming edge ê is identified for a vertex v.The new max-violating inference objective contains now a global loss measured on the partial structure ĥ built up to now plus a candidate edge e for a vertex v in consideration (Line 10 of Alg. 3).On a high level, this resembles the inference procedure of Wiseman et al. (2016), who use it for optimizing global features coming from an RNN.Differently though, after processing all the vertices, we repeat the procedure until the score of ĥ no longer improves.
Note that Björkelund and Kuhn (2014) perform inexact search on the same latent tree structures to extend the model to non-local features.In contrast to our approach, they use beam search and accumulate the early updates.
In addition to the design of an algorithm enabling the use of our ∆ ρ , there are other intricacies caused by the lack of factorization that need to be taken into account (see the next section).

Approaching factorization properties
The ∆ ρ defined by Equation 4approximately falls into the interval [0, 100].However, the simple optimal loss functions, ∆ F and ∆ Y J , output a value dependent on the size of the input training document in terms of edges (as they factorize in terms of edges).Since this property cannot be learned from MELA by our regression algorithm, we calibrate our loss with respect to the number of correctly predicted mentions, c, in that document, obtaining ∆ ρ = c 100 ∆ ρ .Finally, another important issue is connected to the fact that on the way as we incrementally construct a max-violating tree according to Alg. 3, ∆ ρ decreases (and MELA grows), as we add more mentions to the output, traversing the tree nodes v. Thus, to equalize the contribution of the loss among the candidate edges of different nodes, we also scale the loss of the candidate edges of the node v having order i in the document, according to the formula ∆ ρ = i |V | ∆ ρ .This can be interpreted as giving more weight to the hard-toclassify instances -an important issue alleviated by Zhao and Ng (2010).Towards the end of the document, the probability of correctly predicting an incoming edge for a node generally decreases, as increases the number of hypotheses.

Experiments
In our experiments, we first show that our regressor for learning MELA approximates it rather accurately.Then, we examine the impact of our ∆ ρ on state-of-the-art systems in comparison with other loss functions.Finally, we show that the impact of our model is amplified when learning in smaller feature spaces.

Setup
Data We conducted our experiments on English and Arabic parts of the corpus from CoNLL 2012-Shared Task4 .The English data contains 2,802, 343, and 348 documents in the training, dev.and test parts, respectively.The Arabic data includes 359, 44, and 44 documents for training, dev.and test sets, respectively.

Models
We implement our version of LSP, where LSP F , LSP Y J , and LSP ρ use the loss functions, ∆ F , ∆ Y J , and ∆ ρ , defined in Section 3.3 and 5.2, respectively.We used cort 5 -coreference toolkit by Martschat and Strube (2015) both to preprocess the English data and to extract candidate mentions and features (the basic set).For Arabic, we used mentions and features from BART 6 (Uryupina et al., 2012).We extended the initial feature set for Arabic with the feature combinations proposed by Durrett and Klein (2013), those permitted by the available initial features.Parametrization All the perceptron models require tuning of a regularization parameter C. LSP F and LSP Y J -also tuning of a specific loss parameter r.We select the parameters on the entire dev.set by training on 100 random documents from the training set.We pick up C ∈ {1.0, 100.0, 1000.0,2000.0}, the r values for LSP F from the interval [0.5, 2.5] with step 0.5, and the r values for LSP Y J -from {0.05, 0.1, 0.5}.Ultimately, for English, we used C = 1000.0 in all the models; r = 1.0 in LSP F and r = 0.1 in LSP Y J .And wider ranges of parameter values were considered for Arabic, due to the lower mention detection rate: C = 1000.0,r = 6.0 for LSP F , C = 1000.0,r = 0.01 for LSP Y J , and C = 5000.0-for LSP ρ .A standard previous work setting for the number of epochs T of LSP is 5 (Martschat and Strube, 2015).Fernandes et al. (2014) noted that T = 50 was sufficient for convergence.We selected the best T from 1 to 50 on the dev.set.Evaluation measure We used MUC, B 3 , CEAF e and their average MELA for evaluation, computed by the version 8 of the official CoNLL scorer.

Learning loss functions
For learning MELA, we generated training and test examples from LSP F according to the procedure described in Section 4.3.In the first experiment, we trained the w ρ model on a set of examples S 1 , generated from a sample of 100 English documents and tested on a set of examples S 2 , generated from another sample of the same size, and vice versa.The results in Table 1 show that with just 5, 000/6, 000, the Mean Squared Error (MSE) is roughly between ∼ 2.4 − 2.7: these are rather small numbers considering that the regression output values in the interval [0, 100].Squared Correlation Coefficient (SCC) reaches a correlation of about 99.7%, demonstrating that our regression approach is effective in estimating MELA.
Additionally, Figure 2 shows the regression learning curves evaluated with MSE and SCC.The former rapidly decreases and, with about 1, 000 examples, reaches a plateau of around 2.3.The latter shows a similar behaviour, approaching a correlation of about 99.8% with real MELA.

State of the art and model comparison
We first experimented with the standard CoNLL setting to compare the LSP accuracy in terms of MELA using the three different loss functions, i.e., LSP F , LSP Y J and LSP ρ .In particular, we used all the documents of the training set and all N ∼ 16.8M features from cort, and tested on the both dev.and test sets.The results are reported in Columns All of Table 2.
We note first that our ∆ ρ is effective as it stays on a par with ∆ F and ∆ Y J on the dev.set.This is interesting as Corollary 1 shows that such functions can optimize MELA, the reported values refer to the optimal epoch numbers.Also, LSP ρ improves the other models on the test set by 0.3 percent points (statistical significant at the 93% level of confidence).Secondly, all the three models improve the state of the art on CR using LSP, i.e., by Martschat and Strube (2015) using antecedent trees (M&S AT) or mention ranking (M&S MR), Björkelund and Kuhn (2014) using a global feature model (B&K) and Fernandes et al. (2014) (Fer).Noted that all the LSP models were trained on the training set only, without retraining on the training and dev.sets together, thus our scores can be improved.
Thirdly, Table 3 shows the breakdown of the MELA results in terms of its components on the test set.Interestingly, LSP ρ is noticeably better in terms of B 3 and CEAF e , while LSP with simple losses, as expected, deliver higher MUC score.
Finally, the overall improvement of ∆ ρ is not impressive.This mainly depends on the optimality of the competing loss functions, which in a setting of ∼ 16.8M features, satisfy the separability condition of Proposition 1.

Learning in more challenging conditions
In these experiments, we verify the hypothesis that when the optimality property is partially or totally missing ∆ ρ is more visibly superior to ∆ F and ∆ Y J .As we do not want to degrade their effectiveness, the only condition dependent on the setting is the data inseparability or at least harder to be separated.These conditions can be obtained by reducing the size of the feature space.However, since we aim at testing conditions, where ∆ ρ is practically useful, we filter out less important features, preserving the model accuracy (at least when the selection is not extremely harsh).For this purpose, we use a feature selection approach using a basic binary classifier trained to discriminate between correct and incorrect mention pairs.It is typically used in non structured CR methods and has a nice property of using the same features of LSP (we do not use global features in our study).We carried out a selection using the absolute values of the model weights of the classifier for ranking features and then selecting those having higher rank (Haponchyk and Moschitti, 2017).
The MELA produced by our models using all the training data is presented in Figure 3.The first 7 plots show learning curves in terms of LSP epochs for different feature sets with increasing size N , evaluated on the dev.set.We note that: firstly, the fewer features are available, the better LSP ρ curves are than those of LSP F and LSP Y J in terms of accuracy and convergence speed.The intuition is that finding a separation of the training set (generalizing well) becomes more challenging (e.g., with 10k features, the data is not linearly sep-arable) thus a loss function which is closer to the real measure provides some advantages.
Secondly, when using all features, LSP ρ is still overall better than the other models but clearly the latter can achieve the same MELA on the dev.set.
Thirdly, the last plot shows the MELA produced by LSP models on the test set, when trained with the best epoch derived from the dev.set (previous plots).We observe that LSP ρ is constantly better than the other models, though decreasing its effect as the feature number increases.
Next, in Column 1 (Selected) of Table 2, we report the model MELA using 1 million features.We note that LSP ρ improves the other models by at least 0.6 percent points, achieving the same accuracy as the best of its competitors, i.e., LSP F , using all the features.
Finally, ∆ ρ does not satisfy Proposition 1, therefore, generally, we do not know if it can optimize any µ-type measure over graphs.However, being learned to optimize MELA, it clearly separates data maximizing such a measure.We empirically verified this by checking the MELA score obtained on the training set: we found that LSP ρ always optimizes MELA, iterating for fewer epochs than the other loss functions.

Generalization to other languages
Here, we test the effectiveness of the proposed method on Arabic using all available data and features.The results in Table 4 reveal an indisputable superiority of LSP ρ over the counterparts optimizing simple loss functions.They support the results of the previous section as we had to deal with the insufficiency of the expert-based features for Arabic.In such an uneasy case, LSP ρ was able to improve over LSP F by more than 4.7 points.
We also tested the loss model w ρ trained for the experiments on the English data (resp.setting All of Section 6.3) in LSP ρ on Arabic.This corresponds to LSP EN ρ model.Notably, it performs even better, 1.5 points more, than LSP ρ using a loss learned from Arabic examples.This suggests a nice property of data invariance of ∆ ρ .The improvement delivered by the "English" w ρ is due to the fact that it was trained on the data which is richer: (i) quantitatively, since coming from almost 8 times more training documents in comparison to Arabic and (ii) qualitatively, in a sense of diversity with respect to the RL target value.Indeed, the Arabic data is much less separable than the English data and this prevents to have examples where MELA values are higher.

Conclusions
In this paper, we studied the use of complex loss functions in structured prediction for CR.Given the scale of our investigation, we limited our study to LSP, which is anyway considered state of the art.We derived several findings: (i) for the first time, up to our knowledge, we showed that a complex measure, such as MELA, can be learned by a linear regressor (RL) with high accuracy and effective generalization.(ii) The latter was essential for designing a new LSP based on inexact search and RL.(iii) We showed that an automatically learned loss can be optimized and provides stateof-the-art performance in a real setting, including thousands of documents and millions of features, such as CoNLL-2012 Shared Task.(iv) We defined a property of optimal loss functions for CR, which shows that in separable cases, such losses are enough to get the state of the art.However, as soon as separability becomes more complex simple loss functions lose optimality and RL becomes more accurate and faster.(v) Our MELA approximation provides a loss that is data invariant which, once learned, can be optimized in LSP on different datasets and in different languages.
Our study opens several future directions, ranging from defining algorithms based on automatically learned loss functions to learning more effective measures from expert examples.

Figure 1 :
Figure 1: Latent tree used for structural learning des et al. (2014) -render exact inference possible by introducing auxiliary graph structures.The modeling of Fernandes et al.(also referred to as the antecedent tree approach) is exploited in the works ofBjörkelund and Kuhn (2014),Martschat and Strube (2015), andLassalle and Denis (2015).Like us, the first couples such approach with approximate inference but for enabling the use of non-local features.The current state-of-the-art model ofWiseman et al. (2016) also employs a greedy inference procedure as it has global features from an RNN as a non-decomposable term in the inference objective.
if summing up over the mentions not the entities, we simply use Ψ =

Figure 3 :
Figure3: Results of LSP models on the dev.set using different number of features, N .The last plot reports MELA score on the test set of the models using the optimal number of epochs tuned on the dev.set.

Table 1 :
Accuracy of the loss regressor on two different sets of examples generated from different documents samples.

Table 2 :
Results of our and previous work models evaluated on the dev.and test sets following the exact CoNLL-2012 English setting, using all training documents with All and 1M features.T best is evaluated on the dev.set.

Table 3 :
Results on the test set using the same setting of Table2and the measures composing MELA.

Table 4 :
Results of our and baseline models evaluated on the dev.and test sets following the exact CoNLL-2012 Arabic setting, using all training documents.T best is evaluated on the dev.set.