Learning What’s Easy: Fully Differentiable Neural Easy-First Taggers

We introduce a novel neural easy-first decoder that learns to solve sequence tagging tasks in a flexible order. In contrast to previous easy-first decoders, our models are end-to-end differentiable. The decoder iteratively updates a “sketch” of the predictions over the sequence. At its core is an attention mechanism that controls which parts of the input are strategically the best to process next. We present a new constrained softmax transformation that ensures the same cumulative attention to every word, and show how to efficiently evaluate and backpropagate over it. Our models compare favourably to BILSTM taggers on three sequence tagging tasks.


Introduction
In the last years, neural models have led to major advances in several structured NLP problems, including sequence tagging (Plank et al., 2016;Lample et al., 2016), sequence-to-sequence prediction (Sutskever et al., 2014), and sequence-totree (Dyer et al., 2015). Part of the success comes from clever architectures such as (bidirectional) long-short term memories (BILSTMs; Hochreiter and Schmidhuber (1997); Graves et al. (2005)) and attention mechanisms (Bahdanau et al., 2015), which are able to select the pieces of context relevant for prediction.
A noticeable aspect about many of the systems above is that they typically decode from left to right, greedily or with a narrow beam. While this is computationally convenient and reminiscent of the way humans process spoken language, the combination of unidirectional decoding and greediness leads to error propagation and suboptimal classification performance. This can partly be mitigated by globally normalized models (Andor et al., 2016) and imitation learning (Daumé et al., 2009;Ross et al., 2011;, however these techniques still have a left-to-right bias. Easy-first decoders (Tsuruoka and Tsujii, 2005;Goldberg and Elhadad, 2010, §2) are an interesting alternative: instead of a fixed decoding order, these methods schedule their own actions by prefering "easier" decisions over more difficult ones. A disadvantage is that these models are harder to learn, due to the factorial number of orderings leading to correct predictions. Usually, gradients are not backpropagated over this combinatorial latent space (Kiperwasser and Goldberg, 2016a), or a separate model is used to determine the easiest next move (Clark and Manning, 2016).
In this paper, we develop novel, fully differentiable, neural easy-first sequence taggers ( §3). Instead of taking discrete actions, our decoders use an attention mechanism to decide (in a soft manner) which word to focus on for the next tagging decision. Our models are able to learn their own sense of "easiness": the words receiving focus may not be the ones the model is most confident about, but the best to avoid error propagation in the long run. To make sure that all words receive the same cumulative attention, we further contribute with a new constrained softmax transformation ( §4). This transformation extends the softmax by permitting upper bound constraints on the amount of probability a word can receive. We show how to evaluate this transformation and backpropagate its gradients.
We run experiments in three sequence tagging tasks: multilingual part-of-speech (POS) tagging, named entity recognition (NER), and word-level quality estimation ( §5). We complement our findings with a visual analysis of the attention distribu-

Easy-First Decoders
The idea behind easy-first decoding is to perform "easier" and less risky decisions before committing to more difficult ones (Tsuruoka and Tsujii, 2005;Goldberg and Elhadad, 2010;Ma et al., 2013). Alg. 1 shows the overall procedure for a sequence tagging problem (the idea carries out to other structured problems). Let x 1:L be an input sequence (e.g. words in a sentence) and y 1:L be the corresponding tag sequence (e.g. their POS tags). The algorithm assigns tags one position i at the time, maintaining a set B of covered positions. It also maintains a set S of pairs (i, y i ), storing the tags that have already been predicted at those positions. We can regard this set as a sketch of the output sequence, built incrementally while the algorithm is executed. At each time step, the model computes a score f (i, y i ; x 1:L , S) for each position i / ∈ B and each candidate tag y i , taking into account the current "sketch" S, which provides useful contextual information. The "easiest" position and the corresponding tag are then jointly obtained by maximizing this score (line 6). The algorithm terminates when all positions are covered.
Previous work has trained easy-first systems with variants of the perceptron algorithm (Goldberg and Elhadad, 2010;Ma et al., 2013) or with a gradient-based method (Kiperwasser and Goldberg, 2016a)-but without backpropagating information about the best ordering chosen by the algorithm (only tag mistakes). In fact, doing so directly would be hard, since the space of possible orderings is combinatorial-the argmax in line 6 is not continuous, let alone differentiable. In the next section, we introduce a fully differentiable easyfirst system that sidesteps this problem by working with a "continuous" space of actions. Figure 1: A neural easy-first system applied to a POS tagging problem. Given the current input/sketch representation, an attention mechanism decides where to focus (see bar plot) and is used to generate the next sketch. Right: A sequence of sketches (S n ) N n=1 generated along the way.

Neural Easy-First Sequence Taggers
Let ∆ L−1 := {α ∈ R L | 1 α = 1, α ≥ 0} be the probability simplex. Our neural easy-first decoders depart from Alg. 1 in the following key points: • Instead of picking the position with the largest score at each step (line 6 in Alg. 1), we compute a (continuous) attention distribution α ∈ ∆ L−1 over word positions.
• Instead of a set of covered positions B, we maintain a (continuous) cumulative attention vector β ∈ R L (ideally in [0, 1] L ) over the L positions in the sequence.
• The sketch set S is replaced by a sketch matrix S ∈ R Ds×L , whose columns are D sdimensional vector representations of the output labels to be predicted.
The high-level procedure is shown in Figure 1. We describe two models that implement this procedure: a single-state model and a full-state model. They differ in the way they update the sketch matrix: the single-state model applies a rank-one update, while the full-state model does a full update.

Single-State Model (NEF-S)
Let concat(x 1 , . . . , x K ) ∈ R K k=1 D k be the concatenation of the vectors x k ∈ R D k . We use the shorthand affine(x) := Wx+b to denote an affine transformation of x, where W is a weight matrix and b is a bias vector.
As stated above, our algorithm maintains a cumulative attention vector β ∈ R L and a sketch matrix S ∈ R Ds×L , both with all entries initialized to zero. It then performs N sketching steps, which progressively refine this sketch matrix, producing versions S 1 , . . . , S N . At the nth step, the following operations are performed: Input-Sketch Contextual Representation. For each word i ∈ [L], we compute a state c n i summarizing the surrounding local information about other words and sketches. We use a simple vector concatenation over a w-wide context window: where we denote by s n−1 j the jth column of S n−1 . The intuition is that the current sketches provide valuable information about the neighboring words' predictions that can influence the prediction for the ith word. In the vanilla easy-first algorithm, this was assumed in the score computation (line 4 of Alg. 1).
Attention Mechanism. We then use an attention mechanism to decide what is the "best" word to focus on next. This is done in a similar way as the feedforward attention proposed by Bahdanau et al. (2015). We first compute a score for each word i ∈ [L] based on its contextual representation, where v ∈ R Dz is a model parameter. Then, we aggregate these scores in a vector z n ∈ R L and apply a transformation ρ to map them to a probability distribution α n ∈ ∆ L−1 (optionally taking into account the past cumulative attention β n−1 ): The "standard" choice for ρ is the softmax transformation. However, in this work, we consider other possible transformations (to be described in §4). After this, the cumulative attention is updated via β n = β n−1 + α n .
Sketch Generation. Now that we have a distribution α n ∈ ∆ L−1 over word positions, it remains to generate a sketch for those words. We first compute a single-state vector representation of the entire sentencec n = L i=1 α n i c n i as the weighted average of the word states defined in Eq. 1. Then, we update each column of the sketch matrix as: 1 (4) The intuition for this update is the following: in the extreme case where the attention distribution is peaked on a single word (say, the kth word, α n = e k ), we obtainc n = c n k and the sketch update only affects that word, i.e., (5) This is similar to the sketch update in the original easy-first algorithm (line 7 in Alg. 1).
The three operations above are repeated N times (or "sketch steps"). The standard choice is N = L (one step per word), which mimics the vanilla easy-first algorithm. However, it is possible to have fewer steps (or more, if we want the decoder to be able to "self-correct"). After completing the N sketch steps, we obtain the final sketch matrix S N = [s N 1 , . . . , s N L ]. Then, we compute a tag probability for every word as follows: 2 In §5, we compare this to a BILSTM tagger, which predicts according to p i = softmax(affine(h i )).

Full-State Model (NEF-F)
The full-state model differs from the single-state model in §3.1 by computing a full matrix for every sketch step, instead of a single vector. Namely, instead of Eq. 4, it does the following sketch update for every word i ∈ [L]: Note that the only difference is in replacing the single vectorc n by the word-specific vector c n i . As a result, this is no longer a rank-one update of the sketch matrix, but a full update. In the extreme case where the attention is peaked on a single word, the sketch update reduces to the same form as in the single-state model (Eq. 5). However, the full-state model is generally more flexible and allows processing words in parallel, since it allows different sketch updates for multiple words receiving attention, instead of trying to force those words to receive the same update. We will see in the experiments ( §5) that this flexibility can be important in practice.

Computational Complexity
For both models, assuming that the ρ(z; β) transformation in Eq. 3 takes O(L) time to compute, the total runtime of Alg. 2 is O((N +K)L) (where K is the number of tags), which becomes O(L 2 ) if K ≤ N = L. This is so because the input-sketch representation, the attention mechanism, and the sketch generation step all have O(L) complexity, and the final softmax layer requires O(KL) operations. This is the same runtime of the vanilla easy-first algorithm, though the latter can be reduced to O(KLlogL) with caching and a heap, if the scores in line 4 depend only on local sketches (Goldberg and Elhadad, 2010). By comparison, a standard BILSTM tagger has runtime O(KL).

Constrained Softmax Attention
An important part of our models is their attention component (line 7 in Alg. 2). To keep the "easyfirst" intuition, we would like the transformation ρ in Eq. 3 to have a couple of properties: 1. Sparsity: being able to generate sparse distributions α n (ideally, peaked on a single word).
The standard choice for attention mechanisms is the softmax transformation, α n = softmax(z n ). However, the softmax does not satisfy either of the properties above. For the first requirement, we could incorporate a "temperature" parameter in the softmax to push for more peaked distributions. However, this does not guarantee sparsity (only "hard attention" in the limit) and we found it numerically unstable when plugged in Alg. 2. For the second one, we could add a penalty before the softmax transformation, α n = softmax(z n − λβ n−1 ), where λ ≥ 0 is a tunable hyperparameter. This strategy was found effective to prevent a word to receive too much attention, but it made the model less accurate. An alternative is the sparsemax transformation (Martins and Astudillo, 2016): The sparsemax maintains most of the appealing properties of the softmax (efficiency to evaluate and backpropagate), and it is able to generate truly sparse distributions. However, it still does not satisfy the "evenness" property. Instead, we propose a novel constrained softmax transformation that satisfies both requirements. It resembles the standard softmax, but it allows imposing hard constraints on the maximal probability assigned to each word. Let us start by writing the (standard) softmax in the following variational form (?): where KL and H denote the Kullback-Leibler divergence and the entropy, respectively. Based on this observation, we define the constrained softmax transformation as follows: where u ∈ R L is a vector of upper bounds. Note that, if u ≥ 1, all constraints are loose and this Algorithm 3 Constrained Softmax Forward end if 9: end for reduces to the standard softmax; on the contrary, if u ∈ ∆ L−1 , they are tight and we must have α = u due to the normalization constraint. Thus, we propose the following for Eq. 3: The constraints guarantee β n = β n−1 + α n ≤ 1.
Since 1 β N = N n=1 1 α n = N , they also ensure that β N = 1, hence the "evenness" property is fully satisfied. Intuitively, each word gets a credit of one unit of attention that is consumed during the execution of the algorithm. When this credit expires, all subsequent attention weights for that word will be zero.
The next proposition shows how to evaluate the constrained softmax and compute its gradients.
Proposition 1 Let α = csoftmax(z; u), and define the set A = {i ∈ [L] | α i < u i } of the constraints in Eq. 10 that are met strictly. Then: • Forward propagation. The solution of Eq. 10 can be written in closed form as Let L(θ) be a loss function, dα = ∇ α L(θ) be the output gradient, and dz = ∇ z L(θ) and du = ∇ u L(θ) be the input gradients. Then, we have: where m = ( i∈A α i dα i )/(1 − i / ∈A u i ).
Algs. 3-4 turn the results in Prop. 1 into concrete procedures for evaluating csoftmax and for backpropagating its gradient. Their runtimes are respectively O(LlogL) and O(L).

Experiments
We evaluate our neural easy-first models in three sequence tagging tasks: POS tagging, NER, and word quality estimation.

Part-of-Speech Tagging
We ran POS tagging experiments in 12 languages from the Universal Dependencies project v1.4 (Nivre et al., 2016), using the standard splits. The datasets contain 17 universal tags. 3 We implemented Alg. 2 in DyNet (Neubig et al., 2017), which we extended with the constrained softmax operator (Algs. 3-4). 4 We used 64dimensional word embeddings, initialized with pre-trained Polyglot vectors (Al-Rfou et al., 2013). Apart from the words, we embedded prefix and suffix character n-grams with n ≤ 4. We set the affix embedding size to 50 and summed all these embeddings; in the end, we obtained a 164dimensional representation for each word (words, prefixes, and suffixes). We then fed these embeddings into a BILSTM (with 50 hidden units in each direction) to obtain the encoder states [h 1 , . . . , h L ] ∈ R 100×L . The other hyperparameters were set as follows: we used a context size w = 2, set the pre-attention size D z and the sketch size D s to 50, and applied dropout with a probability of 0.2 after the embedding and BILSTM layers and before the final softmax output layer. 5 We ran 20 epochs of Adagrad to minimize the crossentropy loss, with a stepsize of 0.1, and gradient  clipping of 5 (DyNet's default). We excluded from the training set sentences longer than 50 words. Table 1 compares several variants of our neural easy-first system-the single-state model (NEF-S), the full-state model (NEF-F), and the latter with softmax, sparsemax, and csoftmax attention. We used as many sketch steps as the number of words, N = L. As baselines, we used: • A feature-based linear model (TurboTagger, Martins et al. (2013)).
• A BILSTM tagger identical to our system, but without sketch steps (N = 0).
• A vanilla easy-first tagger (Alg. 1), using the argument of the softmax in Eq. 6 as the scoring function. This uses the same sketch representations as the neural easy-first systems, but replaces the attention mechanism by "hard" attention placed on the highest scored word.
For comparison, we also show the accuracies reported by Gillick et al. (2016) for their byte-tospan system (trained separately on each language) and by Plank et al. (2016) for their state-of-the-art multi-task BILSTM tagger (these results are not fully comparable though, due to different treebank versions). Among the neural easy-first systems, we observe that NEF-F with csoftmax attention generally outperforms the others, but the differences are very slight (excluding the sparsemax attention system, which performed substancially worse). This system wins over the linear system for all languages but Spanish, and over the BILSTM baseline for 9 out of 12 languages (loses in Arabic and German, and ties in Japanese). Note, however, that the differences are small (95.47% against 95.39%, averaged across treebanks). We conjecture that this is due to the fact that the BILSTM already captures most of the relevant context in its encoder. Our NEF-F system with csoftmax also wins over the vanilla easy-first system for 10 out of 12 languages (arguably due to its ability to backpropagate the gradients through the soft attention mechanism), but the difference in the average score is again small (95.47% against 95.42%). Figure 2 depicts some patterns learned by the NEF-F model with various attention types. With the csoftmax, the model learns to move left and right, and the main verb "thought" is the most prominent candidate for the easiest decision. In fact, in 57% of the test sentences, the model focuses first on a verb. The "raindrop" appearance of the plot is due to the evenness property of csoftmax, which causes the attention over a word to increase gradually until the cumulative attention is exhausted. This constrasts with the softmax attention (less diverse and non-sparse) and the sparsemax (sparse, but not even). We show for comparison the (hard) decisions made by the vanilla easy-first decoder.

Named Entity Recognition
Next, we applied our model to NER. We used the official datasets from the CoNLL 2002-3 shared tasks (Sang, 2002;Sang and De Meulder, 2003), which tag names, locations, and organizations using a BIO scheme, and cover four languages (Dutch, English, German, and Spanish). We made two experiments: one using the exact same BIL-STM and NEF models with a standard softmax output layer, as in §5.1 (which does not guarantee valid segmentations), and another one replacing the output softmax layer by a sequential CRF layer, which requires learning O(K 2 ) additional parameters for pairs of consecutive tags (Huang et al., 2015;Lample et al., 2016). We used the same hyperparameters as in the POS tagging experiments, except the dropout probability, which was set to 0.3 (tuned on the validation set). For English, we used pre-trained 300dimensional GloVe-840B embeddings (Pennington et al., 2014); for Spanish and German, we used the 64-dimensional word embeddings from Lample et al. (2016); for Dutch we used the aforementioned Polyglot vectors. All embeddings are finetuned during training. Since many words are not entities, and those receive a default "outside" tag, we expect that fewer sketch steps are necessary to achieve top performance. Table 2 shows the results, which confirm this hypothesis. We compare the same BILSTM baseline to our NEF-S and NEF-F models with csoftmax attention (with and without the CRF output layer), varying the maximum number of sketch steps. We also compare against the byte-to-span model of Gillick et al. (2016) and the state-of-theart character-based LSTM-CRF system of Lample et al. (2016). 6 We can see that, for all languages,  the NEF-CRF-F model with 5 steps is consistently better than the BILSTM-CRF and, with the exception of English, the NEF-CRF-S model. The same holds for the BILSTM and NEF-F models without the CRF output layer. With the exception of German, increasing the number of steps did not make a big difference. Figure 3 shows the attention distributions over the sketch steps for an English sentence, for fullstate models trained with N ∈ {5, L}. The model with L sketch steps learned that it is easiest to focus on the beginning of a named entity, and then to move to the right to identify the full span. The model with only 5 sketch steps learns to go straight to the point, placing most attention on the entity words and ignoring most of the O-tokens.

Word-Level Quality Estimation
Finally, we evaluate our model's performance on word-level translation quality estimation. The goal is to evaluate a translation system's quality without access to reference translations (Blatz et al., 2004;Specia et al., 2013). Given a sentence pair (a source sentence and its machine translated sentence in a target language), a word-level system classifies each target word as OK or BAD. We used the official English-German dataset from the WMT16 shared task (Bojar et al., 2016).
This task differs from the previous ones in which its input is a sentence pair and not a single eters, mixing character and word-based models, sharing a model across languages, or combining CRFs with convolutional and recurrent layers. We used simpler models in our experiments since our goal is to assess how much the neural easy-first systems can bring in addition to a BILSTM system, rather than building a state-of-the-art system. sentence. We replaced the affix embeddings by the concatenation of the 64-dimensional embeddings of the target words with those of the aligned source words (we used the alignments provided in the shared task), yielding 128-dimensional representations. We used the same hyperparameters as in the POS tagging task, except the dropout probability, set to 0.1. We followed prior work (Kreutzer et al., 2015) and upweighted the BAD words in the loss function to make the model more pessimistic; we used a weight of 5 (tuned in the validation set). Table 3 shows the results. We see that all our NEF-S and NEF-F models outperform the BIL-STM, and that the NEF-F model with 5 sketch steps achieved the best results. 7 Figure 4 illustrates the attention over the target words for 5 sketches. We observe that the attention focuses early in the areas predicted BAD and moves left and right within these areas, not wasting attention on the OK part of the sentence. This block-wise fo-7 Our best system would rank third in the shared task, out of 13 submissions. The winner system, which achieved 49.52 F1-MULT, was considerably more complex than ours, using an ensemble of three neural networks with a linear system (Martins et al., 2016).  Table 3: F 1 -MULT scores (product of F 1 for OK and BAD words) for word-level quality estimation, computed by the official shared task script. cus makes sense for quality estimation, since often complete phrases are BAD.

Ablation Study
To better understand our proposed model, we carried out an ablation study for NER on the English dataset. The following alternate configurations were tried and compared against the NEF-CRF-F model with csoftmax attention and 5 sketch steps: • A NEF-CRF-F model for which the final concatenation in Eq. 6 was removed, being replaced by p i = softmax(affine(s N i )). The goal was to see if the sketches retain enough information about the input to make a final prediction without requiring the states h i .
• A model for which the attention mechanism applied at each sketch step was replaced by a uniform distribution over the input words.
• A vanilla easy-first system (Alg. 1). Since this system can only focus on one word at the time (unlike the models with soft attention), we tried both N = 5 and N = L sketch steps.
• A left-to-right and right-to-left model, which replaces the attention mechanism by one of these two prescribed orders. Table 4 shows the results. As expected, the neural easy-first system was the best performing one, although the difference with respect to the ablated  systems is relatively small. Removing the concatenation in Eq. 6 is harmful, which suggests that there is information about the input not retained in the sketches. The uniform attention performs surprisingly well, and so do the left-to-right and right-to-left models, but they are still about half a point behind. The vanilla easy-first system has the worst performance with N = 5. This is due to the fact that the vanilla model is uncapable of processing words "in parallel" in the same sketch step, a disadvantage with respect to the neural easy-first models, which have this capability due to their soft attention mechanisms (see the top image in Fig. 3).

Related Work
Vanilla easy-first decoders have been used in POS tagging (Tsuruoka and Tsujii, 2005;Ma et al., 2013), dependency parsing (Goldberg and Elhadad, 2010), and coreference resolution (Stoyanov and Eisner, 2012), being related to cyclic dependency networks and guided learning (Toutanova et al., 2003;Shen et al., 2007). More recent works compute scores with a neural network (Socher et al., 2011;Clark and Manning, 2016;Kiperwasser and Goldberg, 2016a), but they still operate in a discrete space to pick the easiest actions (the non-differentiable argmax in line 6 of Alg. 1). Generalizing this idea to "continuous" operations is at the very core of our paper, allowing gradients to be fully backpropagated. In a different context, building differentiable computation structures has also been addressed by Graves et al. (2014); Grefenstette et al. (2015). An important contribution of our paper is the constrained softmax transformation. Others have proposed alternatives to softmax attention, including the sparsemax (Martins and Astudillo, 2016) and multi-focal attention (Globerson et al., 2016). The latter computes a KL projection onto a budget polytope to focus on multiple words. Our constrained softmax also corresponds to a KL projec-tion, but (i) it involves box constraints instead of a budget, (ii) it is normalized to 1, and (iii) we also backpropagate the gradient over the constraint variables. It also achieves sparsity (see the "raindrop" plots in Figures 2-4), and is suitable for sequentially computing attention distributions when diversity is desired (e.g. soft 1-to-1 alignments). Recently, Chorowski and Jaitly (2016) developed an heuristic with a threshold on the total attention as a "coverage criterion" (see their Eq. 11), however their heuristic is non-differentiable.
Our sketch generation step is similar in spirit to the "deep recurrent attentive writer" (DRAW, Gregor et al. (2015)) which generates images by iteratively refining sketches with a recurrent neural network (RNN). However, our goal is very different: instead of generating images, we generate vectors that lead to a final sequence tagging prediction.
Finally, the visualization provided in Figures 2-4 brings up the question how to understand and rationalize predictions by neural network systems, addressed by Lei et al. (2016). Their model, however, uses a form of stochastic attention and it does not perform any iterative refinement like ours.

Conclusions
We introduced novel fully-differentiable easy-first taggers that learn to make predictions over sequences in an order that is adapted to the task at hand. The decoder iteratively updates a sketch of the predictions by interacting with an attention mechanism. To spread attention evenly through all words, we introduced a new constrained softmax transformation, along with an algorithm to backpropagate its gradients. Our neural-easy first decoder consistently outperformed a BILSTM on a range of sequence tagging tasks.
A natural direction for future work is to go beyond sequence tagging (which we regard as a simple first step) toward other NLP structured prediction problems, such as sequence-to-sequence prediction. This requires replacing the sketch matrix in Alg. 2 by a dynamic memory structure. A Proof of Proposition 1 We provide here a detailed proof of Proposition 1.

A.1 Forward Propagation
The optimization problem is csoftmax(z, u) = argmin −H(α) − z α The Lagrangian function is: To obtain the solution, we invoke the Karush-Kuhn-Tucker conditions. From the stationarity condition, we have 0 = log(α) + 1 − z + λ1 − µ + ν, which due to the primal feasibility condition implies that the solution is of the form: where Z is a normalization constant. From the complementarity slackness condition, we have that 0 < α i < u i implies that µ i = ν i = 0 and therefore α i = exp(z i )/Z. On the other hand, ν i > 0 implies α i = u i . Hence the solution can be written as α i = min{exp(z i )/Z, u i }, where Z is determined such that the distribution normalizes: with A = {i ∈ [L] | α i < u i }.

A.2 Gradient Backpropagation
We now turn to the problem of backpropagating the gradients through the constrained softmax transformation. For that, we need to compute its Jacobian matrix, i.e., the derivatives ∂α i ∂z j and ∂α i ∂u j for i, j ∈ [L]. Let us first express α as where s = j / ∈A u j . Note that we have ∂s/∂z j = 0, ∀j, and ∂s/∂u j = 1(j / ∈ A). To compute the entries of the Jacobian matrix, we need to consider several cases.
Case 1: i ∈ A. In this case, the evaluation of Eq. 17 goes through the first branch. Let us first compute the derivative with respect to u j . Two things can happen: if j ∈ A, then s does not depend on u j , hence If j ∈ A and i = j, we have Finally, if j / ∈ A, we have ∂α i ∂z j = 0.
Case 2: i / ∈ A. In this case, the evaluation of Eq. 17 goes through the second branch, which means that ∂α i ∂z j = 0, always. Let us now compute the derivative with respect to u j . This derivative is always zero unless i = j, in which case ∂α i ∂u j = 1.
To sum up, we have: and Therefore, we obtain: and where m = i∈A α i dα i 1−s .