Backpropagating through Structured Argmax using a SPIGOT

We introduce structured projection of intermediate gradients (SPIGOT), a new method for backpropagating through neural networks that include hard-decision structured predictions (e.g., parsing) in intermediate layers. SPIGOT requires no marginal inference, unlike structured attention networks and reinforcement learning-inspired solutions. Like so-called straight-through estimators, SPIGOT defines gradient-like quantities associated with intermediate nondifferentiable operations, allowing backpropagation before and after them; SPIGOT’s proxy aims to ensure that, after a parameter update, the intermediate structure will remain well-formed. We experiment on two structured NLP pipelines: syntactic-then-semantic dependency parsing, and semantic parsing followed by sentiment classification. We show that training with SPIGOT leads to a larger improvement on the downstream task than a modularly-trained pipeline, the straight-through estimator, and structured attention, reaching a new state of the art on semantic dependency parsing.


Introduction
Learning methods for natural language processing are increasingly dominated by end-to-end differentiable functions that can be trained using gradient-based optimization.Yet traditional NLP often assumed modular stages of processing that formed a pipeline; e.g., text was tokenized, then tagged with parts of speech, then parsed into a phrase-structure or dependency tree, then semantically analyzed.Pipelines, which make "hard" (i.e., discrete) decisions at each stage, appear to be incompatible with neural learning, leading many researchers to abandon earlier-stage processing.
Inspired by findings that continue to see benefit from various kinds of linguistic or domain-specific preprocessing (He et al., 2017;Oepen et al., 2017;Ji and Smith, 2017), we argue that pipelines can be treated as layers in neural architectures for NLP tasks.Several solutions are readily available: • Reinforcement learning (most notably the REINFORCE algorithm; Williams, 1992), and structured attention (SA; Kim et al., 2017).These methods replace argmax with a sampling or marginalization operation.We note two potential downsides of these approaches: (i) not all argmax-able operations have corresponding sampling or marginalization methods that are efficient, and (ii) inspection of intermediate outputs, which could benefit error analysis and system improvement, is more straightforward for hard decisions than for posteriors.
• The straight-through estimator (STE; Hinton, 2012) treats discrete decisions as if they were differentiable and simply passes through gradients.While fast and surprisingly effective, it ignores constraints on the argmax problem, such as the requirement that every word has exactly one syntactic parent.We will find, experimentally, that the quality of intermediate representations degrades substantially under STE.This paper introduces a new method, the structured projection of intermediate gradients optimization technique (SPIGOT; §2), which defines a proxy for the gradient of a loss function with respect to the input to argmax.Unlike STE's gradient proxy, SPIGOT aims to respect the constraints in the argmax problem.SPIGOT can be applied with any intermediate layer that is expressible as a constrained maximization problem, and whose feasible set can be projected onto.We show empirically that SPIGOT works even when the maximization and the projection are done approximately.
We offer two concrete architectures that employ structured argmax as an intermediate layer: semantic parsing with syntactic parsing in the middle, and sentiment analysis with semantic parsing in the middle ( §3).These architectures are trained using a joint objective, with one part using data for the intermediate task, and the other using data for the end task.The datasets are not assumed to overlap at all, but the parameters for the intermediate task are affected by both parts of the training data.
Our experiments ( §4) show that our architecture improves over a state-of-the-art semantic dependency parser, and that SPIGOT offers stronger performance than a pipeline, SA, and STE.On sentiment classification, we show that semantic parsing offers improvement over a BiLSTM, more so with SPIGOT than with alternatives.Our analysis considers how the behavior of the intermediate parser is affected by the end task ( §5).Our code is open-source and available at https:// github.com/Noahs-ARK/SPIGOT.

Method
Our aim is to allow a (structured) argmax layer in a neural network to be treated almost like any other differentiable function.This would allow us to place, for example, a syntactic parser in the middle of a neural network, so that the forward calculation simply calls the parser and passes the parse tree to the next layer, which might derive syntactic features for the next stage of processing.
The challenge is in the backward computation, which is key to learning with standard gradientbased methods.When its output is discrete as we assume here, argmax is a piecewise constant function.At every point, its gradient is either zero or undefined.So instead of using the true gradient, we will introduce a proxy for the gradient of the loss function with respect to the inputs to argmax, allowing backpropagation to proceed through the argmax layer.Our proxy is designed as an improvement to earlier methods (discussed below) that completely ignore constraints on the argmax operation.It accomplishes this through a projec-tion of the gradients.
We first lay out notation, and then briefly review max-decoding and its relaxation ( §2.1).We define SPIGOT in §2.2, and show how to use it to backpropagate through NLP pipelines in §2.3.Notation.Our discussion centers around two tasks: a structured intermediate task followed by an end task, where the latter considers the outputs of the former (e.g., syntactic-then-semantic parsing).Inputs are denoted as x, and end task outputs as y.We use z to denote intermediate structures derived from x.We will often refer to the intermediate task as "decoding", in the structured prediction sense.It seeks an output ẑ = argmax z∈Z S from the feasible set Z, maximizing a (learned, parameterized) scoring function S for the structured intermediate task.L denotes the loss of the end task, which may or may not also involve structured predictions.We use to denote the (k − 1)-dimensional simplex.We denote the domain of binary variables as B = {0, 1}, and the unit interval as U = [0, 1].By projection of a vector v onto a set A, we mean the closest point in A to v, measured by Euclidean distance:

Relaxed Decoding
Decoding problems are typically decomposed into a collection of "parts", such as arcs in a dependency tree or graph.In such a setup, each element of z, z i , corresponds to one possible part, and z i takes a boolean value to indicate whether the part is included in the output structure.The scoring function S is assumed to decompose into a vector s(x) of part-local, input-specific scores: In the following, we drop s's dependence on x for clarity.
In many NLP problems, the output space Z can be specified by linear constraints (Roth and Yih, 2004): where ψ are auxiliary variables (also scoped by argmax), together with integer constraints (typically, each z i ∈ B).  tices), is relaxed into a convex polytope P (the area encompassed by blue edges).Left: making a gradient update to ẑ makes it step outside the polytope, and it is projected back to P, resulting in the projected point z.∇ s L is then along the edge.Right: updating ẑ keeps it within P, and thus ∇ s L = η∇ ẑL.
The problem in Equation 1 can be NP-complete in general, so the {0, 1} constraints are often relaxed to [0, 1] to make decoding tractable (Martins et al., 2009).Then the discrete combinatorial problem over Z is transformed into the optimization of a linear objective over a convex polytope P ={p ∈ R d |Ap≤b}, which is solvable in polynomial time (Bertsimas and Tsitsiklis, 1997).This is not necessary in some cases, where the argmax can be solved exactly with dynamic programming.

From STE to SPIGOT
We now view structured argmax as an activation function that takes a vector of input-specific partscores s and outputs a solution ẑ.For backpropagation, to calculate gradients for parameters of s, the chain rule defines: where the Jacobian matrix J = ∂ẑ ∂s contains the derivative of each element of ẑ with respect to each element of s.Unfortunately, argmax is a piecewise constant function, so its Jacobian is either zero (almost everywhere) or undefined (in the case of ties).
One solution, taken in structured attention, is to replace the argmax with marginal inference and a softmax function, so that ẑ encodes probability distributions over parts (Kim et al., 2017;Liu and Lapata, 2018).As discussed in §1, there are two reasons to avoid this modification.Softmax can only be used when marginal inference is feasible, by sum-product algorithms for example (Eisner, 2016;Friesen and Domingos, 2016); in general marginal inference can be #P-complete.Further, a soft intermediate layer will be less amenable to inspection by anyone wishing to understand and improve the model.
In another line of work, argmax is augmented with a strongly-convex penalty on the solutions (Martins and Astudillo, 2016;Amos and Kolter, 2017;Niculae and Blondel, 2017;Niculae et al., 2018;Mensch and Blondel, 2018).However, their approaches require solving a relaxation even when exact decoding is tractable.Also, the penalty will bias the solutions found by the decoder, which may be an undesirable conflation of computational and modeling concerns.
A simpler solution is the STE method (Hinton, 2012), which replaces the Jacobian matrix in Equation 3 by the identity matrix.This method has been demonstrated to work well when used to "backpropagate" through hard threshold functions (Bengio et al., 2013;Friesen and Domingos, 2018) and categorical random variables (Jang et al., 2016;Choi et al., 2017).
Consider for a moment what we would do if ẑ were a vector of parameters, rather than intermediate predictions.In this case, we are seeking points in Z that minimize L; denote that set of minimizers by Z * .Given ∇ ẑL and step size η, we would update ẑ to be ẑ − η∇ ẑL.This update, however, might not return a value in the feasible set Z, or even (if we are using a linear relaxation) the relaxed set P.
SPIGOT therefore introduces a projection step that aims to keep the "updated" ẑ in the feasible set.Of course, we do not directly update ẑ; we continue backpropagation through s and onward to the parameters.But the projection step nonetheless alters the parameter updates in the way that our proxy for "∇ s L" is defined.
The procedure is defined as follows: First, the method makes an "update" to ẑ as if it contained parameters (Equation 4a), letting p denote the new value.Next, p is projected back onto the (relaxed) feasible set (Equation 4b), yielding a feasible new value z.Finally, the gradients with respect to s are computed by Equation 4c.Due to the convexity of P, the projected point z will always be unique, and is guaranteed to be no farther than p from any point in Z * (Luenberger and Ye, 2015). 1 Compared to STE, SPIGOT in-volves a projection and limits ∇ s L to a smaller space to satisfy constraints.See Figure 1 for an illustration.
When efficient exact solutions (such as dynamic programming) are available, they can be used.Yet, we note that SPIGOT does not assume the argmax operation is solved exactly.

Backpropagation through Pipelines
Using SPIGOT, we now devise an algorithm to "backpropagate" through NLP pipelines.In these pipelines, an intermediate task's output is fed into an end task for use as features.The parameters of the complete model are divided into two parts: denote the parameters of the intermediate task model by φ (used to calculate s), and those in the end task model as θ. 2 As introduced earlier, the end-task loss function to be minimized is L, which depends on both φ and θ.
Algorithm 1 describes the forward and backward computations.It takes an end task training pair x, y , along with the intermediate task's feasible set Z, which is determined by x.It first runs the intermediate model and decodes to get intermediate structure ẑ, just as in a standard pipeline.Then forward propagation is continued into the end-task model to compute loss L, using ẑ to define input features.Backpropagation in the endtask model computes ∇ θ L and ∇ ẑL, and ∇ s L is then constructed using Equations 4. Backpropagation then continues into the intermediate model, computing ∇ φ L.
Due to its flexibility, SPIGOT is applicable to many training scenarios.When there is no x, z training data for the intermediate task, SPIGOT can be used to induce latent structures for the end-task (Yogatama et al., 2017;Kim et al., 2017;Choi et al., 2017, inter alia).When intermediate-task training data is available, one can use SPIGOT to adopt joint learning by minimizing an interpolation of L (on end-task data x, y ) and an intermediate-task loss function L (on intermediate task data x, z ).This is the setting in our experiments; note that we do not assume any overlap in the training examples for the two tasks.
Algorithm 1 Forward and backward computation with SPIGOT.
In early experiments we observe that for both tasks, projecting with respect to all constraints of their original formulations using a generic quadratic program solver was prohibitively slow.Therefore, we construct relaxed polytopes by considering only a subset of the constraints. 33 The projection then decomposes into a series of singly constrained quadratic programs (QP), each of which can be efficiently solved in linear time.
The two approximate projections discussed here are used in backpropagation only.In the forward pass, we solve the decoding problem using the models' original decoding algorithms.
Arc-factored unlabeled dependency parsing.For unlabeled dependency trees, we impose [0, 1] constraints and single-headedness constraints. 4ormally, given a length-n input sentence, excluding self-loops, an arc-factored parser considers d = n(n − 1) candidate arcs.Let i→j denote an arc from the ith token to the jth, and σ(i→j) denote its index.We construct the relaxed feasible set by: i.e., we consider each token j individually, and force single-headedness by constraining the number of arcs incoming to j to sum to 1. Algorithm 2 summarizes the procedure to project onto P DEP .
Line 3 forms a singly constrained QP, and can be solved in O(n) time (Brucker, 1984).
Algorithm 2 Projection onto the relaxed polytope P DEP for dependency tree structures.Let bold σ(•→j) denote the index set of arcs incoming to j.For a vector v, we use v σ(•→j) to denote vector 1: procedure DEPPROJ(p) 2: for j = 1, 2, . . ., n do 3: end for 5: return z 6: end procedure First-order semantic dependency parsing.Semantic dependency parsing uses labeled bilexical dependencies to represent sentence-level semantics (Oepen et al., 2014(Oepen et al., , 2015(Oepen et al., , 2016)).Each dependency is represented by a labeled directed arc from a head token to a modifier token, where the arc label encodes broadly applicable semantic relations.Figure 2 diagrams a semantic graph from the DELPH-IN MRS-derived dependencies (DM), together with a syntactic tree.
We use a state-of-the-art semantic dependency parser (Peng et al., 2017) that considers three types of parts: heads, unlabeled arcs, and labeled arcs.Let σ(i → j) denote the index of the arc from i to j with semantic role .In addition to [0, 1] constraints, we constrain that the predictions for labeled arcs sum to the prediction of their associated unlabeled arc: This ensures that exactly one label is predicted if and only if its arc is present.The projection onto P SDP can be solved similarly to Algorithm 2. We drop the determinism constraint imposed by Peng et al. (2017) in the backward computation.

Experiments
We empirically evaluate our method with two sets of experiments: using syntactic tree structures in semantic dependency parsing, and using semantic dependency graphs in sentiment classification.

Syntactic-then-Semantic Parsing
In this experiment we consider an intermediate syntactic parsing task, followed by seman- tic dependency parsing as the end task.We first briefly review the neural network architectures for the two models ( §4.1.1),and then introduce the datasets ( §4.1.2) and baselines ( §4.1.3).

Architectures
Syntactic dependency parser.For intermediate syntactic dependencies, we use the unlabeled arc-factored parser of Kiperwasser and Goldberg (2016).It uses bidirectional LSTMs (BiLSTM) to encode the input, followed by a multilayerperceptron (MLP) to score each potential dependency.One notable modification is that we replace their use of Chu-Liu/Edmonds' algorithm (Chu and Liu, 1965;Edmonds, 1967) with the Eisner algorithm (Eisner, 1996(Eisner, , 2000)), since our dataset is in English and mostly projective.Martins et al., 2015).To add syntactic features to NEURBOPARSER, we concatenate a token's contextualized representation to that of its syntactic head, predicted by the intermediate parser.Formally, given length-n input sentence, we first run a BiLSTM.We use the concatenation of the two hidden representations h j ] at each position j as the contextualized token representations.We then concatenate h j with the representation of its head h HEAD(j) by where ẑ ∈ B n(n−1) is a binary encoding of the tree structure predicted by by the intermediate parser.
We then use h j anywhere h j would have been used in NEURBOPARSER.In backpropagation, we compute ∇ ẑL with an automatic differentiation toolkit (DyNet; Neubig et al., 2017).
We note that this approach can be generalized to convolutional neural networks over graphs (Mou et al., 2015;Duvenaud et al., 2015;Kipf and Welling, 2017, inter alia), recurrent neural networks along paths (Xu et al., 2015;Roth and Lapata, 2016, inter alia) or dependency trees (Tai et al., 2015).We choose to use concatenations to control the model's complexity, and thus to better understand which parts of the model work.
We refer the readers to Kiperwasser and Goldberg ( 2016) and Peng et al. (2017) for further details of the parsing models.
Training procedure.Following previous work, we minimize structured hinge loss (Tsochantaridis et al., 2004) for both models.We jointly train both models from scratch, by randomly sampling an instance from the union of their training data at each step.In order to isolate the effect of backpropagation, we do not share any parameters between the two models. 5Implementation details are summarized in the supplementary materials.

Datasets
• For semantic dependencies, we use the English dataset from SemEval 2015 Task 18 (Oepen et al., 2015).Among the three formalisms provided by the shared task, we consider DELPH-IN MRS-derived dependencies (DM) and Prague Semantic Dependencies (PSD). 6It includes §00-19 of the WSJ corpus as training data, §20 and §21 for development and in-domain test data, resulting in a 33,961/1,692/1,410 train/dev./testsplit, and 5 Parameter sharing has proved successful in many related tasks (Collobert and Weston, 2008;Søgaard and Goldberg, 2016;Ammar et al., 2016;Swayamdipta et al., 2016Swayamdipta et al., , 2017, inter alia), inter alia), and could be easily combined with our approach. 6We drop the third (PAS) because its structure is highly predictable from parts-of-speech, making it less interesting.

Baselines
We compare to the following baselines: • A pipelined system (PIPELINE).The pretrained parser achieves 92.9 test unlabeled attachment score (UAS).8 • Structured attention networks (SA; Kim et al., 2017).We use the inside-outside algorithm (Baker, 1979) to populate z with arcs' marginal probabilities, use log-loss as the objective in training the intermediate parser.

Empirical Results
Table 1 compares the semantic dependency parsing performance of SPIGOT to all five baselines.FREDA3 (Peng et al., 2017) is a state-of-the-art variant of NEURBOPARSER that is trained using multitask learning to jointly predict three different semantic dependency graph formalisms.Like the basic NEURBOPARSER model that we build from, FREDA3 does not use any syntax.Strong DM performance is achieved in a more recent work by using joint learning and an ensemble (Peng et al., 2018), which is beyond fair comparisons to the models discussed here.We found that using syntactic information improves semantic parsing performance: using pipelined syntactic head features brings 0.5-1.4% absolute labeled F 1 improvement to NEUR-BOPARSER.
Such improvements are smaller compared to previous works, where dependency path and syntactic relation features are included (Almeida and Martins, 2015;Ribeyre et al., 2015;Zhang et al., 2016), indicating the potential to get better performance by using more syntactic information, which we leave to future work.
Both STE and SPIGOT use hard syntactic features.By allowing backpropation into the intermediate syntactic parser, they both consistently outperform PIPELINE.On the other hand, when marginal syntactic tree structures are used, SA outperforms PIPELINE only on the out-of-domain PSD test set, and improvements under other cases are not observed.
Compared to STE, SPIGOT outperforms STE on DM by more than 0.3% absolute labeled F 1 , both in-domain and out-of-domain.For PSD, SPIGOT achieves similar performance to STE on in-domain test set, but has a 0.5% absolute labeled F 1 improvement on out-of-domain data, where syntactic parsing is less accurate.tecture achieves 93.5 UAS when trained and evaluated with the standard split, close to the results reported by Kiperwasser and Goldberg (2016).

Semantic Dependencies for Sentiment Classification
Our second experiment uses semantic dependency graphs to improve sentiment classification performance.We are not aware of any efficient algorithm that solves marginal inference for semantic dependency graphs under determinism constraints, so we do not include a comparison to SA.

Architectures
Here we use NEURBOPARSER as the intermediate model, as described in §4.1.1,but with no syntactic enhancements.
Sentiment classifier.We first introduce a baseline that does not use any structural information.
It learns a one-layer BiLSTM to encode the input sentence, and then feeds the sum of all hidden states into a two-layer ReLU-MLP.
To use semantic dependency features, we concatenate a word's BiLSTM-encoded representation to the averaged representation of its heads, together with the corresponding semantic roles, similarly to that in Equation 7. 9 Then the concatenation is fed into an affine transformation followed by a ReLU activation.The rest of the model is kept the same as the BiLSTM baseline.
Training procedure.We use structured hinge loss to train the semantic dependency parser, and log-loss for the sentiment classifier.Due to the discrepancy in the training data size of the two tasks (33K vs. 7K), we pre-train a semantic dependency parser, and then adopt joint training together with the classifier.In the joint training stage, we randomly sample 20% of the semantic dependency training instances each epoch.Implementations are detailed in the supplementary materials.

Datasets
For semantic dependencies, we use the DM dataset introduced in §4.1.2.
We consider a binary classification task using the Stanford Sentiment Treebank (Socher et al., 2013).It consists of roughly 10K movie review sentences from Rotten Tomatoes.The full dataset includes a rating on a scale from 1 to 5 for each constituent (including the full sentences), resulting in more than 200K instances.Following previous work (Iyyer et al., 2015) instances, with neutral instances excluded (3s) and the remaining four rating levels converted to binary "positive" or "negative" labels.This results in a 6,920/872/1,821 train/dev./testsplit.

Empirical Results
Table 2 compares our SPIGOT method to three baselines.Pipelined semantic dependency predictions brings 0.9% absolute improvement in classification accuracy, and SPIGOT outperforms all baselines.In this task STE achieves slightly worse performance than a fixed pre-trained PIPELINE.

Analysis
We examine here how the intermediate model is affected by the end-task training signal.Is the endtask signal able to "overrule" intermediate predictions?
We use the syntactic-then-semantic parsing model ( §4.1) as a case study.Table 3 compares a pipelined system to one jointly trained using SPIGOT.We consider the development set instances where both syntactic and semantic annotations are available, and partition them based on whether the two systems' syntactic predictions agree (SAME), or not (DIFF).The second group includes sentences with much lower syntactic parsing accuracy (91.3 vs. 97.4UAS), and SPIGOT further reduces this to 89.6.Even though these changes hurt syntactic parsing accuracy, they lead to a 1.1% absolute gain in labeled F 1 for semantic parsing.Furthermore, SPIGOT has an overall less detrimental effect on the intermediate parser than STE: using SPIGOT, intermediate dev.parsing UAS drops to 92.5 from the 92.9 pipelined performance, while STE reduces it to 91.8.
We then take a detailed look and categorize the changes in intermediate trees by their correlations with the semantic graphs.Specifically, when a modifier m's head is changed from h to h in the The first two reflect modifications to the syntactic parse that rearrange semantically linked words to be neighbors.Under (c), the semantic parser removes a syntactic dependency that reverses the direction of a semantic dependency.These cases account for 17.6%, 10.9%, and 12.8%, respectively (41.2% combined) of the total changes.Making these changes, of course, is complicated, since they often require other modifications to maintain well-formedness of the tree.Figure 2 gives an example.

Related Work
Joint learning in NLP pipelines.To avoid cascading errors, much effort has been devoted to joint decoding in NLP pipelines (Habash and Rambow, 2005;Cohen and Smith, 2007;Goldberg and Tsarfaty, 2008;Lewis et al., 2015;Zhang et al., 2015, inter alia).However, joint inference can sometimes be prohibitively expensive.Recent advances in representation learning facilitate exploration in the joint learning of multiple tasks by sharing parameters (Collobert and Weston, 2008;Blitzer et al., 2006;Finkel and Manning, 2010;Zhang and Weiss, 2016;Hashimoto et al., 2017, inter alia).
Differentiable optimization.Gould et al. (2016) review the generic approaches to differentiation in bi-level optimization (Bard, 2010;Kunisch and Pock, 2013).Amos and Kolter (2017) extend their efforts to a class of subdifferentiable quadratic programs.However, they both require that the intermediate objective has an invertible Hessian, limiting their application in NLP.In another line of work, the steps of a gradient-based optimization procedure are unrolled into a single computation graph (Stoyanov et al., 2011;Domke, 2012;Goodfellow et al., 2013;Brakel et al., 2013).This comes at a high computational cost due to the second-order derivative computation during backpropagation.Moreover, constrained optimization problems (like many NLP problems) often require projection steps within the procedure, which can be difficult to differentiate through (Belanger and McCallum, 2016;Belanger et al., 2017).

Conclusion
We presented SPIGOT, a novel approach to backpropagating through neural network architectures that include discrete structured decisions in intermediate layers.SPIGOT devises a proxy for the gradients with respect to argmax's inputs, employing a projection that aims to respect the constraints in the intermediate task.We empirically evaluate our method with two architectures: a semantic parser with an intermediate syntactic parser, and a sentiment classifier with an intermediate semantic parser.Experiments show that SPIGOT achieves stronger performance than baselines under both settings, and outperforms stateof-the-art systems on semantic dependency parsing.Our implementation is available at https: //github.com/Noahs-ARK/SPIGOT.

Figure 1 :
Figure1: The original feasible set Z (red vertices), is relaxed into a convex polytope P (the area encompassed by blue edges).Left: making a gradient update to ẑ makes it step outside the polytope, and it is projected back to P, resulting in the projected point z.∇ s L is then along the edge.Right: updating ẑ keeps it within P, and thus ∇ s L = η∇ ẑL.

Figure 2 :
Figure2: A development instance annotated with both gold DM semantic dependency graph (red arcs on the top), and gold syntactic dependency tree (blue arcs at the bottom).A pretrained syntactic parser predicts the same tree as the gold; the semantic parser backpropagates into the intermediate syntactic parser, and changes the dashed blue arcs into dashed red arcs ( §5).

Table 1 :
Semantic dependency parsing performance in both unlabeled (UF ) and labeled (LF ) F 1 scores.Bold font indicates the best performance.Peng et al. (2017)does not report UF .
(b) F1 on out-of-domain test set.
, we only use full-sentence