Pre- and In-Parsing Models for Neural Empty Category Detection

Motivated by the positive impact of empty category on syntactic parsing, we study neural models for pre- and in-parsing detection of empty category, which has not previously been investigated. We find several non-obvious facts: (a) BiLSTM can capture non-local contextual information which is essential for detecting empty categories, (b) even with a BiLSTM, syntactic information is still able to enhance the detection, and (c) automatic detection of empty categories improves parsing quality for overt words. Our neural ECD models outperform the prior state-of-the-art by significant margins.


Introduction
Encoding unpronounced nominal elements, such as dropped pronouns and traces of dislocated elements, the empty category is an important piece of machinery in representing the (deep) syntactic structure of a sentence (Carnie, 2012). Figure 1 shows an example. In linguistic theory, e.g. Government and Binding (GB; Chomsky, 1981), empty category is a key concept bridging S-Structure and D-Structure, due to its possible contribution to trace movements. In practical treebanking, empty categories have been used to indicate long-distance dependencies, discontinuous constituents, and certain dropped elements (Marcus et al., 1993;Xue et al., 2005). Recently, there has been an increasing interest in automatic empty category detection (ECD; Johnson, 2002;Seeker et al., 2012;Xue and Yang, 2013;Wang et al., 2015). And it has been shown that ECD is able to improve the linear model-based dependency parsing (Zhang et al., 2017b).
There are two key dimensions of approaches Pre-Parsing In-Parsing Post-Parsing Linear ✔ ✔ ✔ Neural ✘ ✘ ✔ Table 1: ECD approaches that have been investigated.
for ECD: the relationship with parsing and statistical disambiguation. Considering the relationship with parsing, we can divide ECD models into three types: (1) Pre-parsing approach (e.g. Dienes and Dubey (2003)) where empty categories are identified without using syntactic analysis, (2) In-parsing approach (e.g. Cai et al. (2011); Zhang et al. (2017b)) where detection is integrated into a parsing model, and (3) Post-parsing approach (e.g. Johnson (2002); Wang et al. (2015)) where parser outputs are utilized as clues to determine the existence of empty categories. For disambiguation, while early work on dependency parsing focused on linear models, recent work started exploring deep learning techniques for the post-parsing approach (Wang et al., 2015). From the above two dimensions, we show all existing systems for ECD in Table 1. Neural models for pre-and in-parsing ECD have not been studied yet. In this paper, we fill this gap in the literature. It is obvious that empty categories are highly related to surface syntactic analysis. To determine the existence of empty elements between two overt words relies on not only the sequential contexts but also the hierarchical contexts. Traditional linear structured prediction models, e.g. conditional random fields (CRF), for sequence structures are rather weak to capture hierarchical contextual information which is essentially non-local for their architectures. Accordingly, pre-parsing models based on linear disambiguation techniques fail to produce comparable accuracy to the other two models. In striking contrast, RNN based se-  Figure 1: An example from CTB: Shanghai Pudong recently enacted 71 regulatory documents involving the economic fields. The dependency structure is according to Xue (2007). "∅ 1 " indicates a null operator that represents empty relative pronouns. "∅ 2 " indicates a trace left by relativization. quence labeling models have been shown very powerful to capture non-local information, and therefore have great potential to advance the preparsing approach for ECD. In this paper, we propose a new bidirectional LSTM (BiLSTM) model for pre-parsing ECD using information about contextual words. Previous studies highlight the usefulness of syntactic analysis for ECD. Furthermore, syntactic parsing of overt words can benefit from detection of empty elements and vice versa (Zhang et al., 2017b). In this paper, we follow Zhang et al.'s encouraging results obtained with linear models and study first-and second-order neural models for inparsing ECD. The main challenge for neural inparsing ECD is to encode empty element candidates and integrate the corresponding embeddings into a parsing model. We focus on the state-ofthe-art parsing architecture developed by Kiperwasser and Goldberg (2016) and Dozat and Manning (2016), which use BiLSTMs to extract features from contexts followed by a nonlinear transformation to perform local scoring.
To evaluate the effectiveness of deep learning techniques for ECD, we conduct experiments on a pro-drop language, i.e. Chinese. The empirical evaluation indicates some non-obvious facts: 1. Neural ECD models outperform the prior state-of-the-art by significant margins. Even a pre-parsing model without any syntactic information outperforms the best existing linear in-parsing and post-parsing ECD models.
2. Incorporating empty elements can help neural dependency parsing. This parallels Zhang et al.'s investigation on linear models.
3. Our in-parsing neural models obtain better predictions than the pre-parsing model.
The implementation of all models is available at https://github.com/draplater/ empty-parser.
2 Pre-Parsing Detection 2.1 Context of Empty Categories Sequential Context Perhaps, it is the most intuitive idea to view a natural language sentence as a word-by-word sequence. Analyzing contextual information by modeling neighboring words according to this sequential structure is a very basic view for dealing with a large number of NLP tasks, e.g. POS tagging and syntactic parsing. It is also important to consider sequential contexts for ECD to derive the horizontal features that exploit the lexical context of the current pending point, presented as one or more preceding and following word tokens, as well as their part-of-speech tags (POS).

Hierarchical Context
The detection of ECs requires broad contextual knowledge. Besides onedimensional representation, vertical features are equally essential to express the empty element. The hierarchical structure is a compact reflection of the syntactic content. By integrating the hierarchical context, we can analyze the regular distributional pattern of ECs in a syntactic tree. More specifically, it means considering the head information of the EC and relevant dependencies to augment the prediction.
Both sequential and hierarchical contexts are essential to determine the existence of empty elements between two overt words. Even words close to each other in a hierarchical structure may appear far apart in sequential representations, which makes it hard for linear sequential tagging models to catch the hierarchical contextual information. RNN based sequence models have been proven very powerful to capture non-local features. In this paper, we show that LSTM is able to advance the pre-parsing ECD significantly.  Figure 2: An example of four kinds of annotations. The phrase is cut out from the sentence in Figure 1. "@@" means interspaces between words.

A Sequence-Oriented Model
In the sequence-oriented model, we formulate ECD as a sequence labeling problem. In general, we attach ECs to surrounding overt tokens to represent their identifications, i.e. their locations and types. We explore four sets of annotation specifications, denoted as Interspace, Pre2, Pre3 and Prepost, respectively. Following is the detailed descriptions.
Interspace We convert ECs' information into different tags of the interspaces between words. The assigned tag is the concatenation of ECs between the two words. If there is no EC, we just tag the interspace as O. Specially, according to our observation that only one EC occurs at the end of the sentence in our data set, we simply count on the heading space of sentences instead of the one standing at the end. Assume that there are n words in a given sentence, then there will be 2 * n items (n words and n interspaces) to tag.

Pre2 and Pre3
We stick ECs to words following them. In experiments using POS information, ECs are attached to the POS of the next word, while the normal words are just tagged with their POS. In experiments without POS information, ECs are straightly regarded as the label of the following words. Words without ECs ahead are consistently tagged using an empty marker. Similar to Interspace, linearly consecutive ECs are concatenated as a whole. Pre2 means that at most two preceding consecutive ECs are considered while Pre3 limits the considered continuous length to three. The determination of window lengths are grounded in the distribution of ECs' continuous lengths as shown in Table 2.
Prepost Considering that it may be a challenge to capture long-distance features, we introduce another labeling rule called Prepost. Different from Pre2 and Pre3, the responsibility for presenting ECs will be shared by both the preceding and the  following words. Whereas, tags heading sentences will remain unchanged. Particularly, if the amount of consecutive ECs in the current position is an odd number, we choose to attach the extra EC to the following word for consistency and clarity. Take part of the sentence in Figure 1 as an example. As described above, the four kinds of representations are depicted in Figure 2. To investigate the effect of POS in the tagging process, we also conduct experiments by integrating POS to the tagging process. For Interspace, POS tags are individual output labels, while for other representations, the POS information is used to divide an empty category integrated tag into subtypes.

Tagging Based on LSTM-CRF
In order to capture long-range syntactic information for accurate disambiguation in pre-parsing phase, we build a LSTM-CRF model inspired by the neural network proposed in Ma and Hovy (2016). A BiLSTM layer is set up on character embeddings for extracting character-level representations of each word, which is concatenated with the pre-trained word embedding before feeding into another BiLSTM layer to capture contextual information. Thus we have obtained dense and continuous representations of the words in given sentences. The last part is to decode with linear chain CRF which can optimize the output sequence by factoring in local characteristics. Dropout layers both before and after the sentencelevel network serve to prevent over-fitting.
3 In-Parsing Detection Zhang et al. (2017b) designs novel algorithms to produce dependency trees in which empty elements are allowed. Their results show that integrating empty categories can augment the parsing of overt tokens when structured perceptron, a global linear model, is applied for disambiguation. From a different perspective, by jointing ECD and dependency parsing, we can utilize full syntactic information in the process of detecting ECs. Parallel to their work, we explore the effect of ECD on the neural dependency based parsing in this section.

Joint ECD and Dependency Parsing
To perform ECD and dependency parsing in a unified framework, we formulate the issue as an optimization problem. Assume that we are given a sentence s with n normal words. We use an index set I o = {(i, j)|i, j ∈ {1, · · · , n}} to denote all possible overt dependency edges, and use n}} to denote all possible covert dependency edges. φ j denotes an empty node that precede the jth word. Then a dependency parse with empty nodes can be represented as a vector: Let Z denote the set of all possible z, and PART(z) denote the factors in the dependency tree, including edges (and edge siblings in the second-order model). Then parsing with ECD can be defined as a search for the highest-scored z * (s) in all compatible analyses, just like parsing without empty elements: The graph-based parsing algorithms proposed by Zhang et al. are based on two properties: ECs can only serve as dependents and the number of successive ECs is limited. The latter trait makes it reasonable to treat consecutive ECs governed by the same head as one word. We also follow this set-up.
dependency parsers. In particular, a BiLSTM is utilized as a powerful feature extractor to assist a dependency parser. Mainstream data-driven dependency parsers, including both transition-and graph-based ones, can apply useful word vectors provided by a BiLSTM to conduct the disambiguation. Following Kiperwasser and Goldberg (2016)'s experience on graph-based dependency parser, we implement such a parser to recover empty categories and to evaluate the impact of empty categories on surface parsing.
Here we present details of the design of our parser. A vector is associated with each word or POS-tag to transform them into continuous and dense representations. We use pre-trained word embeddings and random initialized POS-tag embeddings.
The concatenation of the word embedding and the POS-tag embedding of each word in a specific sentence is used as the input of BiLSTMs to extract context related feature vectors r i . r 1:n = BiLSTM(s; 1 : n) The context related feature vectors are fed into a non-linear transformation to perform scoring.

A First-Order Model
In the first-order model, we only consider the head and the dependent of the possible dependency arc. The two feature vectors of each word pair is scored with a non-linear transformation g as the firstorder score. When words i and j are overt words, we define the score function in sentence s as follows, SCOREDEP(s, i, j) = W 2 · g(W 1,1 · r i + W 1,2 · r j + b) W 2 , W 1,1 and W 1,2 denote the weight matrices in linear transformations. The score of covert edge from word i to word φ j is calculated in a similar way with different parameters: These non-linear transformations are also known as Multiple Layer Perceptrons(MLPs). The total score in our first-order model is defined as follows, DEP(z) and DEPEMPTY(z) denote all overt and covert edges in z respectively. Because each overt and covert edge is selected independently of the others, the decoding process can be seen as calculating the maximum subtree from overt edges(we use Eisner Algorithm in our experiments) and appending each covert edge (i, φ j ) when SCOREEMPTY(i, φ j ) > 0.

A Second-Order Model
In the second-order model, we also consider sibling arcs. We extend the neural network in section 3.3 to perform the second-order parsing. We calculate second-order scores(scores defined over sibling arcs) in a similar way. Each pair of overt sibling arcs, for example, (i, j) and (i, k) (j < k), is denoted as (i, j, k) and scored with a non-linear transformation.
Zhang et al. (2017b) defines two kinds of second-order scores to describe the interaction between concrete nodes and empty categories: the covert-inside sibling (i, φ j , k) and covert-outside sibling (i, j, φ k ). Their scores can be calculated in a similar way with different parameters.
And finally, the score function over the whole syntactic analysis is defined as: COVERTIN(z) and COVERTOUT(z) denotes overt-both, covertinside and covert-outside siblings of z respectively. Totally 5 MLPs are used to calculate the 5 types of scores. The network structure is shown in Figure 3.
Labeled Parsing Similar to Kiperwasser and Goldberg (2016) and Zhang et al. (2017a), we use a two-step process to perform labeled parsing: conduct an unlabeled parsing and assign labels to each dependency edge. The labels are determined with the nonlinear classification. We use different nonlinear classifiers for edges between concrete nodes and empty categories.
Training In order to update graphs which have high model scores but are very wrong, we use a margin-based approach to compute loss from the gold tree T * and the best predictionT under the current model.
We define the loss term as: The margin objective ∆ measures the similarity between the gold tree T * and the predictionT . Following Kiperwasser and Goldberg (2016)'s experience of loss augmented inference, we define ∆ as the count of dependency edges in prediction results but not belonging to the gold tree.

Structure Regularization
ECD significantly increases the search space for parsing. This results in a side effect for practical parsing. Given the limit of available annotations for training, searching for more complex structures in a larger space is harmful to the generalization ability in structured prediction (Sun, 2014). To control structure-based overfitting, we train a normal dependency parser, namely parser for overt words only, and use its first-and secondorder scores to augment the corresponding score functions in the joint parsing and ECD model. At the training phase, the two parsers are trained separately, while at the test phase, the scores are calculated by individual models and added for decoding.

Data
We conduct experiments on a subset of Penn Chinese Treebank (CTB; Xue et al., 2005) 9.0. As a pro-drop language, the empty category is a very useful method for representing the (deep) syntactic analysis in Chinese language. Empty categories in CTB is divided into six classes: pro, PRO, OP, T, RNR and *, which were described in detail in Xue and Yang (2013); Wang et al. (2015).
For comparability with the state-of-the-art, the division of training, development and testing data is coincident with the previous work (Xue and Yang, 2013). Our experiments can be divided into two groups. The first group is conducted on the linear conditional random field (Linear-CRF) model and LSTM-CRF tagging model to evaluate gains from the introduction of neural structures. The second group is designed for the dependency-based inparsing models.

Evaluation Metrics
We adopt two kinds of metrics for the evaluation of our experiments. The first one focuses on EC's position and type, in accordance with the labeled empty elements measure proposed by Cai et al. (2011), which can be implemented on all models in our experiments. The second one is stricter. Besides position and type, it also checks EC's head information. An EC is considered to be correct, only when all the three parts are the same as the corresponding gold standard. Thus only models involved in dependency structures can be evaluated according to the latter metric. Based on above measures of the two degrees, we evaluate our neural pre-and in-parsing models regarding each type of EC as well as overall performance.
Besides, to compare different models' abilities to capture non-local information, we design Dependency Distance to indicate the number of words from one EC to its head, not counting other ECs on the path. Taking the two ECs in Figure  1 as an example, ∅ 2 has a Dependency Distance of 0 while ∅ 1 's Dependency Distance is 3. We calculate labeled recall scores for enumerated Dependency Distance. A higher score means greater capability to catch and to represent long-distance details. Table 3 shows overall performances of the two sequential models on development data. From the results, we can clearly see that the introduction of neural structure pushes up the scores exceptionally. The reason is that our LSTM-CRF model not only benefits from the linear weighted combination of local characteristics like ordinary CRF models, but also has the ability to integrate more contextual information, especially long-distance information. It confirms LSTM-based models' great superiority in sequence labeling problems.

Results of Pre-Parsing Models
Further more, we find that the difference among the four kinds of representations is not so obvious. The most performing one with LSTM-CRF model is Interspace, but the advantage is narrow. Pre3 uses a larger window length to incorporate richer contextual tokens, but at the same time, the searching space for decoding grows larger. It explains that the performance drops slightly with increasing window length. In general, experiments with POS tags show higher scores as more syntactic clues are incorporated.
We compare LSTM-CRF with other state-ofthe-art systems in Table 4 1 . We can see that a simple neural pre-parsing model outperforms state-ofthe-art linear in-parsing systems. Analysis about results on different EC types as displayed in Table 5 shows that the sequence-oriented pre-parsing model is good at detecting pro compared with previous systems, which is used widely in pro-drop languages. Additionally, the model succeeds in detecting seven * EC tokens in evaluating process. * indicates trace left by passivization as well as raising, and is very rare in training data. Previous models usually cannot identify any *. This detail reflects that the LSTM-CRF model can make the most of limited training data compared with existing systems.   (Cai et al., 2011) 66.0 54.5 58.6   Table 6 presents detailed results of the in-parsing models on test data. Compared with the stateof-the-art, the first-order model performs a little worse while the second-order model achieves a remarkable score. The first-order parsing model only constrains the dependencies of both the covert and overt tokens to make up a tree. Due to the loose scoring constraint of the first-order model, the prediction of empty nodes is affected little from the prediction of dependencies of overt words. The four bold numbers in the table intuitively elicits the conclusion that integrating an empty edge and its sibling overt edges is necessary to boost the performance. It makes sense because empty categories are highly related to syntactic analysis. When we conduct ECD and dependency parsing simultaneously, we can leverage  Table 6: The performances of the first-and second-order in-parsing models on test data. more hierarchical contextual information. Comparing results regarding EC types, we can find that OP and T benefit most from the parsing information, the F 1 score increasing by about ten points, more markedly than other types. Table 7 shows the impact of automatic detection of empty categories on parsing overt words. We compare the results of both steps in labeled parsing. We can clearly see that integrating empty elements into dependency parsing can improve the neural parsing accuracy of overt words. Besides, when jointing parsing models both without and with ECs together, we can push up the performance further. These results confirm the conclusion in Zhang et al. (2017b) that empty elements help parse the overt words. The main reason lies in that the existence of ECs provides extra structural information which can reduce approximation -EC +EC -+EC Unlabeled 87.6 88.9 89.6 Labeled 84.6 85.9 86.6 errors in a structured prediction problem. According to above analysis, we can draw a conclusion that ECD and syntactic parsing can promote each other mutually. That partially explains why in-parsing models can outperform preparsing models. Meanwhile, it provides a new approach to improving the dependency parsing quality in a unified framework.

Dependency Distance
Pre-parsing In-parsing Figure 4: Recall scores of different models regarding Dependency Distance. "Pre-parsing" and "Inparsing" refer to the LSTM-CRF model and the dependency-based in-parsing model respectively.
We compare pre-and in-parsing models regarding Dependency Distance. The former refers to the LSTM-CRF model while the latter means the dependency-based in-parsing model. Figure 4 shows the results. The abscissa value ranges from 0 to 26, with the longest dependency arc spanning 26 non-EC word tokens. We can see that longdistance disambiguation is a challenge shared by both models. When the value of Dependency Distance exceeds four, the recall score drops gradually with abscissa increasing. Based on the comparison of two sets of data, we can find that inparsing model performs better on ECs which are close to their heads. However, as for ECs which are far apart from their heads, two models have performed almost exactly alike. It demonstrates that LSTM structure is capable of capturing nonlocal features, making up for no exposure to parsing information.

Challenges
On the whole, the most challenging EC type is pro. We assume that it is because that pro-drop situations are complicated and diverse in Chinese language. According to Chinese linguistic theory, pronouns are dropped as a result of continuing from the preceding discourse or just idiomatic rules, such as the ellipsis of the first person pronoun "我/I" in the subject position. To fill this gap, we may need to extract more deep structural features.
Another difficulty is the detection of consecutive ECs. In the result of our experiments, inparsing dependency-based model can only accurately detect up to two consecutive ECs. Too many empty elements in the same sentence conceal too much syntactic information, making it hard to disclose the original structure.
Moreover, in view of the fact that ECs play an essential role in syntactic analysis, the current detection accuracy of ECs is far from enough. We still have a long way to go.

Related Work
The detection of empty categories is an essential ground for many downstream tasks. For example, Chung and Gildea (2010) has proved that automatic empty category detection has a positive impact on machine translation. Zhang et al. (2017b) shows that ECD can benefit linear syntactic parsing of overt words. To accurately distinguish empty elements in sentences, there are generally three approaches. The first method is to build pre-processors before syntactic parsing. Dienes and Dubey (2003) proposed a shallow trace tagger which can detect discontinuities. And it can be combined with unlexicalized PCFG parsers to implement deep syntactic processing. Due to the lack of phrase structure information, it did not acquire remarkable results. The second method is to integrate ECD into parsing, as shown in Schmid (2006) and Cai et al. (2011), which involved empty elements in the process of generating parse trees. Another in-parsing system is pro-posed in Zhang et al. (2017b). Zhang et al. (2017b) designed algorithms to produce dependency trees in which empty elements are allowed. To add empty elements into dependency structures, they extend Eisner's first-order DP algorithm for parsing to second-and third-order algorithms. The last approach to recognizing empty elements is post-parsing methods. Johnson (2002) proposed a simple pattern-matching algorithm for recovering empty nodes in phrase structure trees while Campbell (2004) presented a rule-based algorithm. Xue and Yang (2013) conducted ECD based on dependency trees. Their methods can leverage richer syntactic information, thus have achieved more satisfying scores.
As neural networks have been demonstrated to have a great ability to capture complex features, it has been applied in multiple NLP tasks (Bengio and Schwenk, 2006;Collobert et al., 2011). Neural methods have also explored in distinguishing empty elements. For example, Wang et al. (2015) described a novel ECD solution using distributed word representations and achieved the state-ofthe-art performance. Based on above work, we explore neural pre-and in-parsing models for ECD.

Conclusion
Neural networks have played a big role in multiple NLP tasks recently owing to its nonlinear mapping ability and the avoidance of human-engineered features. It should be a well-justified solution to identify empty categories as well as to integrate empty categories into syntactic analysis. In this paper, we study neural models to detect empty categories. We observe three facts: (1) BiLSTM significantly advances the pre-parsing ECD. (2) Automatic ECD improves the neural dependency parsing quality for overt words. (3) Even with a BiLSTM, syntactic information can enhance the detection further. Experiments on Chinese language show that our neural model for ECD exceptionally boosts the state-of-the-art detection accuracy.