A Neural Transition-based Model for Nested Mention Recognition

It is common that entity mentions can contain other mentions recursively. This paper introduces a scalable transition-based method to model the nested structure of mentions. We first map a sentence with nested mentions to a designated forest where each mention corresponds to a constituent of the forest. Our shift-reduce based system then learns to construct the forest structure in a bottom-up manner through an action sequence whose maximal length is guaranteed to be three times of the sentence length. Based on Stack-LSTM which is employed to efficiently and effectively represent the states of the system in a continuous space, our system is further incorporated with a character-based component to capture letter-level patterns. Our model gets the state-of-the-art performances in ACE datasets, showing its effectiveness in detecting nested mentions.


Introduction
There has been an increasing interest in named entity recognition or more generally recognizing entity mentions 2 (Alex et al., 2007;Finkel and Manning, 2009;Lu and Roth, 2015;Muis and Lu, 2017) that the nested hierarchical structure of entity mentions should be taken into account to better facilitate downstream tasks like question answering (Abney et al., 2000), relation extraction (Mintz et al., 2009;Liu et al., 2017), event extraction (Riedel and McCallum, 2011;Li et al., 2013), and coreference resolution (Soon et al., 2001;Ng and Cardie, 2002;Chang et al., 2013). Practically, the mentions with nested structures frequently exist in news (Doddington et al., 2004) and biomedical documents (Kim et al., 2003). For example in 1 We make our implementation available at https:// github.com/berlino/nest-trans-em18. 2 Mentions are defined as references to entities that could be named, nominal or pronominal (Florian et al., 2004).  Traditional sequence labeling models such as conditional random fields (CRF) (Lafferty et al., 2001) do not allow hierarchical structures between segments, making them incapable to handle such problems. Finkel and Manning (2009) presented a chart-based parsing approach where each sentence with nested mentions is mapped to a rooted constituent tree. The issue of using a chart-based parser is its cubic time complexity in the number of words in the sentence.
To achieve a scalable and effective solution for recognizing nested mentions, we design a transition-based system which is inspired by the recent success of employing transition-based methods for constituent parsing (Zhang and Clark, 2009) and named entity recognition (Lou et al., 2017), especially when they are paired with neural networks (Watanabe and Sumita, 2015). Generally, each sentence with nested mentions is mapped to a forest where each outermost mention forms a tree consisting of its inner mentions. Then our transition-based system learns to construct this forest through a sequence of shift-reduce actions. Figure 1 shows an example of such a forest. In contrast, the tree structure by Finkel and Manning (2009) further uses a root node to connect all tree elements. Our forest representation eliminates the root node so that the number of actions required to construct it can be reduced significantly.
Following (Dyer et al., 2015), we employ Stack-LSTM to represent the system's state, which consists of the states of input, stack and action history, in a continuous space incrementally. The (partially) processed nested mentions in the stack are encoded with recursive neural networks (Socher et al., 2013) where composition functions are used to capture dependencies between nested mentions. Based on the observation that letter-level patterns such as capitalization and prefix can be beneficial in detecting mentions, we incorporate a characterlevel LSTM to capture such morphological information. Meanwhile, this character-level component can also help deal with the out-of-vocabulary problem of neural models. We conduct experiments in three standard datasets. Our system achieves the state-of-the-art performance on ACE datasets and comparable performance in GENIA dataset.

Related Work
Entity mention recognition with nested structures has been explored first with rule-based approaches Zhou, 2006) where the authors first detected the innermost mentions and then relied on rule-based postprocessing methods to identify outer mentions. McDonald et al. (2005) proposed a structured multi-label model to represent overlapping segments in a sentence. but it came with a cubic time complexity in the number of words. Alex et al. (2007) proposed several ways to combine multiple conditional random fields (CRF) (Lafferty et al., 2001) for such tasks. Their best results were obtained by cascading several CRF models in a specific order while each model is responsible for detecting mentions of a particular type. However, such an approach cannot model nested mentions of the same type, which frequently appear. Lu and Roth (2015) and Muis and Lu (2017) proposed new representations of mention hypergraph and mention separator to model overlapping mentions. However, the nested structure is not guaranteed in such approaches since overlapping structures additionally include the crossing structures 3 , which rarely exist in practice (Lu and Roth, 2015). Also, their representations did not model the dependencies between nested mentions explicitly, which may limit their performance. In contrast, the chart-based parsing method (Finkel and Manning, 2009) can capture the dependencies between nested mentions with composition rules which allow an outer entity to be influenced by its contained entities. However, their cubic time complexity makes them not scalable to large datasets.
As neural network based approaches are proven effective in entity or mention recognition (Collobert et al., 2011;Lample et al., 2016;Huang et al., 2015;Chiu and Nichols, 2016;Ma and Hovy, 2016), recent efforts focus on incorporating neural components for recognizing nested mentions. Ju et al. (2018) dynamically stacked multiple LSTM-CRF layers (Lample et al., 2016), detecting mentions in an inside-out manner until no outer entities are extracted. Katiyar and Cardie (2018) used recurrent neural networks to extract features for a hypergraph which encodes all nested mentions based on the BILOU tagging scheme.

Model
Specifically, given a sequence of words {x 0 , x 1 , . . . , x n }, the goal of our system is to output a set of mentions where nested structures are allowed. We use the forest structure to model the nested mentions scattered in a sentence, as shown in Figure 1. The mapping is straightforward: each outermost mention forms a tree where the mention is the root and its contained mentions correspond to constituents of the tree. 4

Shift-Reduce System
Our transition-based model is based on the shiftreduce parser for constituency parsing (Watan-abe and Sumita, 2015), which adopts (Zhang and Clark, 2009;Sagae and Lavie, 2005). Generally, our system employs a stack to store (partially) processed nested elements. The system's state is defined as [S, i, A] which denotes stack, buffer front index and action history respectively. In each step. an action is applied to change the system's state.
Our system consists of three types of transition actions, which are also summarized in Figure 2: • SHIFT pushes the next word from buffer to the stack.
• REDUCE-X pops the top two items t 0 and t 1 from the tack and combines them as a new tree element {X → t 0 t 1 } which is then pushed onto the stack.
• UNARY-X pops the top item t 0 from the stack and constructs a new tree element {X → t 0 } which is pushed back to the stack.
Since the shift-reduce system assumes unary and binary branching, we binarize the trees in each forest in a left-branching manner. For example, if three consecutive words A, B, C are annotated as Person, we convert it into a binary tree {P erson → {P erson * → A, B}, C} where P erson * is a temporary label for P erson. Hence, the X in reduce-actions will also include such temporary labels.
Note that since most words are not contained in any mention, they are only shifted to the stack and not involved in any reduce-or unary-actions. An example sequence of transitions can be found in Figure 3. Our shift-reduce system is different from previous parsers in terms of the terminal state. 1) It does not require the terminal stack to be a rooted tree. Instead, the final stack should be a forest consisting of multiple nested elements with tree structures. 2) To conveniently determine the ending of our transition process, we add an auxiliary symbol $ to each sentence. Once it is pushed to the stack, it implies that all deductions of actual words are finished. Since we do not allow unary rules between labels like X1 → X2, the length of maximal action sequence is 3n. 5

Action Constraints
To make sure that each action sequence is valid, we need to make some hard constraints on the ac-5 In this case, each word is shifted (n) and involved in a unary action (n). Then all elements are reduced to a single node (n − 1). The last action is to shift the symbol $.  tion to take. For example, reduce-action can only be conducted when there are at least two elements in the stack. Please see the Appendix for the full list of restrictions. Formally, we use V(S, i, A) to denote the valid actions given the parser state. Let us denote the feature vector for the parser state at time step k as p k . The distribution of actions is computed as follows: (1) where w z is a column weight vector for action z, and b z is a bias term.

Neural Transition-based Model
We use neural networks to learn the representation of the parser state, which is p k in (1).

Representation of Words
Words are represented by concatenating three vectors: where e w i and e p i denote the embeddings for i-th word and its POS tag respectively. c w i denotes the representation learned by a character-level model using a bidirectional LSTM. Specifically, for character sequence s 0 , s 1 , . . . , s n in the i-th word, we use the last hidden states of forward and backward LSTM as the character-based representation of this word, as shown below:

Representation of Parser States
Generally, the buffer and action history are encoded using two vanilla LSTMs (Graves and Schmidhuber, 2005). For the stack that involves popping out top elements, we use the Stack-LSTM (Dyer et al., 2015) to efficiently encode it. Formally, if the unprocessed word sequence in the buffer is x i , x i+1 , . . . , x n and action history sequence is a 0 , a 1 , . . . , a k−1 , then we can compute buffer representation b k and action history representation a k at time step k as follows: where each action is also mapped to a distributed representation e a . 6 For the state of the stack, we also use an LSTM to encode a sequence of tree elements. However, the top elements of the stack are updated frequently. Stack-LSTM provides an efficient implementation that incorporates a stackpointer. 7 Formally, the state of the stack b k at time step k is computed as: where h t i denotes the representation of the i-th tree element from the top, which can be computed recursively similar to Recursive Neural Network (Socher et al., 2013) as follows: where W u,l and W b,l denote the weight matrices for unary(u) and binary(b) composition with parent node being label(l). Note that the composition function is distinct for each label l. Recall that the leaf nodes of each tree element are raw words. Instead of representing them with their original embeddings introduced in Section 3.3, we found that 6 Note that LSTM b runs in a right-to-left order such that the output can represent the contextual information of xi.
7 Please refer to Dyer et al. (2015) for details. concatenating the buffer state in (5) are beneficial during our initial experiments. Formally, when a word x i is shifted to the stack at time step k, its representation is computed as: Finally, the state of the system p k is the concatenation of the states of buffer, stack and action history: Training We employ the greedy strategy to maximize the log-likelihood of the local action classifier in (1). Specifically, let z ik denote the k-th action for the i-th sentence, the loss function with 2 norm is: where λ is the 2 coefficient.

Experiments
We mainly evaluate our models on the standard ACE-04, ACE-05 (Doddington et al., 2004), and GENIA (Kim et al., 2003) datasets with the same splits used by previous research efforts (Lu and Roth, 2015;Muis and Lu, 2017). In ACE datasets, more than 40% of the mentions form nested structures with some other mention. In GENIA, this number is 18%. Please see Lu and Roth (2015) for the full statistics.

Setup
Pre-trained embeddings GloVe (Pennington et al., 2014) of dimension 100 are used to initialize the word vectors for all three datasets. 9 The embeddings of POS tags are initialized randomly with dimension 32. The model is trained using Adam (Kingma and Ba, 2014) and a gradient clipping of 3.0. Early stopping is used based on the performance of development sets. Dropout (Srivastava et al., 2014) is used after the input layer. The 2 coefficient λ is also tuned during development process.

Results
The main results are reported in Table 1. Our neural transition-based model achieves the best results in ACE datasets and comparable results in GENIA dataset in terms of F 1 measure. We hypothesize that the performance gain of our model compared with other methods is largely due to improved performance on the portions of nested mentions in our datasets. To verify this, we design an experiment to evaluate how well a system can recognize nested mentions.

Handling Nested Mentions
The idea is that we split the test data into two portions: sentences with and without nested mentions. The results of GENIA are listed in Table  2. We can observe that the margin of improvement is more significant in the portion of nested mentions, revealing our model's effectiveness in handling nested mentions. This observation helps explain why our model achieves greater improvement in ACE than in GENIA in Table 1 since the former has much more nested structures than the latter. Moreover, Ju et al. (2018) performs better when it comes to non-nested mentions possibly due to the CRF they used, which globally normalizes each stacked layer.

Decoding Speed
Note that Lu and Roth (2015) and Muis and Lu (2017) also feature linear-time complexity, but with a greater constant factor. To compare the decoding speed, we re-implemented their model with the same platform (PyTorch) and run them on the same machine (CPU: Intel i5 2.7GHz). Our model turns out to be around 3-5 times faster than theirs, showing its scalability. 9 We also additionally tried using embeddings trained on PubMed for GENIA but the performance was comparable.

GENIA Nested
Non-Nested P R F 1 P R F 1 Lu and Roth (2015)

Ablation Study
To evaluate the contribution of neural components including pre-trained embeddings, the characterlevel LSTM and dropout layers, we test the performances of ablated models. The results are listed in Table 1. From the performance gap, we can conclude that these components contribute significantly to the effectiveness of our model in all three datasets.

Conclusion and Future Work
In this paper, we present a transition-based model for nested mention recognition using a forest representation. Coupled with Stack-LSTM for representing the system's state, our neural model can capture dependencies between nested mentions efficiently. Moreover, the character-based component helps capture letter-level patterns in words.
The system achieves the state-of-the-art performance in ACE datasets. One potential drawback of the system is the greedy training and decoding. We believe that alternatives like beam search and training with exploration (Goldberg and Nivre, 2012) could further boost the performance. Another direction that we plan to work on is to apply this model to recognizing overlapping and entities that involve discontinuous spans (Muis and Lu, 2016) which frequently exist in the biomedical domain.