CU-NLP at SemEval-2016 Task 8: AMR Parsing using LSTM-based Recurrent Neural Networks

We describe the system used in our participation in the AMR Parsing task for SemEval-2016. Our parser does not rely on a syntactic pre-parse, or heavily engineered features, and uses ﬁve recurrent neural networks as the key architectural components for estimating AMR graph structure


Introduction
Abstract Meaning Representation, or AMR (Banarescu et al., 2012) is a graph-based representation of the meaning of sentences which incorporates linguistic phenomena such as semantic roles, coreference, negation, and more. 1 The process of creating AMR's for sentences is called AMR Parsing. We used an early version of the system described in this paper to generate our submission to the Semeval-2016 Meaning Representation Parsing Task. 2 The details of our system will be explained using this example sentence: France plans further nuclear cooperation with numerous countries .
A graphical depiction is shown in Figure 1.
The system extracts features from the sentence which are processed by a form of recurrent neural network called BDLSTM to create a set of AMR concepts. Features from these concepts are processed by a pair of BDLSTM networks to compute relation probabilities. All concepts are then connected using an iterative, greedy algorithm to compute the set of relations in the AMR. Another two 1 http://amr.isi.edu/language.html 2 http://alt.qcri.org/semeval2016/task8/# BDLSTM networks compute attribute and name categories to complete the estimation of AMR element probabilities.

Related Work
Most current AMR parsers assume input that has undergone varying degrees of syntactic analysis, ranging from simple part-of-speech tagging to more complex dependency or phrase-structure analysis. (Wang et al., 2015;Vanderwende et al., 2015;Peng et al., 2015;Pust et al., 2015;Artzi et al., 2015;Flanigan et al., 2014;Werling et al., 2015). In contrast, we follow the spirit of minimal feature extraction using pre-trained word embeddings, as in (Collobert et al., 2011) and a recurrent network architecture similar to that described in (Zhou and Xu, 2015).
3 System Architecture 3.1 Feature Extraction In our system, all features are represented by embedding vectors, trained and stored in lookup tables. Word feature embeddings are mapped from the words in the sentence, and are trained with back propagation just like other parameters in the network. They are initialized with vectors which are pre-trained on large corpora of english text, we use the word embeddings from (Collobert et al., 2011).
The only explicit features not derived from the raw input are features based on named entity recognition (NER). We first use the Univ. of Illinois Wikifier to find and classify named entities and then encode these features as embeddings.

Neural Networks
Unlike relatively simple sequence processing tasks like part-of-speech tagging and NER, semantic analysis requires the ability to keep track of relevant information that may be arbitrarily far away from the words currently under consideration. Fortunately, recurrent neural networks (RNNs) are a class of neural architecture that use a form of short-term memory in order to solve this semantic distance problem. Basic RNN systems have been enhanced with the use of special memory cell units, referred to as Long Short-Term Memory neural networks, or LSTM's (Hochreiter and Schmidhuber, 1997). Such systems can effectively process information dispersed over hundreds of words (Schmidhuber et al., 2002;Gers et al., 2001).
Bidirectional LSTMs (BDLSTM) networks are LSTMs that are connected so that both future and past words in the sentence can be examined. We use the LSTM cell as described in (Graves et al., 2013), Figure 3, configured in a Bi-directional structure, called BDLSTM (Zhou and Xu, 2015), shown in Figure 4 as the core network in our system. Five BDLSTM Neural Networks comprise our parser.

Level 0 Concepts BDLSTM Network (L0)
The first step in our process is to create the set of concepts (nodes) that form the basis for any AMR representation; we call these Level 0, or L0, concepts. For the most part in current AMR training data, these concepts are in a direct relationship to words and sequences of words in a sentence. The task of the L0 network is, therefore, to take the input sequence of words and produce an output sequence of IOB tags that identify and classify the concepts in the AMR output.
For training, AMR concepts are first aligned to words using an AMR-to-word alignment algorithm. We used the alignment provided in the SemEval dataset. In cases where multiple concepts are associated with the same word, we use only the lower level concept and ignore upper level concept(s).
The system classifies each L0 concept as predicate or non-predicate, and predicts the PropBank sense for the predicates. AMR concepts are either Figure 3: LSTM Cell. An "unrolled" representation of an LSTM Cell. Rectangles represent linear layers followed by the labelled nonlinearity. Each cell learns how to weigh, or gate, the input, previous cell memory, and output.
English words (boy), PropBank framesets (plan-01), or special keywords. A translation table was created from training data by calculating the most probable AMR concept, given the sentence word and the general concept identifier.
The most common multilevel cases, a subgraph composed of a named entity, its related category, and a wiki link when available, are identified as exceptions and tagged as name concepts, which will be expanded. 3 The features used in the L0 nework are: • word: 130Kx50, the word embedding • suffix: 430x5, embedding based on the final two letters of each word. • caps: 5x5, embedding based on the capitalization pattern of the word. The L0 Network produces probabilities for 19 BIOES tagged concept types, and the highest probability tag is chosen for each word, as shown for the example sentence in Table 1.

Predicate Argument Relations BDLSTM Network (Args)
The Args Network is run once for each predicate concept, and produces a matrix P args which defines 3 For example France in the shaded section of Figure 1. x 0 x  The L0 embedding of the word and surrounding 2 words associated with the source predicate concept. • regionMark: 21x5, indexed by the distance in words between the word and the word associated with the source predicate concept.

Non-Predicate Relations BDLSTM Network (Nargs)
The Nargs Network uses features similar to the Args network, is run once for each concept, and produces a matrix P nargs which defines the probability of a type of relation from an L0 concept to any other L0 concept, prior to the identification of any relations. 5 The matrix has dimensions a n by c, where a n is the number of non-arg relations to be identified, and c is the total number of concepts.

Attributes BDLSTM Network (Attr)
The Attr Network determines a primary attribute for each concept, if any. 6 The attributes (op words) associated with named entities are determined directly during L0 concept identification. This network is simplified to detect only one attribute (there could be many) per concept, and only computes probabilities for the most common attributes: TOP, polarity, and quant.

Named Category BDLSTM Network (NCat)
The NCat Network uses features similar to the L0 Network, along with the suggested categories (up to eight) from the Wikifier, and produces probabilities for each of 68 :instance roles, or categories, for named entities identified in the training set AMR's.
• Word, Suffix and Caps as in the L0 network.

Wikifier
Named entities in AMR are annotated with a canonical form, using Wikipedia as the standard (see France in Figure 1). A :wiki role, or link, should be provided if an appropriate wikipedia page exists, root category (or top-level :instance role) should also be provided. To determine these fields, prior to running the L0 network, we run the sentences through the University of Illinois Wikifier (Ratinov et al., 2011;Cheng and Roth, 2013) which provides a wiki link and a list of possible categories. We insert the link directly as a :wiki role, and use the possible categories as feature inputs to the NCat Network.

Relation Resolution
The generated P args and P nargs for each L0 identified concept are processed to determine the most likely relation connections, using the constraints: 1. AMR's are single component graphs without cycles. 2. AMR's are simple directed graphs, a max of one relation between concepts is allowed. 3. Outgoing predicate relations are limited to one of each kind (i.e. can't have two ARG0's) We apply a greedy algorithm which repeatedly selects the most probable edge from P args and P nargs , then adjusts P args and P nargs based on the constraints (hard decisions change the probabilities), until all edge probabilities are below a threshold. From then on, only the most probable edges which span subgraphs are chosen, until the graph contains a single component.

Results
Semeval task 8 provides aligned, split datasets. Our Smatch F1 result for the test dataset was 66.1%, and 56.0% for the eval dataset (Table 2). Reportedly, the eval dataset is more challenging than the provided test dataset. The mean of all task 8 results for the eval dataset is 55% with a standard deviation of 6%, more detail is not yet available.

Conclusion
In this paper, we have described our submission to the AMR Parsing task for SemEval-2016. Our parser does not make use of a syntactic pre-parse, and avoids the use of heavily engineered features. Future work will include expanding the identification of concepts and exploring the use of more sophisticated alignments and word embeddings.