Semantic Frame Labeling with Target-based Neural Model

This paper explores the automatic learning of distributed representations of the target’s context for semantic frame labeling with target-based neural model. We constrain the whole sentence as the model’s input without feature extraction from the sentence. This is different from many previous works in which local feature extraction of the targets is widely used. This constraint makes the task harder, especially with long sentences, but also makes our model easily applicable to a range of resources and other similar tasks. We evaluate our model on several resources and get the state-of-the-art result on subtask 2 of SemEval 2015 task 15. Finally, we extend the task to word-sense disambiguation task and we also achieve a strong result in comparison to state-of-the-art work.


Introduction and Related Work
Semantic frame labeling is the task of selecting the correct frame for a given target based on its semantic scene. A target is often called lexical unit which evokes the corresponding semantic frame. The lexical unit can be a verb, adjective or noun. Generally, a semantic frame describes how the lexical unit is used and specifies its characteristic interactions. There are many semantic frame resources, such as FrameNet (Baker et al., 1998), VerbNet (Schuler, 2006), Prop-Bank (Palmer et al., 2005) and Corpus Pattern Analysis (CPA) frames (Hanks, 2012). However, most existing frame resources are manually created, which is time-consuming and expensive. Automatic semantic frame labeling can lead to the development of a broader range of resources. * The corresponding author Early works for semantic frame labeling mainly focus on FrameNet, PropBank and VerbNet resources. But most of them focus only one resource and rely heavily on feature engineering (e.g., Honnibal and Hawker 2005;Abend et al. 2008). Recently, there are some works on learning CPA frames based on a new semantic frame resource, the Pattern Dictionary of English Verbs (PDEV) (El Maarouf and Baisa, 2013;El Maarouf et al., 2014). The above two works also rely on features and both are only tested on 25 verbs. Most works aim at constructing the context representations of the target with explicit rules based on some basic features, e.g., Parts Of Speech (POS), Named Entities (NE) and dependency relations related to the target. Currently, some deep learning models have been applied with dependency features. Hermann et al. (2014) used the direct dependents and dependency path to extract the context representation based on distributed word embeddings on English FrameNet. Inspired by the work, Zhao et al. (2016) used a deep feed forward neural network on Chinese FrameNet with similar features. This is different from our goal where we want to explore an appropriate deep learning architecture without complex rules to construct the context representations. Feng et al. (2016) used a multilayer perceptrons (MLP) model on CPA frames without extra feature extraction, but the model is quite simple and has an input window which is not convenient.
In this paper, we present a target-based neural model which takes the whole target-specific sentence as input and gives the semantic frame label as output. Our goal is to make the model light without explicit rules to construct context representations and applicable to a range of resources. To cope with variable-length sentences under our constraint, a simple idea is to use recurrent neural networks (RNN) to process the sentences. But noise caused by irrelevant words in long sentences may hinder learning. In fact, the arguments related to the target are usually distributed near the target because when we write or speak, we will focus mainly on arguments that are in the immediate context of a core word. We use two RNNs each of which processes one part of the sentence split by the target. The model takes the target as the center and we call it the target-based recurrent networks (TRNN). In fact, TRNN itself is not novel enough, but according to our knowledge, no related research has focused on this topic. We will show that TRNN is quite suitable for learning the context of the target.  Figure 1: Architecture of TRNN with an example sentence whose target word is in bold.
In our model we select long short-term memory (LSTM) networks, a type of RNN designed to avoid the vanishing and exploding gradients. The overall structure is illustrated Figure 1. w t is the t-th word in the sentence the length of which is T and target is the index of the target. x t is obtained by mapping w t into a fixed vector through well pre-trained word vectors. The model has two LSTMs each of which processes one part of the sentence split by the target. The model can automatically learn the distributed representation of target's context from w with few manual design.

Context Representations
An introduction about LSTM can be found in the work of Hochreiter and Schmidhuber (1997). The parameters of LSTM are W x * , W h * and b * where * stands for one of several internal gates. W x * is the matrix between the input vector x t and gates, W h * is the matrix between the output h t of LSTM and gates and b * is the bias vector on gates. The formulas of LSTM are: where σ is the sigmoid function and represents the element-wise multiplication. i t , f t c t and o t are the output of input gates, forget gates, cell states and output gates, respectively. In our model, two LSTMs share the same parameters. At last, the target's context representations cr are added by the outputs of two LSTMs: The dimension of cr is decided by the number of hidden units in LSTM, which is a hyper parameter in our model, and is usually much lower than that of one word vector. Here we make some intuitions behind the above formulas. The gradients from last layer flow equally on the (target − 1)-th LST-M box and the target-th LSTM box and then the two flows go to both ends. As it is quite common in deep learning models, the gradients usually become ineffective as the depth of the flow increases especially when the sentence is very long. The gradients on words far from the target get less impact than those near the target. As a whole, more data are usually required to learn the arguments far from the target than those near the target. If the real arguments are distributed near the target, this model will be suitable as its architecture is designed to take care of the local context of the target.

Output Layer
We use Softmax layer as the output layer on the context representations. The output layer computes a probability distribution over the semantic frame labels. During the training, the cost we minimize is the negative log likelihood of the model: Here M is number of the training sentences, t m is the index of the correct frame label for the m-th sentence and p is the probability.

Datasets
We simply divide all the datasets in two types: per-target and non per-target. Per-target semantic frame resources define a different set of frame labels for each target and we train one model for each target; different targets may share some semantic frame labels in non per-target resources and we train a single model for such resources. We use the Semlink project  to create our datasets 1 . Semlink aims to link together different lexical resources via a set of mappings. We use its corpus which annotates FrameNet and Propbank frames for the WSJ section of the Penn Treebank. Another resource we use is PDEV 2 which is quite new and has CPA frame annotated examples on British National Corpus. All the original instances are sentence-tokenized and the punctuation was removed. The details of creating the datasets are as follows: • FrameNet: Non per-target type. We get FrameNet annotated instances through Semlink. If one FrameNet frame label contains more than 300 instances, we divide it proportionately: 70%, 20% and 10%. Then we respectively accumulate the three parts by each frame label to create the training, test and validation set.
• PropBank: Per-target type. The creation process is same as FrameNet except that we finally get training, test and validation set for each target and the cutoff is set to 70 instead of 300.
• PDEV: Same as PropBank but with the cutoff set to 100 instead of 70.
Since the performance of our model is almost decided by the training data we empirically choose the cutoff above to keep the instances of each label enough. Summary statistics of the above datasets are in Table 2.

Models and Training
We compare our model with the following baselines.: 1 The current version of the Semlink project has some problems to get the right position of targets in WSJ section of Penn Treebank. Instead, we use annotations of PropBank corpus, also annotated in WSJ section of Penn Treebank, to index targets. 2 http://pdev.org.uk/ Sentences Frame Names In Moscow they kept asking us things like why do you make 15 different corkscrews Activityongoing It said it has taken measures to continue shipments during the work stoppage.

Activityongoing
But the Army Corps of Engineers expects the river level to continue falling this month.

Processcontinue
The oil industry's middling profits could persist through the rest of the year.  • MF: The most frequent (MF) method selects the most frequent semantic frame label seen in training instances for each instance in the test dataset. MF is actually a strong baseline for per-target dataset because we observed that most targets have one main frame label.

Processcontinue
• Target-Only: For FrameNet dataset, we use Target-Only method: if the target in the test instance has a unique frame label in the training data we give this frame label to current test instance; if the target has multiple frame labels in the training data we select the most frequent one in these labels; if the target is not seen in the training data, we select the most frequent label from the whole training data. This baseline is especially for FrameNet because we observed that each frame label has a set of targets but only a few targets have multiple frame labels. It may be easy to predict the frame label for test instances only according to the target.
• MaxEnt: The Maximum Entropy model. We use the Stanford CoreNLP module 3 to ex-tract features for MaxEnt toolkit 4 . All dependents related to the target, their POS tags, dependency relations, lemmas, NE tags and the target itself will be extracted as features.
The number of the iterations for MaxEnt is decided by the validation set. For simplicity, we set the learning rate to 1.0 for TRNN and LSTM. The number of hidden units is tested on validation data with the values {35, 45, 55} for per-target resource and {80, 100, 120} for non per-target resource. We use the publicly available word2vec vectors, a dimensionality of 300, that were trained through the GloVe model (Pennington et al., 2014) on Wikipedia and Gigaword. For words not appeared in the vector model, their word vectors are all set to zero vectors. We train these models by stochastic gradient descent with minibatches. The minibatch is set to 10 for per-target resource and 50 for non per-target resource. We keep the word vectors static since no obvious improvement has been observed. Training will stop when the zeroone loss is zero over training data.

Results
The results of the above datasets are in Table 3. Target-Only gets very high scores on FrameNet dataset. FrameNet dataset has 55 targets which has multiple frame labels in the training data and these targets have 1981 instances in the test data. We get 0.769 F-score on these instances and 0.393 F-score on 64 unseen targets with 77 test instances. This can be the extreme case that the main feature for the correct frame is the target itself. Despite this simple fact, standard LSTM performs very badly on FrameNet. The main reason is that sentences in FrameNet dataset are too long and standard LSTM can not learn well due to the large number of irrelevant words that appear in long sentences. To show this, we select the size of truncation window for original FrameNet sentences and we get the best size of 5 on validation data with each 2 words surrounding the target. Finally, we get 0.958 F-score on FrameNet test data which is still lower than TRNN on full sentences. As for PropBank and PDEV dataset, we train one model for each target so the final F-score is the average of all targets. However, the number of training instances per target is limited. TRNN will usually not perform well when it tries to learn some frames which consist of many different concepts and especially when the frame has a few training instances. Considering the sentence 4 of Table  4 as an example, it is difficult to TRNN to learn what is 'Activity' in the correct frame because this concept is huge. TRNN may need lots of data to learn something related to this concept. However, this correct frame only has 6 instances in our training data. The second reason of TRNN's failure is lack of knowledge due to unseen words in test data. The sentence 1 of Table 4 shows TRNN will make the right decision since we observe that it has seen the word 'cow' in the training data and knows this word belongs to the concept 'Animate or Plant' in the correct frame. But TRNN does not know the word 'Elegans' in sentence 3 so it usually selects the most frequent frame seen in the training data. However, in many cases, the unseen words can be captured by well trained word embeddings as the sentence 2 shows where 'ducks', 'chickens' and 'geese' are all unseen words.  Table 3: Results on several semantic frame resources. The format of cell value is "Fscore/hidden unit" for TRNN and LSTM and "Fscore/iteration" for MaxEnt toolkit.

CPA Experiment
Corpus Pattern Analysis (CPA) is a new technique for identifying the main patterns in which a word is used in text and is currently being used to build the PDEV resource as we mentioned above. It is also a shared task in SemEval-2015 task 15 (Baisa et al., 2015). The task is divided into three subtasks: CPA parsing, CPA clustering and CPA lexicography. We only introduce the first two related subtasks. CPA parsing aims at identifying the arguments of the target and tagging predefined semantic meaning on them; CPA clustering clusters the instances to obtain CPA frames based on the result of CPA parsing. However, the first step results seem unpromising (Feng et al., 2015;Mills and Levow, 2015;Elia, 2016) which will influence the process of obtaining CPA frames. Since our model can be applied on sentence-level input without feature extraction we can directly evaluate

Word Sense Disambiguation Experiment
Finally, we choose Word Sense Disambiguation (WSD) task to extend our experiment. As our benchmark for WSD task, we choose English Lexical Sample WSD tasks of SemEval-2007task 17 (Pradhan et al., 2007. We use cross-validation on the training set and we observe the model performs better when we update the word vectors which is different from the preceding experimental setup. The number of hidden units is set to 55. The result is in Table 6. The rows from 4 to 6 come from Iacobacci et al. (2016). They inte-grate word embeddings into IMS (It Makes Sense) system (Zhong and Ng, 2010) which uses support vector machine as its classifier based on some standard WSD features and they get the best result; they use an exponential decay function, also designed to give more importance to close context, to compute the word representation, but their method need manually choose the window size of the target word and one parameter of their exponential decay function. Both with word vectors only, our model is comparable with the sixth row.

Conclusion
In this paper, we describe an end-to-end neural model to target-specific semantic frame labeling. Without explicit rule construction to fit for some specific resources, our model can be easily applied to a range of semantic frame resources and similar tasks. In the future, non-English semantic frame resources can be considered to extend the coverage of our model and our model can integrate the best features explored in the state-of-the-art work to see how many improvements our model can make.