Improved Dependency Parsing using Implicit Word Connections Learned from Unlabeled Data

Pre-trained word embeddings and language model have been shown useful in a lot of tasks. However, both of them cannot directly capture word connections in a sentence, which is important for dependency parsing given its goal is to establish dependency relations between words. In this paper, we propose to implicitly capture word connections from unlabeled data by a word ordering model with self-attention mechanism. Experiments show that these implicit word connections do improve our parsing model. Furthermore, by combining with a pre-trained language model, our model gets state-of-the-art performance on the English PTB dataset, achieving 96.35% UAS and 95.25% LAS.


Introduction
Dependency parsing is a fundamental task for language processing which aims to establish syntactic relations between words in a sentence. Graphbased models (McDonald et al., 2005;McDonald and Pereira, 2006;Carreras, 2007;Koo and Collins, 2010) and transition-based models (Nivre, 2008;Zhang and Nivre, 2011) are the most successful solutions to the challenge.
Recently, neural network methods have been successfully introduced into dependency parsing. Deep feed-forward neural network models (Chen and Manning, 2014;Pei et al., 2015; are proposed firstly. It alleviates the heavy burden of feature engineering. LSTM networks (Hochreiter and Schmidhuber, 1997) are then applied to dependency parsing (Dyer et al., 2015;Cross and Huang, 2016;Wang and Chang, 2016;Kiperwasser and Goldberg, 2016;Dozat and Manning, 2016) due to its ability to capture contextual information. Generative neural network models (Dyer et al., 2016;Smith et al., 2017;Choe and Charniak, 2016) also show promising parsing performance. Different from Machine Translation task where massive sets of labeled data could be easily obtained, parsing performance is limited by the relatively small size of available treebank. Vinyals et al. (2015) and  adopt an approach of tri-training to augment the labeled data. They generate large quantities of parse trees by parsing unlabeled data with two existing parsers and selecting only the sentences for which the two parsers produced the same trees. However, the trees produced this way have noise 1 and tend to be short sentences, since it is easier for different parsers to get consistent results.
Pre-trained neural networks are another methods to take advantage of unlabeled data. Pretrained word embeddings (Mikolov et al., 2013) and language model (Józefowicz et al., 2016;Peters et al., 2017Peters et al., , 2018 have been shown useful in modelling NLP tasks since word embeddings could capture word semantic information and language model could capture contextual information at the sentence level. However, connections between words in the sentence cannot be directly captured by word embeddings or language model, which are crucial for dependency parsing given its goal is to establish dependency relations between words. In this paper, we propose to implicitly model word connections by a word ordering model. The purpose of word ordering model is to generate a well-formed sentence given a bag of words. We human could make sentences easily from unordered words since we have syntactic knowledge, thus a model generating wellformed sentences from the bag of words encodes syntactic information. In addition, word ordering task allows us to use self-attention mechanism to model connections between words in the sentence. Different from the tri-training approach, our approach takes advantage of implicit word connections learned by self-attended word ordering model in an unsupervised way. Experiments show that pre-trained word ordering model significantly improves our dependency parsing model. Ablation tests also show self-attention mechanism is critical. Moreover, by combining word ordering model and language model, our graph-based dependency parsing model achieves SOTA performance on the English Penn Treebank (Marcus et al., 1993) Figure 1: Overview of our word ordering model.

Neural Word Ordering Model
The target of word ordering is to generate a wellformed sentence given a bag of words. To capture word connections implicated in the sentence, an LSTM-based word ordering model with selfattention is proposed. Self-attention mechanism effectively decides which words in the word bag are more important in generating the next word. It improves the ability of our model to capture word connections. As illustrated in Figure 1, the proposed word ordering model consists of two layers:

Encoder Layer
Given a bag of words w 1 , w 2 , ..., w n , we encode each word by a character-level BiLSTM (c wo w 1:n ), which could reduce the parameters used in our model compared with word embeddings. For the input word of current time-step (w i ), a selfattention layer is utilized to align the word with its related words, producing its self-attended vector (sa wo w i ) as following: The scores s i 1:n in self-attention explicitly represent the connections between words.
We then concatenate the character-level word embedding (c wo w i ) and its self-attended vector (sa wo w i ): x wo i is fed into the decoder layer to generate the next word.

Decoder Layer
Given the current input vector (x wo i ), which contains current word information and weighted information of related words, a forward LSTM is used to generate the next word. We initialize the forward LSTM with an average of the input word embeddings (c wo w i:n ). A virtual token BOS is added as the input of the first LSTM time-step: At each time-step, the hidden state − → h wo i is utilized to predict the next word. Due to the output vocabulary is limited in the bag of words, we just compute scores for the given words (w 1:n ): To reduce the parameters, each output word is represented by a character-level BiLSTM embedding (cd wo w j ) and a low-dimensional word embedding 2 (wd wo w j ). M is a matrix projecting a lowdimensional embedding back up to the dimensionality of LSTM hidden states. The scores so i 1:n are then normalized with Softmax, and the word with max probability is chosen as the next token.
The word ordering model could be trained easily in an unsupervised manner. Given a large set of unlabeled sentences, we can just ignore the word order of sentence and train the model to generate the corresponding well-formed sentence in the training set. To be specific, we minimize the sum of negative log probabilities of the ground truth words on the unlabeled data set. Different from language model, the choice for each decoder step is limited in the bag of words. Moreover, selfattention can be introduced into the word ordering model since we have known the bag of words, which could capture the dependency connections between words. We also pre-train a backward word ordering model to generate sentences in reverse order. The forward and backward models share character-level BiLSTM embeddings, selfattention layer, and Softmax layer.
Different from previous word ordering models (Liu et al., 2015;Schmaltz et al., 2016), selfattention mechanism is introduced into our model to capture word connections. Moreover, our more important goal is to implicitly utilize large-scale unlabeled data to help dependency parsing.

Neural Graph-based Parsing Model
We implement an LSTM-based neural network model as our graph-based dependency parsing baseline, which is similar to (Kiperwasser and Goldberg, 2016;Wang and Chang, 2016). As shown in the Figure 2, it consists of three layers:

Input Layer
Given a n-words input sentence s with words w 1 , w 2 , ..., w n and its POS tags p 1 , p 2 , ..., p n . The input layer creates a sequence of input vectors x 1:n in which each x i is a concatenation of its word embedding (e w i ), POS tag embedding (e p i ), character-level BiLSTM embedding (c w i ), and word ordering model pre-trained vector (wo w i ): To get the word ordering model pre-trained vector (wo w i ), the sentence s is fed into the pretrained word ordering model. Following Peters et al. (2018), we then combine the input vector .., L) by a Softmaxnormalized weight (W woc ) and a scalar parameter (γ): The parameters of word ordering model are fixed during the training of parsing model. However, the weight and scalar parameters are tuned to better adapt to it. The combined output wo w i contains word connections from self-attention, word information from character-level embedding and contextual information from LSTM.

Encoder & Output Layer
To introduce more contextual information, we encode each input element by deep BiLSTMs: v i = BiLSTM(x 1:n , i) Two different highway networks (Srivastava et al., 2015) are then used to encode head word representations (v head 1:n ) and dependent word representations (v dep 1:n ). For a head-dependent dependency pair (w h , w d ), the dependency arc and label score are computed by two MLP networks: We use the Max-Margin criterion to train our parsing model, which is the same as (Kiperwasser and Goldberg, 2016;Wang and Chang, 2016 (Toutanova et al., 2003) is used for assigning POS tags. Following previous work, UAS (unlabeled attachment scores) and LAS (labeled attachment scores) are calculated by excluding punctuation. For the CoNLL 09 English dataset, we follow the standard practice and include all punctuation in the evaluation. We pre-train our word ordering model on the 1 billion word benchmark (Chelba et al., 2014).

Implementation Details
The graph-based dependency parsing model and word ordering model are optimized with Adam with an initial learning rate of 2e −3 . The β 1 and β 2 used in Adam are 0.9 and 0.999 respectively.
The following hyper-parameters are used in all graph-based dependency parsing models: word embedding size = 300, POS tag embedding size = 32, character embedding size = 50, word-level LSTM hidden vector size = 200, word-level BiL-STM layer number = 3, character-level LSTM hidden vector size = 50, character-level BiLSTM layer number = 2, batch size = 32. We also apply dropout for the input and each layer with dropout rate of 0.3. We use pre-trained casesensitive GloVe embeddings 4 to initialize word embeddings. These word embeddings are fine tuned with the graph-based dependency parsing model. The parameters of pre-trained word ordering model are fixed during the training of dependency parsing model. For deep BiLSTM, we concatenate the outputs of each layer as its final outputs.
For our word ordering model: input characterlevel LSTM hidden vector size = 512, input character-level BiLSTM layer number = 1, wordlevel LSTM hidden vector size = 1024, word-level LSTM layer number = 2, output character-level LSTM hidden vector size = 512, output characterlevel BiLSTM layer number = 1, output lowdimensional word embedding size = 64, batch size = 32, dropout for the input and each layer = 0.5. Table 1 shows the performance of our model and previous work on two English benchmarks. Our model achieves promising results on both datasets. Two sets of experiments are provided to show the effectiveness of pre-trained word ordering model. Although our baseline system is similar to (Kiperwasser and Goldberg, 2016;Wang and Chang, 2016) but with subtle differences in architecture, the baseline could perform much better to our surprise and thus constitutes a very strong baseline. Compared with this baseline, introducing the pre-trained word ordering model achieves a significant improvement (almost 0.6% UAS gains for both datasets, p < 0.001). To further show the effectiveness of word ordering model, we also implement an even stronger baseline with pretrained language model 5 . Compared with this much stronger baseline, incorporating pre-trained word ordering model still achieves a significant improvement (0.3% UAS gains for both datasets, p < 0.01). We attribute the improvement to the ability of word ordering model to capture word connections, which cannot be directly captured by language model. Moreover, by combining with a pre-trained language model, our  of POS tag features could contribute about 0.2% improvement in our experiments. The word ordering model could be more helpful without POS tag features and seem to compensate for the lack of POS tag features.

Main Results & Ablation Study
To show the importance of self-attention mechanism, we do ablation tests on the models with pre-trained word ordering model vectors. We remove self-attention vectors by replacing it with the character-level representations. As shown in table 2, self-attention further improves dependency parsing. Word connections modeled by selfattention are important for dependency parsing. Figure 3 shows an example of word connections learned by the model, where we use the solid line to indicate the word connections learned by the word-ordering model and dashed line to the expected dependencies. We can see meaningful overlap could be observed in the example. The percentage of overlap between connections and dependency arcs is over 40% for the sentences less than 10 words. The differences between connections and dependency arcs are because that our word ordering model trained without any supervised dependency information. The connections are actually built to increase the likelihood.

Conclusion
In this paper, we propose to implicitly capture word connections from large-scale unlabeled data by a word ordering model with self-attention.