Hierarchical Low-Rank Tensors for Multilingual Transfer Parsing

Accurate multilingual transfer parsing typically relies on careful feature engineering. In this paper, we propose a hierarchical tensor-based approach for this task. This approach induces a compact feature representation by combining atomic features. However, unlike traditional tensor models, it enables us to incorporate prior knowledge about desired feature interactions, eliminating invalid feature combinations. To this end, we use a hierarchical structure that uses intermediate embeddings to capture desired feature combinations. Algebraically, this hierarchical tensor is equivalent to the sum of traditional tensors with shared components, and thus can be effectively trained with standard online algorithms. In both unsupervised and semi-supervised transfer scenarios, our hierarchical tensor consistently improves UAS and LAS over state-of-the-art multilingual transfer parsers and the traditional tensor model across 10 different languages. 1


Introduction
The goal of multilingual syntactic transfer is to parse a resource lean target language utilizing annotations available in other languages. Recent approaches have demonstrated that such transfer is possible, even in the absence of parallel data. As a main source of guidance, these methods rely on the commonalities in dependency structures across languages. These commonalities manifest themselves through a broad and diverse set of indicators, ranging from standard arc features used in monolingual parsers to typological properties 1 The source code is available at https://github. com/yuanzh/TensorTransfer.   (Dryer et al., 2005) feature codes for verbsubject and noun-adjective ordering preferences. needed to guide cross-lingual sharing (e.g., verbsubject ordering preference). In fact, careful feature engineering has been shown to play a crucial role in state-of-the-art multilingual transfer parsers .
Tensor-based models are an appealing alternative to manual feature design. These models automatically induce a compact feature representation by factorizing a tensor constructed from atomic features (e.g., the head POS). No prior knowledge about feature interactions is assumed. As a result, the model considers all possible combinations of atomic features, and addresses the parameter explosion problem via a low-rank assumption.
In the multilingual transfer setting, however, we have some prior knowledge about legitimate feature combinations. Consider for instance a typological feature that encodes verb-subject preferences. As Table 1 shows, it is expressed as a conjunction of five atomic features. Ideally, we would like to treat this composition as a single non-decomposable feature. However, the traditional tensor model decomposes this feature into multiple dimensions, and considers various combinations of these features as well as their individual interactions with other features. Moreover, we want to avoid invalid combinations that con-join the above feature with unrelated atomic features. For instance, there is no point to constructing features of the form {head POS=ADJ}∧{head POS=VERB} ∧ · · · ∧ {82A=SV} as the head POS takes a single value. However, the traditional tensor technique still considers these unobserved feature combinations, and assigns them non-zero weights (see Section 7). This inconsistency between prior knowledge and the low-rank assumption results in a sub-optimal parameter estimation.
To address this issue, we introduce a hierarchical tensor model that constrains parameter representation. The model encodes prior knowledge by explicitly excluding undesired feature combinations over the same atomic features. At the bottom level of the hierarchy, the model constructs combinations of atomic features, generating intermediate embeddings that represent the legitimate feature groupings. For instance, these groupings will not combine the verb-subject ordering feature and the POS head feature. At higher levels of the hierarchy, the model combines these embeddings as well as the expert-defined typological features over the same atomic features. The hierarchical tensor is thereby able to capture the interaction between features at various subsets of atomic features. Algebraically, the hierarchical tensor is equivalent to the sum of traditional tensors with shared components. Thus, we can use standard online algorithms for optimizing the low-rank hierarchical tensor. We evaluate our model on labeled dependency transfer parsing using the newly released multilingual universal dependency treebank . We compare our model against the state-of-the-art multilingual transfer dependency parser  and the direct transfer model . All the parsers utilize the same training resources but with different feature representations. When trained on source languages alone, our model outperforms the baselines for 7 out of 10 languages on both unlabeled attachment score (UAS) and labeled attachment score (LAS). On average, it achieves 1.1% UAS improvement over 's model and 4.8% UAS over the direct transfer. We also consider a semi-supervised setting where multilingual data is augmented with 50 annotated sentences in the target language. In this case, our model achieves improvement of 1.7% UAS over 's model and 4.5% UAS over the direct transfer.

Related Work
Multilingual Parsing The lack of annotated parsing resources for the vast majority of world languages has kindled significant interest in multisource parsing transfer (Hwa et al., 2005;Durrett et al., 2012;Zeman and Resnik, 2008;Yu et al., 2013b;Cohen et al., 2011;Rasooli and Collins, 2015). Recent research has focused on the non-parallel setting, where transfer is driven by cross-lingual commonalities in syntactic structure (Naseem et al., 2010;Berg-Kirkpatrick and Klein, 2010;Cohen and Smith, 2009;Duong et al., 2015).
Our work is closely related to the selectivesharing approaches (Naseem et al., 2012;. The core of these methods is the assumption that head-modifier attachment preferences are universal across different languages. However, the sharing of arc direction is selective and is based on typological features. While this selective sharing idea was first realized in the generative model (Naseem et al., 2012), higher performance was achieved in a discriminative arc-factored model . These gains were obtained by a careful construction of features templates that combine standard dependency parsing features and typological features. In contrast, we propose an automated, tensor-based approach that can effectively capture the interaction between these features, yielding a richer representation for crosslingual transfer. Moreover, our model handles labeled dependency parsing while previous work only focused on the unlabeled dependency parsing task.
Tensor-based Models Our approach also relates to prior work on tensor-based modeling. Lei et al. (2014) employ three-way tensors to obtain a low-dimensional input representation optimized for parsing performance. Srikumar and Manning (2014) learn a multi-class label embedding tailored for document classification and POS tagging in the tensor framework. Yu and Dredze (2015), Fried et al. (2015) apply low-rank tensor decompositions to learn task-specific word and phrase embeddings. Other applications of tensor framework include low-rank regularization (Primadhanty et al., 2015;Quattoni et al., 2014;Singh et al., 2015) and neural tensor networks (Socher et Yu et al., 2013a). While these methods can automatically combine atomic features into a compact composite representation, they cannot take into account constraints on feature combination. In contrast, our method can capture features at different composition levels, and more generally can incorporate structural constraints based on prior knowledge. As our experiments show, this approach delivers higher transfer accuracy.
3 Hierarchical Low-rank Scoring for Transfer Parsing

Background
We start by briefly reviewing the traditional threeway tensor scoring function (Lei et al., 2014). The three-way tensor characterizes each arc h → m using the tensor-product over three feature vectors: the head vector (φ h ∈ R n ), the modifier vector (φ m ∈ R n ) and the arc vector (φ h→m ∈ R l ). φ h captures atomic features associated with the head, such as its POS tag and its word form. Similarly, φ m and φ h→m capture atomic features associated with the modifier and the arc respectively. The tensor-product of these three vectors is a rank-1 tensor: This rank-1 tensor captures all possible combinations of the atomic features in each vector, and therefore significantly expands the feature set. The tensor score is the inner product between a threeway parameter tensor A ∈ R n×n×l and this rank-1 feature tensor: where vec(·) denotes the vector representation of a tensor. This tensor scoring method avoids the parameter explosion and overfitting problem by assuming a low-rank factorization of the parameters Figure 2: Visual representation for hierarchical tensor, represented as a tree structure. The tensor first captures the low-level interaction (Hφ h , M φ m and Dφ d ) by an element-wise product, and then combines the intermediate embedding with other components higher in the hierarchy, e.g. e 2 and Lφ l . The equations show that we composite two representations by an element-wise sum.
A. Specifically, A is decomposed into the sum of r rank-1 components: where r is the rank of the tensor, U, V ∈ R r×n and W ∈ R r×l are parameter matrices. U (i) denotes the i-th row of matrix U and similarly for V (i) and W (i). Figure 1 shows the representation of a more general multiway factorization. With this factorization, the model effectively alleviates the feature explosion problem by projecting sparse feature vectors into dense r-dimensional embeddings via U , V and W . Subsequently, the score is computed as follows: where [·] i denotes the i-th element of the matrix. In multilingual transfer, however, we want to incorporate typological features that do not fit in any of the components. For example, if we add the verb-subject ordering preference into φ h→m , the tensor will represent the concatenation of this preference with a noun-adjective arc, even though this feature should never trigger.

Hierarchical Low-rank Tensor
To address this issue, we propose the hierarchical factorization of tensor parameters. 2 The key idea is to generate intermediate embeddings that capture the interaction of the same set of atomic features as other expert-defined features. As Figure 2 shows, this design enables the model to handle expert-defined features over various subsets of the atomic features. Now, we will illustrate this idea in the context of multilingual parsing. Table 2 summarizes the notations of the feature vectors and the corresponding parameters. Specifically, for each arc h → m with label l, we first compute the intermediate feature embedding e 1 that captures the interaction between the head φ h , the modifier φ m and the arc direction and length φ d , by an element-wise product. [ where [·] i denotes the i-th value of the feature embedding, and H, M and D are the parameter matrices as in Table 2.
The embedding e 1 captures the unconstrained interaction over the head, the modifier and the arc. Note that φ tu includes expert-defined typological features that rely on the specific values of the head POS, the modifier POS and the arc direction, such as the example nounadjective feature in Table 1. Therefore, the embedding T u φ tu captures an expert-defined interaction over the head, the modifier and the arc. Thus e 1 and T u φ tu provide two different representations of the same set of atomic features (e.g. the head) and our prior knowledge motivates us to exclude the interaction between them since the low-rank assumption would not apply. Thus, we combine e 1 and T u φ tu as e 2 using an element-wise sum and thereby avoid such combinations. As Figure 2 shows, e 2 in turn is used to capture the higher level interaction with arc label features φ l , Now e 3 captures the interaction between head, modifier, arc direction, length and label. It is over the same set of atomic features as the typological features that depend on arc labels φ t l , such as the example verb-subject ordering feature in Table 1. Therefore, we sum over these embeddings as Finally, we capture the interaction between e 4 and context feature embeddings H c φ hc and By combining Equation 1 to 5, we observe that our hierarchical tensor score decomposes into three multiway tensor scoring functions.
This decomposition provides another view of our tensor model. That is, our hierarchical tensor is algebraically equivalent to the sum of three multiway tensors, where H c , M c and L are shared. 3 From this perspective, we can see that our tensor model effectively captures the following three sets of combinations over atomic features: The last set of features f 3 captures the interaction across standard atomic features. The other two sets of features f 1 and f 2 focus on combining atomic typological features with atomic label and context features. Consequently, we explicitly assign zero weights for invalid assignments, by excluding the combination of φ tu with φ h and φ m .

Lexicalization Components
In order to encode lexical information in our tensor-based model, we add two additional components, H w φ hw and M w φ mw , for head and modifier lexicalization respectively. We compute the final score as the interaction between the delexicalized feature embedding in Equation 5 and the lexical components. Specifically: where e 5 is the embedding that represents the delexicalized transfer results. We describe the features in φ hw and φ mw in Section 5.

Combined Scoring
Similar to previous work on low-rank tensor scoring models (Lei et al., 2014;Lei et al., 2015), we combine the traditional scoring and the low-rank tensor scoring. More formally, for a sentence x and a dependency tree y, our final scoring function has the form

Learning
In this section, we describe our learning method. 4 Following standard practice, we optimize the parameters θ = (w, H, M, D, L, T u , T l , H c , M c ) in a maximum soft-margin framework, using online passive-aggressive (PA) updates (Crammer et al., 2006).
For tensor parameter update, we employ the joint update method originally used by Lei et al. (2015) in the context of four-way tensors. While our tensor has a very high order (8 components for the delexicalized parser and 10 for the lexicalized parser) and is hierarchical, the gradient computation is nevertheless similar to that of traditional tensors. As described in Section 3.2, we can view our hierarchical tensor as the combination of three multiway tensors with parameter sharing. Therefore, we can compute the gradient of each multiway tensor and take the sum accordingly. For example, the gradient of the label component is where is the element-wise product and + denotes the element-wise addition. y * andỹ are the gold tree and the maximum violated tree respectively. For each sentence x, we findỹ via costaugmented decoding.
Tensor Initialization Given the high tensor order, initialization has a significant impact on the learning quality. We extend the previous power method for high-order tensor initialization (Lei et al., 2015) to the hierarchical structure using the algebraic view as in computing the gradient.
Briefly, the power method incrementally computes the most important rank-1 component for H(i), M (i) etc, for i = 1 . . . r. In each iteration, the algorithm updates each component by taking the multiplication between the tensor T and the rest of the components. When we update the label component l, we do the multiplication for different Feature Description 82A Order of Subject and Verb 83A Order of Object and Verb 85A Order of Adposition and Noun Phrase 86A Order of Genitive and Noun 87A Order of Adjective and Noun  et al., 2005) used to build the feature templates in our work, inspired by Naseem et al. (2012). Unlike previous work (Naseem et al., 2012;, we use 82A and 83A instead of 81A (order of subject, object and verb) because we can distinguish between subject and object relations based on dependency labels. multiway tensors and then take the sum.
where the operator T 0 , h c , m c , −, t u returns a vector in which the i-th element is computed as The algorithm updates other components in a similar fashion until convergence.

Features
Linear Scoring Features Our traditional linear scoring features in φ(h l − → m) are mainly drawn from previous work . Table 3 lists the typological features from "The World Atlas of Language Structure (WALS)" (Dryer et al., 2005) used to build the feature templates in our work. We use 82A and 83A for verb-subject and verb-object order respectively because we can distinguish between these two relations based on dependency labels. Table 4 summarizes the typological feature templates we use. In addition, we expand features with dependency labels to enable labeled dependency parsing.
Tensor Scoring Features For our tensor model, feature vectors listed in Table 2 capture the five types of atomic features as follows: (a) φ h , φ m : POS tags of the head or the modifier. (b) φ hc , φ mc : POS tags of the left/right neighboring words. (c) φ l : dependency labels. (d) φ d : dependency length conjoined with direction. (e) φ tu , φ t l : selectively shared typological features, as described in Table 4. 82A-87A denote the WALS typological feature value. δ(·) is the indicator function. subj ∈ l denotes that the arc label l indicates a subject relation, and similarly for obj ∈ l.
We further conjoin atomic features (b) and (d) with the family and the typological class of the language, because the arc direction and the word order distribution depends on the typological property of languages . We also add a bias term into each feature vector.

Partial Lexicalization
We utilize multilingual word embeddings to incorporate partial lexical information in our model. We use the CCA method (Faruqui and Dyer, 2014) to generate multilingual word embeddings. Specifically, we project word vectors in each non-English language to the English embedding space. To reduce the noise from the automatic projection process, we only incorporate lexical information for the top-100 most frequent words in the following closed classes: pronoun, determiner, adposition, conjunction, particle and punctuation mark. Therefore, we call this feature extension partial lexicalization. 5 We follow previous work (Lei et al., 2014) for adding embedding features. For the linear scoring model, we simply append the head and the modifier word embeddings after the feature vector. For the tensor-based model, we add each entry of the word embedding as a feature value into φ hw and φ mw . In addition, we add indicator features for the English translation of words because this improves performance in preliminary experiments. For example, for the German word und, we add the word and as a feature.

Experimental Setup
Dataset We evaluate our model on the newly released multilingual universal dependency treebank v2.0 ) that consists of 10 languages: English (EN), French (FR), German (DE), Indonesian (ID), Italian (IT), Japanese (JA), Korean (KO), Brazilian-Portuguese (PT), Spanish (ES) and Swedish (SV). This multilingual treebank is annotated with a universal POS tagset and a universal dependency label set. Therefore, this dataset is an excellent benchmark for cross-lingual transfer evaluation. For POS tags, the gold universal annotation used the coarse tagset ) that consists of 12 tags: noun, verb, adjective, adverb, pronoun, determiner, adposition, numeral, conjunction, particle, punctuation mark, and a catch-all tag X. For dependency labels, the universal annotation developed the Stanford dependencies (De Marneffe and Manning, 2008) into a rich set of 40 labels. This universal annotation enables labeled dependency parsing in crosslingual transfer.
Evaluation Scenarios We first consider the unsupervised transfer scenario, in which we assume no target language annotations are available. Following the standard setup, for each target language evaluated, we train our model on the concatenation of the training data in all other source languages.
In addition, we consider the semi-supervised transfer scenario, in which we assume 50 sentences in the target language are available with annotation. However, we observe that random sentence selection of the supervised sample results in a big performance variance. Instead, we select sentences that contain patterns that are absent or rare in source language treebanks. To this end, each time we greedily select the sentence that minimizes the KL divergence between the trigram distribution of the target language and the trigram distribution of the training data after adding this sentence. The training data includes both the target and the source languages. The trigrams are based on universal POS tags. Note that our method does not require any dependency annotations. To incorporate the new supervision, we simply add the new sentences into the original training set, weighing their impact by a factor of 10.
Baselines We compare against different variants of our model. • NT-Select: our model without the tensor component. This baseline corresponds to the prior feature-based transfer method  with extensions to labeled parsing, lexicalization and semi-supervised parsing. 6 • Multiway: tensor-based model where typological features are added as an additional component and parameters are factorized in the multiway structure similarly as in Figure 1. • Sup50: our model trained only on the 50 sentences in the target language in the semisupervised scenario. In all the experiments we incorporate partial lexicalization for all variants of our model and we focus on labeled dependency parsing.
Supervised Upper Bound As a performance upper bound, we train the RBGParser (Lei et al., 2014), the state-of-the-art tensor-based parser, on the full target language training set. We train the first-order model 7 with default parameter settings, using the current version of the code. 8 Evaluation Measures Following standard practices, we report unlabeled attachment score (UAS) and labeled attachment score (LAS), excluding punctuation. For all experiments, we report results on the test set and omit the development results because of space.
Experimental Details For all experiments, we use the arc-factored model and use Eisner's algorithm (Eisner, 1996) to infer the projective Viterbi parse. We train our model and the baselines for 10 epochs. We set a strong regularization C = 0.001 during learning because cross-lingual transfer contains noise and the models can easily overfit. Other hyper-parameters are set as γ = 0.3 and r = 200 (rank of the tensor). For partial lexicalization, we set the embedding dimension to 50. Table 5 and 7 summarize the results for the unsupervised and the semi-supervised scenarios. Averaged across languages, our model outperforms all 6 We use this as a re-implementation of 's model because their code is not publicly available. 7 All multilingual transfer models in our work and in   Table 5: Unsupervised: Unlabeled attachment scores (UAS) and Labeled attachment scores (LAS) of different variants of our model with partial lexicalization in unsupervised scenario. "Direct" and "Multiway" indicate the direct transfer and the multiway variants of our model. "NT-Select" indicates our model without tensor component, corresponding to a re-implementation of previous transfer model  with extensions to partial lexicalization and labeled parsing. The last column shows the results by our hierarchical tensor-based model. Boldface numbers indicate the best UAS or LAS.

Impact of Hierarchical Tensors
We first analyze the impact of using a hierarchical tensor by comparing against the Multiway baseline that implements traditional tensor model. As Table 6 shows, this model learns non-zero weights even for invalid feature combinations. This disregard to known constraints impacts the resulting performance. In the unsupervised scenario, our hierarchical tensor achieves an average improvement of 0.5% on UAS and 1.3% on LAS. Moreover, our model obtains better UAS on all languages and better LAS on 9 out of 10 languages. This observation shows that the multilingual transfer consistently benefits more from a hierarchical tensor structure. In addition, we observe a similar gain over this baseline in the semisupervised scenario.

Impact of Tensor Models
To evaluate the effectiveness of tensor modeling in multilingual transfer, we compare our model against the NT-Select baseline. In the unsupervised scenario, our tensor model yields a 1.1% gain on UAS and a 1.5% on LAS. In the semi-supervised scenario, the improvement is more pronounced, reaching 1.7% on UAS and 1.9% on LAS. The relative error reduction almost doubles, e.g. 7.1% vs. 3.8% on UAS. While both our model and NT-Select outperform Direct baseline by a large margin on UAS, we observe that NT-Select achieves a slightly worse LAS than Direct. By adding a tensor component, our model outperforms both baselines on LAS, demonstrating that tensor scoring function is able to capture better labeled features for transfer comparing to Direct and NT-Select baselines.
Transfer Performance in the Context of Supervised Results To assess the contribution of multilingual transfer, we compare against the Sup50 results in which we train our model only on 50 target language sentences. As Table 7 shows, our model improves UAS by 2.3% and LAS by 2.7%. We also provide a performance upper bound  Table 7: Semi-supervised and Supervised: UAS and LAS of different variants of our model when 50 annotated sentences in the target language are available. "Sup50" columns show the results of our model when only supervised data in the target language is available. We also include in the last two columns the supervised training results with partial or full lexicalization as the performance upper bound. Other columns have the same meaning as in Table 5. Boldface numbers indicate the best UAS or LAS.
by training RBGParser on the full training set. 9 When trained with partial lexical information as in our model, RBGParser gives 82.9% on UAS and 74.5% on LAS with partial lexical information. By utilizing source language annotations, our model closes the performance gap between training on the 50 sentences and on the full training set by about 30% on both UAS and LAS. We further compare to the performance upper bound with full lexical information (87.3% UAS and 83.5% LAS). In this case, our model still closes the performance gap by 21% on UAS and 15% on LAS.

Time Efficiency of Hierarchical Tensors
We observe that our hierarchical structure retains the time efficiency of tensor models. On the English test set, the decoding speed of our hierarchical tensor is close to the multiway counterpart (58.6 vs. 61.2 sentences per second), and is lower than the three-way tensor by a factor of 3.1 (184.4 sentences per second). The time complexity of tensors is linear to the number of low-rank components, and is independent of the factorization structure.

Conclusions
In this paper, we introduce a hierarchical tensor based-model which enables us to constrain learned representation based on desired feature interactions. We demonstrate that our model outperforms state-of-the-art multilingual transfer parsers and