Joint part-of-speech and dependency projection from multiple sources

Most previous work on annotation projection has been limited to a subset of IndoEuropean languages, using only a single source language, and projecting annotation for one task at a time. In contrast, we present an Integer Linear Programming (ILP) algorithm that simultaneously projects annotation for multiple tasks from multiple source languages, relying on parallel corpora available for hundreds of languages. When training POS taggers and dependency parsers on jointly projected POS tags and syntactic dependencies using our algorithm, we obtain better performance than a standard approach on 20/23 languages using one parallel corpus; and 18/27 languages using another.


Introduction
Cross-language annotation projection for unsupervised POS tagging and syntactic parsing was introduced fifteen years ago (Yarowsky et al., 2001;Hwa et al., 2005), and the best unsupervised dependency parsers today rely on annotation projection (Rasooli and Collins, 2015).
Despite the maturity of the field, there is an inherent language bias in previous work on crosslanguage annotation projection. Cross-language annotation projection experiments require training data in m source languages, a parallel corpus of translations from the m source languages into the target language of interest, as well as evaluation data for the target language. 1 Since the canonical resource for parallel text is the Europarl Corpus (Koehn, 2005), which covers languages spoken in the European parliament, annotation projection is typically limited to the subset of Indo-European languages that have treebanks.
Previous work is also limited in another respect. While treebanks typically contain multiple layers of annotation, previous work has focused on projecting data for a single task.
We go significantly beyond previous work in two ways: 1) by considering multi-source projection across languages in parallel corpora that are available for hundreds of languages, including many non-Indo-European languages; and 2) by jointly projecting annotation for two mutually dependent tasks, namely POS tagging and dependency parsing. Using multiple source languages makes our projections denser. In single source projection, the source language may not contain all syntactic phenomena of the target language; we combat this by transferring syntactic information from multiple source languages. Our work also differs from previous work on annotation projection in projecting soft rather than hard constraints, i.e., scores rather than labels and edges.
Contributions We present a novel ILP-based algorithm for jointly projecting POS labels and dependency annotations across word-aligned parallel corpora. The performance of our algorithm compares favorably to that of a state-of-the-art projection algorithm, as well as to multi-source delexicalized transfer. Our experiments include between 23 and 27 languages using two parallel corpora that are available for hundreds of languages, namely a collection of Bibles and Watchtower periodicals. Finally, we make both the parallel corpora and the code publicly available. 2

Projection algorithm
The projection algorithm is divided into two distinct steps. First, we project potential syntactic edges and POS tags from all source languages into an intermediate target graph, which is left deliberately ambiguous. In the second step, we decode the target graph by solving a constrained optimisation problem, which simultaneously resolves all ambiguities and produces a single dependency tree with a fixed set of POS tags. Below we describe both steps in more detail.

Cross-language sentence
The input to our projection algorithm is a crosslanguage sentence, a data structure that ties together a collection of aligned sentences from a parallel corpus, i.e., sentences in many different languages that are determined to be translation equivalents. One sentence of the set is designated as the target while the rest are sources. We project syntactic information from the sources to the target.
All source sentences are automatically parsed with a graph-based dependency parser and labeled with parts of speech. Instead of using the single best dependency tree output by the parser, we extract its scoring matrix, an ambiguous structure that assigns a numeric score to each potential dependency edge. The target sentence is not parsed or POS-tagged. In fact, our approach is explicitly designed to work for target languages where no such resources are available. Only unsupervised word alignments couple the target sentence with each source sentence.
More formally, a cross-language sentence may be represented as a graph G = (V, E), where each vertex is a POS-tagged token of a sentence in some language. With one target and n source languages, the total set of tagged word vertices V can be written as the union of sentence vertices: Two kinds of weighted edges connect the graph. Edges that go between tagged tokens of a sentence V i represent potential dependency edges. Thus, for the sentence i, the induced subgraph G[V i ] is the (ambiguous) dependency graph. Edges connecting a source vertex to target vertex represent word alignments. The set of alignment edges is To account for POS we introduce a vertex labeling function l : V → Σ, where Σ is the POS vocabulary. The source sentences are automatically tagged, and for any source vertex the label function simply returns this tag. For the target sentence the POS labels are unknown, which is to say that every target token is ambiguous between |Σ| POS tags. We represent this ambiguity in the graph by creating a vertex for each possible combination of target word and POS. Concretely, if a source sentence i has n tokens, and the target sentence has m tokens, then |V i | = n, and |V s | = m|Σ|.
Alignments are constrained such that an alignment (u, v) ∈ V s × V t only exists if the source and target token were linked by the automatic aligner and l(u) = l(v), i.e., the POS tags match. This filters out potential source relations with dissimilar syntax, a luxury that we are allowed in a multiple source language setup.

Projecting to ambiguous target graph
The target graph G[V t ] starts out empty and is populated with edges in the following way. We go through the source sentences, looking for potential dependency edges where both endpoints are aligned to the target sentence, and transferring the edge whenever we find one. Technically, for every source sentence i and for each edge in the source The edge weight is the source edge score (as determined by an automatic parser) weighted by the joint alignment probability of (u s , u t ) and (v s , v t ): For clarity, d refers to weights of dependency edges, and a to alignment edge weights. Multiple source sentences may project the same edge to the target graph. When this happens we update the target edge weight only if the new weight is larger than the existing. The weight then reflects the strongest evidence found for a given syntactic relation across all source languages.

Decoding the target graph
We are now ready to decode the target graph. The result of decoding is a dependency tree as well as a labeling of the target sentences with POS tags. Labeling with POS corresponds to selecting a subset of the verticesṼ ⊂ V t , such that exactly one vertex is chosen for each token. Similarly the decoded dependency tree is a subset of the projected target edges with the constraint that it must form a tree over the vertices ofṼ . The joint optimization objective is to simultaneously select a set of verticesṼ and edgesẼ to maximize the score of the decoded tree. We solve this constrained optimization problem by casting it as an integer linear programming (ILP) problem.
The full specification of the ILP model is displayed as Figure 1. The model is optimized over two types of binary decision variables mapping directly to the target graph representation discussed in the previous section, plus additional flow variables that enforce tree structure. An edge variable e i,k,j,l represents a target edge (i, j) where the POS of i is k and the POS of j is l. For instance, the variable e 2,V,1,N represents a directed edge from the second token (a verb) to the first (a noun). An active vertex variable v i,k indicates that the POS of token i is chosen as k.
Following Martins (2012), we constrain the search space to spanning trees by using a singlecommodity-flow construction. In the commodityflow analogy, we imagine the root as a factory that produces n commodities (for an n token sentence) which are distributed along the edges of the tree. Each token is a consumer that must receive and pass on all except one commodity to its dependents, i.e., the difference between incoming and outgoing flow should be 1. Since all commodities must be consumed, the outgoing flow for a leaf node will be zero. Together with the requirement that each token must have exactly one head, this ensures all tokens are connected to the root in the tree structure.
The last two constraint groups enforce edge and POS consistency, and the selection of single POS per token. Both are new to this work.

Data sources
Our projection requires parallel text, ideally spanning a large number of languages, and dependency treebanks for the sources.
Treebanks To train the source-side taggers and dependency parsers, and to evaluate the crosslingual taggers and parsers, we use the Universal Dependencies (UD) version 1.2 treebanks with the corresponding test sets. 3 Parallel texts We exploit two sources of parallel text: the Edinburgh Multilingual Bible corpus (EBC) (Christodouloupoulos and Steedman, 2014), and our own collection of online texts published by the Wathctower Society (WTC Vertices One parent per token i,k,l e i,k,j,l = 1 ∀j = 0 The root token (index 0) sends n flow j,l φ 0,0,j,l = n Each token consumes one unit of flow Active edges choose token POS v i,k ≥ e i,k,j,l ∀i = 0, j, k, l v i,l ≥ e i,k,j,l ∀i, j, k, l Above, i, j, and x are token indices, while k and l refer to POS. Quantification over these symbols in the equations are always with respect to a given target graph. Figure 1: Specification of the ILP model. We list, in order, the decision variables, the objective, and the five groups of constraint templates.
the two collections span more than 100 languages, we focus on the subsets that overlap with the UD languages to facilitate evaluation. For EBC, that amounts to 27 languages, and 23 for WTC.
Preprocessing We use simple sentence splitting and tokenization models to segment the parallel corpora. 5 To sentence-and word-align the individual language pairs, we use a Gibbs samplingbased IBM1 alignment model called efmaral (Östling, 2015). IBM1 has been shown to lead to more robust alignments across typologically distant language pairs (Östling, 2015). We modify the aligner to output alignment probabilities. All the source-side texts are POS-tagged and dependency parsed using TnT (Brants, 2000) and Tur-boParser (Martins et al., 2013). We use our own fork of the arc-factored TurboParser to output the edge weight matrices. 6 4 Experiments

Setup
In our experiments, as in the preprocessing, we use the TnT tagger and the arc-factored TurboParser, which we train on the EBC and WTC texts with projected and decoded annotations. We randomly sample up to 20k sentences per training file in both tagging and parsing. This 20k sampling limit applies to all systems. We compare two cross-lingual projection-based parsing systems, and one baseline system.
ILP The ILP-based joint projection algorithm we presented in Section 2.
DCA Our implementation of the de facto standard annotation projection algorithm of Hwa et al. (2005), as refined by Tiedemann (2014). In contrast to our ILP approach, it uses heuristics to ensure dependency tree constraints on a sourcetarget sentence pair basis. We gather all the pairwise projections into a target sentence graph and then perform maximum spanning tree decoding following Sagae and Lavie (2006).
DELEX The multi-source direct delexicalized transfer baseline of McDonald et al. (2011). Each source is represented by an approximately equal number of sentences.

Results
Table 1 provides a summary of dependency parsing scores. We report UAS scores over predicted and gold POS. The predicted tags come from our cross-lingual taggers. Our ILP approach consistently outperforms DCA on both by a large margin of 3-5 points UAS using predicted POS, and 5-10 points on gold POS. Note that DELEX is trained on gold POS and therefore has an advantage in this 6 https://github.com/andersjo/ TurboParser 8 We do not include DELEX in the comparison for the gold POS scenario only. In this particular scenario, DELEX is also trained on gold POS, and thus biased: the cross-lingual taggers do not have gold POS available for training, and the same holds for DELEX and projected POS.  In Table 2, we split the scores across the test languages and parallel data sources, and we also report the POS tagging accuracies. Our WTC taggers are on average 3.5 points better than EBC taggers, yielding the top score for 16/23 languages from the overlap. Notably, on several non-Indo-European languages, we observe significant improvements. For example, on Indonesian, DCA improves over DELEX by 12 points UAS, while ILP adds 6 more points on top. We observe a similar pattern for Arabic and Estonian. We note that DELEX tops ILP and DCA on only 1 EBC and 3 WTC languages, and by a narrow margin.
Analysis A projected parse is allowed to be a composite of edges from many source languages. To find out to what degree this actually happens, we analyze all projections into English and German on the WTC corpus.
For German the top four source languages are Czech, Norwegian, French, and English, contributing between 16% and 7% of all edges. For English the top languages are Norwegian, Italian, Indonesian, and Swedish. Here, the top language Norwegian is responsible for 42% of the edges, while Swedish accounts for 13%. Only the language projecting the highest scoring edge is counted. On average, a German sentence has edges from 4.1 source languages. The same number for English is slightly higher, at 4.5.
Manually annotated data We annotate a small number of sentences in English from EBC and

Related work
In recent years, we note an increased interest for work in cross-lingual processing, and particularly in POS tagging and dependency parsing of lowresource languages. Yarowsky et al. (2001) proposed the idea of inducing NLP tools via parallel corpora. Their contribution started a line of work in annotation projection. Das and Petrov (2011) used graph-based label propagation to yield competitive POS taggers, while Hwa et al. (2005) introduced the projection of dependency trees. Tiedemann (2014) further improved this approach to single-source projection in the context of synthesizing dependency treebanks (Tiedemann and Agić, 2016). The current state of the art in cross-lingual dependency parsing also involves exploiting large parallel corpora (Ma and Xia, 2014;Rasooli and Collins, 2015).
Transferring models by training parsers without lexical features was first introduced by Zeman and Resnik (2008). McDonald et al. (2011) and Søgaard (2011) coupled delexicalization with contributions from multiple sources, while McDonald et al. (2013) were the first to leverage uniform representations of POS and syntactic dependencies in cross-lingual parsing.
Even more recently, Agić et al. (2015) exposed a bias towards closely related Indo-European languages shared by most previous work on annotation projection, while introducing a bias-free projection algorithm for learning 100 POS taggers from multiple sources. Their line of work is non-trivially extended to multilingual dependency parsing by .
The work in annotation projection for crosslingual NLP invariably treats mutually dependent layers of annotation separately. Our contribution is distinct from these works by implementing the first approach to joint projection of POS and dependencies, while maintaining the outlook on processing truly low-resource languages.

Conclusion
In our contribution, we addressed tagging and parsing for low-resource languages through joint cross-lingual projection of POS tags and syntactic dependencies from multiple source languages. Our novel approach to transferring the annotations via word alignments is based on integer linear programming, more specifically on a commodityflow formalization for spanning trees.
In our experiments with 27 treebanks from the Universal Dependencies (UD) project, our approach compared very favorably to two competitive cross-lingual systems: we provided the best cross-lingual taggers and parsers for 18/27 and 20/23 languages, depending on the parallel corpora used. We made no unrealistic assumptions as to the availability of parallel texts and preprocessing tools for the target languages. Our code and data is freely available. 9