Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages

We present SuperPivot, an analysis method for low-resource languages that occur in a superparallel corpus, i.e., in a corpus that contains an order of magnitude more languages than parallel corpora currently in use. We show that SuperPivot performs well for the crosslingual analysis of the linguistic phenomenon of tense. We produce analysis results for more than 1000 languages, conducting – to the best of our knowledge – the largest crosslingual computational study performed to date. We extend existing methodology for leveraging parallel corpora for typological analysis by overcoming a limiting assumption of earlier work: We only require that a linguistic feature is overtly marked in a few of thousands of languages as opposed to requiring that it be marked in all languages under investigation.


Introduction
Significant linguistic resources such as machinereadable lexicons and part-of-speech (POS) taggers are available for at most a few hundred languages.This means that the majority of the languages of the world are low-resource.Lowresource languages like Fulani are spoken by tens of millions of people and are politically and economically important; e.g., to manage a sudden refugee crisis, NLP tools would be of great benefit.Even "small" languages are important for the preservation of the common heritage of humankind that includes natural remedies and linguistic and cultural diversity that can potentially enrich everybody.Thus, developing analysis methods for low-resource languages is one of the most important challenges of NLP today.
We address this challenge by proposing a new method for analyzing what we call superparallel corpora, corpora that are by an order of magnitude more parallel than corpora that have been available in NLP to date.The corpus we work with in this paper is the Parallel Bible Corpus (PBC) that consists of translations of the New Testament in 1169 languages.Given that no NLP analysis tools are available for most of these 1169 languages, how can we extract the rich information that is potentially hidden in such superparallel corpora?
The method we propose is based on two hypotheses.H1 Existence of overt encoding.For any important linguistic distinction f that is frequently encoded across languages in the world, there are a few languages that encode f overtly on the surface.H2 Overt-to-overt and overt-tonon-overt projection.For a language l that encodes f , a projection of f from the "overt languages" to l in the superparallel corpus will identify the encoding that l uses for f , both in cases in which the encoding that l uses is overt and in cases in which the encoding that l uses is nonovert.Based on these two hypotheses, our method proceeds in 5 steps.
1. Selection of a linguistic feature.We select a linguistic feature f of interest.Running example: We select past tense as feature f .
2. Heuristic search for head pivot.Through a heuristic search, we find a language l h that contains a head pivot p h that is highly correlated with the linguistic feature of interest.
Running example: "ti" in Seychelles Creole (CRS).CRS "ti" meets our requirements for a head pivot well as will be verified empirically in §3.First, "ti" is a surface marker: it is easily identifable through whitespace tokenization and it is not ambiguous, e.g., it does not have a second meaning apart from being a grammatical marker.Second, "ti" is a good marker for past tense in terms of both "precision" and "recall".CRS has mandatory past tense marking (as opposed to languages in which tense marking is facultative) and "ti" is highly correlated with the general notion of past tense.
This does not mean that every clause that a linguist would regard as past tense is marked with "ti" in CRS.For example, some tense-aspect configurations that are similar to English present perfect are marked with "in" in CRS, not with "ti" (e.g., ENG "has commanded" is translated as "in ordonn").
Our goal is not to find a head language and a head pivot that is a perfect marker of f .Such a head pivot probably does not exist; or, more precisely, linguistic features are not completely rigorously defined.In a sense, one of the significant contributions of this work is that we provide more rigorous definitions of past tense across languages; e.g., "ti" in CRS is one such rigorous definition of past tense and it automatically extends (through projection) to 1000 languages in the superparallel corpus.
3. Projection of head pivot to larger pivot set.Based on an alignment of the head language to the other languages in the superparallel corpus, we project the head pivot to all other languages and search for highly correlated surface markers, i.e., we search for additional pivots in other languages.This projection to more pivots achieves three goals.First, it makes the method more robust.Relying on a single pivot would result in many errors due to the inherent noisiness of linguistic data and because several components we use (e.g., alignment of the languages in the superparallel corpus) are imperfect.Second, as we discussed above, the head pivot does not necessarily have high "recall"; our example was that CRS "ti" is not applied to certain clauses that would be translated using present perfect in English.Thus, moving to a larger pivot set increases recall.Third, as we will see below, the pivot set can be leveraged to create a fine-grained map of the linguistic feature.Consider clauses referring to eventualities in the past that English speakers would render in past progressive, present perfect and simple past tense.Our hope is that the pivot set will cover these distinctions, i.e., one of the pivots marks past progressive, but not present prefect and simple past, another pivot marks present perfect, but not the other two and so on.It is beyond the scope of this paper to verify that we can produce such an analysis for all linguistic features, but a promising example of this type of map, including distinctions like progressive and perfective aspect, is given in §4.
Running example: We compute the correlation of "ti" with words in other languages and select the 100 highest correlated words as pivots.Examples of pivots we find this way are Torres Strait Creole "bin" (from English "been") and Tzotzil "laj"."laj" is a perfective marker, e.g., "Laj meltzaj -uk" 'LAJ be-made subj' means "It's done being built" (Aissen, 1987).
4. Projection of pivot set to all languages.Now that we have a large pivot set, we project the pivots to all other languages to search for linguistic devices that express the linguistic feature f .Up to this point, we have made the assumption that it is easy to segment text in all languages into pieces of a size that is not too small (individual characters of the Latin alphabet would be too small) and not too large (entire sentences as tokens would be too large).Segmentation on standard delimiters is a good approximation for the majority of languages -but not for all: it undersegments some (e.g., the polysynthetic language Inuit) and oversegments others (e.g., languages that use punctuation marks as regular characters).
For this reason, we do not employ tokenization in this step.Rather we search for character ngrams (2 ≤ n ≤ 6) to find linguistic devices that express f .This implementation of the search procedure is a limitation -there are many linguistic devices that cannot be found using it, e.g., templates in templatic morphology.We leave addressing this for future work ( §7).
Running example: We find "-ed" for English and "-te" for German as surface features that are highly correlated with the 100 past tense pivots.
5. Linguistic analysis.The result of the previous steps is a superparallel corpus that is richly annotated with information about linguistic feature f .This structure can be exploited for the analysis of a single language l i that may be the focus of a linguistic investigation.Starting with the character n-grams that were found in the step "projection of pivot set to all languages", we can explore their use and function, e.g, for the mined n-gram "-ed" in English (assuming English is the language l i and it is unfamiliar to us).Many of the other 1000 languages provide annotations of linguistic feature f for l i : both the languages that are part of the pivot set (e.g., Tzotzil "laj") and the mined ngrams in other languages that we may have some knowledge of (e.g., "-te" in German).
We can also use the structure we have generated for typological analysis across languages following the work of Michael Cysouw.He has pioneered a new methodology for typology ( (Cysouw, 2014), §5).We do not contribute any innovations to typology in this paper, but our method is a significant advancement computationally over Cysouw's work because we overcome many of his limiting assumptions.Most importantly, our method scales to thousands of languages as we demonstrate below whereas Cysouw worked on a few dozen.
Running example: We sketch the type of analysis that our new method makes possible in §4.
The above steps "1.heuristic search for head pivot" and "2.projection of head pivot to larger pivot set" are based on H1: we assume the existence of overt coding in a subset of languages.
The above steps "2.projection of head pivot to larger pivot set" and "3.projection of pivot set to all languages" are based on H2: we assume that overt-to-overt and overt-to-non-overt projection is possible.
In the rest of the paper, we will refer to the method that consists of steps 1 to 5 as SuperPivot: "linguistic analysis of SUPERparallel corpora using surface PIVOTs".
We make three contributions.(i) Our basic hypotheses are H1 and H2.(H1) For an important linguistic feature, there exist a few languages that mark it overtly and easily recognizably.(H2) It is possible to project overt markers to overt and non-overt markers in other languages.Based on these two hypotheses we design SuperPivot, a new method for analyzing highly parallel corpora, and show that it performs well for the crosslingual analysis of the linguistic phenomenon of tense.(ii) Given a superparallel corpus, SuperPivot can be used for the analysis of any low-resource language represented in that corpus.In the supplementary material, we present results of our analysis for three tenses (past, present, future) for 11631 languages.An evaluation of accuracy is presented in Table 3.2.(iii) We extend Michael Cysouw's pioneering work on typological analysis using paral-lel corpora by overcoming several limiting factors.The most important is that Cysouw's method is only applicable if markers of the relevant linguistic feature are recognizable on the surface in all languages.In contrast, we only assume that markers of the relevant linguistic feature are recognizable on the surface in a small number of languages.
2 SuperPivot: Description of method 1. Selection of a linguistic feature.The linguistic feature of interest f is selected by the person who performs a SuperPivot analysis, i.e., by a linguist, NLP researcher or data scientist.Henceforth, we will refer to this person as the linguist.
2. Heuristic search for head pivot.There are several ways for finding the head language and the head pivot.Perhaps the linguist knows a language that has a good head pivot.Or she is a trained typologist and can find the head pivot by consulting the typological literature.
In this paper, we use our knowledge of English and an alignment from English to all other languages to find head pivots.(See below for details on alignment.)We define a "query" in English and search for words that are highly correlated to the query in other languages.For future tense, the query is simply the word "will", so we search for words in other languages that are highly correlated with "will".For present tense, the query is the union of "is", "are" and "am".So we search for words in other languages that are highly correlated with the "merger" of these three words.For past tense, we POS tag the English part of PBC and merge all words tagged as past tense into one past tense word. 2 We then search for words in other languages that are highly correlated with this artificial past tense word.
As an additional constraint, we do not select the most highly correlated word as the head pivot, but the most highly correlated word in a Creole language.Our rationale is that Creole languages are more regular than other languages because they are young and have not accumulated "historical baggage" that may make computational analysis more difficult.
Table 1 lists the three head pivots for F .
3. Projection of head pivot to larger pivot set.We first use fast align (Dyer et al., 2013) to align the head language to all other languages in the corpus.This alignment is on the word level.
We compute a score for each word in each language based on the number of times it is aligned to the head pivot, the number of times it is aligned to another word and the total frequencies of head pivot and word.We use χ 2 (Casella and Berger, 2008) as the score throughout this paper.Finally, we select the k words as pivots that have the highest association score with the head pivot.
We impose the constraint that we only select one pivot per language.So as we go down the list, we skip pivots from languages for which we already have found a pivot.We set k = 100 in this paper.Table 1 gives the top 10 pivots.
4. Projection of pivot set to all languages.As discussed above, the process so far has been based on tokenization.To be able to find markers that cannot be easily detected on the surface (like "-ed" in English), we identify non-tokenizationbased character n-gram features in step 4.
The immediate challenge is that without tokens, we have no alignment between the languages anymore.We could simply assume that the occurrence of a pivot has scope over the entire verse.But this is clearly inadequate, e.g., for the sentence "I arrived yesterday, I'm staying today, and I will leave tomorrow", it is incorrect to say that it is marked as past tense (or future tense) in its entirety.Fortunately, the verses in the New Testament mostly have a simple structure that limits the variation in where a particular piece of content occurs in the verse.We therefore make the assumption that a particular relative position in language l 1 (e.g., the character at relative position 0.62) is aligned with the same relative position in l 2 (i.e., the character at relative position 0.62).This is likely to work for a simple example like "I arrived yesterday, I'm staying today, and I will leave tomorrow" across languages.
In our analysis of errors, we found many cases where this assumption breaks down.A wellknown problematic phenomenon for our method is the difference between, say, VSO and SOV languages: the first class puts the verb at the beginning, the second at the end.However, keep in mind that we accumulate evidence over k = 100 pivots and then compute aggregate statistics over the entire corpus.As our evaluation below shows, the "linear alignment" assumption does not seem to do much harm given the general robustness of our method.
One design element that increases robustness is that we find the two positions in each verse that are most highly (resp.least highly) correlated with the linguistic feature f .Specifically, we compute the relative position x of each pivot that occurs in the verse and apply a Gaussian filter (σ = 6 where the unit of length is the character), i.e., we set p(x) ≈ 0.066 (0.066 is the density of a Gaussian with σ = 6 at x = 0) and center a bell curve around x.The total score for a position x is then the sum of the filter values at x summed over all occurring pivots.Finally, we select the positions x min and x max with lowest and highest values for each verse.
χ 2 is then computed based on the number of times a character n-gram occurs in a window of size w around x max (positive count) and in a window of size w around x min (negative count).Verses in which no pivot occurs are used for the negative count in their entirety.The top-ranked character ngrams are then output for analysis by the linguist.We set w = 20.
5. Linguistic analysis.We now have created a structure that contains rich information about the linguistic feature: for each verse we have relative positions of pivots that can be projected across languages.We also have maximum positions within a verse that allow us to pinpoint the most likely place in the vicinity of which linguistic feature f is marked in all languages.This structure can be used for the analysis of individual low-resource languages as well as for typological analysis.We will give an example of such an analysis in §4.
6. Hierarchical clusterings of markers and languages.As an additional evaluation, we worked on hierarchical clusterings of past, present and future pivots.As detailed in §2.4,we represent each verse by a vector of length 100 showing which pivot markers are used to express this verse.The other way of looking at these data is that for each marker we have an occurrence distribution over verses and we may exploit these data to demonstrate the distance between markers.For the purpose of comparing two markers, we propose calculation of the Jensen-Shannon divergence between the normalized occurrence distribution over verses: where mp i and mp i , are the normalized occurrence distributions over verses.We compare the obtained distance between markers with genetic distance of their corresponding languages using WALS information (Dryer et al., 2005).For visualization purposes, we perform Unweighted Pair Group Method with Arithmetic Mean (UPGMA) hierarchical clustering on the pairwise distance matrix of the marker for each tense separately (Johnson, 1967).
In addition to clustering of pivot markers for each tense separately, we performed the same comparison for all top markers of 1107 languages3 and take the average distances of languages in past, present, and future marking.This allows us to compare the average tense behavior of languages.
3 Data, experiments and results

Data
We use a New Testament subset of the Parallel Bible Corpus (PBS) (Mayer and Cysouw, 2014) that consists of 1556 translations of the the Bible in 1169 unique languages.We consider two languages to be different if they have different ISO 639-3 codes.
The translations are aligned on the verse level.However, many translations do not have complete coverage, so that most verses are not present in at least one translation.One reason for this is that sometimes several consecutive verses are merged, so that one verse contains material that is in reality not part of it and the merged verses may then be missing from the translation.Thus, there is a trade-off between number of parallel translations and number of verses they have in common.Although some preprocessing was done by the authors of the resource, many translations are not preprocessed.For example, Japanese is not tokenized.We also observed some incorrectness and sparseness in the metadata.One example is that one Fijian translation (see §4) is tagged fij hindi, but it is Fijian, not Fiji Hindi.
We use the 7958 verses with the best coverage across languages.

Experiments
1. Selection of a linguistic feature.We conduct three experiments for the linguistic features past tense, present tense and future tense.
3. Projection of head pivot to larger pivot set.Using the method described in §2, we project each head pivot to a set of k = 100 pivots.Table 1 gives the top 10 pivots for each tense.
4. Projection of pivot set to all languages.Using the method described in §2, we compute highly correlated character n-gram features, 2 ≤ n ≤ 6, for all 1163 languages.
See §4 for the last step of SuperPivot: 5. Linguistic analysis.

Evaluation
We rank n-gram features and retain the top 10, for each linguistic feature, for each language and for each n-gram size.We process 1556 translations.Thus, in total, we extract 1556 × 5 × 10 n-grams.
Table 3.2 shows Mean Reciprocal Rank (MRR) for 10 languages.The rank for a particular ranking of n-grams is the first n-gram that is highly correlated with the relevant tense; e.g., character subsequences of the name "Paulus" are evaluated as incorrect, the subsequence "-ed" in English as correct for past.MRR is averaged over all n-gram sizes, 2 ≤ n ≤ 6. Chinese has consistent tense marking only for future, so results are poor.Russian and Polish perform poorly because their central grammatical category is aspect, not tense.The poor performance on Arabic is due to the limits of character n-gram features for a "templatic" language.
During this evaluation, we noticed a surprising amount of variation within translations of one language; e.g., top-ranked n-grams for some German translations include names like "Paulus".We suspect that for literal translations, linear alignment ( §2) yields good n-grams.But many translations are free, e.g., they change the sequence of clauses.This deteriorates mined n-grams.See §7.
Hierarchical clusterings of markers.Hierarchical clusterings of past, present and future pivots using JSD between the normalized occurrence distribution over verses are shown in Figure 1 tenses respectively.In addition to markers clusterings, the average tense behavior clustering of 1107 languages is depicted in Figure 4.In these figures languages are colored based on their language families using WALS (Dryer et al., 2005), languages without family information on WALS are uncolored.We observed that most of pivot past and future markers belong to Niger Congo family and present markers are mostly within Indo-European family.It can be seen that in many cases languages with the same family behave accordingly in tense marking.For instance, in past tense marking Oto-Manguean languages use almost the same marker of ni with small writing variations (Figure 1).Although Tezoatlán Mixtec did not have a record on WALS, since its marker is the same as other Oto-Manguean languages and works almost identical to ni in Oto-Manguean languages, we may guess this language is also Oto-Manguean, which turned out to be true when we performed further searches.4There were many of such cases for which we could guess the family of language based on their tense marking similarities in Figure 1, Figure 2 and Figure 3.We use normalized JSD (0 ≤ JSD ≤ 1) for comparison of each pair of languages/markers; this allows us to investigate whether a simple threshold of 0.5 accurately predicts whether two languages are genetically related or not.The results are summarized in Table 3.3.Although the average tense marking divergence has a low recall, it expresses a high precision of 0.36, where the random chance is 1 103 ≈ 0.01.Thus, it means that if divergence of tense marking is low the languages are very likely to be genetically related.This conclusion is supported by Figure 4 where many small clusters of nodes have the same color.This suggests that our method may help in completion of WALS.

A map of past tense
To illustrate the potential of our method we select five out of the 100 past tense pivots that give rise to large clusters of distinct combinations.Starting with CRS, we find other pivots that "split" the set of verses that contain the CRS past tense pivot "ti" into two parts that have about the same size.This gives us two sets.We now look for a pivot that splits one of these two sets about evenly and so on.After iterating four times, we arrive at five pivots: CRS "ti", Fijian (FIJ) "qai", Hawaiian Creole (HWC) "wen", Torres Strait Creole (TCS) "bin" and Tzotzil (TZO) "laj".
Figure 5 shows a t-SNE (Maaten and Hinton, 2008) visualization of the large clusters of combinations that are found for these five languages, including one cluster of verses that do not contain any of the five pivots.
This figure is a map of past tense for all 1163 languages, not just for CRS, FIJ, HWC, TCS and TZO: once the interpretation of a particular cluster has been established based on CRS, FIJ, HWC, TCS and TZO, we can investigate this cluster in the 1164 other languages by looking at the verses      Languages with no record on WALS remained white.This clustering is based on JSD of markers in marking 5960 verses in bible.We observed that most of pivot past and future markers belong to Niger Congo family and present markers are mostly within Indo-European family.It can be seen that in many cases languages with the same family behave accordingly in tense marking.For instance, in past tense marking Oto-Manguean languages use almost the same marker of ni with small writing variations (Figure 1).Although Tezoatlán Mixtec did not have a record on WALS, since its marker is the same as other Oto-Manguean languages and works almost identical to ni in Oto-Manguean languages, we may guess this language is also Oto-Manguean, which turned out to be true when we performed further searches.that are members of this cluster.This methodology supports the empirical investigation of questions like "how is progressive past tense expressed in language X"?We just need to look up the cluster(s) that correspond to progressive past tense, look up the verses that are members and retrieve the text of these verses in language X.
To give the reader a flavor of the distinctions that are reflected in these clusters, we now list phenomena that are characteristic of verses that contain only one of the five pivots; these phenomena identify properties of one language that the other four do not have.
CRS "ti".CRS has a set of markers that can be systematically combined, in particular, a progressive marker "pe" that can be combined with the past tense marker "ti".As a result, past progressive sentences in CRS are generally marked with "ti".Example: "43004031 Meanwhile, the disciples were urging Jesus, 'Rabbi, eat something."'"crs bible 43004031 Pandan sa letan, bann disip ti pe sipliy Zezi, 'Met!Manz en pe."' The other four languages do not consistently use the pivot for marking the past progressive; e.g., HWC uses "was begging" in 43004031 (instead of "wen") and TCS uses "kip tok strongwan" 'keep talking strongly' in 43004031 (instead of "bin").
FIJ "qai".This pivot means "and then".It is highly correlated with past tense in the New Testament because most sequential descriptions of events are descriptions of past events.But there are also some non-past sequences.Example:      Languages with no record on WALS remained white.This clustering is based on JSD of markers in marking 6590 verses in bible.It can be seen that in many cases languages with the same family behave accordingly in tense marking.
"eng newliving 44009016 And I will show him how much he must suffer for my name's sake.""fij hindi 44009016 Au na qai vakatakila vua na levu ni ka e na sota kaya e na vukuqu."This verse is future tense, but it continues a temporal sequence (it starts in the preceding verse) and therefore FIJ uses "qai".The pivots of the other four languages are not general markers of temporal sequentiality, so they are not used for the future.
HWC "wen".HWC is less explicit than the other four languages in some respects and more explicit in others.It is less explicit in that not all sentences in a sequence of past tense sentences need to be marked explicitly with "wen", resulting in some sentences that are indistinguishable from present tense.On the other hand, we found many cases of noun phrases in the other four languages that refer implicitly to the past, but are translated as a verb with explicit past tense marking in HWC.Examples: "hwc 2000 40026046 Da guy who wen set me up . . ." 'the guy who WEN set me up', "eng newliving 40026046 . . .my betrayer . . ."; "hwc 2000 43008005 . . .Moses wen tell us in da Rules . . ." 'Moses WEN tell us in the rules', "eng newliving 43008005 The law of Moses says . . ."; "hwc 2000 47006012 We wen give you guys our love . . .", "eng newliving 47006012 There is no lack of love on our part . . .".In these cases, the other four languages (and English too) use a noun phrase with no tense marking that is translated as a tense-marked clause in HWC.
While preparing this analysis, we realized that HWC "wen" unfortunately does not meet one of the criteria we set out for pivots: it is not unambiguous.In addition to being a past tense marker (derived from standard English "went"), it can also be a conjunction, derived from "when".This ambiguity is the cause for some noise in the clusters marked for presence of HWC "wen" in the figure .TCS "bin".Conditionals is one pattern we found in verses that are marked with TCS "bin", but are not marked for past tense in the other four languages.Example: "tcs bible 46015046 Wanem i bin kam pas i da nomal bodi ane den da spiritbodi i bin kam apta."'what came first is the normal body and then the spirit body came after', "eng newliving 46015046 What comes first is the natural body, then the spiritual body comes later."Apparently, "bin" also has a modal aspect in TCS: generic statements that do not refer to specific events are rendered using "bin" in TCS whereas the other four languages (and also English) use the default unmarked tense, i.e., present tense.
TZO "laj".This pivot indicates perfective aspect.The other four past tense pivots are not perfective markers, so that there are verses that are marked with "laj", but not marked with the past tense pivots of the other four languages.Example: "tzo huixtan 40010042 . . .ja'ch-ac'bat ben-      Languages with no record on WALS remained white.This clustering is based on JSD of markers in marking 5733 verses in bible.It can be seen that in many cases languages with the same family behave accordingly in tense marking.dición yu'un hech laj spas . . ." (literally "a blessing . . .LAJ make"), "eng newliving 40010042 . . .you will surely be rewarded."Perfective aspect and past are correlated in the real world since most events that are viewed as simple wholes are in the past.But future events can also be viewed this way as the example shows.
Similar maps for present and future tenses are presented in the Figure 6 and Figure 7.

Related work
Our work is inspired by (Cysouw, 2014;Cysouw and Wälchli, 2007); see also (Dahl, 2007;Wälchli, 2010).Cysouw creates maps like Figure 5 by manually identifying occurrences of the proper noun "Bible" in a parallel corpus of Jehovah's Witnesses' texts.Areas of the map correspond to semantic roles, e.g., the Bible as actor (it tells you to do something) or as object (it was printed).This is a definition of semantic roles that is complementary to and different from prior typological research because it is empirically grounded in real language use across a large number of languages.It allows typologists to investigate traditional questions from a radically new perspective.
Low resource.Even resources with the widest coverage like World Atlas of Linguistic Structures (WALS) (Dryer et al., 2005) have little information for hundreds of languages.Many researchers have taken advantage of parallel information for extracting linguistic knowledge in low-resource settings (Resnik et al., 1997;Resnik, 2004;Mihalcea and Simard, 2005;Mayer and Cysouw, 2014      present, and future marking of their respective top markers.Each node is colored based on its family information.Languages with no record on WALS remained white.It can be seen that many small clusters of nodes have the same color, which together with our quantitative evaluation supports that if divergence of tense marking is low the languages are very likely to be genetically related.Christodouloupoulos and Steedman, 2015;Lison and Tiedemann, 2016).

Parallel corpora and annotation projection
In general, parallel corpora are a resource of immense importance in natural language processing at least since Brown et al. (1993)'s work on machine translation and they are widely used.
In addition to machine translation, other applications include typology (Asgari and Mofrad, 2016;Malaviya et al., 2017) and paraphrase mining (Bannard and Callison-Burch, 2005).Annotation projection is a specific use of paral-lel corpora: a set of labels that is available for L 1 is projected to L 2 via alignment links within the parallel corpus.L 1 labels can either be obtained through manual annotation or through an analysis module that may be available for L 1 , but not for L 2 .We interpret label here broadly, including, e.g., part of speech labels, morphological tags and segmentation boundaries, sense labels, mood labels, event labels, syntactic analysis and coreference.We can only cite a small subset of papers using annotation projection published in the last two decades: McEnery and Xiao (1999), Ide (2000), Yarowsky et al. (2001), Xiao andMcEnery (2002), Diab and Resnik (2002), Hwa et al. (2005)  For each of the five languages, we present a subfigure that highlights the subset of verse clusters that are marked by the pivot of that language.The sixth subfigure highlights verses not marked by any of the five pivots.For each of the five languages, we present a subfigure that highlights the subset of verse clusters that are marked by the pivot of that language.The sixth subfigure highlights verses not marked by any of the five pivots.
Verses marked in KLV  For each of the five languages, we present a subfigure that highlights the subset of verse clusters that are marked by the pivot of that language.The sixth subfigure highlights verses not marked by any of the five pivots.
In contrast to this previous work, the labels we project in this paper are not the result of human annotation nor the result of the annotation computed by an NLP analysis module.Instead we interpret words in L 1 as annotation labels (words like CRS "ti" and TZO "laj") and project these word annotation labels to another language L 2 .

Discussion
Our motivation is not to develop a method that can then be applied to many other corpora.Rather, our motivation is that many of the more than 1000 languages in the Parallel Bible Corpus are lowresource and that providing a method for creating the first richly annotated corpus (through the projection of annotation we propose) for many of these languages is a significant contribution.
The original motivation for our approach is provided by the work of the typologist Michael Cysouw.He created the same type of annotation as we, but he produced it manually whereas we use automatic methods.But the structure of the annotation and its use in linguistic analysis is the same as what we provide.
The basic idea of the utility of the final outcome of SuperPivot is that the 1163 languages all richly annotate each other.As long as there are a few among the 1163 languages that have a clear marker for linguistic feature f , then this marker can be projected to all other languages to richly annotate them.For any linguistic feature, there is a good chance that a few languages clearly mark it.Of course, this small subset of languages will be different for every linguistic feature.
Thus, even for extremely resource-poor languages for which at present no annotated resources exist, SuperPivot will make available richly annotated corpora that should advance linguistic research on these languages.

Conclusion
We presented SuperPivot, an analysis method for low-resource languages that occur in a superpar-allel corpus, i.e., in a corpus that contains an order of magnitude more languages than parallel corpora currently in use.We showed that Su-perPivot performs well for the crosslingual analysis of the linguistic phenomenon of tense.We produced analysis results for more than 1000 languages, conducting -to the best of our knowledge -the largest crosslingual computational study performed to date.We extended existing methodology for leveraging parallel corpora for typological analysis by overcoming a limiting assumption of earlier work.We only require that a linguistic feature is overtly marked in a few of thousands of languages as opposed to requiring that it be marked in all languages under investigation.

Future directions
There are at least two future directions that seem promising to us.
• Creating a common map of tense along the lines of Figure 5, but unifying the three tenses • Addressing shortcomings of the way we compute alignments: (i) generalizing character n-grams to more general features, so that templates in templatic morphology, reduplication and other more complex manifestations of linguistic features can be captured; (ii) use n-gram features of different lengths to account for differences among languages, e.g., shorter ones for Chinese, longer ones for English; (iii) segmenting verses into clauses and performing alignment not on the verse level (which caused many errors in our experiments), but on the clause level instead; (iv) using global information more effectively, e.g., by extracting alignment features from automatically induced bi-or multilingual lexicons.
Kon ni > die Tez oatl án Mix tec > ni Ata tláh uca Mix tec > ni Yo so nd úa Mi xte c > ni Pe ño les Mi xte c > ní Sa n Mi gu el El Gr an de Mi xte c > ni O co te pe c M ixt ec > nī So ut he rn Pu eb la M ix te c > nì A fri ka an s > he t C nj ob 'a l >

Figure 1 :
Figure1: Clustering of 100 pivot past tense markers.Each node is colored based on its family information.Languages with no record on WALS remained white.This clustering is based on JSD of markers in marking 5960 verses in bible.We observed that most of pivot past and future markers belong to Niger Congo family and present markers are mostly within Indo-European family.It can be seen that in many cases languages with the same family behave accordingly in tense marking.For instance, in past tense marking Oto-Manguean languages use almost the same marker of ni with small writing variations (Figure1).Although Tezoatlán Mixtec did not have a record on WALS, since its marker is the same as other Oto-Manguean languages and works almost identical to ni in Oto-Manguean languages, we may guess this language is also Oto-Manguean, which turned out to be true when we performed further searches.

Figure 2 :
Figure2: Clustering of 100 pivot present tense markers.Each node is colored based on its family information.Languages with no record on WALS remained white.This clustering is based on JSD of markers in marking 6590 verses in bible.It can be seen that in many cases languages with the same family behave accordingly in tense marking.

Figure 3 :
Figure3: Clustering of 100 pivot future tense markers.Each node is colored based on its family information.Languages with no record on WALS remained white.This clustering is based on JSD of markers in marking 5733 verses in bible.It can be seen that in many cases languages with the same family behave accordingly in tense marking.

Figure 4 :
Figure4: Clustering of 1107 languages based on the average Jensen-Shannon divergence in past, present, and future marking of their respective top markers.Each node is colored based on its family information.Languages with no record on WALS remained white.It can be seen that many small clusters of nodes have the same color, which together with our quantitative evaluation supports that if divergence of tense marking is low the languages are very likely to be genetically related.

Figure 5 :
Figure5: A map of past tense based on the largest clusters of verses with particular combinations of the past tense pivots from Seychellois Creole (CRS), Fijian (FIJ), Hawaiian Creole (HWC), Torres Strait Creole (TCS) and Tzotzil (TZO).For each of the five languages, we present a subfigure that highlights the subset of verse clusters that are marked by the pivot of that language.The sixth subfigure highlights verses not marked by any of the five pivots.

Figure 6 :
Figure6: A map of present tense based on the largest clusters of verses with particular combinations of the past tense pivots from Papiamento (PAP), Waima (RRO), Afrikaans (ARF), Urdu (URD) and Icelandic (ISL).For each of the five languages, we present a subfigure that highlights the subset of verse clusters that are marked by the pivot of that language.The sixth subfigure highlights verses not marked by any of the five pivots.

Figure 7 :
Figure7: A map of future tense based on the largest clusters of verses with particular combinations of the past tense pivots from Bwanabwana (TTE), Tok Pisin (TPI), Quiché (QUC), Malay (MSA) and Maskelynes (KLV).For each of the five languages, we present a subfigure that highlights the subset of verse clusters that are marked by the pivot of that language.The sixth subfigure highlights verses not marked by any of the five pivots.

Table 1 :
, Figure 2, and Figure 3 for past, present, and future Top ten past, present, and future tense pivots extracted from 1163 languages.C. = Creole Table 2: MRR results for step 4. See text for details.

Table 3 :
Language family similarity prediction results based on coordinated marking of verses.Only languages with records on WALS are included in this evaluation.TNR: true negative rate.
Alba nian > ësht ë Low Sax on > es Ge rma n > ist We ste rn Fri sia n > is ; , Muk-