Fine-grained Coordinated Cross-lingual Text Stream Alignment for Endless Language Knowledge Acquisition

This paper proposes to study fine-grained coordinated cross-lingual text stream alignment through a novel information network decipherment paradigm. We use Burst Information Networks as media to represent text streams and present a simple yet effective network decipherment algorithm with diverse clues to decipher the networks for accurate text stream alignment. Experiments on Chinese-English news streams show our approach not only outperforms previous approaches on bilingual lexicon extraction from coordinated text streams but also can harvest high-quality alignments from large amounts of streaming data for endless language knowledge mining, which makes it promising to be a new paradigm for automatic language knowledge acquisition.


Introduction
Coordinated text streams (Wang et al., 2007) refer to the text streams that are topically related and indexed by the same set of time points. Previous studies (Wang et al., 2007;Hu et al., 2012) on coordinated text stream focus on discovering and aligning common topic patterns across languages. Despite their contributions to applications like cross-lingual information retrieval and topic analysis, such a coarse-grained topic-level alignment framework inevitably overlooks many useful fine-grained alignment knowledge. For example, Figure 1 shows typical knowledge that can be derived from fine-grained Chinese-English text stream alignments. In addition to (a) bi-lingual word translations, we can also discover (b) polysemous and multi-referential words if one Chinese word is aligned to multiple English words, (c) synonymous and co-referential word pairs if two Chinese words are aligned to the same English word, and (d) entity phrases (e.g., 阿布扎比 in Figure 1 if adjacent Chinese words in text are aligned to the same English named entity. In order to acquire language knowledge for Natural Language Processing (NLP) applications, we study fine-grained cross-lingual text stream alignment. Instead of directly turning massive, unstructured data streams into structured knowledge (D2K), we adopt a new Data-to-Network-to-Knowledge (D2N2K) paradigm, based on the following observations: (i) most information units are not independent, instead they are interconnected or interacting, forming massive networks; (ii) if information networks can be constructed across multiple languages, they may bring tremendous power to make knowledge mining algorithms more scalable and effective because we can employ the graph structures to acquire and propagate knowledge.
Based on the motivations, we employ a promising text stream representation -Burst Information Networks (BINets) (Ge et al., 2016a), which can be easily constructed without rich language resources, as media to display the most important information units and illustrate their connections in the text streams. With the BINet representation, we propose a simple yet effective network decipherment algorithm for aligning cross-lingual text streams, which can take advantage of the coburst characteristic of cross-lingual text streams and easily incorporate prior knowledge and rich clues for fast and accurate network decipherment.

Dinara Safina
Jan 12 -Jan 25 For example, in Figure 2, each node in a BI-Net is a bursty word with one of its burst periods, representing an important information unit in a text stream. To decipher the Chinese BINet, our approach first focuses on the nodes in the English BINet in Figure 2 as the candidates because they co-burst with the Chinese nodes. Then, we decipher some nodes based on prior knowledge (the green node), the pronunciation similarity clue (the orange nodes) or literal translation similarity clue (the blue node). These deciphered nodes will serve as neighbor clues to decipher their adjacent nodes (the red node) which will then be used for further decipherment (e.g., decipher the yellow node) through knowledge propagation across the network, as the dashed arrows in Figure 2 show. Experiments on Chinese-English coordinated news streams show our approach can accurately align nodes across the cross-lingual BINets and derive various knowledge, and that with more streaming data provided, we can harvest more high-quality alignments and thus derive more knowledge. By aligning endless text streams, it is promising for never-ending language knowledge mining, which can not only complement language resources but also benefit some NLP applications.
The main contributions of this paper are: • We propose a promising framework to mine knowledge from inexhaustible coordinated cross-lingual text streams through finegrained alignment, exploring a paradigm for language knowledge acquisition.
• We propose a network decipherment approach for text stream alignment, which can work in both low and rich resource settings and outperform previous approaches.
• We release our data (annotations) and systems to guarantee the reproducibility and help future work improve on this task.

Burst Information Network
A Burst Information Network (BINet) is a graphbased text stream representation and has proven effective for multiple text stream mining tasks (Ge et al., 2016a,b,c). In contrast to many information networks (e.g., (Ji, 2009;Li et al., 2014)), BI-Nets are specially for text streams. They focus on the burst information units which are usually related to important events or trending topics in text streams and illustrate their connections. A BINet is originally defined as G = V, E, ω in (Ge et al., 2016a). Each node v ∈ V is a burst element defined as a burst word 1 during one of its burst periods w, P where w denotes a word and P denotes one consecutive burst period of w, as Figure 2 shows. Each edge ∈ E indicates the connection between two burst elements with the weight ω which is defined as the number of documents where these two burst elements co-occur in the text stream. In this paper, we extend the BINet definition to G = V, E, ω, π by adding a binary indicator π to indicate if two nodes (i.e., burst elements) are frequently (more than 5 times) adjacent (as a bigram) in text, for mining knowledge such as entity phrases in Figure 1(d).

Decipherment
After constructing a BINet from a foreign language (we use Chinese as a foreign language in this paper), we can decipher it by consulting an Engish BINet constructed from its coordinated English text stream. We define G c = V c , E c , ω c , π c and G e = V e , E e , ω e , π e as the Chinese BINet and English BINet respectively. For people who do not know Chinese, G c is a network of ciphers. We design a novel BINet decipherment procedure to decipher G c by aligning as many nodes in G c as possible to G e . The decipherment process is defined to find e ∈ V e for a node c ∈ V c so that e is c's counterpart in the English text stream. 2

Starting Point
To decipher the Chinese BINet, we need a few seeds based on prior knowledge as a starting point. Inspired by previous work on bi-lingual lexicon induction, decipherment and name translation mining, we utilize a few linguistic resources -a bilingual lexicon and language-universal representations such as time/calendar date, number, website URL, currency and emoticons to decipher a subset of Chinese nodes. For the example shown in Figure 2, we can decipher some nodes in the Chinese BINet such as "7-6" (to "7-6") and "种子" (to "seed").

Candidate Generation
For the nodes that cannot be deciphered by the prior knowledge, we first need to discover their possible candidates. For a node c in the Chinese BINet, its counterpart e can be any node in the English BINet or does not exist in the English BINet, resulting in an extremely large search space. Fortunately, burst information that refers to a hot topic usually co-bursts across languages. Based on this characteristic, for a node in the Chinese BINet, its counterpart is likely to be a node with the same burst period in the English BINet. For example, the node "威廉(Williams)" in the Chinese BINet in Figure 2 bursts between January 25 and January 31, 2010. We only need to look for its counterpart from the nodes in the English BINet whose burst period overlaps with this period. Formally, for a node c ∈ V c in the Chinese BINet, its candidate nodes in the English BINet can be derived as: where e ∈ V e , and P(c) and P(e) are the burst periods of c and e respectively.

Candidate Verification
For the candidate list for c (i.e., Cand(c)), we need to verify each node e ∈ Cand(c) and choose the most probable one as c's counterpart. Formally, we define Score(c, e) as the credibility score of e being the correct counterpart of c and propose the following novel clues for verification.

Pronunciation
Inspired by previous work on name translation mining (e.g., (Schafer III, 2006;Sproat et al., 2006;Ji, 2009)), for a node e ∈ Cand(c), if its pronunciation is similar to c, then e is likely to be the translation of c. For a Chinese node c and an English node e, we define S p as its scaled pronunciation score to measure their pronunciation similarity whose range is [0, 1]: where LD is the normalized (by e's length) Levenshtein edit distance between c's pinyin 3 string and e's word string.

Translation
For a node e ∈ Cand(c), it is possible that e's word exists or partially exists in the bi-lingual lexicon. We can exploit the translation clue to verify if e is c's counterpart. For example, "Australian Open" is a candidate of "澳洲网球公开 赛(Australian Open)" as shown in Figure 2. Even though "澳洲网球公开赛(Australian Open)" is not in the bi-lingual lexicon, "Australian" and "open" are in the lexicon and their Chinese translations are "澳洲的(Australian)" and "公开(open)" respectively. If we literally translate "Australian Open" word by word, we will get "澳 洲 的 公 开" which has long common subsequences with the Chinese node "澳洲网球公开赛(Australian Open)", inferring that "Australian Open" is likely to be the translation of "澳洲网球公开赛".
Motivated by this observation, for a candidate e ∈ Cand(c), we first extract its possible Chinese translations C(e) from the bilingual lexicon. Note that if e is a multiword, we concatenate translations of its components. Then, for c, e , we define S t as its scaled translation similarity score whose range is [0, 1]: is maximum length of the longest common subsequence between c and c ∈ C(e).

Neighbor
The graph topological structure of a BINet is also an important clue for decipherment. By analyzing a node's neighbors, we can learn useful topic-level knowledge to decipher the node. For the example in Figure 2, "艾宁(Henin)" in the Chinese BINet has neighbors such as "威廉(Williams)", "澳洲 网球公开赛(Australian Open)" and "郑洁(Zheng Jie)" while "Justine Henin" in the English BINet is connected with "Serena Williams", "Australian Open" and "Zheng Jie". If we know "Serena Williams", "Australian Open" and "Zheng Jie" are the counterpart of '威廉", "澳洲网球公开赛" and "郑洁" respectively, we can infer "Justine Henin" is likely to be the counterpart of "艾宁", which can be further used as a clue to decipher its neighbors such as "外卡(wildcard)" through knowledge propagation.
We define N (c) and N (e) as the set of adjacent nodes of c in the Chinese BINet and the adjacent nodes of e in the English BINet respectively. The neighbor clue score S n of c, e is defined as: where Score(c , e ) is the overall score of e being the counterpart of c , as defined at the beginning of this section, is the normalized weight of the edge between c and c .

Correlation of burst
If the word of e ∈ Cand(c) frequently co-bursts with the word of c, then e is likely to be the counterpart of c. For example, "Serena Williams" in the English stream usually co-bursts with "小威" in the Chinese stream, as shown in Figure 3, which is a useful clue to infer that "Serena Williams" is the counterpart of "小威". We define S b as the burst correlation score: where w(v) denotes the word of the node v and s w denotes the burst sequence of the word w in which each entry is a binary variable indicating if w bursts at a moment throughout the time frame. Note that in the above equation, we regard s w as a vector. The numerator is the number of days when w(c) and w(e) co-burst and the denominator is the number of days when either w(c) or w(e) bursts.

Graph-based Decipherment
We define the overall (credibility) score as the linear combination of the clues introduced above: where S p , S t , S n and S b are the scores that measure the value/reliability of the pronunciation, translation, neighbor and burst correlation clues respectively, and η, λ, γ and δ are hyperparameters for adjusting their weights. Based on Eq (3), we can now compute the score of any candidate pair c, e . For pairs that are known to be correct alignments according to prior knowledge, their overall scores will be fixed to 1.0. For other possible candidate pairs, we simply initialize their scores as follows: where Cand(c) is the set of c's candidate nodes in the English BINet.
Given that Score(c, e) is influenced by other pairs' scores, we design an iterative algorithm to compute and update the scores to decipher the entire Chinese BINet through propagation. This process is elaborated in Algorithm 1.
Algorithm 1 Graph-based Decipherment 1: For the determined pair c, e based on the prior knowledge, Score(c, e) ← 1.0 2: For other undermined pairs c, e , initialize Score(c, e) according to Eq (4); 3: while T rue (until ∆Conf (Gc, Ge) ≤ 0.0001) do 4: for each undetermined pair c, e do 5: Compute new score according to Eq (3); 6: update(c, e) = min(1.0, new score) 7: end for 8: for each undetermined pair c, e do 9: Score(c, e) ← update(c, e) 10: end for 11: end while ∆Conf (G c , G e ) in the 3rd line of Algorithm 1 is the difference between the network decipherment confidence score at the current iteration and that at its previous iteration. Conf (G c , G e ) is defined as follows, reflecting how much confidence we have in our network decipherment result: In practice, propagation of prior knowledge and clues makes the confidence score increase because it helps us know more about the network (as illustrated by Figure 2). When the confidence score stops increasing or increases marginally (≤ 0.0001) after several iterations, the algorithm terminates 4 .

Experiments
We first evaluate our approach on aligning nodes in the cross-lingual BINets for fine-grained crosslingual stream alignment in Section 4.1. Then, we show the value of derived alignments for endless language knowledge acquisition in Section 4.2.

Data
We used the public 2010 Agence France Presse (AFP) news in Chinese (Graff and Chen, 2005) and English Gigaword (Graff et al., 2003) as our cross-lingual text streams. The Chinese stream has 17,327 while the English one contains 186,737 documents.
We removed stopwords, conducted lemmatization and name tagging for the English stream, and did word segmentation and name tagging for the Chinese stream using the Stanford CoreNLP toolkit .
We detected bursts and constructed the BINets 5 for the Chinese and English stream based on (Ge et al., 2016a). The constructed Chinese BINet has 7,360 nodes and 33,892 edges while the English one has 8,852 nodes and 85,125 edges. Our seed bi-lingual lexicon is released by (Zens and Ney, 2004), containing 81,990 Chinese word entries, each of which has an English translation. Among the 7,360 nodes in the Chinese BINet, 2,281 nodes need to be deciphered since their words are not in the bi-lingual lexicon.

Evaluation Setting
We evaluate our approach in an end-to-end fashion. For a node c in the Chinese BINet, we choose the node e * which has the highest score as c's counterpart in the English BINet: e * = arg max e∈Cand(c)

Score(c, e)
We rank the aligned node pairs by the score and manually evaluate the quality of the top K pairs. A pair c,e is annotated as correct if e is a correct translation of c or e refers to an entity that c refers to. The annotation assignment is done by three human judges with 89.4% agreement. The disagreement mainly arises from the ambiguity of some named entities. In the evaluation, we consider c,e correct if more than two judges annotate it as correct.
We compare our approach to the following baselines that use various combinations of clues to verify candidates for decipherment as well as the state-of-the-art algorithm for language decipherment from non-parallel corpora: models across languages. We adapt it to our experiment setting by considering adjacent nodes in a BINet as bigrams for decipherment. We used 2009 AFP Chinese/English news in Gigaword as our development set to tune hyperparameters. Since our approach has only 4 parameters (i.e., η, λ, γ, δ in Eq (3)), it is easy to tune the parameters using grid search (from 0.0 to 1.0 with a step 0.2) on the development set. For baselines except Bayesian inference, the score computation function is almost identical to Eq (3) except that the weights of the clues which are not used are set to 0.

Results
We present the results in Figure 4. Our approach outperforms all the baselines because it considers various clues for decipherment. Among the baselines, accuracy scores of pv and tv drop dramatically with K increasing because a single clue can only decipher a limited number of nodes effectively. pv+tv seems to alleviate the problem to some extent: its accuracy does not drop so drastically as pv or tv because multiple clues allow us to decipher more nodes but its accuracy is still not desirable. Among the clues, cv performs worst, demonstrating that the burst correlation clue alone is far from enough for decipherment. Compared with pv, tv and cv, nv deciphers the nodes in the Chinese BINet through propagation but the neighbor clue alone is not sufficient for accurate decipherment. It is notable that nv achieves comparable performance to the Bayesian inference method which uses similar clues, demonstrating the effectiveness of our decipherment framework despite its simplicity. Moreover, our graph-based decipherment approach is more flexible to incorporate a variety of clues. When it is combined with pv+tv, the performance shows a significant boost  and achieves approximately 90% accuracy in the top 200 results though it is slightly inferior to our final approach due to the lack of awareness of burst correlation. Another interesting observation from Figure 4 is that our approach clearly know the confidence of its predictions. For top 100 mined pairs with the highest confidence scores (i.e., the score in Eq (3)), the accuracy is 98%. Therefore, it is easy to control the quality of mined pairs, which is important for a text mining algorithm.
We also study the effect of language resources on the performance. We first randomly sample different sizes of entries from the original bi-lingual lexicon as new bi-lingual lexicons. The results 6 in Figure 5 show the accuracy improves as the size of bi-lingual lexicon grows because more prior knowledge benefits deciphering the BINet. In addition, we test our approach in a low-resource setting where there is no knowledge of the romanization system (i.e., pinyin) and no pre-trained word segmentation and name tagging tools are available. The only available resource is a very small bi-lingual lexicon with 1,000 most common Chinese words 7 and their corresponding English translations. In this setting, we use an unsupervised Chinese word segmentation approach combining a Hierarchical Dirichlet Process (HDP) model with a Bayesian HMM model (Chen et al., 2014) to segment Chinese text instead of the preprocessing steps mentioned in Section 4.1.1. According to Figure 5, our approach still performs well in the low-resource setting although its accuracy curve is lower than that in rich-resource settings, demonstrating it can work in both rich-and low-resource settings. In order to test the generalization ability, we evaluate our approach using the same hyperparameters on another coordinated text streams -AFP Chinese and APW English news stream in 2008. The results in Figure 6 show that our decipherment approach consistently outperforms the other baseline and still deciphers the top 100 nodes in high accuracy even though the curve in 2008 is lower than that in 2010. The performance difference in 2008 and 2010 mainly arises from the difference on topic overlaps. In the streams of 2010, the Chinese and English news are from the same news agency (i.e., AFP). Therefore, the topic overlaps of the cross-lingual streams are larger than 2008, allowing more nodes to be deciphered correctly.
Finally, we investigated the performance of our approach under various sizes of data provided, as shown in Figure 7. As observed, when the data size is small (e.g., 6-month coordinated text streams), the approach works poorly because there are very few nodes in BINets that can be aligned. As the data size increases, our approach can efficiently 8 harvest a growing number of high-quality 8 Efficiency is reported in the supplementary notes.  alignments, as reflected by the higher curves in Figure 7. Considering massive coordinated text streams generated every day, if the approach can be applied to the endless streams, it is possible to monitor the streaming data and derive countless alignments for never-ending language knowledge acquisition. Table 1 shows the stream alignment result of our approach. As demonstrated above, we can derive a variety of language knowledge from the finegrained cross-lingual alignments. Word/entity translations are the main knowledge that can be derived from our alignment results by extracting word pairs from the aligned cross-lingual node pairs. Formally, we find a Chinese word w's English translation w * as follows:

Endless language knowledge mining
e * = arg maxe∈V e max c∈Vc(w) Score (c, e) where V c (w) is the set of Chinese nodes whose word is w, and w(e) denotes the word of node e.
We evaluate our approach on mining translations of bursty Chinese words, based on the evaluation criteria of bilingual lexicon extraction. Specifically, we test how many out-of-vocabulary (OOV) words appearing in the Chinese BINet are correctly translated. The datasets used for evaluation are the 2010 and 2008 streams in Figure  6. In total, there are 1,226 and 1,082 distinct Chinese OOV words (excluding incorrectly segmented words) in the corresponding Chinese BI-Nets. Accuracy is used to measure the proportion of the words being correctly translated, as (Tamura et al., 2012) did. Table 2 compares our approach to representative bilingual lexicon extraction approaches. CONTEXT is one of the earliest approaches for extracting word translations from comparable cor-

Model
Acc1(2010) Acc1(2008) CONTEXT (Fung and Yee, 1998) 0.32% 0.37% COLP (Tamura et al., 2012) 0.32% 0.46% SIMLP (Tamura et al., 2012) 0.49% 0.46% DIVERSE (Schafer and Yarowsky, 2002) 5.22% 4.25% DIVERSESP (Sproat et al., 2006) 5.46% 4.44 % BAYESIAN(LM) (Dou and Knight, 2012) 0.57% 0.55% BAYESIAN(BINET) 11.17% 4.81% Ours 28.38% 19.78% pora based on context similarity. COLP and SIMLP are label propagation models on word cooccurrence and similarity graphs for bilingual lexicon extraction. DIVERSE is a variant of CON-TEXT by adding various information (e.g., pronunciation and temporality) and DIVERSESP is the approach using phonetic and frequency correlation with a score propagation strategy. BAYESIAN is the Bayesian decipherment approach which has been introduced in the previous section, and it is evaluated in two settings (i.e., based on traditional bigram language models and BINets). According to Table 2, our approach substantially outperforms the other approaches on both datasets, showing its advantages for mining translation of bursty words in coordinated text streams. It is also notable that the BINet-based BAYESIAN improves the LM-based counterpart, demonstrating the advantage of burst-level alignment for this task.
In addition to the comparisons to the classical baselines, we also test the latest representative unsupervised bi-lingual lexicon extraction approaches (Zhang et al., 2017a,b) based on word embedding and generative adversarial nets (GANs). Unfortunately, these approaches do not perform well in our setting. For example, the approach in (Zhang et al., 2017a) achieved <1% accuracy 9 . One reason is that the topic overlap of coordinated cross-lingual text streams is not so significant as the Wikipedia data used for their experiments, and the other reason is that their approaches focus on common fundamental words like "城市(city)" while our targets are OOVs like "东协(ASEAN)" which do not frequently appear in a corpus. In contrast, our approach is more practical: it not only works well in easily available and endless coordinated text streams without high content overlap requirement, but also can accurately mine translations of many OOVs which do not appear frequently and really need mining their translations. 9 We implement this approach using the codes released by the authors. Their reported accuracy for the common words with over 1,000 occurrences is 2.53% on Gigaword corpus.  As illustrated in Figure 1, besides word/entity translations, various types of knowledge can also be derived from the BINet alignment results as by-products. For example, for node 9 in Table  1, deciphering the nickname "小威" into Serena Williams can benefit cross-lingual entity linking. Nodes 10-11 also demonstrate the potential effect on synonym detection, entity linking and coreference resolution, like the case of Figure 1(b). Nodes 12-13 show that the deciphered BINets can detect polysemous/multi-referential word like "央 行(Central bank)" which may refer to different entities during different burst periods, like Figure  1(c). Moreover, the deciphered BINets can also help entity phrase extraction based on the idea of Figure 1(d). For example, in nodes 14-15, 翁 山苏姬(Aung San Suu Kyi) is not recognized as a person name by the Chinese name tagger; instead, it is mistakenly separated into two words -翁 山(Aung San) and 苏 姬(Suu Kyi). However, since 翁山(Aung San) and 苏姬(Suu Kyi) are deciphered into the same English named entity -Aung San Suu Kyi, we can merge them back to form the correct entity.
For evaluating our approach's performance on language knowledge acquisition, we align the AFP Chinese-English text streams from 2002 to 2010. The Chinese stream has 119,196 documents and the English one contains 1,608,636 documents. Our approach obtained 7,211 node alignments 10 . Among them, we focus on the top 500 alignments to guarantee their quality and use the aforementioned idea for deriving language knowledge. Table 3 shows the result of deriving knowledge from the alignments. Among top 500 alignments, we derived 416 correct word/entity translation pairs with 83% accuracy. Also, we correctly derived 8 polysemous/multi-referential words, 49 synonymous/co-referential word pairs and 84 entity phrases as byproducts. It is notable that the data size of coordinated cross-lingual text streams available on the web is much larger than that used in our experiment and they are endlessly updated.
That means it is promising to endlessly derive language knowledge by applying our approach to the huge size of endless cross-lingual text streams, which may benefit NLP applications like machine translation, entity linking and name tagging.
In contrast to previous cross-lingual projection work like data transfer (Pado and Lapata, 2009) and model transfer (McDonald et al., 2011), we do not require any parallel data. Moreover, our BINets are cheap to construct, which can be easily extended to other languages. This is also the first attempt to apply the decipherment idea (e.g., (Ravi and Knight, 2011;Dou and Knight, 2012;Dou et al., 2014)) to graph structures instead of sequence data.

Conclusions and Future Work
This paper proposes an approach to deciphering the Burst Information Network constructed from foreign languages as a novel way to align crosslingual text streams. For the first time we propose to model stream alignment as a network decipherment problem. By leveraging the network structures with stream-level burst features as well as various clues, our approach can accurately align the important information units across languages and derive a variety of knowledge. Given that our approach is unsupervised, effective, intuitive, interpretable, and easily implementable, it is promising to use it as a framework for never-ending language knowledge mining from big data, which might benefit NLP applications such as machine translation and cross-lingual information access.
For future work, we plan to 1) conduct more experiments and analyses following this preliminary study to verify our approach's effectiveness for more languages and domains (e.g., social stream VS news stream); 2) attempt to use word embedding (e.g., word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) and ELMo (Peters et al., 2018)) for local context encoding and use it as a clue for decipherment; 3) apply our approach to real-time coordinated text streams for never-ending knowledge mining and use the mined knowledge to improve the downstream applications.