Multilingual Clustering of Streaming News

Clustering news across languages enables efficient media monitoring by aggregating articles from multilingual sources into coherent stories. Doing so in an online setting allows scalable processing of massive news streams. To this end, we describe a novel method for clustering an incoming stream of multilingual documents into monolingual and crosslingual clusters. Unlike typical clustering approaches that report results on datasets with a small and known number of labels, we tackle the problem of discovering an ever growing number of cluster labels in an online fashion, using real news datasets in multiple languages. In our formulation, the monolingual clusters group together documents while the crosslingual clusters group together monolingual clusters, one per language that appears in the stream. Our method is simple to implement, computationally efficient and produces state-of-the-art results on datasets in German, English and Spanish.


Introduction
Following developing news stories is imperative to making real-time decisions on important political and public safety matters. Given the abundance of media providers and languages, this endeavor is an extremely difficult task. As such, there is a strong demand for automatic clustering of news streams, so that they can be organized into stories or themes for further processing. Performing this task in an online and efficient manner is a challenging problem, not only for newswire, but also for scientific articles, online reviews, forum posts, blogs, and microblogs.
A key challenge in handling document streams is that the story clusters must be generated on the fly in an online fashion: this requires handling documents one-by-one as they appear in the document stream. In this paper, we provide a treatment to the problem of online document clustering, i.e. the task of clustering a stream of documents into themes. For example, for news articles, we would want to cluster them into related news stories.
To this end, we introduce a system which aggregates news articles into fine-grained story clusters across different languages in a completely online and scalable fashion from a continuous stream. Our clustering approach is part of a larger media monitoring project to solve the problem of monitoring massive text and TV/Radio streams (speechto-text). In particular, media monitors write intelligence reports about the most relevant events, and being able to search, visualize and explore news clusters assists in gathering more insight about a particular story. Since relevant events may be spawned from any part of the world (and from many multilingual sources), it becomes imperative to cluster news across different languages.
In terms of granularity, the type of story clusters we are interested in are the group of articles which, for example : (i) Narrate recent air-strikes in Eastern Ghouta (Syria); (ii) Describe the recent launch of Space X's Falcon Heavy rocket.
Main Contributions While most existing news clustering approaches assume a monolingual document stream -a non-realistic scenario given the diversity of languages on the Web -we assume a general, multilingual, document stream. This means that in our problem-formulation story documents appear in multiple languages and we need to cluster them to crosslingual clusters. Our main contributions are as follows: • We develop a system that aggregates news articles into fine-grained story clusters across different languages in a completely online and scalable fashion from a continuous stream. As discussed in the introduction, this is a highly relevant task for the use-case of media monitoring. • We formulate the problem of online multilingual document clustering and the representation that such clustering takes by interlacing the problem of monolingual clustering with crosslingual clustering. The representation of our clusters is interpretable, and similarly to topic models, consists of a set of keywords and weights associated with the relevant cluster. In our formulation, a monolingual cluster is a group of documents, and a crosslingual cluster is a group of monolingual clusters in different languages.
• We compare our approach to our own implementation of a state-of-the-art streaming method, and show much superior results for a dataset in English, Spanish and German.

Problem Formulation
We focus on clustering of a stream of documents, where the number of clusters is not fixed and learned automatically. We denote by D a (potentially infinite) space of multilingual documents. Each document d is associated with a language in which it is written through a function L : D → L where L is a set of languages. For example, L(d) could return English, Spanish or German. (In the rest of the paper, for an integer n, we denote by [n] the set {1, . . . , n}.) We are interested in associating each document with a monolingual cluster via the function C(d) ∈ N, which returns the cluster label given a document. This is done independently for each language, such that the space of indices we use for each language is separate. Furthermore, we interlace the problem of monolingual clustering with crosslingual clustering. This means that as part of our problem formulation we are also interested in a function E : N × L → N that associates each monolingual cluster with a crosslingual cluster, such that each crosslingual cluster only groups one monolingual cluster per different language, at a given time. The crosslingual cluster for a document d is E(C(d), L(d)). As such, a crosslingual cluster groups together monolingual clusters, at most one for each different language.
Intuitively, building both monolingual and crosslingual clusters allows the system to leverage high-precision monolingual features (e.g., words, named entities) to cluster documents of the same language, while simplifying the task of crosslingual clustering to the computation of similarity scores across monolingual clusters -which is a smaller problem space, since there are (by definition) less clusters than articles. We validate this choice in §5.

The Clustering Algorithm
Each document d is represented by two vectors in R k 1 and R k 2 . The first vector exists in a "monolingual space" (of dimensionality k 1 ) and is based on a bag-of-words representation of the document. The second vector exists in a "crosslingual space" (of dimensionality k 2 ) which is common to all languages. More details about these representations are discussed in §4.
Online Clustering With our clustering algorithm, we maintain two types of centroid functions for each monolingual cluster. The first is a centroid function H : N × L → R k 1 ∪ {⊥} that assists in associating each document with a monolingual cluster. The second is a centroid function G : N → R k 2 ∪ {⊥} that assists in associating each monolingual cluster with a crosslingual cluster. The ⊥ symbol is reserved to denote documents which are not associated with any cluster yet.
In our algorithm, we need to incrementally construct the functions H, G (the two centroid functions), C (the monolingual clustering function) and E (the crosslingual clustering function). Informally, we do so by first identifying a monolingual cluster for an incoming document by finding the closest centroid with the function H, and then associate that monolingual cluster with the crosslingual cluster that is closest based on the function G. The first update changes C and the second update changes E. Once we do that, we also update H and G to reflect the new information that exists in the new incoming document.
Example Figure 1 depicts the algorithm and the state it maintains. A document in some language (d 9 ) appears in the stream, and is clustered into one of the monolingual clusters (circles) that group together documents about the same story (for example, c 2 , DE could be a German cluster about a recent political event). Then, following this monolingual update, the online clustering algorithm updates the crosslingual clusters (round rectangles), each grouping together a set of monolingual clusters, one per language at the most. The centroids for the monolingual clusters are maintained by the function H. For example, H(2, English) gives the centroid of the upper left English monolingual cluster. The function G maintains the crosslingual clusters. Considering the upper-left most crosslingual cluster, a 1 , then G(1) returns its centroid.

Monolingual space
Crosslingual space Figure 1: A pictorial description of the algorithm and the state it maintains. The algorithm maintains a monolingual cluster space, in which each cluster is a set of documents in a specific language. The algorithm also maintains a crosslingual cluster space, in which a cluster is a set of monolingual clusters in different languages. Documents are denoted by d i , monolingual clusters by c i (circles) and crosslingual clusters by a i .

Algorithm
To be more precise, the online clustering process works as follows. H and G start with just returning ⊥ for any cluster number, both monolingual and crosslingual. With a new incoming document d, represented as a vector, we compute a similarity metric Γ 0 : R k 1 × R k 1 → R between the document vector and each of the existing centroids {i | H(i, L(d)) = ⊥}. If the largest similarity exceeds a threshold τ for cluster index j, then we set C(d) = j. In that case, we also update the value of H(i, L(d)) to include new information from document d, as detailed below under "H update." If none of the similarity values exceed a threshold τ , we find the first i such that H(i, L(d)) = ⊥ (the first cluster id which is still unassigned), and set C(d) = i, therefore creating a new cluster. We again follow an "H update" -this time for starting a new cluster.
In both cases, we also update the function G, by selecting the best crosslingual cluster for the recently updated (or created) monolingual cluster. To this end, we use another similarity metric Γ 1 : R k 2 × R k 2 → R. Accordingly, we compute the similarity (using Γ 1 ) between the updated (or created) monolingual cluster and all monolingual clusters in each candidate crosslingual cluster, in the crosslingual feature space. The crosslingual cluster with highest sum of similarity scores is then selected. We also experimented computing this similarity by considering just the monolingual cluster of a particular "pivot language". The pivot language is a language that serves as the main indicator for a given crosslingual cluster. In our experiments, we mostly use English as the pivot language.
H Update To update H, we maintain a centroid for each cluster that is created as the average of all monolingual representations of documents that belong to that cluster. This is done for each language separately. This update can be done in O(k 1 ) time in each step. Similarly, the update of G can be done in O(k 2 ) time. In principle, we consider an "infinite" stream of documents, which means the number of documents in each cluster can be large. As such, for efficiency purposes, updates to H are immutable, which means that when a document is assigned to a monolingual cluster, that assignment is never changed.
G Update As described, updates to function G result in associating a monolingual cluster with a crosslingual cluster (and consequently, other monolingual clusters). Therefore, errors committed in updating G are of a higher magnitude than those committed in H, since they involve groups of documents. We also note that the best crosslingual cluster for a particular monolingual cluster might not be found right at the beginning of the process. We experiment with two types of updates to G. One which is immutable, in which changes to G are not reversed (and are described above), and one in which we introduce a novel technique to make a sequence of changes to G if necessary, as a mechanism to self-correct past clustering decisions. When a past decision is modified, it may result in a chaining of consequent modifications ("toppling of dominoes") which need to be evaluated. We coin this method "domino-toppling".
The motivation behind this technique is the change in news stories over time. The technique allows the method to modify past crosslingual clustering decisions and enables higher quality clustering results. When a past decision is modified, it may result in a chain of consequent modifications which need to be evaluated.
Our method of "domino-toppling" works by making (potentially sequences) of changes to previous clustering decisions for the crosslingual clusters, at each step placing a residual monolingual cluster in a crosslingual cluster that is most similar to it. Figure 2 gives the pseudocode for domino toppling.
This "domino-toppling" technique could have in principle a quadratic complexity in the number Inputs: A monolingual cluster c and a list of pairs aj, Γ1(c, aj) , j ∈ [N ].

Algorithm:
• For all pairs aj, Γ1(c, aj) , j ∈ [N ], ordered by the second coordinate: • If L(c) is not in aj, add c to aj and break.
• Otherwise, let y ← M (aj, L(c)). If Γ1(c, aj) > Γ1(y, aj) then: • Add c to aj, remove y from aj and call domino toppling with y playing the role of c and break. • If c is left unassigned, create aN+1 and add c to it. Figure 2: Crosslingual "domino-toppling". a j is the jth crosslingual cluster (out of total N clusters) and Γ 1 is the similarity between them as in §4. L(c) is the language for cluster c. M (a, ) returns the monolingual cluster for language ∈ L in crosslingual cluster a. See text for details. of crosslingual clusters. However, we have verified that in practice it converges very fast, and in our evaluation dataset only 1% of the crosslingual updates result in topples. We apply this technique only to update G (and not H) because reversing cluster assignments in G can be done much more efficiently than in H -the total number of monolingual clusters (the clustered elements in G) is significantly smaller than the number of documents (the clustered elements in H). Crosslingual clustering is also a harder problem, which motivated the additional effort of developing this algorithm.

Document Representation
In this section, we give more details about the way we construct the document representations in the monolingual and crosslingual spaces. In particular, we introduce the definition of the similarity functions Γ 0 and Γ 1 that were referred in §3.

Monolingual Representation
The monolingual representation for each document d in language L(d) is a vector in R k 1 constructed from several TF-IDF subvectors with words, word lemmas and named entities. Each subvector is repeated for different sections of the document, the title, the body and both of them together. Besides these text fields and document timestamps, no other metadata was used. To detect named entities, we used Priberam's Text Analysis (Amaral et al., 2008) for English and Spanish, and Turbo Parser (Martins et al., 2013) for German. The extracted entities consist of people, organizations, places and other types.
Crosslingual Representation In the crosslingual space, a document representation is a vector in R k 2 . Let e(d, i) be a crosslingual embedding of word i in the document d, which is a vector of length m . Then the document representation v(d) of d consists of subvectors of the form v(d) = n i=1 t i e(d, i), where t i is the TF-IDF score of the ith word in the relevant section of the document (title, body or both). As detailed further in §5 we compute IDF values from a large pretraining dataset. Furthermore, for both the monolingual and crosslingual cases, we also experiment with using document timestamp features, as explained in §4.1. We use a new set of diverse timestamp features in addition to the simple absolute difference (in hours) between timestamps used by Rupnik et al. (2016).

Similarity Metrics
Our similarity metric computes weighted cosine similarity on the different subvectors, both in the case of monolingual clustering and crosslingual clustering. Formally, for the monolingual case, the similarity is given by a function defined as: (1) and is computed on the TF-IDF subvectors where K is the number of subvectors for the relevant document representation. For the crosslingual case, we discuss below the function Γ 1 , which has a similar structure.
Here, d j is the jth document in the stream and c l is a monolingual cluster. The function φ i (d j , c l ) returns the cosine similarity between the document representation of the jth document and the centroid for cluster c l . The vector q 0 denotes the weights through which each of the cosine similarity values for each subvectors are weighted, whereas q 1 denotes the weights for the timestamp features, as detailed further. Details on learning the weights q 0 and q 1 are discussed in §4.2.
The function γ(d, c) that maps a pair of document and cluster to R 3 is defined as follows. Let for a given µ and σ > 0. For each document d and cluster c, we generate the following threedimensional vector γ(d, c) = (s 1 , s 2 , s 3 ): • s 1 = f (t(d) − n 1 (c)) where t(d j ) is the timestamp for document d and n 1 (c) is the timestamp for the newest document in cluster c. • s 2 = f (t(d)−n 2 (c)) where n 2 (c) is the average timestamp for all documents in cluster c. • s 3 = f (t(d) − n 3 (c)) where n 3 (c) is the timestamp for the oldest document in cluster c.
These three timestamp 1 features model the time aspect of the online stream of news data and help disambiguate clustering decisions, since time is a valuable indicator that a news story has changed, even if a cluster representation has a reasonable match in the textual features with the incoming document. The same way a news story becomes popular and fades over time (Lerman and Hogg, 2010), we model the probability of a document belonging to a cluster (in terms of timestamp difference) with a probability distribution.
For the case of crosslingual clustering, we introduce Γ 1 , which has a similar definition to Γ 0 , only instead of passing document/cluster similarity feature vectors, we pass cluster/cluster similarities, across all language pairs. Furthermore, the features are the crosslingual embedding vectors of the sections title, body and both combined (similarly to the monolingual case) and the timestamp features. For denoting the cluster timestamp, we use the average timestamps of all articles in it.

Learning to Rank Candidates
In §4.1 we introduced q 0 and q 1 as the weight vectors for the several document representation features. We experiment with both setting these weights to just 1 (q 0 i = 1 ∀i and q 1 j = 1 ∀j ∈ [3]) and also learning these weights using support vector machines (SVMs). To generate the SVM training data, we simulate the execution of the algorithm on a training data partition (which we do not get evaluated on) and in which the gold standard labels are given. We run the algorithm using only the first subvector φ 1 (d j , c l ), which is the TF-IDF vector with the words of the document in the body and title. For each incoming document, we create a collection of positive examples, for the document and the clusters which share at least one document in the gold labeling. We then generate 20 negative examples for the document from the 20 best-matching clusters which are not correct. To find out the bestmatching clusters, we rank them according to their similarity to the input document using only the first subvector φ 1 (d j , c l ).
Using this scheme we generate a collection of ranking examples (one for each document in the dataset, with the ranking of the best cluster matches), which are then trained using the SVM-Rank algorithm (Joachims, 2002). We run 5-fold cross-validation on this data to select the best model, and train both a separate model for each language according to Γ 0 and a crosslingual model according to Γ 1 .

Experiments
Our system was designed to cluster documents from a (potentially infinite) real-word data stream. The datasets typically used in the literature (TDT, Reuters) have a small number of clusters (≈ 20) with coarse topics (economy, society, etc.), and therefore are not relevant to the use case of media monitoring we treat -as it requires much more fine-grained story clusters about particular events. To evaluate our approach, we adapted a dataset constructed for the different purpose of binary classification of joining cluster pairs. 2 We processed it to become a collection of articles annotated with monolingual and crosslingual cluster labels. 3 Statistics about this dataset are given in Table 1. As described further, we tune the hyperparameter τ on the development set. As for the hyper-parameters related to the timestamp features, we fixed µ = 0 and tuned σ on the development set, yielding σ = 72 hours (3 days). 4 To compute IDF scores (which are global numbers computed across a corpus), we used a different and much larger dataset that we collected from Deutsche Welle's news website (http://www.dw.com/). The dataset consists of 77,268, 118,045 and 134,243 documents for Spanish, English and German, respectively.
The conclusions from our experiments are: (a) the weighting of the similarity metric features using SVM significantly outperforms unsupervised baselines such as CluStream (   Rupnik et al. (2016), as explained in §5. "Size" denotes the number of documents in the collection, "Avg. L." is the average number of words in a document, "C" denotes the number of clusters in the collection and "Avg. S." is the average number of documents in each cluster.
for the optimal τ (Table 4); (c) separating the feature space into one for monolingual clusters in the form of keywords and the other for crosslingual clusters based on crosslingual embeddings significantly helps performance.

Evaluation Method
We evaluate clustering in the following manner: let tp be the number of correctly clustered-together document pairs, let fp be the number of incorrectly clustered-together document pairs and let fn be the number of incorrectly not-clustered-together document pairs. Then we report precision as tp tp+fp , recall as tp tp+fn and F 1 as the harmonic mean of the precision and recall measures. We do the same to evaluate crosslingual clustering, but on a higher level: we count tp, fn and fp for the decisions of clustering clusters, as crosslingual clusters are groups of monolingual gold clusters.

Monolingual Results
In our first set of experiments, we report results on monolingual clustering for each language separately. Monolingual clustering of a stream of documents is an important problem that has been inspected by others, such as by Ahmed et al. (2011) and by Aggarwal and Yu (2006). We compare our results to our own implementation of the online micro-clustering routine presented by Aggarwal and Yu (2006), which shall be referred to as CluStream. We note that CluStream of Aggarwal and Yu (2006) has been a widely used state-of-the-art system in media monitoring companies as well as academia, and serves as a strong baseline to this day.
In our preliminary experiments, we also evaluated an online latent semantic analysis method, in which the centroids we keep for the function H (see  Table 2: Clustering results on the labeled dataset. We compare our algorithm (with and without timestamps) with the online micro-clustering routine of Aggarwal and Yu (2006) (denoted by CluStream). The F 1 values are for the precision (P) and recall (R) in the following columns. See Table 3 for a legend of the different models. Best result for each language is in bold. §3) are the average of reduced dimensional vectors of the incoming documents as generated by an incremental singular value decomposition (SVD) of a document-term matrix that is updated after each incoming document. However, we discovered that online LSA performs significantly worse than representing the documents the way is described in §4. Furthermore, it was also significantly slower than our algorithm due to the time it took to perform singular value decomposition. 5 Table 2 gives the final monolingual results on the three datasets. For English, we see that the significant improvement we get using our algorithm over the algorithm of Aggarwal and Yu (2006) is due to an increased recall score. We also note that the trained models surpass the baseline for all languages, and that the timestamp feature (denoted by TS), while not required to beat the baseline, has a very relevant contribution in all cases. Although the results for both the baseline and our models seem to differ across languages, one can verify a consistent improvement from the latter to the former, suggesting that the score differences should be mostly tied to the different difficulty found across the datasets for each language. The presented scores show that our learning framework generalizes well to different languages and enables high quality clustering results.

Clustering experiments
To investigate the impact of the timestamp fea-  tures, we ran an additional experiment using only the same three timestamp features as used in the best model on the English dataset. This experiment yielded scores of F 1 = 61.1, P = 44.5 and R = 97.6, which lead us to conclude that while these features are not competitive when used alone (hence temporal information by itself is not sufficient to predict the clusters), they contribute significantly to recall with the final feature ensemble. We note that as described in §3, the optimization of the τ parameter is part of the development process. The parameter τ is a similarity threshold used to decide when an incoming document should merge to the best cluster or create a new one. We tune τ on the development set for each language, and the sensitivity to it is demonstrated in Figure 3 (this process is further referred to as τ search ). Although applying grid-search on this parameter is the most immediate approach to this problem, we experimented with a different method which yielded superior results: as described further, we discuss how to do this process with an additional classifier (denoted SVM-merge), which captures more information about the incoming documents and the existing clusters.
Additionally, we also experimented with computing the monolingual clusters with the same embeddings as used in the crosslingual clustering phase, which yielded poor results. In particular, this system achieved F 1 score of 74.8 for English, which is below the bag-of-words baseline presented in Table 2. This result supports the approach we then followed of having two separate feature spaces for the monolingual and crosslingual clustering systems, where the monolingual space is discrete and the crosslingual space is based on embeddings.
SVM ranker experiments To investigate the importance of each feature, we now consider in Ta-    Table 2). The first method, τ search , corresponds to executing grid-search to find the optimal clustering τ parameter (see §3). SVM-merge is an alternative method in which we train an SVM binary classifier to decide if a new cluster should be created or not, where we use as features the maximal value of each coordinate for each document in a cluster. ble 3 the accuracy of the SVM ranker for English as described in §4.1. We note that adding features increases the accuracy of the SVM ranker, especially the timestamp features. However, the timestamp feature actually interferes with our optimization of τ to identify when new clusters are needed, although they improve the SVM reranking accuracy. We speculate this is true because high accuracy in the reranking problem does not necessarily help with identifying when new clusters need to be opened. To investigate this issue, we experimented with a different technique to learn when to create a new cluster. To this end, we trained another SVM classifier just to learn this decision, this time a binary classifier using LIBLINEAR (Fan et al., 2008), by passing the max of the similarity of each feature between the incoming document and the current clustering pool as the input feature vector. This way, the classifier learns when the current clusters,  as a whole, are of a different news story than the incoming document. As presented in Table 4, this method, which we refer to as SVM-merge, solved the issue of searching for the optimal τ parameter for the SVM-rank model with timestamps, by greatly improving the F 1 score in respect to the original grid-search approach (τ search ).

Crosslingual Results
As mentioned in §3, crosslingual embeddings are used for crosslingual clustering. We experimented with the crosslingual embeddings of Gardner et al. (2015) and Ammar et al. (2016). In our preliminary experiments we found that the former worked better for our use-case than the latter.
We test two different scenarios for optimizing the similarity threshold τ for the crosslingual case. Table 5 shows the results for these experiments. First, we consider the simpler case of adjusting a global τ parameter for the crosslingual distances, as also described for the monolingual case. As shown, this method works poorly, since the τ grid-search could not find a reasonable τ which worked well for every possible language pair. Subsequently, we also consider the case of using English as a pivot language (see §3), where distances for every other language are only compared to English, and crosslingual clustering decisions are made only based on this distance. 6 This yielded our best crosslingual score of F 1 =84.0, confirming that crosslingual similarity is of higher quality between each language and English, for the embeddings we used. This score represents only a small degradation in respect to the monolingual results, since clustering across different languages is a harder problem.

Related Work
Early research efforts, such as the TDT program (Allan et al., 1998), have studied news clustering for some time. The problem of online monolingual clustering algorithms (for English) has also received a fair amount of attention in the literature. One of the earlier papers by Aggarwal and Yu (2006) introduced a two-step clustering system with both offline and online components, where the online model is based on a streaming implementation of k-means and a bag-of-words document representation. Other authors have experimented with distributed representations, such as Ahmed et al. (2011), who cluster news into storylines using Markov chain Monte Carlo methods,Řehůřek and Sojka (2010) who used incremental Singular Value Decomposition (SVD) to find relevant topics from streaming data, and Sato et al. (2017) who used the paragraph vector model (Le and Mikolov, 2014) in an offline clustering setting.
More recently, crosslingual linking of clusters has been discussed by Rupnik et al. (2016) in the context of linking existing clusters from the Event Registry (Leban et al., 2014) in a batch fashion, and by Steinberger (2016) who also present a batch clustering linking system. However, these are not "truly" online crosslingual clustering systems since they only decide on the linking of already-built monolingual clusters. In particular, Rupnik et al. (2016) compute distances of document pairs across clusters using nearest neighbors, which might not scale well in an online setting. As detailed before, we adapted the cluster-linking dataset from Rupnik et al. (2016) to evaluate our online crosslingual clustering approach. Preliminary work makes use of deep learning techniques (Xie et al., 2016;Guo et al., 2017) to cluster documents while learning their representations, but not in an online or multilingual fashion, and with a very small number of cluster labels (4, in the case of the text benchmark).
In our work, we studied the problem of monolingual and crosslingual clustering, having experimented several directions and methods and the impact they have on the final clustering quality. We described the first system which aggregates news articles into fine-grained story clusters across different languages in a completely online and scalable fashion from a continuous stream.

Conclusion
We described a method for monolingual and crosslingual clustering of an incoming stream of documents. The method works by maintaining centroids for the monolingual and crosslingual clusters, where a monolingual cluster groups a set of documents and a crosslingual cluster groups a set of monolingual clusters. We presented an online crosslingual clustering method which auto-corrects past decisions in an efficient way. We showed that our method gives state-of-the-art results on a multilingual news article dataset for English, Spanish and German. Finally, we discussed how to leverage different SVM training procedures for ranking and classification to improve monolingual and crosslingual clustering decisions. Our system is integrated in a larger media monitoring project (Liepins et al., 2017;Germann et al., 2018) and solving the usecases of monitors and journalists, having been validated with qualitative user testing.