Jarkko Lagus


2022

pdf bib
Optimizing singular value based similarity measures for document similarity comparisons
Jarkko Lagus | Arto Klami
Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)

pdf bib
Second-order Document Similarity Metrics for Transformers
Jarkko Lagus | Niki Loppi | Arto Klami
Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)

2021

pdf bib
Learning to Lemmatize in the Word Representation Space
Jarkko Lagus | Arto Klami
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Lemmatization is often used with morphologically rich languages to address issues caused by morphological complexity, performed by grammar-based lemmatizers. We propose an alternative for this, in form of a tool that performs lemmatization in the space of word embeddings. Word embeddings as distributed representations natively encode some information about the relationship between base and inflected forms, and we show that it is possible to learn a transformation that approximately maps the embeddings of inflected forms to the embeddings of the corresponding lemmas. This facilitates an alternative processing pipeline that replaces traditional lemmatization with the lemmatizing transformation in downstream processing for any application. We demonstrate the method in the Finnish language, outperforming traditional lemmatizers in example task of document similarity comparison, but the approach is language independent and can be trained for new languages with mild requirements.

pdf bib
A COVID-19 news coverage mood map of Europe
Frankie Robertson | Jarkko Lagus | Kaisla Kajava
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

We present a COVID-19 news dashboard which visualizes sentiment in pandemic news coverage in different languages across Europe. The dashboard shows analyses for positive/neutral/negative sentiment and moral sentiment for news articles across countries and languages. First we extract news articles from news-crawl. Then we use a pre-trained multilingual BERT model for sentiment analysis of news article headlines and a dictionary and word vectors -based method for moral sentiment analysis of news articles. The resulting dashboard gives a unified overview of news events on COVID-19 news overall sentiment, and the region and language of publication from the period starting from the beginning of January 2020 to the end of January 2021.

2019

pdf bib
Low-Rank Approximations of Second-Order Document Representations
Jarkko Lagus | Janne Sinkkonen | Arto Klami
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Document embeddings, created with methods ranging from simple heuristics to statistical and deep models, are widely applicable. Bag-of-vectors models for documents include the mean and quadratic approaches (Torki, 2018). We present evidence that quadratic statistics alone, without the mean information, can offer superior accuracy, fast document comparison, and compact document representations. In matching news articles to their comment threads, low-rank representations of only 3-4 times the size of the mean vector give most accurate matching, and in standard sentence comparison tasks, results are state of the art despite faster computation. Similarity measures are discussed, and the Frobenius product implicit in the proposed method is contrasted to Wasserstein or Bures metric from the transportation theory. We also shortly demonstrate matching of unordered word lists to documents, to measure topicality or sentiment of documents.