Surangika Ranathunga


2024

pdf bib
Quality Does Matter: A Detailed Look at the Quality and Utility of Web-Mined Parallel Corpora
Surangika Ranathunga | Nisansa De Silva | Velayuthan Menan | Aloka Fernando | Charitha Rathnayake
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

We conducted a detailed analysis on the quality of web-mined corpora for two low-resource languages (making three language pairs, English-Sinhala, English-Tamil and Sinhala-Tamil). We ranked each corpus according to a similarity measure and carried out an intrinsic and extrinsic evaluation on different portions of this ranked corpus. We show that there are significant quality differences between different portions of web-mined corpora and that the quality varies across languages and datasets. We also show that, for some web-mined datasets, Neural Machine Translation (NMT) models trained with their highest-ranked 25k portion can be on par with human-curated datasets.

2022

pdf bib
Pre-Trained Multilingual Sequence-to-Sequence Models: A Hope for Low-Resource Language Translation?
En-Shiun Lee | Sarubi Thillainathan | Shravan Nayak | Surangika Ranathunga | David Adelani | Ruisi Su | Arya McCarthy
Findings of the Association for Computational Linguistics: ACL 2022

What can pre-trained multilingual sequence-to-sequence models like mBART contribute to translating low-resource languages? We conduct a thorough empirical experiment in 10 languages to ascertain this, considering five factors: (1) the amount of fine-tuning data, (2) the noise in the fine-tuning data, (3) the amount of pre-training data in the model, (4) the impact of domain mismatch, and (5) language typology. In addition to yielding several heuristics, the experiments form a framework for evaluating the data sensitivities of machine translation systems. While mBART is robust to domain differences, its translations for unseen and typologically distant languages remain below 3.0 BLEU. In answer to our title’s question, mBART is not a low-resource panacea; we therefore encourage shifting the emphasis from new models to new data.

pdf bib
Some Languages are More Equal than Others: Probing Deeper into the Linguistic Disparity in the NLP World
Surangika Ranathunga | Nisansa de Silva
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Linguistic disparity in the NLP world is a problem that has been widely acknowledged recently. However, different facets of this problem, or the reasons behind this disparity are seldom discussed within the NLP community. This paper provides a comprehensive analysis of the disparity that exists within the languages of the world. We show that simply categorising languages considering data availability may not be always correct. Using an existing language categorisation based on speaker population and vitality, we analyse the distribution of language data resources, amount of NLP/CL research, inclusion in multilingual web-based platforms and the inclusion in pre-trained multilingual models. We show that many languages do not get covered in these resources or platforms, and even within the languages belonging to the same language group, there is wide disparity. We analyse the impact of family, geographical location, GDP and the speaker population of languages and provide possible reasons for this disparity, along with some suggestions to overcome the same.

pdf bib
Dataset and Baseline for Automatic Student Feedback Analysis
Missaka Herath | Kushan Chamindu | Hashan Maduwantha | Surangika Ranathunga
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we present a student feedback corpus, which contains 3000 instances of feedback written by university students. This dataset has been annotated for aspect terms, opinion terms, polarities of the opinion terms towards targeted aspects, document-level opinion polarities and sentence separations. We develop a hierarchical taxonomy for aspect categorization, which covers all the areas of the teaching-learning process. We annotated both implicit and explicit aspects using this taxonomy. Annotation methodology, difficulties faced during the annotation, and the details about the aspect term categorization have been discussed in detail. This annotated corpus can be used for Aspect Extraction, Aspect Level Sentiment Analysis, and Document Level Sentiment Analysis. Also the baseline results for all three tasks are given in the paper.

pdf bib
BERTifying Sinhala - A Comprehensive Analysis of Pre-trained Language Models for Sinhala Text Classification
Vinura Dhananjaya | Piyumal Demotte | Surangika Ranathunga | Sanath Jayasena
Proceedings of the Thirteenth Language Resources and Evaluation Conference

This research provides the first comprehensive analysis of the performance of pre-trained language models for Sinhala text classification. We test on a set of different Sinhala text classification tasks and our analysis shows that out of the pre-trained multilingual models that include Sinhala (XLM-R, LaBSE, and LASER), XLM-R is the best model by far for Sinhala text classification. We also pre-train two RoBERTa-based monolingual Sinhala models, which are far superior to the existing pre-trained language models for Sinhala. We show that when fine-tuned, these pre-trained language models set a very strong baseline for Sinhala text classification and are robust in situations where labeled data is insufficient for fine-tuning. We further provide a set of recommendations for using pre-trained models for Sinhala text classification. We also introduce new annotated datasets useful for future research in Sinhala text classification and publicly release our pre-trained models.

pdf bib
Math Word Problem Generation with Multilingual Language Models
Kashyapa Niyarepola | Dineth Athapaththu | Savindu Ekanayake | Surangika Ranathunga
Proceedings of the 15th International Conference on Natural Language Generation

pdf bib
Proceedings of the Sixth Widening NLP Workshop (WiNLP)
Shaily Bhatt | Sunipa Dev | Bonaventure Dossou | Tirthankar Ghosal | Hatem Haddad | Haley M. Lepp | Fatemehsadat Mireshghallah | Surangika Ranathunga | Xanda Schofield | Isidora Tourni | Weijia Xu
Proceedings of the Sixth Widening NLP Workshop (WiNLP)

2021

pdf bib
Proceedings of the Fifth Workshop on Widening Natural Language Processing
Erika Varis | Ryan Georgi | Alicia Tsai | Antonios Anastasopoulos | Kyathi Chandu | Xanda Schofield | Surangika Ranathunga | Haley Lepp | Tirthankar Ghosal
Proceedings of the Fifth Workshop on Widening Natural Language Processing

pdf bib
Data Augmentation to Address Out of VocabularyProblem in Low Resource Sinhala English Neural Machine Translation
Aloka Fernando | Surangika Ranathunga
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf bib
Classification of Code-Mixed Text Using Capsule Networks
Shanaka Chathuranga | Surangika Ranathunga
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

A major challenge in analysing social me-dia data belonging to languages that use non-English script is its code-mixed nature. Recentresearch has presented state-of-the-art contex-tual embedding models (both monolingual s.a.BERT and multilingual s.a.XLM-R) as apromising approach. In this paper, we showthat the performance of such embedding mod-els depends on multiple factors, such as thelevel of code-mixing in the dataset, and thesize of the training dataset. We empiricallyshow that a newly introduced Capsule+biGRUclassifier could outperform a classifier built onthe English-BERT as well as XLM-R just witha training dataset of about 6500 samples forthe Sinhala-English code-mixed data.

pdf bib
Metric Learning in Multilingual Sentence Similarity Measurement for Document Alignment
Charith Rajitha | Lakmali Piyarathna | Dilan Sachintha | Surangika Ranathunga
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Document alignment techniques based on multilingual sentence representations have recently shown state of the art results. However, these techniques rely on unsupervised distance measurement techniques, which cannot be fined-tuned to the task at hand. In this paper, instead of these unsupervised distance measurement techniques, we employ Metric Learning to derive task-specific distance measurements. These measurements are supervised, meaning that the distance measurement metric is trained using a parallel dataset. Using a dataset belonging to English, Sinhala, and Tamil, which belong to three different language families, we show that these task-specific supervised distance learning metrics outperform their unsupervised counterparts, for document alignment.

2020

pdf bib
Word Embedding Evaluation for Sinhala
Dimuthu Lakmal | Surangika Ranathunga | Saman Peramuna | Indu Herath
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper presents the first ever comprehensive evaluation of different types of word embeddings for Sinhala language. Three standard word embedding models, namely, Word2Vec (both Skipgram and CBOW), FastText, and Glove are evaluated under two types of evaluation methods: intrinsic evaluation and extrinsic evaluation. Word analogy and word relatedness evaluations were performed in terms of intrinsic evaluation, while sentiment analysis and part-of-speech (POS) tagging were conducted as the extrinsic evaluation tasks. Benchmark datasets used for intrinsic evaluations were carefully crafted considering specific linguistic features of Sinhala. In general, FastText word embeddings with 300 dimensions reported the finest accuracies across all the evaluation tasks, while Glove reported the lowest results.

pdf bib
Multi-lingual Mathematical Word Problem Generation using Long Short Term Memory Networks with Enhanced Input Features
Vijini Liyanage | Surangika Ranathunga
Proceedings of the Twelfth Language Resources and Evaluation Conference

A Mathematical Word Problem (MWP) differs from a general textual representation due to the fact that it is comprised of numerical quantities and units, in addition to text. Therefore, MWP generation should be carefully handled. When it comes to multi-lingual MWP generation, language specific morphological and syntactic features become additional constraints. Standard template-based MWP generation techniques are incapable of identifying these language specific constraints, particularly in morphologically rich yet low resource languages such as Sinhala and Tamil. This paper presents the use of a Long Short Term Memory (LSTM) network that is capable of generating elementary level MWPs, while satisfying the aforementioned constraints. Our approach feeds a combination of character embeddings, word embeddings, and Part of Speech (POS) tag embeddings to the LSTM, in which attention is provided for numerical values and units. We trained our model for three languages, English, Sinhala and Tamil using separate MWP datasets. Irrespective of the language and the type of the MWP, our model could generate accurate single sentenced and multi sentenced problems. Accuracy reported in terms of average BLEU score for English, Sinhala and Tamil languages were 22.97%, 24.49% and 20.74%, respectively.

2019

pdf bib
Transfer Learning Based Free-Form Speech Command Classification for Low-Resource Languages
Yohan Karunanayake | Uthayasanker Thayasivam | Surangika Ranathunga
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Current state-of-the-art speech-based user interfaces use data intense methodologies to recognize free-form speech commands. However, this is not viable for low-resource languages, which lack speech data. This restricts the usability of such interfaces to a limited number of languages. In this paper, we propose a methodology to develop a robust domain-specific speech command classification system for low-resource languages using speech data of a high-resource language. In this transfer learning-based approach, we used a Convolution Neural Network (CNN) to identify a fixed set of intents using an ASR-based character probability map. We were able to achieve significant results for Sinhala and Tamil datasets using an English based ASR, which attests the robustness of the proposed approach.

2018

pdf bib
Handling Rare Word Problem using Synthetic Training Data for Sinhala and Tamil Neural Machine Translation
Pasindu Tennage | Prabath Sandaruwan | Malith Thilakarathne | Achini Herath | Surangika Ranathunga
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Annotating Opinions and Opinion Targets in Student Course Feedback
Janaka Chathuranga | Shanika Ediriweera | Ravindu Hasantha | Pranidhith Munasinghe | Surangika Ranathunga
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Improving domain-specific SMT for low-resourced languages using data from different domains
Fathima Farhath | Pranavan Theivendiram | Surangika Ranathunga | Sanath Jayasena | Gihan Dias
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Graph Based Semi-Supervised Learning Approach for Tamil POS tagging
Mokanarangan Thayaparan | Surangika Ranathunga | Uthayasanker Thayasivam
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Opinion Target Extraction for Student Course Feedback
Janaka Chathuranga | Shanika Ediriweera | Pranidhith Munasinghe | Ravindu Hasantha | Surangika Ranathunga
Proceedings of the 29th Conference on Computational Linguistics and Speech Processing (ROCLING 2017)

pdf bib
Multi-Domain Aspect Extraction Using Support Vector Machines
Nadheesh Jihan | Yasas Senarath | Dulanjaya Tennekoon | Mithila Wickramarathne | Surangika Ranathunga
Proceedings of the 29th Conference on Computational Linguistics and Speech Processing (ROCLING 2017)

2016

pdf bib
Implicit Aspect Detection in Restaurant Reviews using Cooccurence of Words
Rrubaa Panchendrarajan | Nazick Ahamed | Brunthavan Murugaiah | Prakhash Sivakumar | Surangika Ranathunga | Akila Pemasiri
Proceedings of the 7th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf bib
Sinhala Short Sentence Similarity Calculation using Corpus-Based and Knowledge-Based Similarity Measures
Jcs Kadupitiya | Surangika Ranathunga | Gihan Dias
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)

Currently, corpus based-similarity, string-based similarity, and knowledge-based similarity techniques are used to compare short phrases. However, no work has been conducted on the similarity of phrases in Sinhala language. In this paper, we present a hybrid methodology to compute the similarity between two Sinhala sentences using a Semantic Similarity Measurement technique (corpus-based similarity measurement plus knowledge-based similarity measurement) that makes use of word order information. Since Sinhala WordNet is still under construction, we used lexical resources in performing this semantic similarity calculation. Evaluation using 4000 sentence pairs yielded an average MSE of 0.145 and a Pearson correla-tion factor of 0.832.

pdf bib
Automatic Creation of a Sentence Aligned Sinhala-Tamil Parallel Corpus
Riyafa Abdul Hameed | Nadeeshani Pathirennehelage | Anusha Ihalapathirana | Maryam Ziyad Mohamed | Surangika Ranathunga | Sanath Jayasena | Gihan Dias | Sandareka Fernando
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)

A sentence aligned parallel corpus is an important prerequisite in statistical machine translation. However, manual creation of such a parallel corpus is time consuming, and requires experts fluent in both languages. Automatic creation of a sentence aligned parallel corpus using parallel text is the solution to this problem. In this paper, we present the first ever empirical evaluation carried out to identify the best method to automatically create a sentence aligned Sinhala-Tamil parallel corpus. Annual reports from Sri Lankan government institutions were used as the parallel text for aligning. Despite both Sinhala and Tamil being under-resourced languages, we were able to achieve an F-score value of 0.791 using a hybrid approach that makes use of a bilingual dictionary.

pdf bib
Comprehensive Part-Of-Speech Tag Set and SVM based POS Tagger for Sinhala
Sandareka Fernando | Surangika Ranathunga | Sanath Jayasena | Gihan Dias
Proceedings of the 6th Workshop on South and Southeast Asian Natural Language Processing (WSSANLP2016)

This paper presents a new comprehensive multi-level Part-Of-Speech tag set and a Support Vector Machine based Part-Of-Speech tagger for the Sinhala language. The currently available tag set for Sinhala has two limitations: the unavailability of tags to represent some word classes and the lack of tags to capture inflection based grammatical variations of words. The new tag set, presented in this paper overcomes both of these limitations. The accuracy of available Sinhala Part-Of-Speech taggers, which are based on Hidden Markov Models, still falls far behind state of the art. Our Support Vector Machine based tagger achieved an overall accuracy of 84.68% with 59.86% accuracy for unknown words and 87.12% for known words, when the test set contains 10% of unknown words.

2015

pdf bib
Ruchi: Rating Individual Food Items in Restaurant Reviews
Burusothman Ahiladas | Paraneetharan Saravanaperumal | Sanjith Balachandran | Thamayanthy Sripalan | Surangika Ranathunga
Proceedings of the 12th International Conference on Natural Language Processing

pdf bib
Dialogue Act Recognition for Text-based Sinhala
Sudheera Palihakkara | Dammina Sahabandu | Ahsan Shamsudeen | Chamika Bandara | Surangika Ranathunga
Proceedings of the 12th International Conference on Natural Language Processing

Search