Kevin Scannell


2022

pdf bib
Diachronic Parsing of Pre-Standard Irish
Kevin Scannell
Proceedings of the 4th Celtic Language Technology Workshop within LREC2022

Irish underwent a major spelling standardization in the 1940’s and 1950’s, and as a result it can be challenging to apply language technologies designed for the modern language to older, “pre-standard” texts. Lemmatization, tagging, and parsing of these pre-standard texts play an important role in a number of applications, including the lexicographical work on Foclóir Stairiúil na Gaeilge, a historical dictionary of Irish covering the period from 1600 to the present. We have two main goals in this paper. First, we introduce a small benchmark corpus containing just over 3800 words, annotated according to the Universal Dependencies guidelines and covering a range of dialects and time periods since 1600. Second, we establish baselines for lemmatization, tagging, and dependency parsing on this corpus by experimenting with a variety of machine learning approaches.

2020

pdf bib
Neural Models for Predicting Celtic Mutations
Kevin Scannell
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

The Celtic languages share a common linguistic phenomenon known as initial mutations; these consist of pronunciation and spelling changes that occur at the beginning of some words, triggered in certain semantic or syntactic contexts. Initial mutations occur quite frequently and all non-trivial NLP systems for the Celtic languages must learn to handle them properly. In this paper we describe and evaluate neural network models for predicting mutations in two of the six Celtic languages: Irish and Scottish Gaelic. We also discuss applications of these models to grammatical error detection and language modeling.

pdf bib
Universal Dependencies for Manx Gaelic
Kevin Scannell
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

Manx Gaelic is one of the three Q-Celtic languages, along with Irish and Scottish Gaelic. We present a new dependency treebank for Manx consisting of 291 sentences and about 6000 tokens, annotated according to the Universal Dependency (UD) guidelines. To the best of our knowledge, this is the first annotated corpus of any kind for Manx. Our annotations generally follow the conventions established by the existing UD treebanks for Irish and Scottish Gaelic, although we highlight some areas where the grammar of Manx diverges, requiring new analyses. We use 10-fold cross validation to evaluate the accuracy of dependency parsers trained on the corpus, and compare these results with delexicalised models transferred from Irish and Scottish Gaelic.

2019

pdf bib
Code-switching in Irish tweets: A preliminary analysis
Teresa Lynn | Kevin Scannell
Proceedings of the Celtic Language Technology Workshop

pdf bib
Improving full-text search results on dúchas.ie using language technology
Brian Ó Raghallaigh | Kevin Scannell | Meghan Dowling
Proceedings of the Celtic Language Technology Workshop

2015

pdf bib
Minority Language Twitter: Part-of-Speech Tagging and Analysis of Irish Tweets
Teresa Lynn | Kevin Scannell | Eimear Maguire
Proceedings of the Workshop on Noisy User-generated Text

2014

pdf bib
Statistical models for text normalization and machine translation
Kevin Scannell
Proceedings of the First Celtic Language Technology Workshop