Laura Rimell


2023

pdf bib
A Natural Bias for Language Generation Models
Clara Meister | Wojciech Stokowiec | Tiago Pimentel | Lei Yu | Laura Rimell | Adhiguna Kuncoro
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

After just a few hundred training updates, a standard probabilistic model for language generation has likely not yet learnt many semantic or syntactic rules of natural language, making it difficult to estimate the probability distribution over next tokens. Yet around this point, these models have identified a simple, loss-minimising behaviour: to output the unigram distribution of the target training corpus. The use of such a heuristic raises the question: Can we initialise our models with this behaviour and save precious compute resources and model capacity? Here we show that we can effectively endow standard neural language generation models with a separate module that reflects unigram frequency statistics as prior knowledge, simply by initialising the bias term in a model’s final linear layer with the log-unigram distribution. We use neural machine translation as a test bed for this simple technique and observe that it: (i) improves learning efficiency; (ii) achieves better overall performance; and perhaps most importantly (iii) appears to disentangle strong frequency effects by encouraging the model to specialise in non-frequency-related aspects of language.

2022

pdf bib
Proceedings of the 7th Workshop on Representation Learning for NLP
Spandana Gella | He He | Bodhisattwa Prasad Majumder | Burcu Can | Eleonora Giunchiglia | Samuel Cahyawijaya | Sewon Min | Maximilian Mozes | Xiang Lorraine Li | Isabelle Augenstein | Anna Rogers | Kyunghyun Cho | Edward Grefenstette | Laura Rimell | Chris Dyer
Proceedings of the 7th Workshop on Representation Learning for NLP

2021

pdf bib
You should evaluate your language model on marginal likelihood over tokenisations
Kris Cao | Laura Rimell
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Neural language models typically tokenise input text into sub-word units to achieve an open vocabulary. The standard approach is to use a single canonical tokenisation at both train and test time. We suggest that this approach is unsatisfactory and may bottleneck our evaluation of language model performance. Using only the one-best tokenisation ignores tokeniser uncertainty over alternative tokenisations, which may hurt model out-of-domain performance. In this paper, we argue that instead, language models should be evaluated on their marginal likelihood over tokenisations. We compare different estimators for the marginal likelihood based on sampling, and show that it is feasible to estimate the marginal likelihood with a manageable number of samples. We then evaluate a pretrained language model on both the one-best-tokenisation and marginal perplexities, and show that the marginal perplexity can be significantly better than the one best, especially on out-of-domain data. We link this difference in perplexity to the tokeniser uncertainty as measured by tokeniser entropy. We discuss some implications of our results for language model training and evaluation, particularly with regard to tokenisation robustness.

pdf bib
Pretraining the Noisy Channel Model for Task-Oriented Dialogue
Qi Liu | Lei Yu | Laura Rimell | Phil Blunsom
Transactions of the Association for Computational Linguistics, Volume 9

Direct decoding for task-oriented dialogue is known to suffer from the explaining-away effect, manifested in models that prefer short and generic responses. Here we argue for the use of Bayes’ theorem to factorize the dialogue task into two models, the distribution of the context given the response, and the prior for the response itself. This approach, an instantiation of the noisy channel model, both mitigates the explaining-away effect and allows the principled incorporation of large pretrained models for the response prior. We present extensive experiments showing that a noisy channel model decodes better responses compared to direct decoding and that a two-stage pretraining strategy, employing both open-domain and task-oriented dialogue data, improves over randomly initialized models.

2020

pdf bib
Syntactic Structure Distillation Pretraining for Bidirectional Encoders
Adhiguna Kuncoro | Lingpeng Kong | Daniel Fried | Dani Yogatama | Laura Rimell | Chris Dyer | Phil Blunsom
Transactions of the Association for Computational Linguistics, Volume 8

Textual representation learners trained on large amounts of data have achieved notable success on downstream tasks; intriguingly, they have also performed well on challenging tests of syntactic competence. Hence, it remains an open question whether scalable learners like BERT can become fully proficient in the syntax of natural language by virtue of data scale alone, or whether they still benefit from more explicit syntactic biases. To answer this question, we introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining, by distilling the syntactically informative predictions of a hierarchical—albeit harder to scale—syntactic language model. Since BERT models masked words in bidirectional context, we propose to distill the approximate marginal distribution over words in context from the syntactic LM. Our approach reduces relative error by 2–21% on a diverse set of structured prediction tasks, although we obtain mixed results on the GLUE benchmark. Our findings demonstrate the benefits of syntactic biases, even for representation learners that exploit large amounts of data, and contribute to a better understanding of where syntactic biases are helpful in benchmarks of natural language understanding.

2019

pdf bib
Scalable Syntax-Aware Language Models Using Knowledge Distillation
Adhiguna Kuncoro | Chris Dyer | Laura Rimell | Stephen Clark | Phil Blunsom
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Prior work has shown that, on small amounts of training data, syntactic neural language models learn structurally sensitive generalisations more successfully than sequential language models. However, their computational complexity renders scaling difficult, and it remains an open question whether structural biases are still necessary when sequential models have access to ever larger amounts of training data. To answer this question, we introduce an efficient knowledge distillation (KD) technique that transfers knowledge from a syntactic language model trained on a small corpus to an LSTM language model, hence enabling the LSTM to develop a more structurally sensitive representation of the larger training data it learns from. On targeted syntactic evaluations, we find that, while sequential LSTMs perform much better than previously reported, our proposed technique substantially improves on this baseline, yielding a new state of the art. Our findings and analysis affirm the importance of structural biases, even in models that learn from large amounts of data.

pdf bib
Neural Generative Rhetorical Structure Parsing
Amandla Mabona | Laura Rimell | Stephen Clark | Andreas Vlachos
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Rhetorical structure trees have been shown to be useful for several document-level tasks including summarization and document classification. Previous approaches to RST parsing have used discriminative models; however, these are less sample efficient than generative models, and RST parsing datasets are typically small. In this paper, we present the first generative model for RST parsing. Our model is a document-level RNN grammar (RNNG) with a bottom-up traversal order. We show that, for our parser’s traversal order, previous beam search algorithms for RNNGs have a left-branching bias which is ill-suited for RST parsing. We develop a novel beam search algorithm that keeps track of both structure-and word-generating actions without exhibit-ing this branching bias and results in absolute improvements of 6.8 and 2.9 on unlabelled and labelled F1 over previous algorithms. Overall, our generative model outperforms a discriminative model with the same features by 2.6 F1points and achieves performance comparable to the state-of-the-art, outperforming all published parsers from a recent replication study that do not use additional training data

2017

pdf bib
Proceedings of the 2nd Workshop on Representation Learning for NLP
Phil Blunsom | Antoine Bordes | Kyunghyun Cho | Shay Cohen | Chris Dyer | Edward Grefenstette | Karl Moritz Hermann | Laura Rimell | Jason Weston | Scott Yih
Proceedings of the 2nd Workshop on Representation Learning for NLP

pdf bib
Learning to Negate Adjectives with Bilinear Models
Laura Rimell | Amandla Mabona | Luana Bulat | Douwe Kiela
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We learn a mapping that negates adjectives by predicting an adjective’s antonym in an arbitrary word embedding model. We show that both linear models and neural networks improve on this task when they have access to a vector representing the semantic domain of the input word, e.g. a centroid of temperature words when predicting the antonym of ‘cold’. We introduce a continuous class-conditional bilinear neural network which is able to negate adjectives with high precision.

2016

pdf bib
Proceedings of the 1st Workshop on Representation Learning for NLP
Phil Blunsom | Kyunghyun Cho | Shay Cohen | Edward Grefenstette | Karl Moritz Hermann | Laura Rimell | Jason Weston | Scott Wen-tau Yih
Proceedings of the 1st Workshop on Representation Learning for NLP

pdf bib
Predicting the Direction of Derivation in English Conversion
Max Kisselew | Laura Rimell | Alexis Palmer | Sebastian Padó
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib
SLEDDED: A Proposed Dataset of Event Descriptions for Evaluating Phrase Representations
Laura Rimell | Eva Maria Vecchi
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP

pdf bib
RELPRON: A Relative Clause Evaluation Data Set for Compositional Distributional Semantics
Laura Rimell | Jean Maillard | Tamara Polajnar | Stephen Clark
Computational Linguistics, Volume 42, Issue 4 - December 2016

pdf bib
Take and Took, Gaggle and Goose, Book and Read: Evaluating the Utility of Vector Differences for Lexical Relation Learning
Ekaterina Vylomova | Laura Rimell | Trevor Cohn | Timothy Baldwin
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2015

pdf bib
Exploiting Image Generality for Lexical Entailment Detection
Douwe Kiela | Laura Rimell | Ivan Vulić | Stephen Clark
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
An Exploration of Discourse-Based Sentence Spaces for Compositional Distributional Semantics
Tamara Polajnar | Laura Rimell | Stephen Clark
Proceedings of the First Workshop on Linking Computational Models of Lexical, Sentential and Discourse-level Semantics

2014

pdf bib
Distributional Lexical Entailment by Topic Coherence
Laura Rimell
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Evaluation of Simple Distributional Compositional Operations on Longer Texts
Tamara Polajnar | Laura Rimell | Stephen Clark
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Distributional semantic models have been effective at representing linguistic semantics at the word level, and more recently research has moved on to the construction of distributional representations for larger segments of text. However, it is not well understood how the composition operators that work well on short phrase-based models scale up to full-length sentences. In this paper we test several simple compositional methods on a sentence-length similarity task and discover that their performance peaks at fewer than ten operations. We also introduce a novel sentence segmentation method that reduces the number of compositional operations.

2013

pdf bib
UCAM-CORE: Incorporating structured distributional similarity into STS
Tamara Polajnar | Laura Rimell | Douwe Kiela
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity

2012

pdf bib
Multi-way Tensor Factorization for Unsupervised Lexical Acquisition
Tim Van de Cruys | Laura Rimell | Thierry Poibeau | Anna Korhonen
Proceedings of COLING 2012

2010

pdf bib
Cambridge: Parser Evaluation Using Textual Entailment by Grammatical Relation Comparison
Laura Rimell | Stephen Clark
Proceedings of the 5th International Workshop on Semantic Evaluation

pdf bib
Evaluation of Dependency Parsers on Unbounded Dependencies
Joakim Nivre | Laura Rimell | Ryan McDonald | Carlos Gómez-Rodríguez
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Chart Pruning for Fast Lexicalised-Grammar Parsing
Yue Zhang | Byung-Gyu Ahn | Stephen Clark | Curt Van Wyk | James R. Curran | Laura Rimell
Coling 2010: Posters

2009

pdf bib
Unbounded Dependency Recovery for Parser Evaluation
Laura Rimell | Stephen Clark | Mark Steedman
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

2008

pdf bib
Adapting a Lexicalized-Grammar Parser to Contrasting Domains
Laura Rimell | Stephen Clark
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf bib
Constructing a Parser Evaluation Scheme
Laura Rimell | Stephen Clark
Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation