Embeddings in Natural Language Processing

Embeddings have been one of the most important topics of interest in NLP for the past decade. Representing knowledge through a low-dimensional vector which is easily integrable in modern machine learning models has played a central role in the development of the field. Embedding techniques initially focused on words but the attention soon started to shift to other forms. This tutorial will provide a high-level synthesis of the main embedding techniques in NLP, in the broad sense. We will start by conventional word embeddings (e.g., Word2Vec and GloVe) and then move to other types of embeddings, such as sense-specific and graph alternatives. We will finalize with an overview of the trending contextualized representations (e.g., ELMo and BERT) and explain their potential and impact in NLP.


Description
In this tutorial we will start by providing a historical overview on word-level vector space models, and word embeddings in particular. Word embeddings (e.g. Word2Vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014) or FastText (Bojanowski et al., 2017)) have proven to be powerful keepers of prior knowledge to be integrated into downstream Natural Language Processing (NLP) applications.
However, despite their flexibility and success in capturing semantic properties of words, the effectiveness of word embeddings are generally hampered by an important limitation, known as the meaning conflation deficiency: the inability to discriminate among different meanings of a word. A word can have one meaning (monosemous) or multiple meanings (ambiguous). For instance, the noun mouse can refer to two different meanings depending on the context: an animal or a computer device. Hence, mouse is said to be ambiguous. In fact, according to the Principle of Economical Versatility of Words (Zipf, 1949), frequent words tend to have more senses. Moreover, this meaning conflation can have additional negative impacts on accurate semantic modeling, e.g., semantically unrelated words that are similar to different senses of a word are pulled towards each other in the semantic space (Neelakantan et al., 2014;Pilehvar and Collier, 2016). In our example, the two semantically-unrelated words rat and screen are pulled towards each other in the semantic space for their similarities to two different senses of mouse (see Figure 1). This, in turn, contributes to the violation of the triangle inequality in euclidean spaces (Tversky and Gati, 1982;Neelakantan et al., 2014).
Accurately capturing the meaning of words (both ambiguous and unambiguous) plays a crucial role in the language understanding of NLP systems. In order to deal with the meaning conflation deficiency, this tutorial covers approaches have attempted to model individual word senses (Reisinger and Mooney, 2010;Huang et al., 2012;Neelakantan et al., 2014;Rothe and Schütze, 2015;Li and Jurafsky, 2015;Pilehvar and Collier, 2016;Mancini et al., 2017). Sense representation techniques, however, suffer from limitations which hinders their effective application in downstream NLP tasks: they either need vast amounts of training data to obtain reliable representations or require an additional sense disambiguation on the input text to make them integrable into NLP systems. This data is highly expensive to obtain in practice, which causes the so-called knowledge-acquisition bottleneck (Gale et al., 1992).
As a practical way to deal with the knowledge-acquisition bottleneck, an emerging branch of research has focused on directly integrating unsupervised embeddings into downstream models. Instead of learning a fixed number of senses per word, contextualized word embeddings learn "senses" dynamically,  i.e., their representations dynamically change depending on the context in which a word appears. Con-text2vec (Melamud et al., 2016) and ELMo (Peters et al., 2018a) are some of the early examples for this type of representation. These models represent the context of a target word by extracting the embedding of a word in context from a bi-directional LSTM language model. The latter further proposed a seamless integration into neural NLP systems, as depicted in Figure 2. More recently, Transformers (Vaswani et al., 2017) have proven very effective for encoding contextualized knowledge, thanks to their self-attention mechanism ( Figure 3). BERT (Devlin et al., 2018), which is based on Transformers, has revolutionized the field of representation learning and has impacted many other fields in NLP. Many derivatives and subsequent models have followed up, rapidly pushing up the state of the art in different benchmarks. In this tutorial we extensively cover this recent type of representation.
We also discuss other types of embeddings, for instance for graph structures which are a popular choice in many scenarios, or for longer units of texts such as sentences and documents. Finally, we conclude this tutorial by discussing some of the ethical issues around the implicit gender and stereotypical biases encoded in word embeddings and proposals for reducing these artifacts.

Type of tutorial
Cutting-edge, although the first part of the tutorial could also be considered introductory. The tutorial provides an overview starting from vector space models and word representations and the move to newer

Outline
The tutorial is split into seven sections, each of which being roughly self-contained.
1. Introduction (20 minutes) A quick warm-up introduction to NLP and why it is important for NLP systems to have a semantic comprehension of texts and how this is usually achieved by representing semantics through mathematical or computer-interpretable notations. This section provides necessary motivation for the tutorial and highlights the role of semantic representation as a core component of many NLP systems. It also briefly describes the Vector Space Model and briefly discusses the evolution path of semantic representations in NLP.

Word embeddings (25 minutes)
This section explains the main approaches to learn word embeddings from text corpora, what their advantages are and how they have revolutionized the field of lexical semantics. We describe the concepts behind some of the major word embedding techniques, such as Word2vec and GloVe, and their application in NLP. Finally, we briefly cover other types of word embeddings, such as character-based, cross-lingual or knowledge-enhanced.
3. Graph Embeddings (20 minutes) Graphs are ubiquitous data structures. They are often the preferred choice for representing various type of data, including social networks, word co-occurrence and semantic networks, citation networks, telecommunication networks, molecular graph structures and biological networks. In this part of the tutorial, we discuss some of the prominent techniques for transforming graph nodes and edges into vectors. The goal is for the resulting embedding space to preserve the structural properties of the graph, be it the relative positioning of the nodes or relations (edges) among them.

Sense Embeddings (20 minutes)
In this section we cover those approaches that learn distinct representations for individual meanings of words (i.e., word senses) with the aim of addressing the meaning conflation deficiency. For this part, we will discuss both knowledge-based and unsupervised paradigms.

Contextualized Representations (45 minutes)
In this part of the tutorial we will introduce the latest type of embeddings that aim at providing dynamic representations of words, capable of adapting the representation to syntactic and semantic characteristics of a given context. We will start by discussing the need for contextualization. We then provide a very brief introduction to the architecture and building blocks of the Transformer model. We then start the overview of contextualized models by some of the earliest proposals in this category, i..e, Context2vec and "Embeddings from Language Models" (ELMo). We then discuss the newer and more prominent models based on Transformers. Specifically, we will describe BERT and some of its derivatives and subsequent works, such as XLNet, DistilBERT, GPT-2 and RoBERTa. We will provide an in-depth analysis of this techniques and point out not only their strengths, but also some of the limitations from which they suffer, which can be taken as possible research directions.
6. Sentence and Document Embeddings (15 minutes) This section goes beyond the level of words, and describes how sentences and documents can be encoded into vectorial representations. We cover some of the widely used supervised and unsupervised techniques and discuss the applications and evaluation methods for these representations. Given the tutorial's main focus on word-level representation, this section provides partial coverage but also pointers for further reading.
7. Ethics and bias (10 minutes) In this section we will talk about the implicit bias in vector representations of meaning, with a focus on gender bias and word representations. We will also overview some of the recent techniques for debiasing word embeddings from gender stereotypes and biases.

Breadth
The tutorial is largely based on a recent book written by the instructors published by the Synthesis Lectures on Human Language Technologies of Morgan and Claypool, titled "Embeddings in Natural Language Processing: Theory and Advances in Vector Representations of Meaning" (Pilehvar and Camacho-Collados, 2020). This book covers all the various techniques on vectors representations of meaning in detail, while in this tutorial we provide an overview of the main ideas, without getting into details. .

Prerequisites for the attendees
No special advanced requirements are required for the attendees, but a certain familiarity with linear algebra, natural language processing and machine learning would be desirable.

Small reading list
In addition to the main reference book (Pilehvar and Camacho-Collados, 2020), in the following we present some references that may be helpful for understanding the tutorial. Nonetheless, they are not required to read in advance as their main ideas are also covered as part of the tutorial.  Jose's main area of expertise is Natural Language Processing (NLP) and in particular computational semantics or, in other words, how to make computers understand language. His research has pivoted around both scientific contributions through regular publications in top AI and NLP venues such as ACL, EMNLP, AAAI or IJCAI; and applications with direct impact in society, with a special focus on social media and multilinguality. He has also organised several international workshops, tutorials and open challenges with hundreds of participants across the world.
Mohammad Taher Pilehvar (mp792@cam.ac.uk, http://pilehvar.github.io) is an Assistant Professor at Tehran Institute for Advanced Studies (TeIAS) and an Affiliated Lecturer at the University of Cambridge. Taher's research lies in lexical semantics, mainly focusing on semantic representation and similarity. In the past, he has co-instructed three tutorials on these topics (EMNLP 2015, ACL 2016, and EACL 2017 and co-organised three SemEval tasks and an EACL workshop on sense representations. He has also co-authored several conference papers (including two ACL best paper nominations, at 2013 and 2017).