Embedding Imputation with Grounded Language Information

Due to the ubiquitous use of embeddings as input representations for a wide range of natural language tasks, imputation of embeddings for rare and unseen words is a critical problem in language processing. Embedding imputation involves learning representations for rare or unseen words during the training of an embedding model, often in a post-hoc manner. In this paper, we propose an approach for embedding imputation which uses grounded information in the form of a knowledge graph. This is in contrast to existing approaches which typically make use of vector space properties or subword information. We propose an online method to construct a graph from grounded information and design an algorithm to map from the resulting graphical structure to the space of the pre-trained embeddings. Finally, we evaluate our approach on a range of rare and unseen word tasks across various domains and show that our model can learn better representations. For example, on the Card-660 task our method improves Pearson’s and Spearman’s correlation coefficients upon the state-of-the-art by 11% and 17.8% respectively using GloVe embeddings.


Introduction
Word embeddings (Mikolov et al., 2013;Pennington et al., 2014) are used pervasively in deep learning for natural language processing.However, due to fixed vocabulary constraints in existing approaches to training word embeddings, it is difficult to learn representations for words which are rare or unseen during training.This is commonly referred to as the out-of-vocabulary (OOV) word problem.In the original embedding implementations, a special OOV token is typically reserved for such words.However, this rudimentary approach often detriments the performance of down-Published as a conference paper at ACL 2019 stream tasks which contain numerous rare or unseen words.Recent works have proposed subword approaches (Zhao et al., 2018;Sennrich et al., 2015), which construct embeddings through the composition of characters or sentence pieces for OOV words.Vector space properties are also utilized to learn embeddings with small amounts of data (Bahdanau et al., 2017;Herbelot and Baroni, 2017).In this paper, we propose a novel approach, knowledge-graph-to-vector (KG2Vec), for the OOV word problem.KG2Vec makes use of the grounded language information in the form of a knowledge graph.Grounded information has been extensively used in various NLP tasks to represent real-world knowledge (Niles and Pease, 2003;Gruber, 1993;Guarino, 1998;de Bruijn et al., 2006;Paulheim, 2017) .In particular, early question answering systems used expert-crafted ontologies in order to endow these systems with common knowledge (Harabagiu et al., 2005;Xu et al., 2016).Additionally, lexical-semantic ontologies, such as WordNet, have been used to provide semantic relations between words in a wide variety of language processing and inference tasks (Morris and Hirst, 1991;Ovchinnikova et al., 2010).
Grounded language information has been observed to augment model performance on a wide variety of natural language processing and understanding tasks (He et al., 2017;Choi et al., 2018).In these settings, a model is able to provide better generalization by using relational information from a knowledge graph or knowledge base in addition to the standard set of training examples.Additionally, outputs from models with grounded approaches have been observed to be more factually consistent and logically sound (Bordes et al., 2014) compared with outputs from models without grounding information.
By foregoing the usage of vector space or sub-word information, KG2Vec is able to capture semantic meanings of words directly from the graphical structure in grounded knowledge using recent advances in network representation learning.Furthermore, KG2Vec leverages the most updated information from comprehensive knowledge bases (Wikipedia & Wiktionary).Therefore, KG2Vec can be applied to training embeddings of newly emerging OOV words.In summary, our contributions are three-fold: 1.An approach to constructing graphical representations of entities in a knowledge base in an unsupervised manner.2. Methods for mapping entities from a graphical representation to the space in which a pretrained embedding lies.3. Experimentation on rare and unseen word datasets and a new state-of-art performance on Card-660 dataset.
2 Related Work

Graph Neural Networks
Graph neural networks (GNN) are an emerging deep learning approach for representation learning of graphical data (Xu et al., 2018;Kipf and Welling, 2016).GNNs can learn a representation vector h v for each node in the network by leveraging the graphical structure and node features f v .Node embeddings are generated by recursively aggregating each node's neighborhood information and features.At the t-th iteration, the information aggregation is defined as: where h t v is the representation for v at the t-th iteration, M t is an iteration-specific message aggregation function parametrized by a neural network and N (v) is the set of neighbors of node v.One simple form of M t is mean neighborhood aggregation: where W t and B t are trainable matrices.Typically, h 0 v is initialized as f v .The final node representation is usually a function of h T v from the last iteration T , such as an identity function or a transformation function (Ying et al., 2018).

The OOV word problem
The out-of-vocabulary (OOV) word problem has been present in word embedding models since their inception (Mikolov et al., 2013;Pennington et al., 2014).Due to space and training data constraints, words which are either infrequent or do not appear in the training corpus can lack representations at the time of inference.
Numerous methods have been proposed to tackle the OOV word problem with a small amount of training data.Deep learning based approaches (Bahdanau et al., 2017) and vector-space based methods (Herbelot and Baroni, 2017) can improve the rare word representations on various semantic similarity tasks.One downside to these approaches is that they require small amounts of training data for words whose embeddings are being imputed and, as a result, can have difficulties representing words for which training samples do not exist.
Sub-word level representations have been studied in the context of the OOV word problem.Pinter et al. (2017) uses the RNN's hidden state of the last sub-word in a word to produce representations.Zhao et al. (2018) proposes using characterlevel decomposition to produce embeddings for OOV words.

Model
We propose the knowledge-graph-to-vector (KG2Vec) model for building OOV word representations from knowledge base information.KG2Vec starts with building a knowledge graph K with nodes consisting of pre-trained words and OOV words.It then utilizes a graph convolutional network (GNN) to map graph nodes to lowdimensional embeddings.The GNN is trained to minimize the Euclidean distance between the node embeddings to pre-trained word embeddings in the dictionary such as GloVe (Pennington et al., 2014) and ConceptNet Numberbatch (Speer et al., 2017).Finally, the GNN is used to generate embeddings for OOV words.

Build the Knowledge Graph
In a knowledge graph K, each node v represents a word w v .The nodes (words) in the graph are chosen as follows.We count the frequency of occurrences for English words from the Wikipedia English dataset (with 3B tokens).The 2000 words with the highest frequencies of occurrence are skipped to diminish the effect of stop words.Among the words left, we choose the |V ′ | words with the highest frequencies of occurrence.All OOV words for which we would like to impute embeddings are also added to the graph as nodes.
For each node, we obtain its grounded information from two sources: (I) the words' summary, defined as the first paragraph of the Wikipedia page when this word is searched; (II) the word's definition in Wiktionary.We choose Wikipedia and Wiktionary over other knowledge bases because they are comprehensive, well-maintained and up-to-date.Here is an example of the grounded information for the word Brexit.
• Wikipedia page summary: Brexit, a portmanteau of "British" and "exit", is the impending withdrawal of the United Kingdom (UK) from the European Union (EU).It follows the referendum of 23 June 2016 when 51.9 per cent of voters chose to leave the EU... • Wiktionary definition: Brexit (Britain, politics) The withdrawal of the United Kingdom from the European Union.All the words in the Wikipedia summary and the Wiktionary definition form the grounded language information of this word w v , defined as D v .Specifically, D v is the concatenation of w v 's Wikipedia summary and the Wiktionary definition.An undirected edge e vu exists between node v and u if the Jaccard coefficient |Dv∩Du| |Dv∪Du| > η, where η is a pre-defined threshold and chosen to be 0.5 empirically in the experiments.The edge e vu is then assigned with a weight s vu = |Dv∩Du| |Dv∪Du| .We also compute a feature vector f u as the mean of pre-trained embeddings of words in D v .Finally, the obtained knowledge graph K = (V, E) has a feature vector f v for each node v ∈ V .

Graph Neural Network
The nodes in the graph are mapped to lowdimensional embeddings via graph convolutional neural network (GCN) (Kipf and Welling, 2016).It follows that, at the t-th neighborhood aggregation, the node embedding h t v for node v is modelled as: where S(v) = N (v) ∪ {v}, and the normalization constant C = 1 + u∈N (v) s vu .W t and b t are trainable parameters.The node embeddings are initialized as the feature vector f v , i.e. h 0 v = f v .At the final iteration T , the generated node embeddings {h T v } are computed without the ReLU function.The loss function of the GNN model is the mean square error between the pre-trained word vectors and generated embedding h T v for all words in the graph which are part of the model's vocabulary (e.g.GloVe).During inference, OOV words are assigned embeddings computed by the GNN.

Experiments
To evaluate our method's ability to impute embeddings, we conduct experiments on the following rare and unseen word similarity tasks.embedding imputation models, including Mimick (Pinter et al., 2017), Definition centroid (Herbelot and Baroni, 2017), Definition LSTM (Bahdanau et al., 2017), SemLand (Pilehvar andCollier, 2017) and BoS (Zhao et al., 2018).During evaluation, zero vectors are assigned to missing words and word-word similarity is computed as the inner product of the corresponding embeddings.In KG2Vec, the number of iterations To fairly evaluate KG2Vec, we include a baseline model that assigns the node feature f v as the final word representations for word w v if w v is not in the pre-trained dictionary.The results are denoted as "Node features" in table 1.In all test cases, KG2Vec improves by a large margin upon this baseline.For example, using GloVe on the Card-660 dataset, KG2Vec's achieves a performance increase of 14.5% and 14.2% respectively for Pearson's and Spearman's coefficients over Node features.This observation suggests that the information aggregation by GNN is critical for embedding imputation and semantic inference.It also indicates that learning from the knowledge graph and its language information is an effective way to parse the semantic meaning of a rare word.

Application on Entity Relations Knowledge
Base.Many public knowledge bases consist of relational data in a tuple format: (entity1, en-tity2, relation), where entities can be considered as the "nodes" in the graph and relations define the edges.Note that there are different kinds of relations and therefore edges in the graph have different types or labels.To impute the embeddings for entities in such scenario, one can conveniently adapt KG2Vec following Schlichtkrull et al. (2018) by learning different transformations for different types of edges.
Adaption to New Vocabularies and Information.Considering the fast growth of vocabularies in the current era, the ability to perform online learning and quick adaptation for embedding imputations is a desired property.One can combine KG2Vec with meta-learning, e.g., MAML in Finn et al. (2017), such that the resulting model can quickly learn the embeddings of newly added nodes (words), or updated node features.

Conclusion and Future Work
In this paper, we introduce KG2Vec, a graph neural network based approach for embedding imputation of OOV words which makes use of grounded language information.Using publicly available information sources like Wikipedia and Wiktionary, KG2Vec can effectively impute embeddings for rare or unseen words.Experimental results show that KG2Vec achieves state-ofthe-art results on the Card-660 dataset.Future research directions include a theoretical explanation of KG2Vec and applications to downstream NLP tasks.

Table 1 :
Pilehvar et al. (2018)els on Stanford Rare Word Similarity and Card-660 datasets.Two word dictionaries are used: ConceptNet and GloVe.The overall best are underlined for each column, and the best results for each type of word dictionary are in bold.We run the BoS experiments with the default hyper-parameters fromZhao et al. (2018).Performances of other baseline models are collected fromPilehvar et al. (2018).