Mikhail Khodak


2018

pdf bib
A Large Self-Annotated Corpus for Sarcasm
Mikhail Khodak | Nikunj Saunshi | Kiran Vodrahalli
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors
Mikhail Khodak | Nikunj Saunshi | Yingyu Liang | Tengyu Ma | Brandon Stewart | Sanjeev Arora
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Motivations like domain adaptation, transfer learning, and feature learning have fueled interest in inducing embeddings for rare or unseen words, n-grams, synsets, and other textual features. This paper introduces a la carte embedding, a simple and general alternative to the usual word2vec-based approaches for building such representations that is based upon recent theoretical results for GloVe-like embeddings. Our method relies mainly on a linear transformation that is efficiently learnable using pretrained word vectors and linear regression. This transform is applicable on the fly in the future when a new text feature or rare word is encountered, even if only a single usage example is available. We introduce a new dataset showing how the a la carte method requires fewer examples of words in context to learn high-quality embeddings and we obtain state-of-the-art results on a nonce task and some unsupervised document classification tasks.

2017

pdf bib
Automated WordNet Construction Using Word Embeddings
Mikhail Khodak | Andrej Risteski | Christiane Fellbaum | Sanjeev Arora
Proceedings of the 1st Workshop on Sense, Concept and Entity Representations and their Applications

We present a fully unsupervised method for automated construction of WordNets based upon recent advances in distributional representations of sentences and word-senses combined with readily available machine translation tools. The approach requires very few linguistic resources and is thus extensible to multiple target languages. To evaluate our method we construct two 600-word testsets for word-to-synset matching in French and Russian using native speakers and evaluate the performance of our method along with several other recent approaches. Our method exceeds the best language-specific and multi-lingual automated WordNets in F-score for both languages. The databases we construct for French and Russian, both languages without large publicly available manually constructed WordNets, will be publicly released along with the testsets.