MoRTy: Unsupervised Learning of Task-specialized Word Embeddings by Autoencoding

Word embeddings have undoubtedly revolutionized NLP. However, pretrained embeddings do not always work for a specific task (or set of tasks), particularly in limited resource setups. We introduce a simple yet effective, self-supervised post-processing method that constructs task-specialized word representations by picking from a menu of reconstructing transformations to yield improved end-task performance (MORTY). The method is complementary to recent state-of-the-art approaches to inductive transfer via fine-tuning, and forgoes costly model architectures and annotation. We evaluate MORTY on a broad range of setups, including different word embedding methods, corpus sizes and end-task semantics. Finally, we provide a surprisingly simple recipe to obtain specialized embeddings that better fit end-tasks.


Introduction
Word embeddings are ubiquitous in Natural Language Processing. They provide a low-effort, high pay-off way to improve the performance of a specific supervised end-task by transferring knowledge. However, recent works indicate that universally best embeddings are not yet possible (Bollegala and Bao, 2018;Kiela et al., 2018a;Dingwall and Potts, 2018), and that they instead need to be tuned to fit specific end-tasks using inductive bias -i.e., semantic supervision for the unsupervised embedding learning process (Conneau et al., 2018;Perone et al., 2018). This way, embeddings can be tuned to fit a specific single-task (ST) or multi-task (MT: set of tasks) semantic (Xiong et al., 2018).
Fine-tuning requires labeled data, which is often either too small, not available or of low quality and creating or extending labeled data is costly and slow. Word embeddings are typically induced from huge unlabeled corpora with billions of tokens, but for limited-resource domains like biology or medicine, it becomes less clear whether there is still transfer. We set out to create task-specified embeddings cheaply, with selfsupervision, that are able to provide consistent improvements, even in limited resource settings.
We evaluate the impact of our method, named MORTY, on 18 publicly available benchmark tasks developed by Jastrzebski et al. (2017) 1 using two ways to induce embeddings, Fasttext and GloVe. We test them in two setups corresponding to two different overall aims: (a) to specialize embeddings to better fit a single supervised task or, (b) to generalize embeddings for multiple supervised end-tasks, i.e., to optimize MORTYs for single or multi-task settings. Since most embeddings are pre-trained on large corpora, we also investigate whether our method further improves embeddings trained on small corpus setups.
Hence, we demonstrate the method's application for single-task, multi-task, small, medium and web-scale (common crawl) corpus-size settings (Section 4). Learning to scale-up by pretraining on more (un-)labeled data is both: (a) not always possible in low-resource domains due to lack of such data, and (b) heavily increases the compute requirements of comparatively small supervised down-stream task. This not only leads to high per model-instance costs but also limits learning to scale-out, i.e., when combining many smaller models into a larger dynamic model as is desirable in continual learning settings, where models, inputs and objectives may emerge or disappear over time. To provide an alternative in such settings we design MORTY as a learning-toscale-down approach, that uses less data and compute to achieve a performance improvement despite forgoing (un-)supervised fine tuning on target domain data. Consequently, MORTY uses very little resources, 2 producing a low carbon footprint, especially regarding recent, compute intensive, scale-up approaches like ELMo or BERT (Peters et al., 2018;Devlin et al., 2018) which have high hardware and training time requirements and a large carbon footprint as recently demonstrated by Strubell et al. (2019). As a result, we demonstrate a simple, unsupervised scale-down method, that allows further pretraining exploitation, while requiring minimum extra effort, time and compute resources. As in standard methodology, optimal post-processed embeddings can be selected according to multiple proxy-tasks for overall improvement or using a single end-task's development split-e.g., on a fast baseline model for further time reduction.

MoRTy embeddings
Our proposed post-processing method provides a Menu of Reconstructing Transformations to yield improved end-task performance (MORTY).

Approach:
The key idea of MORTY is to create a family of embeddings by learning to reconstruct the original pre-trained embeddings space via autoencoders.
The resulting family or representations (postprocessed embeddings) gives a "menu" which can be picked from in two ways: (a) standard development set tuning, to gain performance at a single supervised task (ST), or (b) via benchmark tasks, to boost performance of multiple tasks (MT). The first is geared towards optimizing embeddings for a single specific task (specialization), the latter aims at embedding generalization, that works well across tasks.
In more details, the overall MORTY recipe is: (1) Train (or take): an original (pre-trained) embedding space E org using embedding method f .
(2) Reconstruct E org : compute multiple randomly initialized representations of E org using a reconstruction loss (mean square error, cf. below).
(3) Pick: performance-optimal representation for the end-task(s) via a task's development split(s) or proxy tasks, depending on the end-goal, i.e., specialization or generalization. (4) Gain: use optimal MORTY (E post ) to push relative performance on end task(s).
Which autoencoder variant? For step (2), we found the following autoencoder recipe to work best: A linear autoencoder with one hidden layer, trained via bRMSE (batch-wise root mean squared error), the same hidden layer size as the original embedding model and half of its learning rate 3i.e., a linear, complete autoencoder, trained for a single epoch (cf. end of Section 3).
We experimented with alternative autoencoders: sparse (Ranzato et al., 2007), denoising, discrete (Subramanian et al., 2018), and undercomplete autoencoders, but found the simple recipe to work best. In the remainder of the paper, we test this 'imitation-scheme' setup recipe.

Experiments
With the aim of deriving a simple yet effective 'best practice' usage recipe, we evaluate MORTY as follows: a) using two word embedding methods f ; b) corpora of different sizes to induce E org , i.e., small, medium and web-scale; c) evaluation across 18 semantic benchmark tasks spanning three semantic categories to broadly examine MORTY's impact, while assessing both single and multi-task end goals; and finally e) evaluate 1-epoch setups in relation to different corpus sizes.
Embeddings and Corpus Size: We evaluate embeddings trained on small, medium (millions of tokens) and large (billions of tokens) corpus sizes. In particular, we train 100-dimensional embeddings with Fasttext (Bojanowski et al., 2016) 4 and GloVe (Pennington et al., 2014) 5 on the 2M and 103M WikiText created by Merity et al. (2016). We complement them with off-the-shelf webscale Fasttext and GloVe embeddings (trained on 600B and 840B tokens, respectively). This results in the following vocabulary sizes for Fasttext and GloVe embeddings, respectively: on 2M 25,249 and 33,237 word types. For 103M we get 197,256 and 267,633 vocabulary words. Public, off-the-shelf -common-crawl trained -Fasttext and GloVe embeddings have very large vocabularies of 1,999,995 and 2,196,008 words.
To account for variation in results, we train both embedding methods five times each 6 on the two WikiText corpus sizes. We observed only minor 3 Original Fasttext and GloVe used lr = 0.05, so lr ≈ 0.025 is a 'careful' rate and used throughout the experiments in this paper. 4 To train Fasttext we used https://fasttext.cc 5 To train GloVe we used the python glove python wheel 6 Fasttext was trained using the implementation's variations, < 0.5% between runs for both Fasttext and GloVe, in overall performance Σ -i.e., when summing the scores of all benchmark tasks.

Results
The main results are provided in Table 1  f : Fasttext and GloVe: First, regarding the base embeddings (cf. per-category base performance scores in Table 1): i) we notice that Fasttext performs overall better than GloVe; ii) classification and similarity results improve the larger the corpus; consistently over f ; and iii) GloVe is better for the analogy tasks on web-scale data. 8 MORTY for multi-task application: Second, the MT % change columns show that a single best MORTY improves overall performance Σ (black row) 9 -the sum of 18 tasks -by 8.9, 5.8 and 3.4 percent compared to Fasttext base. As corpus size increases, there is less space for MORTY to improve Σ scores. What is interesting to note is that MORTY is able to recover analogy performance on 103M (to more than 2M level). This is also reflected in the Google and MSR analogy scores doubling and tripling (middle column). On 2M we also see a modest improvement (6.2) for similarity tasks, while classification on 2M slightly dropped. Regarding GloVe (3 rightmost columns) we notice lower overall performance (black column), which is consistent with findings by Levy et al. (2015). MORTY on GloVe produces lower but more stable improvements for the MT setting (middle column), with analogy and similarity performance noticeably increasing for the small 2M dataset. Generally, we see both performance increases and drops for individual task, especially on 2M and Fasttext, indicating that, a single overall best MORTY specializes the base Fasttext embedding to better fit a specific subset of the 18 tasks, while still beating the base embedders f in overall score (Σ).

MORTY for single-task application:
In the ST % change columns we see best single task (ST) results for task-specific optimal MORTY embeddings. Both embedders get consistent boosts, with Fasttext exhibiting significantly higher improvement from MORTY on 2M and 103M, despite already starting out at a higher base performance.
training corpus size (small, medium, common crawl) score in % (18 tasks)   Applying the MORTY 1-epoch recipe So far, we saw MORTYs potential for overall (ST/MT/Σ) performance improvements, but will we observe the same in the wild? To answer this question for the MT use-case, we apply a 1-epoch training only recipe. That is, we train 1-epoch using a linear, complete autoencoder using half of the base embedders learning rate on three randomly initialized MORTYs, and then test them on the 18 task (MT) setup. Figure 1 shows consistent MT/Σ score improvements for each of the 3 MORTY-over-Fasttext runs (red, yellow, green) on 2M, 103M, and 600B vs. base Fasttext (blue 100).
We see that, for practical application, this allows MORTY to boost supervised MT performance even without using a supervised development split or proxy task(s), while also eliminating multi-epoch tuning. Both Figure 1 and Table 1 show similar overall (MT) improvements per corpus size, which suggests that 1-epoch training is sufficient and that MORTY is especially beneficial on smaller corpora -i.e., in limited resource settings.

Related Work
There is a large body of work on information transfer between supervised and unsupervised tasks. First and foremost unsupervisedto-supervised transfer includes using embeddings for supervised tasks. However, transfer also works vice versa, in a supervised-to-unsupervised setup to (learn to) specialize embeddings to better fit a specific supervised signal (Ruder and Plank, 2017;Ye et al., 2018). This includes injecting generally relevant semantics via retrofitting or auxiliary multi-task supervision (Faruqui et al., 2015;Kiela et al., 2018b). Supervised-to-supervised methods provide knowledge transfer between supervised tasks which is exploited successively (Kirkpatrick et al., 2017), jointly (Kiela et al., 2018b) and in joint-succession (Hashimoto et al., 2017).
Unsupervised-to-unsupervised transfer is less studied. Dingwall and Potts (2018) proposed a GloVe model-modification that retrofits publicly available GloVe embeddings to produce specialized domain embeddings, while Bollegala and Bao (2018) propose meta-embeddings via denoising autoencoders to merge diverse (Fasttext and GloVe) embeddings spaces. The later, is also a low-effort approach and closest to ours. However, it focuses on embedding merging that they tuned on a single semantic similarity task, while MORTY provides an overview of tuning for 19 different settings. Furthermore, MORTY requires only a single embedding space, which contributes to the literature by outlining that meta-embedding improvements may partly stem from re-encoding rather than only from semantic merging.

Conclusion
We demonstrated a low-effort, self-supervised, learning scale-down method to construct taskoptimized word embeddings from existing ones to gain performance on a (set of) supervised endtask(s) without direct domain adaptation. Despite its simplicity, MORTY is able to produce significant performance improvements for single and multi-task supervision settings as well as for a variety of desirable word encoding properties while forgoing building and tuning complex model architectures and labeling. 10 Perhaps most importantly, MORTY shows considerable benefits for low-resource settings and thus provides a learning-to-scale-down alternative to recent scaleup approaches.

Acknowledgements
This work was supported by the German Federal Ministry of Education and Research (BMBF) through the project DEEPLEE (01IW17001) and by the European Unions Horizon 2020 research and innovation programme under grant agreement No 780495 (BigMedilytics). We also thank Philippe Thomas and Isabelle Augenstein for helpful discussions.