Charmanteau: Character Embedding Models For Portmanteau Creation

Portmanteaus are a word formation phenomenon where two words combine into a new word. We propose character-level neural sequence-to-sequence (S2S) methods for the task of portmanteau generation that are end-to-end-trainable, language independent, and do not explicitly use additional phonetic information. We propose a noisy-channel-style model, which allows for the incorporation of unsupervised word lists, improving performance over a standard source-to-target model. This model is made possible by an exhaustive candidate generation strategy specifically enabled by the features of the portmanteau task. Experiments find our approach superior to a state-of-the-art FST-based baseline with respect to ground truth accuracy and human evaluation.


Introduction
Portmanteaus (or lexical blends Algeo (1977)) are novel words formed from parts of multiple root words in order to refer to a new concept which can't otherwise be expressed concisely. Portmanteaus have become frequent in modern-day social media, news reports and advertising, one popular example being Brexit (Britain + Exit). Petri (2012). These are found not only in English but many other languages such as Bahasa Indonesia Dardjowidjojo (1979), Modern Hebrew Bat-El (1996); Berman (1989) and Spanish Piñeros (2004). Their short length makes them ideal for headlines and brandnames (Gabler, 2015). Unlike better-defined morphological phenomenon such as inflection and derivation, portmanteau generation * * denotes equal contribution Figure 1: A sketch of our BACKWARD, noisychannel model. The attentional S2S model with bidirectional encoder gives P (x|y) and nextcharacter model gives P (y), where y (spime) is the portmanteau and x = concat(x (1) , ";", x (2) ) are the concatenated root words (space and time).
is difficult to capture using a set of rules. For instance, Shaw et al. (2014) state that the composition of the portmanteau from its root words depends on several factors, two important ones being maintaining prosody and retaining character segments from the root words, especially the head. An existing work by Deri and Knight (2015) aims to solve the problem of predicting portmanteau using a multi-tape FST model, which is datadriven, unlike prior approaches. Their methods rely on a grapheme to phoneme converter, which takes into account the phonetic features of the language, but may not be available or accurate for non-dictionary words, or low resource languages.
Prior works, such as Faruqui et al. (2016), have demonstrated the efficacy of neural approaches for morphological tasks such as inflection. We hypothesize that such neural methods can (1) provide a simpler and more integrated end-to-end framework than multiple FSTs used in the previous work, and (2) automatically capture features such as phonetic similarity through the use of character embeddings, removing the need for explicit grapheme-to-phoneme prediction. To test these hypotheses, in this paper, we propose a neural S2S model to predict portmanteaus given the two root words, specifically making 3 major contributions: • We propose an S2S model that attends to the two input words to generate portmanteaus, and an additional improvement that leverages noisy-channel-style modelling to incorporate a language model over the vocabulary of words ( §2). • Instead of using the model to directly predict output character-by-character, we use the features of portmanteaus to exhaustively generate candidates, making scoring using the noisy channel model possible ( §3). In experiments ( §5), our model performs better than the baseline Deri and Knight (2015) on both objective and subjective measures, demonstrating that such methods can be used effectively in a morphological task.

Proposed Models
This section describes our neural models.

Forward Architecture
Under our first proposed architecture, the input sequence x = concat(x (1) , ";", x (2) ), while the output sequence is the portmanteau y. The model learns the distribution P (y|x). The network architecture we use is an attentional S2S model (Bahdanau et al., 2014). We use a bidirectional encoder, which is known to work well for S2S problems with similar token order, which is true in our case. Let −−−−→ LST M and ←−−−− LST M represent the forward and reverse encoder; e enc () and e dec () represent the character embedding functions used by encoder and decoder The following equations describe the model: The context vector c t is computed using dotproduct attention over encoder states. We choose dot-product attention because it doesn't add extra parameters, which is important in a low-data scenario such as portmanteau generation.
In addition to capturing the fact that portmanteaus of two English words typically sound English-like, and to compensate for the fact that available portmanteau data will be small, we pretrain the character embeddings on English language words. We use character embeddings learnt using an LSTM language model over words in an English dictionary, 1 where each word is a sequence of characters, and the model will predict next character in sequence conditioned on previous characters in the sequence.

Backward Architecture
The second proposed model uses Bayes's rule to reverse the probabilities P (y|x) = P (x|y)P (y) to get argmax y P (y|x) = argmax y P (x|y)P (y). Thus, we have a reverse model of the probability P (x|y) that the given root words were generated from the portmanteau and a character language model model P (y). This is a probability distribution over all character sequences y ∈ A * , where A is the alphabet of the language. This way of factorizing the probability is also known as a noisy channel model, which has recently also been shown to be effective for neural MT (Hoang et al. (2017), Yu et al. (2016)). Such a model offers two advantages 1. The reverse direction model (or alignment model) gives higher probability to those portmanteaus from which one can discern the root words easily, which is one feature of good portmanteaus.
2. The character language model P (y) can be trained on a large vocabulary of words in the language. The likelihood of a word y is factorized as P (y) = Π i=|y| i=1 P (y i |y i−1 1 ), where y i j = y i , y i+1 . . . y j , and we train a LSTM to maximize this likelihood.

Making Predictions
Given these models, we must make predictions, which we do by two methods Greedy Decoding: In most neural sequenceto-sequence models, we perform autoregressive greedy decoding, selecting the next character greedily based on the probability distribution for the next character at current time step. We refer to this decoding strategy as GREEDY.
Exhaustive Generation: Many portmanteaus were observed to be concatenation of a prefix of the first word and a suffix of the second. We therefore generate all candidate outputs which follow this rule. Thereafter we score these candidates with the decoder and output the one with the maximum score. We refer to this decoding strategy as SCORE.
Given that our training data is small in size, we expect ensembling (Breiman, 1996) to help reduce model variance and improve performance. In this paper, we ensemble our models wherever mentioned by training multiple models on 80% subsamples of the training data, and averaging log probability scores across the ensemble at test-time.

Dataset
The existing dataset by Deri and Knight (2015) contains 401 portmanteau examples from Wikipedia. We refer to this dataset as D Wiki . Besides being small for detailed evaluation, D Wiki is biased by being from just one source. We manually collect D Large , a dataset of 1624 distinct English portmanteaus from following sources:

Experiments
In this section, we show results comparing various configurations of our model to the baseline FST model of Deri and Knight (2015) (BASELINE). Models are evaluated using exactmatches (Matches) and average Levenshtein editdistance (Distance) w.r.t ground truth.

Objective Evaluation Results
In Experiment 1, we follow the same setup as Deri and Knight (2015). D Wiki is split into 10 folds. Each fold model uses 8 folds for training, 1 for validation, and 1 for test. The average (10 fold crossvalidation style approach) performance metrics on the test fold are then evaluated.   initializing the word embeddings. We believe this is because portmanteaus have high fidelity towards their root word characters and its critical that the model can observe all root sequence characters, which attention manages to do as shown in Fig. 2.

Performance on Uncovered Examples
The set of candidates generated before scoring in the approximate SCORE decoding approach sometimes do not cover the ground truth.

Significance Tests
Since our dataset is still small relatively small (1223 examples), it is essential to verify whether BACKWARD is indeed statistically significantly better than BASELINE in terms of Matches.   Table 4: AMT annotator judgements on whether our system's proposed portmanteau is better or worse compared to the baseline In order to do this, we use a paired bootstrap 4 comparison (Koehn, 2004) between BACKWARD and BASELINE in terms of Matches. BACKWARD is found to be better (gets more Matches) than BASELINE in 99.9% (p = 0.999) of the subsets.
Similarly, BACKWARD has a lower Distance than BASELINE by a margin of 0.2 in 99.5% (p = 0.995) of the subsets.

Subjective Evaluation and Analysis
On inspecting outputs, we observed that often output from our system seemed good in spite of high edit distance from ground truth. Such aspect of an output seeming good is not captured satisfactorily by measures like edit distance. To compare the errors made by our model to the baseline, we designed and conducted a human evaluation task on AMT. 5 In the survey, we show human annotators outputs from our system and that of the baseline. We ask them to judge which alternative is better overall based on following criteria: 1. It is a good shorthand for two original words 2. It sounds better. We requested annotation on a scale of 1-4. To avoid ordering bias, we shuffled the order of two portmanteau between our system and that of baseline. We restrict annotators to be from Anglophone countries, have HIT Approval Rate > 80% and pay 0.40$ per HIT (5 Questions per HIT).
As seen in Table 4, output from our system was labelled better by humans as compared to the baseline 58.12% of the time. Table 3 shows outputs from different models for a few examples.

Related Work
Ozbal and Strapparava (2012) generate new words to describe a product given its category and properties. However, their method is limited to handcrafted rules as compared to our data driven approach. Also, their focus is on brand names. Hiranandani et al. (2017) have proposed an approach to recommend brand names based on brand/product description. However, they consider only a limited number of features like memorability and readability. Smith et al. (2014) devise an approach to generate portmanteaus, which requires user-defined weights for attributes like sounding good. Generating a portmanteau from two root words can be viewed as a S2S problem. Recently, neural approaches have been used for S2S problems (Sutskever et al., 2014) such as MT. Ling et al. (2015) and Chung et al. (2016) have shown that character-level neural sequence models work as well as word-level ones for language modelling and MT. Zoph and Knight (2016) propose S2S models for multi-source MT, which have multi-sequence inputs, similar to our case.

Conclusion
We have proposed an end-to-end neural system to model portmanteau generation. Our experiments show the efficacy of proposed system in predicting portmanteaus given the root words. We conclude that pre-training character embeddings on the English vocabulary helps the model. Through human evaluation we show that our model's predictions are superior to the baseline. We have also released our dataset and code 6 to encourage further research on the phenomenon of portmanteaus. We also release an online demo 7 where our trained model can be queried for portmanteau suggestions. An obvious extension to our work is to try similar models on multiple languages.