Bringing Order to Neural Word Embeddings with Embeddings Augmented by Random Permutations (EARP)

Word order is clearly a vital part of human language, but it has been used comparatively lightly in distributional vector models. This paper presents a new method for incorporating word order information into word vector embedding models by combining the benefits of permutation-based order encoding with the more recent method of skip-gram with negative sampling. The new method introduced here is called Embeddings Augmented by Random Permutations (EARP). It operates by applying permutations to the coordinates of context vector representations during the process of training. Results show an 8% improvement in accuracy on the challenging Bigger Analogy Test Set, and smaller but consistent improvements on other analogy reference sets. These findings demonstrate the importance of order-based information in analogical retrieval tasks, and the utility of random permutations as a means to augment neural embeddings.


Introduction
The recognition of the utility of corpus-derived distributed representations of words for a broad range of Natural Language Processing (NLP) tasks (Collobert et al., 2011) has led to a resurgence of interest in methods of distributional semantics. In particular, the neural-probabilistic word representations produced by the Skip-gram and Continuous Bag-of-Words (CBOW) architectures (Mikolov et al., 2013a) implemented in the word2vec and fastText software packages have been extensively evaluated in recent years.
As was the case with preceding distributional models (see for example (Schütze, 1993;Lund and Burgess, 1996;Schütze, 1998;Karlgren and Sahlgren, 2001)), these architectures generate vector representations of words -or word embeddings -such that words with similar proximal neighboring terms within a corpus of text will have similar vector representations. As the relative position of these neighboring terms is generally not considered, distributional models of this nature are often (and sometimes derisively) referred to as bag-of-words models. While methods of encoding word order into neural-probabilistic representations have been evaluated, these methods generally require learning additional parameters, either for each position in the sliding window (Mnih and Kavukcuoglu, 2013), or for each context word-byposition pair (Trask et al., 2015;Ling et al., 2015).
In this paper we evaluate an alternative method of encoding the position of words within a sliding window into neural word embeddings using Embeddings Augmented by Random Permutations (EARP). EARP leverages random permutations of context vector representations (Sahlgren et al., 2008), a technique that has not been applied during the training of neural word embeddings previously. Unlike prior approaches to encoding position into neural word embeddings, it imposes no computational and negligible space requirements.
Results show that the word order information encoded through EARP leads to a nearly 8% improvement (from 29.17% to 37.07%) in the accuracy of analogy predictions in the Bigger Analogy Test Set of , with the improvement being largest (over 20%) in the category of analogies that exhibit derivational morphology (a case where the use of subword information also improves accuracy for all representations with and without order information). Smaller improvements in performance are evident on other analogy sets, and in downstream sequence labeling tasks. This makes EARP a strong contender for enriching word embeddings with order-based information, leading to greater accuracy on more challenging semantic processing tasks.

Distributed representations of words
Methods of distributional semantics learn representations of words from the contexts they occur in across large text corpora, such that words that occur in similar contexts will have similar representations (Turney and Pantel, 2010). Geometricallymotivated approaches to this problem often involve decomposition of a term-by-context matrix (Schütze, 1993;Landauer and Dumais, 1997;Pennington et al., 2014), resulting in word vectors of considerably lower dimensionality than the total number of contexts. Alternatively, reduceddimensional representations can be generated online while processing individual units of text, without the need to represent a large term-by-context matrix explicitly. A seminal example of the latter approach is the Random Indexing (RI) method (Kanerva et al., 2000), which generates distributed representations of words by superposing randomly generated context vector representations.
Neural-probabilistic methods have shown utility as a means to generate semantic vector representations of words (Bengio et al., 2003). In particular, the Skip-gram and CBOW neural network architectures (Mikolov et al., 2013a) implemented within the word2vec and fastText software packages provide scalable approaches to online training of word vector representations that have been shown to perform well across a number of tasks (Mikolov et al., 2013a;Levy et al., 2015;.
While the definition of what constitutes a context varies across models, a popular alternative is to use words in a sliding window centered on a focus term for this purpose. Consequently where decomposition is involved the matrix in question would be a term-by-term matrix. With online methods, each term has two vector representations -a semantic vector and a context vector, corresponding to the input and output weights for each word in neural-probabilistic approaches.

Order-based distributional models
In most word vector embedding models based on sliding windows, the relative position of words within this sliding window is ignored, but there have been prior efforts to encode this information. Before the popularization of neural word embeddings, a number of researchers developed and evaluated methods to encode word position into distributed vector representations. The BEA-GLE model (Jones et al., 2006) uses circular convolution -as described by Plate (1995) -as a binding operator to compose representations of ngrams from randomly instantiated context vector representations of individual terms. These composite representations are then added to the semantic vector representation for the central term in a sliding window. The cosine similarity between the resulting vectors is predictive of human performance in semantic priming experiments.
A limitation of this approach is that it involves a large number of operations per sliding window. For example, Jones and his colleagues (2006) employ eight superposition and nine convolution operations to represent a single sliding window of three terms (excluding the focus term). Sahlgren, Holst and Kanerva (2008) report a computationally simpler method of encoding word order in the context of RI. Like BEAGLE, this approach involves adding randomly instantiated context vectors for adjacent terms to the semantic vector of the central term in a sliding window. However, RI uses sparse context vectors, consisting of mostly zeroes with a small number non-zero components initialized at random. These vectors are assigned permutations indicating their position within a sliding window. For example, with p representing the permutation assigned to a given position p and words to the right of the equation representing their context vectors, the sliding window "[fast is fine but accuracy] is everything" would result in the following update to S(fine), the semantic vector for the term "fine": Amongst the encoding schemes evaluated using this approach, using a pair of permutations to distinguish between words preceding the focus term and those following it resulted in the best performance on synonym test evaluations (Sahlgren et al., 2008). Furthermore, this approach was shown to outperform BEAGLE on a number of evaluations on account of its ability to scale up to larger amounts of input data (Recchia et al., 2010).

Encoding order into neural embeddings
Some recent work has evaluated the utility of encoding word order into neural word embeddings.
A straightforward way to accomplish this involves maintaining a separate set of parameters for each context word position, so that for a vocabulary of size v and p context window positions the number of output weights in the model is v × p × d for embeddings of dimensionality d. This approach was applied by Ling and his colleagues (2015) to distinguish between terms occurring before and after the central term in a sliding window, with improvements in performance in downstream partof-speech tagging and dependency parsing tasks.
Trask and his colleagues (2015) develop this idea further in a model they call a Partitioned Embedding Neural Network (PENN). In a PENN, both the input weights (word embeddings) and output weights (context embeddings) of the network have separate position-specific instantiations, so the total number of model parameters is 2v × p × d. In addition to evaluating binary (before/after) context positions, the utility of training separate weights for each position within a sliding window was evaluated. Incorporating order in this way resulted in improvements in accuracy over word2vec's CBOW implementation on proportional analogy problems, which were considerably enhanced by the incorporation of character-level embeddings.
A more space-efficient approach to encoding word order involves learning a vector V for each position p in the sliding window. These vectors are applied to rescale context vector representations using pointwise multiplication while constructing a weighted average of the context vectors for each term in the window (Mnih and Kavukcuoglu, 2013;. Results reported using this approach are variable, with some authors reporting worse performance with positiondependent weights on a sentence completion task (Mnih and Kavukcuoglu, 2013), and others reporting improved performance on an analogy completion task .

Skipgram-with-negative-sampling
In the current work, we extend the skipgram-withnegative-sampling (SGNS) algorithm to encode positional information without additional computation. The Skip-gram model (Mikolov et al., 2013a) predicts p(c|w): the probability of observing a context word, c, given an observed word, w. This can be accomplished by moving a slid-ing window through a large text corpus, such that the observed word is at the center of the window, and the context words surround it. The architecture itself includes two sets of parameters for each unique word: The input weights of the network ( − → w ), which represent observed words and are usually retained as semantic vectors after training, and the output weights of the network which represent context words ( − → c ) and are usually discarded.
The probability of observing a word in context is calculated by appying the sigmoid function to the scalar product between the input weights for the observed word, and the output weights for the context word, such that p(c|w) = σ( − → w . − → c ). As maximizing this probability using the softmax over all possible context words would be computationally inconvenient, SGNS instead draws a small number of negative samples (¬c) from the vocabulary to serve as counterexamples to each observed context term. With D representing observed term/context pairs, and D representing randomly constructed counterexamples the SGNS optimization objective is as follows (Goldberg and Levy, 2014):

Embeddings Augmented by Random Permutations (EARP)
Sliding window variants of RI are similar in some respects to the SGNS algorithm, in that training occurs through an online update step in which a vector representation for each context term is added to the semantic vector representation for a focus term (with SGNS this constitutes an update of the input weight vector for the focus term, weighted by the gradient and learning rate (for a derivation see (Rong, 2014)). Unlike SGNS, however, context vectors in RI are immutable and sparse. With SGNS, dense context vectors are altered during the training process, providing an enhanced capacity for inferred similarity. Nonetheless, the technique of permuting context vectors to indicate the position of a context term within a sliding window is readily adaptable to SGNS. Our approach to encoding the position of context words involves assigning a randomlygenerated permutation to each position within a sliding window. For example, with a sliding window of window radius 2 (considering two positions to the left and the right of a focus term), we assign a random permutation, p , to each element For example, upon observing the term "wyatt" in the context of the term "earp", the model would attempt to maximize p(c|w) = σ( − −− → wyatt. +1 ( − − → earp)), with − −− → wyatt and − − → earp as semantic and context vectors respectively. The underlying architecture is illustrated in Figure 1, using a simplified model generating 5-dimensional vector representations for a small vocabulary of 10 terms. The permutation p can be implemented by "rewiring" the components of the input and output weights, without explicitly generating a permuted copy of the context vector concerned. In this way, p(c|w) is estimated and maximized in place without imposing additional computational or space requirements, beyond those required to store the permutations. This is accomplished by changing the index values used to access components of − − → earp when the scalar product is calculated, and when weights are updated. The inverse permutation (or reverse rewiring -connecting components 1:3 rather than 3:1) is applied to − −− → wyatt when updating − − → earp, and this procedure is used with both observed context terms and negative samples.
Within this general framework, we evaluate four word order encoding schemes, implemented by adapting the open source Semantic Vectors 1 package for distributional semantics research: 1 https://github.com/semanticvectors/semanticvectors 3.2.1 Directional (EARP dir ) Directional encoding draws a distinction between terms that occur before or after the focus term in a sliding window. As such, only two permutations (and their inverse permutations) are employed: −1 for preceding terms, and +1 for all subseqent terms. Directional encoding with permutations has been shown to improve performance in synonym tests evaluations when applied in the context of RI (Sahlgren et al., 2008).

Positional (EARP pos )
With positional encoding, a permutation is used to encode each space in the sliding window. As a randomly permuted vector is highly likely to be orthogonal or close-to-orthogonal to the vector from which it originated (Kanerva, 2009), this is likely to result in orthogonal encodings for the same context word in different positions. With RI, positional encoding degraded synonym test performance, but permitted a novel form of distributional query, in which permutation is used to retrieve words that occur in particular positions in relation to one another (Sahlgren et al., 2008). EARP pos facilitates queries of this form also: the nearest neighboring context vector to the permuted semantic vector wyatt in both static-window subword-agnostic EARP pos spaces used in the experiments that follow.

Proximity-based (EARP prox )
With proximity-based encoding, the positional encoding of a particular context term occurring in the first position in a sliding window will be somewhat similar to the encoding when this term occurs in the second position, and less similar (but still not orthogonal) to the encoding when it occurs in the third. This is accomplished by randomly generating an index permutation +1 , randomly reassigning a half of its permutations to generate +2 , and repeating this process iteratively until a permutation for every position in the window is obtained (for the current experiments, we assigned two index permutations +1 and −1 , proceeding bidirectionally). As a low-dimensional example, if +1 were {4:1, 1:2, 2:3, 3:4}, +2 might be {4:3, 1:2, 2:1, 3:4}. The net result is that the similarity between the position-specific representations of a given context vector reflects the proximity between the positions concerned. While this method is reminiscent of interpolation between randomly generated vectors or matrices to encode character position within words (Cohen et al., 2013) and pixel position within images (Gallant and Culliton, 2016) respectively, the iterative application of permutations for this purpose is a novel approach to such positional binding.

Subword embeddings
The use of character n-grams as components of distributional semantic models was introduced by Schütze (1993), and has been shown to improve performance of neural-probabilistic models on analogical retrieval tasks . It is intuitive that this should be the case as standard analogy evaluation reference sets include many analogical questions that require mapping from a morphological derivative of one word (e.g. fast:faster) to the same morphological derivative of another (e.g. high:higher).
Consequently, we also generated n-gram based variants of each of our models, by adapting the approach described in  to our SGNS configuration. Specifically, we decomposed each word into character n-grams, after introducing characters indicating the start (<) and end (>) of a word. N-grams of size betweeen 3 and 6 characters (inclusive) were encoded, and in order to place an upper bound on memory requirements a hash function was used to map each observed n-gram to one of at the most two million vectors without constraints on collisions. The input vector V i for a word with n included n-grams (including the word itself) was then generated as 2 During training, we used V i as the input vector for EARP and SGNS, with updates propagating back to component word and n-gram vectors in proportion to their contribution to V i .

Training data
All models were trained on the first January 2018 release of the English Wikipedia 3 , to which we applied the pre-processing script distributed with the fastText package 4 , resulting in a corpus of approximately 8.8 billion words.

Training procedure
All models used 500-dimensional vectors, and were trained for a single iteration across the corpus with five negative samples per context term. For each of the four models (SGNS , EARP dir , EARP pos , EARP prox ), embeddings were generated with sliding window radii r of 2 and 5 (in each direction), with and without subword embeddings. For all experiments, we excluded numbers, and terms occurring less than 150 times in the corpus. SGNS has a number of hyperparameters that are known to influence performance (Levy et al., 2015). We did not engage in tuning of these hyperparameters to improve performance of our models, but rather were guided by prior research in selecting hyper-parameter settings that are known to perform well with the SGNS baseline model. Specifically, we used a subsampling threshold t of 10 −5 . In some experiments (EARP and SGNS), we used dynamic sliding windows with uniform probability of a sliding window radius between one and r, in an effort to align our baseline model closely with other SGNS implementations. Stochastic reduction of the sliding window radius will result in distal words being ignored at times. With a dynamic sliding window, subsampled words are replaced by the next word in sequence. This increases the data available for training, but will result in relative position being distorted at times. Consequently, we also generated spaces with static fixed-width sliding windows (EARPx) for position-aware models.
After finding that fastText trained on Wikipedia performed better on analogy tests than prior results obtained with word2vec (Levy et al., 2015), we adopted a number of its hyperparameter settings. We set the probability of negative sampling for each term to f 1 2 , where f is the number of term occurrences divided by the total token count. In addition we used an initial learning rate of .05, and subsampled terms with a probability of 1 − ( t f + t f ) 5 . 5 While using this formula reliably improves performance on some of the analogy sets, it differs from both the formula described in Mikolov et al. (2013b), and the formula implemented in the canonical word2vec implementation of SGNS -see Levy et al. (2015) for details. It is also difficult to justify on theoretical grounds, as it returns values less than zero for some words that meet the subsampling threshold. Nevertheless, retaining it throughout our experiments seemed more principled than altering the fastText baseline in a manner that impaired its performance.

Evaluation
To evaluate the nature and utility of the additional information encoded by permutation-based variants, we utilized a set of analogical retrieval reference sets, including the MSR set (Mikolov et al., 2013c) consisting of 8,000 proportional analogy questions that are morphological in nature (e.g. young:younger:quick:?) and the Google analogy set (Mikolov et al., 2013a) which includes 8,869 semantic (and predominantly geographic, e.g. brussels:belgium:dublin:?) and 10,675 morphologically-oriented "syntactic" questions. We also included the Bigger Analogy Test Set (BATS) set , a more challenging set of 99,200 proportional analogy questions balanced across 40 linguistic types in four categories: Inflections (e.g. plurals, infinitives), Derivation (e.g. verb+er), Lexicography (e.g. hypernyms, synonyms) and Encylopedia (e.g. country:capital, male:female). We obtained these sets from the distribution described in Finley et al (2017) 6 , in which only the first correct answer to questions with multiple correct answers in BATS is retained, and used a parallelized implementation of the widely used vector offset method, in which for a given proportional analogy a:b:c:d, all word vectors in the space are rank-ordered in accordance with their cosine similarity to the vector − −−−−− → c + b − a. We report average accuracy, where a result is considered accurate if d is the top-ranked result aside from a, b and c. 7 To evaluate the effects of encoding word order on the relative distance between terms, we used a series of widely used reference sets that mediate comparison between human and machine estimates of pairwise similarity and relatedness between term pairs. Specifically, we used Wordsim-353 (Finkelstein et al., 2001), split into subsets emphasizing similarity and relatedness (Agirre et al., 2009); MEN (Bruni et al., 2014) and Simlex-999 (Hill et al., 2015). For each of these sets, we estimated the Spearman correlation of the cosine similarity between vector representations of the words in a given pair, with the human ratings (averaged across raters) of similarity and/or relat-edness provided in the reference standards.
Only those examples in which all relevant terms were represented in our vector spaces were considered. Consequently, our analogy test sets consisted of 6136; 19,420 and 88,108 examples for the MSR 8 , Google and BATS sets respectively. With pairwise similarity, we retained 998; 335 and all 3,000 of the Simlex, Wordsim and MEN examples respectively. These numbers were identical across models, including fastText baselines.
In addition we evaluated the effects of incorporating word order with EARP on three standard sequence labeling tasks: part-of-speech tagging of the Wall Street Journal sections of the Penn Treebank (PTB) and the CoNLL'00 sentence chunking (Tjong Kim Sang and Buchholz, 2000) and CoNLL'03 named entity recognition (Tjong Kim Sang and De Meulder, 2003) shared tasks. As was the case with the pairwise similarity and relatedness evaluations, we conducted these evaluations using the repEval2016 package 9 after converting all vectors to the word2vec binary format. This package provides implementations of the neural NLP architecture developed by Collobert and his colleagues (2011), which uses vectors for words within a five-word window as input, a single hidden layer of 300 units and an output Softmax layer. The implementation provided in repEval2016 deviates by design from the original implementation by fixing word vectors during training in order to emphasize difference between models for the purpose of comparative evaluation, which tends to reduce performance (for further details, see (Chiu et al., 2016)). As spaces constructed with narrower sliding windows generally perform better on these tasks (Chiu et al., 2016), we conducted these experiments with models of window radius 2 only. To facilitate fair comparison, we added random vectors representing tokens available in the fastText-derived spaces only to all spaces, replacing the original vectors where these existed. This was important in this evaluation as only fastText retains vector representation for punctuation marks (these are eliminated by the Semantic Vectors tokenization procedure), resulting in a relatively large number of out-of-vocabulary terms and predictably reduced performance with the Semantic Vectors implementation of the same algorithm. With the random vectors added, out-of-vocabulary rates were equivalent across the two SGNS implementations, resulting in similar performance.

Analogical retrieval
The results of our analogical retrieval experiments are shown in Table 1. With the single exception of the semantic component of the Google set, the best result on every set and subset was obtained by a variant of the EARP prox model, strongly suggesting that (1) information concerning relative position is of value for solving analogical retrieval problems; (2) encoding this information in a flexible manner that preserves the natural relationship of proximity between sliding window positions helps more than encoding only direction, or encoding window positions as disparate "slots".
On the syntactically-oriented subsets (Gsyn, Binf, Bder) adding subword information improves performance of baseline SGNS and EARP models, with subword-sensitive EARPx prox models showing improvements of between ∼6% and ∼21% in accuracy on these subtasks, as compared with the best performing baseline 10 . The results follow the same pattern at both sliding window radii, aside from a larger decrease in performance of EARPx models on the semantic component of the Google set at radius 2, attributable to semantically useful information lost on account of subsampling without replacement. In general, better performance on syntactic subsets is obtained at radius 2 with subword-sensitive models, with semantic subsets showing the opposite trend.
While better performance on the total Google set has been reported with larger training corpora (Pennington et al., 2014;, the best EARP results on the syntactic component of this set surpass those reported from orderinsensitive models trained on more comprehensive corpora for multiple iterations. With this subset, 10 To assess reproducibility, we repeated the window radius 2 experiments a further 4 times, with different stochastic initialization of network weights. Performance was remarkably consistent, with a standard error of the mean accuracy on the MSR, Google and BATS sets at or below .24% (.0024), .33% (.0033) and .12% (.0012) for all models. All differences in performance from the baseline (only SGNS semVec was repeated) were statistically significant by unpaired t-test, aside from the results of EARP dir and EARP prox on the Google set when no subwords were used. the best EARP prox model obtained an accuracy of 76.74% after a single training iteration on a ∼9 billion word Wikipedia-derived corpus. Pennington and his colleagues (2014) report a best accuracy of 69.3% after training Glove on a corpus of 42 billion words, and Mikolov and colleagues (2017) report an accuracy of 73% when training a subword-sensitive CBOW model for five iterations across a 630 billion word corpus derived from Common Crawl. The latter performance improved to 82% with pre-processing to tag phrases and position-dependent weighting -modifications that may improve EARP performance also, as would almost certainly be the case with multiple training iterations across a much larger corpus.
Regarding the performance of other ordersensitive approaches on this subset, Trask and his colleagues (2015) report a 1.41% to 3.07% increase in absolute accuracy over a standard CBOW baseline with PENN, and Mikolov and his colleagues (2017) report a 4% increase over a subword-sensitive CBOW model with incorporation of position-dependent weights 11 . By comparison, EARPx prox yields improvements of up to 4.27% over the best baseline when subwords are not considered, and 8.29% with subword-sensitive models (both at radius 5).
With the more challenging BATS analogies, Drozd and his colleagues (2016) report results for models trained on a 6 billion word corpus derived from the Wikipedia and several other resources, with best results with the vector offset method for BATS components of 61%, 11.2% (both SGNS), 10.9% and 31.5% (both Glove) for the Inflectional, Derivational, Lexicography and Encyclopedia components respectively. While not strictly comparable on account of our training on Wikipedia alone and acknowledging only one of a set of possible correct answers in some cases, our best results for these sets were 71.56%, 44.30%, 10.07% and 36.52% respectively, with a fourfold increase in performance on the derivational component. These results further support the hypothesis that order-related information is of value in several classes of proportional analogy problem.

Semantic similarity/relatedness
These improvements in analogy retrieval were not accompanied by better correlation with human es-11 CBOW baselines were around 10% higher than our SGNS baselines, attributable to differences in corpus size, composition and preprocessing; and perhaps architecture. respectively. Other models fell between these ex-tremes, with a similar pattern observed with window radius of 5. The differences in performance observed with these tasks are neither as stark nor as consistent as those with analogy tasks, with EARP performance falling between that of the two baseline models in several cases. Nonetheless, EARP models tend to correlate worse with human estimates of pairwise similarity and relatedness. This drop in performance may relate to how semantic information is dispersed with positionaware models -a neighboring word is encoded differently depending on position, which may obscure the semantically useful information that two other words both occur in proximity to it.

Sequence labeling
Results of these experiments are shown in Table 2, and suggest an advantage for models encoding position in sequence labeling tasks. In particular, for the sentence chunking shared task (CoNLL00), the best results obtained with an order-aware model (EARPx pos + subwords) exceeds the best baseline result by around 2%, with smaller improvements in performance on the other two tasks, including a .43% improvement in accuracy for part-of-speech tagging (PTB, without subwords) that is comparable to the .37% improvement over a SGNS baseline reported on this dataset by Ling and his colleagues (2015) when using separate (rather than shared) positiondependent context weights.

Computational performance
All models were trained on a single multi-core machine. In general, Semantic Vectors takes slightly longer to run than fastText, which takes around an hour to generate models including building of the dictionary. With Semantic Vectors, models were generated in around 1h20 when the corpus was indexed at document level, which is desirable as this package uses Lucene 12 for tokenization and indexing. As the accuracy of fastText on analogy completion tasks dropped considerably when we attempted to train it on unsegmented documents, we adapted Semantic Vectors to treat each line of input as an individual document. As this approximately doubled the time required to generate each model, we would not recommend this other than for the purpose of comparative evaluation. Adding subword embeddings increased training time by three-to fourfold. Source code is available via GitHub 13 , and 12 https://lucene.apache.org/ 13 https://github.com/semanticvectors/semanticvectors embeddings are publicly available at 14 .

Limitations and future work
While this paper focused on encoding positional information, EARP is generalizable and could be used to encode other sorts of relational information also. An interesting direction for future work may involve using EARP to encode the nature of semantic and dependency relations, as has been done with RI previously (Cohen et al., 2009;Basile et al., 2011). As the main focus of the current paper was on comparative evaluation across models with identical hyper-parameters, we have yet to formally evaluate the extent to which hyperparameter settings (such as dimensionality) may affect performance, and it seems likely that hyperparameters that would further accentuate EARP performance remain to be identified.

Conclusion
This paper describes EARP, a novel method through which word order can be encoded into neural word embedding representations. Of note, this additional information is encoded without the need for additional computation, and space requirements are practically identical to those of baseline models. Upon evaluation, encoding word order results in substantive improvements in performance across multiple analogical retrieval reference sets, with best performance when order information is encoded using a novel permutationbased method of positional binding.