Embedding Syntax and Semantics of Prepositions via Tensor Decomposition

Prepositions are among the most frequent words in English and play complex roles in the syntax and semantics of sentences. Not surprisingly, they pose well-known difficulties in automatic processing of sentences (prepositional attachment ambiguities and idiosyncratic uses in phrases). Existing methods on preposition representation treat prepositions no different from content words (e.g., word2vec and GloVe). In addition, recent studies aiming at solving prepositional attachment and preposition selection problems depend heavily on external linguistic resources and use dataset-specific word representations. In this paper we use word-triple counts (one of the triples being a preposition) to capture a preposition’s interaction with its attachment and complement. We then derive preposition embeddings via tensor decomposition on a large unlabeled corpus. We reveal a new geometry involving Hadamard products and empirically demonstrate its utility in paraphrasing phrasal verbs. Furthermore, our preposition embeddings are used as simple features in two challenging downstream tasks: preposition selection and prepositional attachment disambiguation. We achieve results comparable to or better than the state-of-the-art on multiple standardized datasets.

sitional attachment ambiguities and idiosyncratic uses in phrases). Existing methods on preposition representation treat prepositions no different from content words (e.g., word2vec and GloVe). In addition, recent studies aiming at solving prepositional attachment and preposition selection problems depend heavily on external linguistic resources and use dataset-specific word representations. In this paper we use word-triple counts (one of the triples being a preposition) to capture a preposition's interaction with its attachment and complement. We then derive preposition embeddings via tensor decomposition on a large unlabeled corpus. We reveal a new geometry involving Hadamard products and empirically demonstrate its utility in paraphrasing phrasal verbs. Furthermore, our preposition embeddings are used as simple features in two challenging downstream tasks: preposition selection and prepositional attachment disambiguation. We achieve results comparable to or better than the state-of-the-art on multiple standardized datasets.

Introduction
Prepositions are a linguistically closed class comprising some of the most frequent words; they play an important role in the English language since they encode rich syntactic and semantic information. Many preposition-related tasks are challenging in computational linguistics because of their polysemous nature and flexible usage patterns. An accurate understanding and representation of prepositions' linguistic role is key to several important NLP tasks such as grammatical error correction and prepositional phrase attachment. A first-order approach is to represent prepositions as real-valued vectors via word embeddings such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014).
Word embeddings have brought a renaissance in NLP research; they have been very successful in capturing word similarities as well as analogies (both syntactic and semantic) and are now mainstream in nearly all downstream NLP tasks (such as question-answering (Chen et al., 2017)). Despite this success, available literature does not highlight any specific properties of word embeddings of prepositions. Indeed, many of the common prepositions have very similar vector representations as shown in Table 1 for preposition vectors trained using word2vec and GloVe (Tensor embedding is our proposed representation for prepositions). While this suggests that using available representations for prepositions diminishes the distinguishing aspect between prepositions, one could hypothesize that this is primarily because standard word embedding algorithms treat prepositions no different from other content words such as verbs and nouns, i.e., embeddings are created based on co-occurrences with other words. However, prepositions are very frequent and cooccur with nearly all words, which means that their co-occurrence ought to be treated differently. to understand a preposition via its interactions with both the head it attaches to (termed head) and its complement (Huddleston, 1984;DeCarrico, 2000). This theory naturally suggests that one should count co-occurrences of a given preposition with pairs of neighboring words. One way of achieving this would be by considering a tensor of triples (word 1 , word 2 , preposition), where we do not restrict word 1 and word 2 to be the head and complement words; instead we model a preposition's interaction with all pairs of neighboring words via a slice of a tensor X, where the slice is populated by word co-occurrences restricted to a context window of the specific preposition. Thus, the tensor dimension is N × N × K where N is the vocabulary size and K is the number of prepositions; since K ≈ 50, we note that N K.
Using such a representation, we notice that the resulting tensor is low rank and use it to extract embeddings for both preposition and nonpreposition words. In doing so, we use a combination of standard ideas from word representations (such as weighted spectral decomposition as in GloVe (Pennington et al., 2014)) and tensor decompositions (alternating least squares (ALS) methods (Sharan and Valiant, 2017)). We find that the preposition embeddings extracted in this manner are discriminative (see the preposition similarity of the tensor embedding in Table 1). Note that the smaller the cosine similarity is, the more distinct the representations are from each other. We demonstrate that the resulting preposition representation captures the core linguistic properties of prepositions-the attachment and the complement properties. Using both intrinsic evaluations and downstream tasks, we show this by providing new state-of-the-art results on well-known NLP tasks involving prepositions.

Intrinsic evaluations:
We show that the Hadamard product of the embeddings of a verb and a preposition that together make a phrasal verb, closely approximates the representation of this phrasal verb's paraphrase as a single verb. Example: v made v from ≈ v produced v, where represents the Hadamard product (i.e., elementwise multiplication) of two vectors and v is a constant vector (not associated with a specific word and is defined later); this approximation validates that prepositional semantics are appropriately encoded into their trained embeddings. We provide a mathematical interpretation for this new geometry while empirically demonstrating the paraphrasing of compositional phrasal verbs. Extrinsic evaluations: Our preposition embeddings are used as features for a simple classifier in two well-known challenging downstream NLP classification tasks. In both tasks, we perform as well as or strictly better than the state-of-the-art on multiple standardized datasets.
Preposition selection: While the context in which a preposition occurs governs the choice of the preposition, the specific preposition by itself significantly influences the semantics of the context in which it occurs. Furthermore, the choice of the right preposition for a given context can be very subtle. This idiosyncratic behavior of prepositions is the reason behind preposition errors being one of the most frequent error types made by second language English speakers (Leacock et al., 2010)). We demonstrate the utility of the preposition embeddings in the preposition selection task, which is to choose the correct preposition to a given sentence. We show this for a large set of contexts-7, 000 combined instances from the CoNLL-2013 and the SE datasets (Prokofyev et al., 2014). Our approach achieves 6% and 2% absolute improvement over the previous state-of-the-art results on the respective datasets.

Prepositional phrase attachment disambiguation:
Prepositional phrase attachment is a common cause of structural ambiguity in natural language. In the sentence "Pierre Vinken joined the board as a voting member", the prepositional phrase "as a voting member" can attach to either "joined" (the VP) or "the board" (the NP); in this case the VP attachment is correct. Despite being extensively studied over decades, prepositional attachment continues to be a major source of syntactic parsing errors (Brill and Resnik, 1994;Kummerfeld et al., 2012;de Kok and Hinrichs, 2016). We use our prepositional representations as simple features to a standard classifier on this task. Our approach tested on a widely studied standard dataset (Belinkov et al., 2015) achieves 89% accuracy and compares favorably with the state-of-the art. It is noteworthy that while the state-of-the-art results are obtained with significant linguistic resources, including syntactic parsers and the Word-Net, our approach achieves a comparable performance without relying on such resources.
We emphasize two aspects of our contributions: (1) Word representations trained via pairwise word counts are previously shown to capture much of the benefits of the unlabeled sentence-data; example: (Sharan and Valiant, 2017) reports that their word representations via word-triple counts are better than others, but still significantly worse than regular word2vec representations. One of our main observations is that considering word-triple counts makes most (linguistic) sense when one of the words is a preposition. Furthermore, the sparsity of the corresponding tensor is no worse than the sparsity of the regular word co-occurrence matrix (since prepositions are so frequent and cooccur with essentially every word). Taken together, these two points strongly suggest the benefits of tensor representations in the context for prepositions.
(2) The word and preposition representations via tensor decomposition are simple features leading to a standard classifier. In particular, we do not use dependency parsing (which many prior methods have relied on) or handcrafted features (Prokofyev et al., 2014) or train task-specific representations on the annotated training dataset (Belinkov et al., 2015). The simplicity of our approach, combined with the strong empirical results, lends credence to the strength of the prepositional representations found via tensor decompositions.

Method
We begin with a description of how the tensor with triples (word, word, preposition) is formed and empirically show that its slices are low-rank. Next, we derive low dimensional vector representations for words and prepositions via appropriate tensor decomposition methods. Tensor creation: Suppose that K prepositions are in the preposition set P = {p 1 , . . . , p K }; here K is 49 in our preposition selection task, and 76 in the attachment disambiguation task. We limited the number of prepositions to what was needed in the dataset. The vocabulary, the set of all words excluding the prepositions, contains N words, V = {w 1 , . . . , w N }, and N ≈ 1M . We generate a third order tensor X N ×N ×(K+1) from the WikiCorpus (Al-Rfou et al., 2013) as follows. We say two words co-occur if they appear within a distance t of each other in a sentence. For k ≤ K, the entry X ijk is the number of occurrences where word w i co-occurs with preposition p k , and w j also co-occurs with preposition p k in the same sentence, and this is counted across all sentences in the WikiCorpus. For 0 ≤ k ≤ K, X[:, :, k] is a matrix of the count of the word pairs that co-occur with the preposition k, and we call such a matrix a slice.
Here we use a window of size t = 3. While prepositions co-occur with many words, there are also a number of other words which do not occur in the context of any preposition. In order to make the maximal use of the data, we add an extra slice X[:, :, K + 1], where the entry X ij(K+1) is the number of occurrences where w i co-occurs with w j (within distance 2t = 6) but at least one of them is not within a distance of t of any preposition. Note that the preposition window of 3 is smaller than the word window of 6, since it is known that the interaction between prepositions and neighboring words usually weakens more sharply with distance when compared to that of content words (Hassani and Lee, 2017). Empirical properties of X: We find that the tensor X is very sparse -only 1% of the tensor elements are non-zero. Furthermore, log(1 + X[: , :, k]) is low-rank (here the logarithm is applied component-wise to every entry of the tensor slice). Towards seeing this, we choose slices corresponding to the prepositions "about", "before","for", "in" and "of", and plot their normalized singular values in Figure 1. We see that the singular values decay dramatically, suggesting the low-rank structure in each slice. Tensor decomposition: We combine standard ideas from word embedding algorithms and tensor decomposition algorithms to arrive at the lowrank approximation to the tensor log(1 + X). In particular, we consider two separate methods: 1. Alternating Least Squares (ALS). A generic method to decompose a tensor into its modes is via the CANDECOMP/PARAFAC (CP) decomposition (Kolda and Bader, 2009). The tensor log(1 + X) is decomposed into three modes: U d×N , W d×N and Q d×(K+1) , based on the solutions to the optimization problem (1). Here u i , w i and q i are the i-th column of U , W and Q, respectively.
where a, b, c = 1 t (a b c) is the inner product of three vectors a, b and c. Here 1 is the column vector of all ones and refers to the Hadamard product. We can interpret the columns of U as the word representations and the columns of Q as the preposition representations, each of dimension d (equal to 200 in this paper). There are several algorithmic solutions to this optimization problem in the literature, most of which are based on alternating least squares methods (Kolda and Bader, 2009;Comon et al., 2009;Anandkumar et al., 2014) and we employ a recent one named Orth-ALS (Sharan and Valiant, 2017) in this paper. Orth-ALS periodically orthogonalizes the decomposed components while fixing two modes and updating the remaining one. It is supported by theoretical guarantees and empirically outperforms standard ALS methods in different applications.
2. Weighted Decomposition (WD): Based on ideas from the literature on word embedding algorithms, we also consider weighting different elements of the tensors differently in order to reduce the effect of the large dynamic range of the tensor values. Specifically, we employ the GloVe objective function to our tensor model and minimize the objective function (2): where b U i is the scalar bias for the word i in the matrix U . Similarly, b W j is the bias for the word j in the matrix W , and b Qk for preposition k in the matrix Q. Bias terms are learned in such a way as to minimize the loss function. Here ω ijk is the weight assigned to each tensor element X ijk , and we use the weighting proposed by GloVe: We set the hyperparameters to be x max = 10, and α = 0.75 in this work. We solve this optimization problem via standard gradient descent, arriving at word representations U and tensor representations Q.

Geometry of Phrasal Verbs
Representation Interpretation Suppose that we have a phrase (h, p i , c) where h, p i and c are the head word, the preposition i(i ≤ K) and the complement respectively. The inner product of the word vectors of h, p i and c reflects how frequently h and c co-occur in the context of p. It also reflects how cohesive the triple is.
Recall that there is an extra (K + 1)−th slice that describes the word co-occurrences outside the preposition window, which considers cases such as the verb phrase (v, c) where v and c are the verb and its complement without a preposition in their shared context. Now consider a phrasal verb sparked off and a verb phrase with head prompted. For any complement word c that fits these two phrases-the phrasal verb having h as its head verb and p i as its preposition, and the other, the verb phrase with v as its head-we can expect that In other words u h q i ≈ u v q K+1 , where a b denotes the pointwise multiplication (Hadamard product) of vectors a and b. This suggests that: (1) The vector q K+1 is a constant vector for all (v, c) pairs, and that (2) we could paraphrase the verb phrase (h, p i ) by finding a verb v such that This shows that well-trained embeddings are able to capture the relation between phrasal verbs and their equivalent single verb forms.
In Table 2, we list paraphrases of some verb phrases, which are generated from the weighted tensor decomposition. As can be seen, the tensor embedding gives reasonable paraphrasing, which  validates that the trained embedding is interpretable in terms of lexical semantics.
In the next two sections, we evaluate the proposed tensor-based preposition embeddings in the context of two important NLP downstream tasks: preposition selection and preposition attachment disambiguation. In this work, we use the English WikiCorpus (around 9 GB) as the training corpus for different sets of embeddings. We train tensor embeddings with both Orth-ALS and weighted decomposition. The implementation of Orth-ALS is built upon the SPLATT toolkit (Smith and Karypis, 2016). We perform orthogonalization in the first 5 iterations in Orth-ALS decomposition, and the training is completed when its performance stabilizes. As for the weighted decomposition, we train for 20 iterations, and its hyperparameters are set as x max = 10, and α = 0.75.
We also include two baselines for comparison-word2vec's CBOW model and GloVe. We set 20 training iterations for both the models. The hyperparameters in word2vec are set as: window size=6, negative sampling=25 and down-sampling=1e-4. The hyperparameters in GloVe are set as: window size=6, x max =10, α=0.75 and minimum word count=5. We note that all the representations in this study-word2vec, GloVe and our tensor embedding-are of dimension 200.

Downstream Application: Preposition Selection
Grammatical error detection and correction constitute important tasks in NLP. Among grammatical errors, prepositional errors constitute about 13% of all errors, ranking second among the most common error types (Leacock et al., 2010). This is due to the fact that prepositions are highly polysemous and have idiosyncratic usage. Selecting a preposition depends on how well we can capture the interaction between a preposition and its context. Hence we choose this task to evaluate how well the lexical interactions are captured by different methods.  a preposition, we either replace the preposition with the correct one or retain it. For example, in the sentence "It can save the effort to carrying a lot of cards," "to" should be corrected as "of." Formally, there is a closed set of preposition candidates P = {p 1 , . . . , p m }. A preposition p is used in a sentence s consisting of words s = {. . . , w −2 , w −1 , p, w 1 , w 2 , . . .}. If used incorrectly, we need to replace p by another prepositionp ∈ P based on the context.
Dataset. For training, we use the data from the Cambridge First Certificate in English (FCE) exam, just as used by the state-of-the-art on preposition error correction (Prokofyev et al., 2014). As for test data, we use two the CoNLL-2013 and the Stack Exchange (SE) datasets. The CoNLL dataset on preposition error correction was published by the CoNLL 2013 shared task (Ng et al., 2014), collected from 50 essays written by 25 nonnative English learners at a university. The SE dataset consists of texts generated by non-native speakers on the Stack Exchange website. Detailed statistics are shown in Table 3. We focus on the most frequent 49 prepositions listed in Appendix A.
Evaluation metric. Three metrics-precision, recall and F1 score-are used to evaluate the preposition selection performance.
Our algorithm. We first preprocess the dataset by removing articles, determiners and pronouns, and take a context window of 3. We divide the task into two steps: error detection and error cor-  rection. Firstly, we decide whether a preposition is used correctly in the context. If not, we suggest another preposition as replacement in the second step. The detection step uses only three features: the cosine similarity between the the current preposition embedding and the average context embedding, the rank of the preposition in terms of this cosine similarity, and the probability that this preposition is not changed in the training corpus. We build a decision tree classifier with these three features and find that we can identify errors with 98% F1 score in the CoNLL dataset and 96% in the SE dataset. For the error correction part, we only focus on the errors detected in the first stage. Suppose that the original preposition is q, and the candidate preposition is p with the embedding v p . The word vectors in the left context window are averaged as the left context embedding v , and the right vectors are averaged to give the right context embedding v r . We have the following features: 1. Embedding features: v , v p and v r ; 2. Pair similarity between the preposition and the context: maximum of the similarity of the preposition between the left and the right context, i.e., pair sim = 3. Triple similarity = v ,vp,vr v 3 · vp 3 · vr 3 ; 4. Confusion probability: the probability that q is replaced by p in the training data.
A two-layer feed-forward neural network (FNN) with hidden layer sizes of 500 and 10 is trained with these features to score prepositions in each sentence. The preposition with the highest score is the suggested edit.
Baseline. The state-of-the-art on preposition selection uses n-gram statistics from a large corpus (Prokofyev et al., 2014). Features such as point-wise mutual information (PMI) and part-ofspeech tags are fed into a supervised scoring system. Given a sentence with a preposition to either replace or retain, the preposition with the highest score is chosen.
The performance of the baseline is affected by both the system architecture and the features. To evaluate the benefits brought about by our tensor embedding-based features, we also consider other baselines which have the same two-step architecture whereas the features are generated from word2vec and GloVe embeddings. These baselines allow us to compare the representation power independent of the classifier.
Result. We compare our proposed embeddingbased method against baselines mentioned in Table 4. We note that the proposed tensor embeddings achieve the best performance among all approaches. In particular, the tensor with weighted decomposition has the highest F1 score on the CoNLL dataset-a 6% improvement over the stateof-the-art. However, the tensor with ALS decomposition performs the best on the SE dataset, achieving a 2% improvement over the state-ofthe art. We also note that with the same architecture, tensor embeddings perform much better than word2vec and GloVe embeddings on both the datasets. This validates the representation power of tensor embeddings of prepositions.
To get a deeper insight into the importance of the features in the preposition selection task, we also performed an ablation analysis of the tensor  method with weighted decomposition as shown in Table 5. We find that the left context is the most important feature in for the CoNLL dataset, whereas the confusion score is the most important for the SE dataset. Pair similarity and triple similarity are less important when compared with the other features. This is because the neural network was able to learn the lexical similarity from the embedding features, thus reducing the importance of the similarity features. Discussion. Now we analyze different cases where our approach selects the wrong preposition. (1) Limited context window. We focus on the local context within a preposition's window. In some cases, we find that head words might be out of the context window. An instance of this is found in the sentence "prevent more of this kind of tragedy to happening" to should be corrected as from. Given the context window of 3, we cannot get the lexical clues provided by prevent, which leads to the selection error. (2) Preposition selection requires more context. Even when the context window contains all the words on which the preposition depends, it still may not be sufficient to select the right one. For example, in the sentence "it is controlled by some men in a bad purpose" where our approach replaces the preposition in with the preposition on given the high frequency of the phrase "on purpose". The correct preposition should be for based on the whole sentence.

Downstream Application: Prepositional Attachment
In this section, we discuss the task of prepositional phrase (PP) attachment disambiguation, a wellstudied, but hard task in syntactic parsing. The PP attachment disambiguation inherently requires an accurate description of the interactions among the head, the preposition and the complement, which becomes an ideal task to evaluate our tensor-based embeddings.
Task. The English dataset used in this work is collected from a linguistic treebank by (Belinkov et al., 2015). It provides 35, 359 training and 1, 951 test instances. Each instance consists of several head candidates, a preposition and a complement word. The task is to pick the head to which the preposition attaches. In the example "he saw an elephant with long tusks", the words "saw" and "elephant" are the candidate head words.
Our algorithm. Let v h , v p and v c be embeddings for the head candidate h, preposition p and child c respectively. We then use the following features: 5. Part-of-speech (pos) tag of candidates and next words; 6. Distance between h and p.
We use a basic neural network, a two-layer feedforward network (FNN) with hidden-layers of size 1000 and 20, to take the input features and predict the probability that a candidate is the head. The candidate with the highest likelihood is chosen as the head.
Baselines. For comparison, we include the following state-of-the-art approaches in preposition attachment disambiguation. The linguistic resources they used to enrich their features are listed in Table 6.  (1) Head-Prep-Child-Dist (HPCD) Model (Belinkov et al., 2015): this compositional neural network is used to train task-specific representations of prepositions.
(2) Low-Rank Feature Representation (LRFR) (Yu et al., 2016): this method incorporates word parts, contexts and labels into a tensor, and uses decomposed vectors as features for disambiguation.
Similar to the experiments in the preposition selection task (see Section 4), we also include baselines which have the same feed-forward network architecture but generate features with vectors trained by word2vec and GloVe. They are denoted as FNN with different initializations in Table 6. Since the attachment disambiguation is a selection task, accuracy is a natural evaluation metric.
Result. We compare the results of the different approaches and the linguistic resources used in Table 6, where we see that our simple classifier built on the tensor representation is comparable in performance to the state-of-the-art (within 1% of the result). This result is notable considering that prior competitive approaches have used significant linguistic resources such as VerbNet and WordNet, whereas we use none. With the same feed-forward neural network as the classifier, our tensor-based approaches (both ALS and WD) achieve better performance than word2vec and GloVe.
An ablation analysis that is provided in Table 7 shows that the head vector feature affects the performance the most (indicating that heads interact more closely with prepositions), and the POS tag feature comes second. The similarity features appear less important since the classifier has access to the lexical relatedness via the embedding fea-tures. Prior works have reported the importance of the distance feature since 81.7% sentences take the word closest to the preposition as the head. In our experiments, the distance feature was found to be less important compared to the embedding features. Discussion. We found that one source of attachment disambiguation error is the lack of a broader context in our features. A broader context is critical in examples such as "worked" and "system," which are head candidates of "for trades" in the sentence "worked on a system for trading". They are reasonable heads in the expressions "worked for trades" and "system for trades" and further disambiguation requires a context larger than what we considered.

Related Work
Word representation. Word embeddings have been successfully used in many NLP applications. Word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014) show that embeddings can capture lexical semantics very well. Zhang et al. (2014) studied embeddings which can generalize different similarity perspectives when combined with corresponding linear transformations. Unlike other words, the crucial syntactic roles of prepositions in addition to their rich semantic meanings have been highlighted in prior works (Hovy et al., 2010;Schneider et al., 2015). Nevertheless, word representations specifically focused on prepositions are not available and to the best of our knowledge, ours is the first work exploring this intriguing direction.
Tensor Decomposition.  multaneous Diagonalization (SD) (Kuleshov et al., 2015) and optimization-based methods (Liu and Nocedal, 1989;Moré, 1978). Orthogonalized Alternating Least Square (Orth-ALS) adds the step of component orthogonalization to each update step in the ALS method (Sharan and Valiant, 2017). Owing to its theoretical guarantees and, more relevantly due to its good empirical performance, Orth-ALS is the algorithm of choice in this paper. Preposition Selection. Preposition selection, an important area of study in computational linguistics, is also a very practical topic in the context of grammar correction and second language learning. Prior works have used hand-crafted heuristic rules (Xiang et al., 2013), n-gram features (Prokofyev et al., 2014;Rozovskaya et al., 2013), and by the use of POS tags and dependency relations to enrich other features (Kao et al., 2013)-all toward addressing preposition error correction.
Prepositional Attachment Disambiguation. There is a storied literature on prepositional attachment disambiguation, long recognized as an important part of syntactic parsing (Kiperwasser and Goldberg, 2016). Recent works, based on word embeddings have pushed the boundary of state of the art empirical results. A seminal work in this direction is the Head-Prep-Child-Dist Model, which trained embeddings in a compositional network to maximize the accuracy of head prediction (Belinkov et al., 2015). The performance has been further improved in conjunction with semantic and syntactic features. A recent work has proposed an initialization with semantics-enriched GloVe embeddings, and retrained representations with LSTM-RNNs (Dasigi et al., 2017). Another recent work has used tensor decompositions to capture the relation between word representations and their labels (Yu et al., 2016).

Conclusion
Co-occurrence counts of word pairs in sentences and the resulting word vector representations (embeddings) have revolutionalized NLP research. A natural generalization is to consider co-occurrence counts of word triples, resulting in a third order tensor. Partly due to the size of the tensor (a vocabulary of 1M, leads to a tensor with 10 18 entries!) and partly due to the extreme dynamic range of entries (including sparsity), word vector representations via tensor decompositions have largely been inferior to their lower order cousins (i.e., regular word embeddings).
In this work, we trek this well-trodden but arduous terrain by restricting word triples to the scenario when one of the words is a preposition. This is linguistically justified, since prepositions are understood to model interactions between pairs of words. Numerically, this is also very well justified since the sparsity and dynamic range of the resulting tensor is no worse than the original matrix of pairwise co-occurrence counts; this is because prepositions are very frequent and co-occur with essentially every word in the vocabulary.
Our intrinsic evaluations and new state-of-theart results in downstream evaluations lend strong credence to the tensor-based approach to prepositional representation. We expect our vector representations of prepositions to be widely used in more complicated downstream NLP tasks where prepositional role is crucial, including "text to programs" (Guu et al., 2017).