Modelling Lexical Ambiguity with Density Matrices

Words can have multiple senses. Compositional distributional models of meaning have been argued to deal well with finer shades of meaning variation known as polysemy, but are not so well equipped to handle word senses that are etymologically unrelated, or homonymy. Moving from vectors to density matrices allows us to encode a probability distribution over different senses of a word, and can also be accommodated within a compositional distributional model of meaning. In this paper we present three new neural models for learning density matrices from a corpus, and test their ability to discriminate between word senses on a range of compositional datasets. When paired with a particular composition method, our best model outperforms existing vector-based compositional models as well as strong sentence encoders.


Introduction
An integral part of natural language understanding is the ability to handle lexical ambiguity. Words can have multiple meanings, and the precise meaning of a word only becomes clear when we see it in use -the surrounding context disambiguates it. Word sense disambiguation (WSD) is said to be an 'AI complete' problem (Navigli, 2009), that is, a problem that is at least as hard as any other problem in AI, and as such has been the subject of extensive research. Standard approaches treat WSD as a classification problem: given a word in context, the task is to classify it into one of a range of possible senses (Lesk, 1986;Schütze, 1998;Navigli, 2009). A more difficult task is to disambiguate every word in a sentence (Chaplot and Salakhutdinov, 2018). A summary of the state of the art is given in (Raganato et al., 2017). More recently, neural approaches (Hadiwinoto et al., 2019;Huang et al., 2019) use contextualised embeddings as input to WSD systems, together with knowledge from WordNet. Other neural approaches generate multiple sense vectors per word (Neelakantan et al., 2014;Cheng and Kartsaklis, 2015) or vectors representing a context (Melamud et al., 2016).
Disambiguation can be costly. Each word should be disambiguated with respect to the correct senses of the other words in the sentence, meaning that the computational complexity of the task can become problematic (Chaplot and Salakhutdinov, 2018). Within a compositional framework, the idea is for words to disambiguate automatically in the process of composition (Kintsch, 2001;Mitchell and Lapata, 2008;Baroni et al., 2014;Boleda, 2020).
Within purely vector-based models, the amount of ambiguity that a word vector can represent is limited. Baroni et al. (2014) argue that distributional vectors work well for polysemy, but less so for homonymy. Piedeleu et al. (2015) extend the vector based model of meaning to encompass homonymy by using the notion of a density matrix. These can be used to encode a probability distribution over possible meanings of a word in a single representation. Density matrices can also be accommodated within a compositional framework, allowing the ambiguity encoded in the matrix to be resolved via composition.
We use density matrices within a compositional distributional framework to model word and sentence meaning. We propose three new models for building density matrices, based on neural word embedding models. We survey several composition methods for density matrices and evaluate how well our density matrices encode ambiguity and to what extent the composition methods achieve disambiguation, on four disambiguation datasets that test disambiguation in a compositional setting. One of our models (multi-sense Word2DM) emerges as the best model overall. When paired with a particular composition method (Phaser), multi-sense Word2DM outperforms all other models (including existing baselines and high-performing sentence encoders) on most of the disambiguation tasks.

Background
Compositional distributional models come in a range of flavours. Mitchell and Lapata (2008) use simple element-wise operations on vectors. More recently, neural models of composition (Socher et al., 2012;Bowman et al., 2015) and large networks such as BERT (Devlin et al., 2019) have been extremely successful. A third flavour is the typelogical tensor-based models of composition (Baroni and Zamparelli, 2010;Coecke et al., 2010;Paperno et al., 2014;Sadrzadeh and Muskens, 2018). The tensor-based model of composition works as follows. We choose a vector space N for nouns, and another S for sentences, and represent relational words as multilinear maps over these spaces. Intransitive verbs are represented as linear maps N → S, i.e. matrices in N ⊗ S. Transitive verbs are represented as maps from two copies of N to S, i.e. order 3 tensors or 'cubes' of parameters in N ⊗ S ⊗ N . Composition is performed via tensor contraction -an extension of matrix multiplication. Matrices and tensors require many parameters. To alleviate this problem, Grefenstette and Sadrzadeh (2011a,b); Kartsaklis et al. (2012) develop ways of building matrices and tensors from word vectors, some of which are described in section 4.1.1.
We use an extension of the tensor-based approach, based on methods given in Piedeleu et al. (2015); Bankova et al. (2018). Nouns and sentences are represented as density matrices and relational words (adjectives, verbs, etc.) are represented as completely positive maps, which take density matrices to density matrices.
Representing words with density matrices A density matrix over R n is a matrix of the form where {p i } i are the probabilities assigned to the vectors { − → v i } i . Density matrices over R n are: 3. Unit trace: tr(ρ) = 1 To represent words, we view each word as a probability distribution over senses, and we view the vectors − → v i in equation (1) as representing its different senses. For example, the word bright could mean shiny or clever. Suppose that when bright is used, it is twice as likely to mean shiny as it is to mean clever. The density matrix for bright, denoted bright , is computed as follows: Composition with density matrices Since we are working with density matrices, nouns are now maps N → N , i.e. matrices in N ⊗ N . Sentences are matrices in S ⊗ S. This means that intransitive verbs are order-4 tensors that take a noun density matrix as input and give back a sentence density matrix. They live in a space N ⊗ N ⊗ S ⊗ S. Transitive verbs are order-6 tensors. Clearly, these spaces get very big very quickly.
To deal with this increase in dimensionality, tricks to create completely positive maps out of density matrices have been proposed (Lewis, 2019b;Coecke and Meichanetzidis, 2020). This allows composition mechanisms to be specified at the level of density matrices, rather than having to work in the high-order spaces described above. We describe these composition mechanisms in section 3.2.
Other applications of density matrices in NLP include modelling entailment in a compositional setting (Balkir et al., 2015;Bankova et al., 2018;Lewis, 2019a;Bradley and Vlassopoulos, 2020). Blacoe et al. (2013) also use density matrices to model ambiguity, but in a different setting. Baroni et al. (2014) argue that compositional distributional semantic models are particularly able to pick out the more subtle shades of meaning termed polysemy. This idea is used in (Mitchell and Lapata, 2008;Grefenstette and Sadrzadeh, 2011a,b;Kartsaklis et al., 2013), where a range of semantic composition models are tested on datasets built to distinguish different senses of words in context. Neural and distributional models for disambiguation are compared in Milajevs et al. (2014), and the role of ellipsis in disambiguation is investigated in Wijnholds and Sadrzadeh (2019).

Density Matrix Models
We now introduce the three novel methods that we propose for building density matrices.
where ind(v) are the indices at which the word v occurs in the corpus and − → v i is the reduced embedding for v. end BERT (Devlin et al., 2019) produces contextualised embeddings for words and sentences. Given a sentence, it produces vectors for each word that are specific to that particular context (BERT actually models subword units, but we average the subword embeddings of a word to obtain a contextualised word embedding). BERT2DM uses the contextualised embeddings of BERT to build density matrices that encode multiple senses of a word. BERT is applied to a corpus and the contextualised embeddings for a word w are combined to compute w's density matrix according to equation (1). The procedure is outlined in algorithm 1.
Since the vectors produced by BERT are fairly large, we apply a dimensionality reduction step (either PCA or SVD) over all content word embeddings before combining to form a density matrix.
We also experiment with clustering the contextual embeddings of a word and applying dimensionality reduction to the cluster centroids instead of the contextualised embeddings. The motivation for this is that clustering contextualised embeddings can produce clusters that correspond to distinct senses (as shown by Wiedemann et al. (2019)).

Word2DM
Word2DM is an extension of Word2Vec (Mikolov et al., 2013a,b) skip-gram with negative sampling Algorithm 2: Word2DM training for each word v in the vocabulary do Randomly initialise a n × m matrix B v . end for each target word w t in the corpus do for each context word w c do Sample K negative samples from some noise distribution. Maximise equation 2 with respect to B t , B c , and B w k for k = 1, ...K. end end for each word v in the vocabulary do Compute its density matrix as . SGNS modifies word vectors to become closer to words they do occur with, and further away from words they don't occur with (the negative samples). When extending the SGNS algorithm to produce density matrices, we must ensure that the matrices satisfy the conditions resulting from their definition: symmetry, positive semidefiniteness, and unit trace. The first and last are easy to enforce, but preserving positivity is more challenging. To preserve positivity, we utilise the following property of positive semi-definiteness: Property 3.1. For any matrix B, the product BB is positive semi-definite.
We enforce positive semi-definiteness by training the weights of an intermediary matrix B and computing our density matrix as A = BB . By updating the weights of B and computing A we indirectly train positive semi-definite matrices. We modify the training objective of SGNS to maximise the similarity of the density matrices of cooccurring words. The objective function at each target-context prediction is then: (2) where A t and A c are the the density matrices of the target and context words respectively, A 1 , A 2 , ..., A K are the density matrices of K negative samples, and θ is the set of weights of the intermediary matrices B t , B c and B 1 , B 2 , ..., B K .
Word2DM is a straightforward extension of Word2Vec for learning density matrices. However, it turns out that enforcing positive semi-definiteness by introducing intermediary matrices leads to suboptimal training updates. This can be shown by examining the gradients of equation (2) with respect to the intermediary matrices (derivation in supplementary material).

Multi-sense Word2DM
Multi-sense Word2DM is a modification of Word2DM designed to overcome the gradient issues of Word2DM and to explicitly model ambiguity. Multi-sense Word2DM achieves this through the following changes to Word2DM: • The columns of the intermediary n×m matrix B now represent the m different senses of the word. Each sense of a word has its own n-dimensional embedding. The density matrix of a word is still computed as before, and can be expressed in terms of the sense embeddings as • Each word is also associated with a single vector v w , which represents it as a context word. • The following objective function is maximised: where c t is the sum of context vectors for all words surrounding the target word and b t is the the embedding for the relevant sense of the target word. We select b t by finding the column of B t most similar to c t (measured by either cosine similarity or dot product). The full training procedure is outlined in algorithm 3.
Multi-sense Word2DM explicitly models ambiguity by letting the columns of the intermediary matrix represent the different senses of a word. During training the column closest to the context embedding is selected as the relevant sense embedding and only this column is updated. This enables the model to avoid the gradient issues of Word2DM. The objective function being maximised (equation 4) has the same gradient as Word2Vec and therefore does not lead to suboptimal training updates.

Composition methods
The composition methods we use are based on methods in (Lewis, 2019a; Coecke and Meichanetzidis, 2020). These reduce the high-dimensional representations needed for relation words to composition of density matrices. The relational word is seen as a map that takes nouns as arguments. The composition methods are as follows, using the example of an adjective modifying a noun: Add : adj + noun Mult : adj noun Tensor : adj ⊗ adj × noun , where ⊗ denotes the Kronecker product and × denotes tensor contraction. Phaser : adj 1/2 noun adj 1/2 More complex phrases are combined according to their parse. So, a transitive sentence modified with an adjective is composed as (subj(verb(adj obj))). For example, composing the sentence Bob likes old cars would consist of the following steps: old where the composer f can be substituted by any of the composition methods listed above.

Experimental Setup
We evaluate our models on three tasks -word similarity, disambiguation, and a word level ambiguity analysis. The code for training and evaluating our models has been made available at https://github.com/francois-meyer/ lexical-ambiguity-dms. In this section we introduce the experimental setup used in all of these tasks.

Baselines
Throughout the experiments we compare our models to existing word and sentence embedding models. For our word embedding baselines we use embeddings produced by three existing models -Word2Vec, GloVe, and FastText. We use the publicly available 1 embeddings trained by Wijnholds and Sadrzadeh (2019). The embeddings are 300-dimensional and were trained on the combined and lemmatised ukWaC and Wackypedia corpora 2 . In the tasks that involve sentence-level semantics (the disambiguation tasks) we compare our models Algorithm 3: MS-Word2DM training for each word w in the vocabulary do Randomly initialise a n × m matrix B w and a n-dimensional vector v w . end for each target word w t in the corpus do Sum the context vectors of the words surrounding w t within a window of size 2l to get a context embedding c t : Compute the similarity (with either cosine similarity or dot product) of the columns b 1 , ..., b m of B t and c t and extract the most similar column as b t (the embedding of the relevant sense). Sample K negative samples from some noise distribution. Maximise equation 4 with regards to b t , c t , and v w k for k = 1, ...K. end for each word v in the vocabulary do Compute its density matrix as to existing compositional distributional semantic (CDS) models and neural sentence encoders.

CDS models
CDS models compute a sentence vector as a function of the distributional vectors of the words in the sentence. We use the pre-trained word embeddings of Wijnholds and Sadrzadeh (2019) and compute sentence embeddings by either summing, elementwise multiplying, or applying tensor-based composition. For the phrase big house with vectors − → big and − −− → house the different compositional distributional methods will be computed as follows:

Sentence encoders
We compare our models to two well-known neural sentence encoders -InferSent (Conneau et al., 2017) and BERT (Devlin et al., 2019). InferSent embeddings are 4096-dimensional, which is much larger than the word embeddings used in our CDS baselines. We use two pre-trained InferSent models that are publicly available 3 , referred to as In-ferSent1 and InferSent2 in our results. We also compare our models to BERT as a sentence encoder. BERT produces an embedding for the entire sentence by adding a special classification token ([CLS]) to the start of every sequence. When BERT is used in a sentence-level task, the [CLS] embedding can be used as a semantic representation for the entire sentence. Some of our evaluation data contains phrases that are not fully formed sentences. To ensure a fair comparison, we convert all phrases to fully formed sentences for evaluation of the sentence encoders. We added "the" before noun phrases and converted verbs to their present tense form.

Context2DM
We also compare our models to a baseline density matrix model, which we call Context2DM. It is based on the procedure of Schütze (1998) for building multi-sense embeddings. Context2DM builds the density matrix of a word w as follows: 1. Context embeddings are obtained for all the contexts in which w occurs (computed by summing the pre-trained embeddings of all the words that occur around w in a particular context). 2. These context embeddings are clustered (using hierarchical agglomerative clustering for k = 2, ..., 10 and the variance reduction criterion to select the number of clusters) and the resulting cluster centroids subsequently represent the different senses of w. 3. The density matrix of w is computed as the mixture of its sense embeddings i.e. the sum of the outer products of the cluster centroids, normalised to have unit trace.
For the pre-trained word embeddings required for step 1 of the above procedure, we use 17dimensional word embeddings, trained with the gensim implementation 4 of Word2Vec on the combined ukWaC+Wackypedia corpus.

Training
All our density matrices are 17 × 17 (so 289 parameters). This is closest in size to the 300dimensional baseline embeddings. We train our Two use cosine similarity to compare sense vectors to context vectors, while the other two use the dot product. We also vary the number of senses modelled (the number of columns in the intermediary matrix) between 5 and 10. We present results for four different BERT2DM models. Two of these cluster the BERT representations into senses before dimensionality reduction, while the other two do not. One of the advantages of clustering the representations is that it reduces the size of the matrix on which dimensionality reduction is applied, so it becomes computationally feasible to train on a larger corpus. We train the unclustered variants on a 10-million word subcorpus of Wackypedia, and the clustered variants on a 20-million word subcorpus. We also vary the dimensionality reduction algorithm between PCA and SVD, to test whether or not centering the contextual embeddings before dimensionality reduction makes any difference. Training BERT2DM takes only a few hours on a 16-core CPU (Intel Xeon Gold 6130) but requires around 4.5GB of memory per 1 million words that it is trained on.

Data Sets
We test our models on data sets designed to test disambiguation in a compositional setting. Data sets for this task contain sentence pairs with: • An ambiguous target word used in a disambiguating phrase. • A landmark word that has the same meaning as one of the target word's senses.
• Human judgements of how similar the meaning of the phrase is when the ambiguous word is replaced by the landmark word.
We use four disambiguation data sets to evaluate our models. Three of the four data sets -GS2011 (Grefenstette and Sadrzadeh, 2011a), GS2012, and KS2013-CoNLL (Kartsaklis et al., 2013) -are publicly available 5 , while ML2008 (Mitchell and Lapata, 2008) was obtained privately from the authors of Wijnholds and Sadrzadeh (2019). We show examples and statistics of the data sets in table 1.

Results
We introduce each of the evaluation tasks and present our results. For multi-sense Word2DM and BERT2DM we trained four models each, with different hyperparameter settings (as described in section 4.2 and listed in   Table 3: Spearman ρ obtained on ML2008.

Word Similarity
To validate the quality of our density matrices as general semantic representations we evaluate them on the following standard word similarity data sets: RG (Rubenstein and Goodenough, 1965), WS (Finkelstein et al., 2001), MC (Miller and Charles, 1991), SL (Hill et al., 2015), and MEN (Bruni et al., 2012). We use the evaluation scripts and data sets made publicly available 6 by Faruqui and Dyer (2014). The results are shown in table 2. Multi-sense Word2DM performs best out of all the density matrix models, achieving scores comparable to the word embeddings. It substantially improves upon Word2DM, supporting our theoretical findings about Word2DM's learning issues. Using cosine similarity to select the relevant sense results in slightly better scores. The BERT2DM models perform worst of all our models, but still demonstrate some ability to judge word similarity. There is no clear performance difference between using PCA or SVD for dimensionality reduction. Clustering the BERT representations before dimensionality reduction leads to worse correlation scores.

Disambiguation
The results we obtain on the disambiguation data sets are presented in tables 3 to 6. In each of these tables our density matrix models are compared to our baselines. Column headings specify the composition methods used to compute the phrase represen-6 https://github.com/mfaruqui/eval-word-vectors  tation. These do not apply to the sentence encoders (BERT and InferSent). The rightmost composition method (Phaser) does not apply to the CDS models. The leftmost column (Verb) compares the semantic representations of the verbs without composition. The best performing models, among the baselines and the density matrices, are indicated in bold. We compare the best-performing density matrix models to the best-performing baseline using a one-sided paired t-test (applying the Bonferroni correction to account for multiple comparisons). We indicate statistically significant improvements over the baseline models, or statistically equivalent scores, by underlining the corresponding scores. Multi-sense Word2DM is by far the best performing density matrix model. It outperforms all the baseline models on 3 out of the 4 data sets. Among all the composition methods, Phaser most consistently achieves high correlation scores (especially on the more complex data sets). In some cases BERT2DM achieves correlation scores that are comparable to multi-sense Word2DM and the baselines. But in general the BERT2DM density matrices cannot reliably be used to achieve disambiguation.

Ambiguity Analysis
To investigate to what extent our models encode ambiguity at a word level, we turn to von Neumann entropy (VNE). For a density matrix ρ =  This can be seen as an extension of Shannon entropy to matrices, and quantifies the amount of information encoded in a density matrix.
We perform two analyses of ambiguity with VNE. First we test whether our density matrices model lexical ambiguity. We do this by investigating whether or not the measured ambiguity of a word's density matrix correlates with the number of meanings associated with the word. Secondly, we perform a systematic analysis of how ambiguity changes when words are composed into phrases. Using the four disambiguation data sets, we measure the VNE before and after composition, expecting ambiguity to decrease after composition. Something similar was done by Piedeleu et al. (2015), at a smaller scale.
For these experiments, we only report results for one variant of multi-sense Word2DM (cosine similarity, 5 senses) and two variants of BERT2DM (SVD and PCA).

Ambiguity and polysemy
To determine the number of senses that a word has, we use WordNet synsets (senses) (Miller, 1995). We compute the correlation between the VNE of density matrices and the number of synsets associated with words. The correlation coefficients are shown in table 7 and the relationships are plotted in figure 1.
The results show that both BERT2DM and multisense Word2DM successfully encode how ambiguous words are. Word2DM exhibits a very low correlation and Context2DM (not plotted) shows none.   Ambiguity and composition VNE allows us to measure how ambiguity evolves through composition. We can compare the ambiguity of a word to the ambiguity of a phrase containing the word. Seeing the context in which a word occurs can reveal which sense of the word is being employed, and should therefore reduce the amount of ambiguity present. The multi-sense Word2DM density matrix for the ambiguous word run has a VNE of 1.491. After composition in the sentence The family run the hotel, the sentence has a VNE of 0.144, so ambiguity has decreased. We test whether this phenomenon holds true for our models on the disambiguation data sets, which consist of ambiguous verbs and disambiguating phrases. For each of the data sets we compute the average VNE of the verb density matrices. We compare this to the average VNE of the disambiguating phrases, where the density matrices are composed using different composition methods. The results are presented in tables 8 to 11.
In each of the tables, the leftmost column (Verb) contains the average von Neumann entropy of the   The results are quite similar across the data sets. Phaser emerges as the best method for decreasing ambiguity through composition. This nicely supports the results of the disambiguation experiments, in which Phaser also emerged as the best composition method for disambiguating the meaning of ambiguous words through composition. Besides Phaser, none of the other composition methods reliably decrease the measured ambiguity.

Conclusion and Future Work
In this paper we addressed the problem of modelling ambiguity in NLP, and how ambiguous words can be disambiguated in context. We investigated density matrices as semantic representations for modelling ambiguity. Our results confirmed the value of density matrices over vector-based approaches. Equipped with a compositional frame-   work, one of our density matrix models (multisense Word2DM) outperformed all other models (including existing compositional models and strong neural baselines) on most of the disambiguation tasks. We also performed a mathematical analysis of the ambiguity encoded by our models. This revealed that the density matrices built by two of our models (multi-sense Word2DM and BERT2DM) reflect true word level ambiguity. We have shown the value in designing neural models that learn density matrices from scratch. Possible directions for future work includes extending our models to larger datasets and longer sentences, and modifying techniques for differing sentences as in the Word in Context dataset (Pilehvar and Camacho-Collados, 2019) and other WSD tasks. We focused on ambiguity here, but it would be possible to do similar experiments focused on other aspects of meaning such as metaphor or entailment, and examining how these interact with composition.

A Word2DM Gradients
The objective function that SGNS optimises at each prediction with regard to model parameters θ is where v t is the embedding of target word, v c is the embedding of the context word, and v 1 , v 2 , ..., v K are the embeddings of K negative samples. By optimising equation 6 over a large corpus, skip-gram learns word embeddings that encode distributional information.
Maximising equation 6 adjusts the embeddings of words occurring in the same context to be more similar and adjusts the embeddings of words that don't occur together to be less similar. This becomes clear when we consider the gradients used to update embeddings during training. We briefly recall the details of the gradient calculation so as to refer back to it later in this section. The derivative of equation 6 with respect to the target vector v t is which is used to update the target vector as follows: The target vector is updated by adding the scaled context vector to it and subtracting the scaled negatively sampled vectors from it. The vectors are scaled proportionally to how dissimilar they are to the target vector. This ensures that the target vector is "pulled closer" to the true context vector and "pushed away" from the negative context vectors. It is this computationally simple training procedure which makes SGNS effective. Word2DM extends SGNS to learn density matrices, replacing equation 6 with the following objective function: where A t and A c are the density matrices of the target and context words respectively, A 1 , A 2 , ..., A K are the density matrices of K negative samples, and θ is the set of weights of the intermediary matrices B t , B c and B 1 , B 2 , ..., B K .
Computing this objective function requires multiple matrix multiplications. For each tr(A t A c ) term (including the terms of the K negative samples), the matrices A t and A c have to be computed respectively as A t = B t B t and A c = B c B c and then the matrix product A t A c has to be computed. This means that, for each target-context prediction, we require 3(K +1) matrix multiplications. One of the most attractive features of SGNS is its computational efficiency, which enabled training on very large corpora in reasonable time. The introduction of multiple matrix multiplications into the objective function means that much of this efficiency is lost. In order to reduce the complexity of our model, we make use of the following property and lemma to find a new objective function that is computationally simpler, but equivalent to equation 9.
Property A.1. The trace of the product of two matrices can be expressed as the sum of the elementwise products of their elements. If A is an n × m matrix and B is an m × n matrix, then the trace of the n × n matrix AB can be computed as If B t and B c are n × m intermediary matrices, then trace of the matrix product A t A c can be written as the sum of the squared elements of an m × m matrix C = B c B t : Proof. We can express tr(A t A c ) as a trace computation involving intermediary matrices B t and B c : Then we can use the cyclic property of the trace function to rewrite this as the product of a matrix C and its transpose: Now we can use property A.1 to express this as the element-wise products of the elements of C and its 289 transpose: This allows us to rewrite equation 9 to find an equivalent objective function that requires fewer computations than straightforward matrix multiplication would. The objective function at each target-context prediction becomes By using the result of lemma A.2 we have reduced the number of matrix multiplications required for each target-context prediction from 3(K + 1) to (K + 1). Density matrices are trained by maximising equation 10 with respect to the intermediary matrices B t , B c , B w 1 , ..., B w K over a large corpus. The model is trained using stochastic gradient descent. We now derive the gradients used to update B t during training, and subsequently show that these gradients lead to suboptimal updates to the density matrices during training. Deriving the gradient with respect to B c and B w k would proceed similarly. To compute the gradients of equation 10 we first rewrite it in terms of the elements of the n × m matrices B t , B c , and B w k : where b x pq denotes the pqth element of B x . We derive the gradient of this objective function with respect to b t pq , an element of the intermediary target word matrix B t . In order to use the chain rule in gradient calculations we rewrite J(θ) as a composite function: The derivative of J with respect to b t pq can now be computed as follows: (1 − σ(z k )) (1 − σ(z k ))2[Bw k B w k Bt]pq The last line in the above derivation is obtained by rewriting the summation expressions as equivalent matrix multiplications. We can now write the derivative of J with respect to the full intermediary matrix B t : (1 − σ(z k (θ)))2B w k B w k B t As opposed to the gradients of Word2Vec (equation 7), the gradients of Word2DM do not lead to simple and easily interpretable training updates. As discussed in the paragraph following equation 8, in Word2Vec the target vector is made more similar to the context vector and less similar to the negative context vectors. Ideally we would like something similar to occur in Word2DM with density matrices, but equation 13 shows that we lose the intuitive training updates of Word2Vec through the introduction of intermediary matrices. Furthermore, we can show that the gradients of Word2DM sometimes lead to unwanted consequences in training. Consider the case where the density matrices of a target and context word are highly dissimilar. Recall from equation 9 that the y is in equation 13 is the trace inner product of the density matrices A t and A c (the measure we use to quantify semantic similarity). The minimum value of the trace inner product of two density matrices is zero (this follows from the fact that density matrices are positive semi-definite), so two density matrices are highly dissimilar when their trace inner product is close to zero i.e. y ≈ 0. From equation 10 we can recall how y can be written in terms of the intermediary matrices: Consider that y ≈ 0 if and only if the elements of B c B t are close to zero in value, since squaring the elements in the summation makes them all positive. We have established the following equivalence: where O is the m × m matrix with all zero entries. Consider how this will affect the target-context update during training. The first term of the gradient in equation 13 becomes (1 − σ(y(θ)))2B c B c B t = (1 − σ(0))2B c O ≈ O so the target-context update becomes ineffective for true contexts. The update should make the density matrix of the target word more similar to that of the context word, but the gradient is so small that it makes this impossible. Moreover, the more dissimilar the target and context density matrices are before the update, the less effective the update will be. This is the opposite of the intended effect (achieved by Word2Vec) in which the magnitude of the target-context update should increase if the target and context representations are less similar. This is an example of how the introduction of intermediary matrices in Word2DM leads to suboptimal training updates. We ensure that our density matrices are positive semi-definite, but lose the guarantee that the algorithm will learn high-quality semantic representations.

B Hyperparameters for Word2DM and
Multi-Sense Word2DM Word2DM additional details We use a dynamic window size i.e. the size of each context window is sampled uniformly between 1 and the maximum window size. We also discard words that occur less than some minimum threshold and subsample frequently occurring words. Negative samples are drawn from a unigram distribution raised to the power of 3 4 . Furthermore, we train two density matrices for each word -one that represents it as a target word and another that represents it as a context word. After training we use the target density matrices as our final density matrices.
Hyperparameters We train our Word2DM and multi-sense Word2DM models on the ukWaC+Wackypedia corpus, consisting of 2.8 billion words. We use a window size of 5, a minimum word count of 50, 5 negative samples per positive context, and a subsampling rate of 1e-5. We train the model for 4 iterations of the ukWaC+Wackypedia corpus, using the Adam optimisation algorithm, a learning rate of 0.001, and 16 sentences per batch.