Towards Quantum Language Models

This paper presents a new approach for building Language Models using the Quantum Probability Theory, a Quantum Language Model (QLM). It mainly shows that relying on this probability calculus it is possible to build stochastic models able to benefit from quantum correlations due to interference and entanglement. We extensively tested our approach showing its superior performances, both in terms of model perplexity and inserting it into an automatic speech recognition evaluation setting, when compared with state-of-the-art language modelling techniques.


Introduction
Quantum Mechanics Theory (QMT) is one of the most successful theories in modern science. Despite its effectiveness in the physics realm, the attempts to apply it in other domains remain quite limited, excluding, of course, the large quantity of studies regarding Quantum Information Processing on quantum computers.
Only in recent years some scholars tried to embody principles derived from QMT into their specific fields, for example, by the Information Retrieval community (Zuccon et al., 2009;Melucci and van Rijsbergen, 2011;González and Caicedo, 2011;Melucci, 2015) and in the domain of cognitive sciences and decision making (Khrennikov, 2010;Busemeyer and Bruza, 2012;Aerts et al., 2013). In the machine learning field (Arjovsky et al., 2016;Wisdom et al., 2016;Jing et al., 2017) have used unitary evolution matrices for building deep neural networks obtaining interesting results, but we have to observe that their works do not adhere to QMT and use unitary evolution operators in a way not allowed by QMT. In recent years, also the Natural Language Processing (NLP) community started to look at QMT with interest and some studies using it have already been presented (Blacoe et al., 2013;Liu et al., 2013;Tamburini, 2014;Kartsaklis et al., 2016).
Language models (LM) are basic tools in NLP used in various applications, such as Automatic Speech Recognition (ASR), machine translation, part-of-speech tagging, etc., and were traditionally modeled by using N-grams and various smoothing techniques. Among the dozen of tools for computing N-gram LM, we will refer to CMU-SLM (with Good-Turing smoothing) (Clarkson and Rosenfeld, 1997) and IRSTLM (with Linear Witten-Bell smoothing) (Federico et al., 2008); the latter is the tool used in Kaldi (Povey et al., 2011b), one of the most powerful and used open-source ASR package that we will use for some of the experiments presented in the following sections.
In recent years new techniques from the Neural Networks (NN) domain have been introduced in order to enhance the performances of such models. Elman recurrent NN, as used in the RNNLM tool (Mikolov et al., 2010(Mikolov et al., , 2011, or Long Short-Term Memory NN, as in the tool LSTMLM (Soutner and Müller, 2015), produce state-of-the-art performances for current language models. This paper presents a different approach for building LM based on quantum probability theory. Actually, we present a QLM applicable only to problems defined on a small set of different tokens. This is a "proof-of-concept" study and our main aim is to show the potentialities of such approach rather than building a complete application for solving this problem for any setting.
The paper is organized as follows: we provide background on Quantum Probability Theory in Section 2 followed by the description of our proposed Quantum Language Model in Section 3. We then discuss some numerical issues mainly related to the optimisation procedure in Section 4, and in Section 5 we present the experiments we did to validate our approach. In Section 6 we discuss our results and draw some provisional conclusions.

Quantum Probability Theory
In QMT the state of a system is usually described, in the most general case, by using density matrices over an Hilbert space H. More specifically, a density matrix ρ is a positive semidefinite Hermitian matrix of unit trace, namely ρ † = ρ, Tr(ρ) = 1, and it is able to encode all the information about the state of a quantum system 1 .
The measurable quantities, or observables, of the quantum system are associated to Hermitian matrices O defined on H. The axioms of QMT specify how one can make predictions about the outcome of a measurement using a density matrix: • the possible outcomes of a projective measurement of an observable O are its eigenvalues {λ j }; • the probability that the outcome of the measurement is λ j is P (λ j ) = Tr(ρΠ λ j ) = Tr(Π λ j ρ), where Π λ j is the projector on the eigenspace of O associated to λ j . Note that in the following we will use some properties of these kind of measurements, namely Π † λ j = Π λ j and Π 2 λ j = Π λ j ; • after the measurement the system state collapses in the following fashion: if the outcome of the measurement was λ j , the collapse is where the denominator is needed for trace normalization; • time evolution of states using a fixed time step is described by a unitary matrix U over H, i.e. U † U = I, where I is the identity matrix. Given a state ρ t , at a specific time t, the system evolution without measurements modifies the state as: See for example (Nielsen and Chuang, 2010) or (Vedral, 2007) for a complete introduction on QPT.

Quantum Language Models
In this section we describe our approach to build QLM that can compute probabilities for the occurrence of a sequence w = (w 1 , w 2 , ..., w n ) of length n, composed using N different symbols, the vocabulary containing all the words in the model, i.e. for every symbol w in the sequence w ∈ {0, ..., N − 1}. We define a set of orthogonal N -dimensional vectors {e w : w ∈ {0, ..., N −1}}, spanning the complex space H = C N ; to measure the probability of a symbol w, collapsing the state over the space spanned by e w , we use the projector Π w = e w e † w . Note that all the words in the vocabulary have been encoded as numbers corresponding to the N dimensions of the vector space H.
Our method is sequential, from QMT point of view, in the sense that we use a quantum system that produces a single symbol upon measurement.
The basic idea is that the probabilistic information for a given sequence w = (w 1 , w 2 , ..., w n ) is encoded in the density matrix that results from the following process:
We then use the initial density matrix ρ 0 and the time evolution unitary matrix U as parameters to optimise the perplexity Γ, evaluated on a training corpus of sequences S, which quantifies the uncertainty of the model. C is the number of tokens in the corpus.
Minimising Γ is equivalent of learning a model by fixing all the model parameters, a typical procedure in the machine learning domain.

Ancillary system
The problem with this setup is that the 'quantum effects' are completely washed out by the measurements on the system by using projectors. The resulting expression for the probability P (w|ρ 0 , U ) for a sequence w is identical to that obtained using a classical Markov model.
To solve this issue, our approach is to avoid the complete collapse of the state after each symbol measurement using a common technique in QMT: we introduce an ancillary system described by a fictitious D-dimensional Hilbert space, H ancilla = C D , and we couple the original system to the ancillary system. The resulting DN -dimensional Hilbert space is where ⊗ denotes the Kronecker product for matrices and D can be seen as a free hyper-parameter of the model. On this new space the projectors are now given by Π The advantage of using this method is that the time evolution for the coupled system creates nontrivial correlations between the two entangled systems such that measuring and collapsing the symbol state keeps some information about the whole sequence stored in the ancillary part of the state. This information is then reshuffled into the symbol state via time evolution, resulting in a 'memory effect' that takes the whole sequence of symbols into account, thereby extending the idea behind the Ngrams approach. Larger D values will results in more memory of this system and, of course, in a larger number of parameters to learn.

System evolution
We need to specify the system evolution for our coupled system. The simplest approach is to use a unitary DN × DN matrix U that acts on the entangled Hilbert space as shown before; it can be specified by (DN ) 2 real parameters with a suitable parametrization (Spengler et al., 2010) that ensures the unitarity of U . However, in our preliminary experiments this approach resulted in an insufficient 'memory' capability for the QLM and in a very complex and slow minimisation procedure.
A different approach could be introduced by using a specific unitary matrix for each word, but this would lead to an enormous amount of parameters to learn with the optimization procedure.
There are a lot of techniques in NLP to represent single words with dense vectors (see for example (Mikolov et al., 2013) for the so called word embeddings). Following this idea, we can represent every symbol in our system with a specific pdimensional vector trained using one of the available techniques w → (α 1 (w), ..., α p (w)) or fixed randomly.
We then work with a set of p DN ×DN unitary matrices U = (U 1 , ..., U p ), one for each component of the word vector, that are used to dynamically build a different system evolution matrix for each word in this way: This results in p(DN ) 2 complex or 2p(DN ) 2 real parameters to be learned.
Essentially, we treat the words in our problem in different ways: the evolution operator for each word V (w) is build by using a combination of the operators U defined for each word-vector component, while, considering the system projection, we treat each word as one basis vector for the space H.
Note that the choice to use a set {V (w)} of operators, one for each word w, does not violate the linearity of quantum mechanics: let K be the quantum operation defined using projectors and evolution matrices. Then K is a valid (i.e. a Completely Positive Trace-preserving) evolution map that exactly reproduces our results in the sequence of evolutions and collapses. The number of evolutionary operators is a tradeoff: as we said before, defining only one operator U resulted in a poor performance of the proposed method in all the relevant experiments, while defining an operator for each word would produce too many parameters to be learned. The trade-off that we chose is to use one operator for each word-vector component, and build the set {V (w)} from them as described above while preserving unitarity.
With regard to the initial density matrix ρ 0 , we have to define it combining the initial density matrix of our system, ρ s 0 , and the initial density matrix of the ancilla, ρ a 0 . We defined ρ s 0 as a diagonal N × N matrix containing the classical Maximum Likelihood probability Estimation to have a specific symbol at the first sequence position: where S is again the set of all sequences in the training set and w 1 is the first word in each sequence w. With regard to the ancilla system we do not know anything about it and thus we have to define ρ a 0 as the D × D diagonal matrix .
Consequently we can define ρ 0 as

The final model
Putting all the ingredients together, we can finally write down the formula for the probability P (w|ρ 0 , U) for a sequence w in the QLM specified by ρ 0 and U. The product of conditional probabilities simplifies because of the normalising denominators added at each collapse and time evolution step. The result is: Using the fact that projectors have many zero entries one can also re-express this trace of the product of DN × DN matrices in terms of the trace of the product of D × D matrices. The formula for P (w|ρ 0 , U) then simplifies to our final result where the matrices R and T are defined as follows: • in terms of entries R i,j with indices i, j = 0, ..., D − 1, the matrix R is given by Note that only the value of first symbol in the sequence, w 1 , enters in the expression. This is to be expected since R derives from the initial density matrix ρ 0 ; • analogously, the matrix T that encodes the chain of combined collapses and time evolutions is given by the product T = T (2) T (3) ...T (n) , where the matrices T (k) are given in entries, with indices i, j = 0, ..., D− 1, by These matrices can be pre-calculated for every pair of the involved symbols, so that the calculation of P (w|ρ 0 , U) for all the sequences will be very fast.
The detailed calculation for obtaining the equation (2) can be found in the supplementary material.

Optimisation and Numerical Issues
In order to optimise the parameters U we numerically minimise the perplexity Γ computed on a given training corpus of sequences S. This requires that the matrices U remain strictly unitary at every step of the minimisation procedure and it can be accomplished in various ways.
The most straightforward way is to employ an explicit parametrization for unitary matrices, as was done in (Spengler et al., 2010). Due to the transcendental functions employed in this parametrisation, this approach resulted in a functional form for Γ that has proven to be very challenging to minimise efficiently in our experiments.
A more elegant and efficient approach is to consider the entries of U as parameters (thereby ensuring a polynomial functional form for Γ) and to employ techniques of differential geometry to keep the parameters from leaving the unitary subspace at each minimisation step. This can be done using a modification of the approach outlined in (Tagare, 2011) that considers the unitary matrices subspace as a manifold, the Stiefel manifold U(DN ). It is then possible to project the gradient ∇f of a generic function f (M ) of the matrix variable M on the tangent space of the Stiefel manifold and build a line search algorithm that sweeps out curves on this manifold so that at each point the parameters are guaranteed to form a unitary matrix.
In our case we have multiple unitary matrices U = (U 1 , ..., U p ). This simply results having curves defined on U(DN ) p , parametrised by a pdimensional vector of DN ×DN unitary matrices.

Formula for the gradient
To implement the curvilinear search method described in (Tagare, 2011) one needs an expression for the gradient G = (G 1 , ..., G p ) of the probability function. This gradient is organised in a pdimensional vector of DN × DN matrices, such that the component G j is obtained by computing the matrix derivative of P (w|ρ 0 , U) with respect to U j either analytically or by applying some numerical estimate of the gradients, for example by using finite differences. The latter method, when working with thousands or millions of variables can be very time consuming and, usually, an explicit analytic formula for the gradient accelerates considerably all the required processing.
A lengthy analytic computation results in an explicit result. Firstly, we introduce the following objects: • The spectral decomposition of U j , given by U j = S j D j S † j , guaranteed to exist by the spectral theorem. S j is unitary and the diagonal matrix D j contains the eigenvalues (u j1 , ..., u jDN ) of U j , j = 1, ..., p.
• The DN × DN matrices C j (α) defined, in entries, by where u is the complex conjugate of u.
• The lesser and greater products associated to the construction of system evolution matrices With these ingredients, the resulting formula for the components G j of the gradient is where · denotes the element-wise matrix product. Again, all the detailed calculations for obtaining the analytic expression (3) for the gradient G j can be found in the supplementary material. Using Tagare's method we can project the gradient onto the Stiefel manifold and build a curvilinear search algorithm for the minimisation.
To achieve this aim, Tagare proposed an Armijo-Wolfe line search inserted into a simple gradient descent procedure. We developed an extension of this algorithm combining the minimization over the Steifel manifold technique with a Moré-Thuente (1994) line search and a Conjugate Gradient minimisation algorithm that uses the Polak-Ribière method for the combination of gradients and search directions (Nocedal and Wright, 2006). All the experiments presented in the next section were performed using these methods.
The minimisation uses random mini-batches that increase their size during the training: they start with approximately one tenth of the training set dimension and increase to include all the instances using a parametrised logistic function. As stopping criterion we used the minimum of the perplexity function over the validation set as suggested in (Bengio, 2012;Prechelt, 2012) for other machine learning techniques.

Data
The TIMIT corpus is a read speech corpus designed to provide speech data for acousticphonetic studies and for the development and evaluation of automatic speech recognition systems (Garofolo et al., 1990). It contains broadband recordings of 630 speakers of eight major dialects of American English and includes time-aligned orthographic, phonetic and word transcriptions as well as a 16-bit, 16 kHz speech waveform file for each utterance.
In the speech community, the TIMIT corpus is the base for a standard phone-recognition task with specific evaluation procedures described in detail in (Lopes and Perdigao, 2011). We stick completely to this evaluation to test the effectiveness of our proposed model adopting, among the other procedures, the same splitting between the different data sets: the training set contains 3696 utterances (140225 phones), the validation set 400 utterances (15057 phones) and the test set 192 utterances (7215 phones).

Evaluation Results
We tested the proposed model by setting up two different evaluations: the first is an intrinsic evaluation of LM performances in terms of global perplexity on the TIMIT testset; the second is an extrinsic evaluation in which we replace the LM tools provided with the Kaldi ASR toolkit (Povey et al., 2011b) with our model in order to check the final system performances in a phone-recognition task and comparing them with the other state-ofthe-art LM techniques briefly introduced in Section 1.

Intrinsic evaluation
The first experiment consisted in an evaluation of models perplexity (PPL) on the TIMIT testset. We compared the QLM model with two N-gram implementations, namely CMU-SLM (Clarkson and Rosenfeld, 1997) and IRSTLM (Federico et al., 2008), and two recurrent NN models able to produce state-of-the-art results in language modelling, the RNNLM (Mikolov et al., 2010(Mikolov et al., , 2011 and the LSTMLM (Soutner and Müller, 2015) packages. Table 1 shows the results of the intrinsic evaluation. With regard to RNNLM and LSTMLM results, only the best hyper-parameters combination after a lot of experiments, optimizing them on the validation set, has been inserted into the Table. With regard to QLM, all the presented experiments are based on artificial word vectors produced randomly using values from the set {−1, 0, 1} instead of real word embeddings. Every word vector is different from the others and we decided not to use real embeddings in order to test the core QMT method without adding the contex- tual information, contained in word embeddings, that could have helped our approach to obtain better performances, at least in principle.

Extrinsic evaluation
The "TIMIT recipe" contained in the Kaldi distribution 2 reproduces exactly the same evaluation settings described in (Lopes and Perdigao, 2011) for a phone recognition task based on this corpus. Moreover, Kaldi provides some n-best rescoring scripts that apply RNNLM hypothesis rescoring and interpolate the results with the standard N-gram model results used in the evaluation. We slightly modified these scripts to work with LSTMLM and QLM in order to test different models using the same setting. This allowed us to replace the LM used in Kaldi and experiment with all the systems evaluated in the previous section. Table 2 outlines the results we obtained replacing the LM technique into Kaldi ASR package w.r.t. the different ASR systems that the TIMIT recipe implements. These systems are built on top of MFCC, LDA, MLLT, fMLLR with CMN 3 features (see (Povey et al., 2011b;Rath et al., 2013) for all acronyms references and a complete feature or recipe descriptions).
For this extrinsic evaluation we used the best models we obtained in the previous experiments interpolating their log-probability results for each utterance with the original bigram (or trigram) log-probability using a linear model with a ratio 0.25/0.75 between the original N-gram LM and the tested one as suggested in the standard Kaldi rescoring script. For this test we rescored the 10,000-best hypothesis.
We have to say that in this experiment we were not trying to build the best possible phone recogniser, but simply to compare the relative performances of the analysed LM techniques showing the effectiveness of QLM when used in a real application. Thus absolute Phone Error Rate is not so important here and it can be certainly possible to devise recognisers with better performances by applying more sophisticated techniques. For example (Peddinti et al., 2015) presented a method for lattice rescoring in Kaldi that exhibits better performances than the n-best rescoring we used to interpolate between n-grams and the tested models, but modifying it in order to test LSTMLM and QLM presented a lot of problems and thus we decided to use the simpler n-best approach. For completeness, the last column of Table 2 outlines the results obtained using this lattice rescoring method with RNNLM as described in (Peddinti et al., 2015).

Discussion and conclusions
We presented a new technique for building LM based on QMT, and its probability calculus, testing it extensively both with intrinsic and extrinsic evaluation methods.
The PPL results for the intrinsic evaluation, outlined in Table 1, show a clear superiority of the proposed method when compared with state-of-the-art techniques such as RNNLM and LSTMLM. It is interesting to note that even using D = 20, that means a system containing a quarter of parameters, therefore much less 'memory', w.r.t. the system with D = 40, we obtain a PPL performance better than the other methods.
With regard to the second experiment we made, an extrinsic evaluation where we replaced the LM of an ASR system with the LM produced by all the tested methods (see Table 2), QLM consistently exhibits the best performances for all the tested ASR systems from the Kaldi "TIMIT recipe". De-spite using a n-best technique in this evaluation for hypothesis rescoring, that is known to perform worse than the lattice rescoring method proposed in (Peddinti et al., 2015), the QLM performances are even better than this method.
The approach we have presented in this paper is not without problems: the number of different word types in the considered language has to be small in order to keep the model computationally tractable. Even if the code we used in the evaluations is analytically highly optimised, the training of this model is rather slow and requires relevant computational resources even for small problems. On the contrary, inference is very quick, faster than the RNNLM and LSTMLM packages we tested.
The main research question that drove this work was to verify if the distinguishing properties of quantum probability theory, namely interference and system entaglement that could allow the ancilla to have a "potentially infinite" memory, were enough to build stochastic systems more powerful than those built using classical probabilities or those built using recurrent NN. Our main aim was not to build a complete model to handle all possible LM scenarios, but to present a "proof-ofconcept" study to test the potentialities of this approach. For this reason we tried to keep the model as simple as possible using orthogonal projectors: for measuring probabilities, projecting the system state, each word is mapped onto a single basis vector and the dimension of the system Hilbert space, N , is equal to the number of different words. Given the matrix dimensions that we have to manage when we add the ancilla, DN × DN , this setting does not scale to real LM problems (e.g. the Brown corpus), even though the calculations are performed using D × D submatrices, but allowed us to successfully verify the research question. For the same reason out-of-vocabulary words cannot be handled in this model because there are no basis vectors assigned to them.
In order to overcome these limitations, this work can be extended by using generalized quantum measurements projectors (POVM) and by using a different structure for the system Hilbert space: instead of mapping each word onto a single basis vector we can span this space using as basis the same p-basis vectors used to define the V matrices. In this way we will project the system state on a generic word vector built as a superposi-  Table 2: Phone-recognition performances, in terms of Phone Error Rate, for the TIMIT dataset and the different Kaldi ASR models, rescoring the 10,000-best solutions with the tested LM techniques interpolated with the IRSTLM bigrams and trigrams LM (the standard LM used in Kaldi). In boldface the best performing system and in italics the second best. Kaldi ASR systems descriptions: tri1 = a triphone model using 13 dim. MFCC+∆+∆∆; tri2 = tri1+LDA+MLLT; tri3 = tri2+SAT; SGMM2 = Semi-supervised Gaussian Mixture Model (Huang and Hasegawa-Johnson, 2010;Povey et al., 2011a); Dan NN = DNN model by (Zhang et al., 2014;.
tion on the p-basis. Such improvement would reduce dramatically the dimensions of the matrices to Dp × Dp potentially mitigating the computational issue. Moreover, this would solve also the problem of out-of-vocabulary words allowing for a proper management of the large set of different words typical of real applications. We are still working on these improvements and we will hope to get a complete model soon.
With this contribution we would like to raise also some interest in the community to analyse and develop more effective techniques, both on the modelling and minimisation/learning sides, to allow to build real world application based on this framework. QMT and its probability calculus seem to be promising methodologies to enhance the performances of our systems in NLP and certainly deserve further investigations.