MUSE: Modularizing Unsupervised Sense Embeddings

This paper proposes to address the word sense ambiguity issue in an unsupervised manner, where word sense representations are learned along a word sense selection mechanism given contexts. Prior work focused on designing a single model to deliver both mechanisms, and thus suffered from either coarse-grained representation learning or inefficient sense selection. The proposed modular approach, MUSE, implements flexible modules to optimize distinct mechanisms, achieving the first purely sense-level representation learning system with linear-time sense selection. We leverage reinforcement learning to enable joint training on the proposed modules, and introduce various exploration techniques on sense selection for better robustness. The experiments on benchmark data show that the proposed approach achieves the state-of-the-art performance on synonym selection as well as on contextual word similarities in terms of MaxSimC.


Introduction
Recently, deep learning methodologies have dominated several research areas in natural language processing (NLP), such as machine translation, language understanding, and dialogue systems. However, most of applications usually utilize word-level embeddings to obtain semantics. Considering that natural language is highly ambiguous, the standard word embeddings may suffer from polysemy issues. Neelakantan et al. (2014) pointed out that, due to triangle inequality in vector space, if one word has two different senses but is restricted to one embedding, the sum of the distances between the word and its synonym in each sense would upper-bound the distance between the respective synonyms, which may be mutually irrelevant, in embedding space 1 . Due to the theoretical inability to account for polysemy using a single embedding representation per word, multi-sense word representations are proposed to address the ambiguity issue using multiple embedding representations for different senses in a word (Reisinger and Mooney, 2010;Huang et al., 2012).
This paper focuses on unsupervised learning from the unannotated corpus. There are two key mechanisms for a multi-sense word representation system in such scenario: 1) a sense selection (decoding) mechanism infers the most probable sense for a word given its context and 2) a sense representation mechanism learns to embed word senses in a continuous space.
Under this framework, prior work focused on designing a single model to deliver both mechanisms (Neelakantan et al., 2014;Li and Jurafsky, 2015;Qiu et al., 2016). However, the previously proposed models introduce side-effects: 1) mixing word-level and sense-level tokens achieves efficient sense selection but introduces ambiguous word-level tokens during the representation learning process (Neelakantan et al., 2014;Li and Jurafsky, 2015), and 2) pure sense-level tokens prevent ambiguity from word-level tokens but require exponential time complexity when decoding a sense sequence (Qiu et al., 2016).
Unlike the prior work, this paper proposes MUSE 2 -a novel modularization framework incorporating sense selection and representation learning models, which implements flexible modules to optimize distinct mechanisms. Specifically, MUSE enables linear time sense identity decoding with a sense selection module and purely senselevel representation learning with a sense representation module.
With the modular design, we propose a novel joint learning algorithm on the modules by connecting to a reinforcement learning scenario, which achieves the following advantages. First, the decision making process under reinforcement learning better captures the sense selection mechanism than probabilistic and clustering methods. Second, our reinforcement learning algorithm realizes the first single objective function for modular unsupervised sense representation systems. Finally, we introduce various exploration techniques under reinforcement learning on sense selection to enhance robustness.
In summary, our contributions are five-fold: • MUSE is the first system that maintains purely sense-level representation learning with linear-time sense decoding. • We are among the first to leverage reinforcement learning to model the sense selection process in sense representations system. • We are among the first to propose a single objective for modularized unsupervised sense embedding learning. • We introduce a sense exploration mechanism for the sense selection module to achieve better flexibility and robustness. • Our experimental results show the state-ofthe-art performance for synonym selection and contextual word similarities in terms of MaxSimC.

Related Work
There are three dominant types of approaches for learning multi-sense word representations in the literature: 1) clustering methods, 2) probabilistic modeling methods, and 3) lexical ontology based methods. Our reinforcement learning based approach can be loosely connected to clustering methods and probabilistic modeling methods. Reisinger and Mooney (2010) first proposed multi-sense word representations on the vector space based on clustering techniques. With the power of deep learning, some work exploited neural networks to learn embeddings with sense selection based on clustering (Huang et al., 2012;Neelakantan et al., 2014).  replaced the clustering procedure with a word sense disambiguation model using WordNet (Miller, 1995). Kågebäck et al. (2015) and Vu and Parker (2016) further leveraged a weighting mechanism and interactive process in the clustering procedure. Moreover, Guo et al. (2014) leveraged bilingual resources for clustering. However, most of the above approaches separated the clustering procedure and the representation learning procedure without a joint objective, which may suffer from the error propagation issue. Instead, the proposed approach, MUSE, enables joint training on sense selection and representation learning.
Instead of clustering, probabilistic modeling methods have been applied for learning multisense embeddings in order to make the sense selection more flexible, where Tian et al. (2014) and Jauhar et al. (2015) conducted probabilistic modeling with EM training. Li and Jurafsky (2015) exploited Chinese Restaurant Process to infer the sense identity. Furthermore, Bartunov et al. (2016) developed a non-parametric Bayesian extension on the skip-gram model (Mikolov et al., 2013b). Despite reasonable modeling on sense selection, all above methods mixed wordlevel and sense-level tokens during representation learning-unable to conduct representation learning in the pure sense level due to the complicated computation in their EM algorithms.
Recently, Qiu et al. (2016) proposed an EM algorithm to learn purely sense-level representations, where the computational cost is high when decoding the sense identity sequence, because it takes exponential time to search all sense combination within a context window. Our modular design addresses such drawback, where the sense selection module decodes a sense sequence with linear-time complexity, while the sense representation module remains representation learning in the pure sense level.
Unlike a lot of relevant work that requires additional resources such as the lexical ontology (Pilehvar and Collier, 2016;Rothe and Schütze, 2015;Jauhar et al., 2015;Chen et al., 2015;Iacobacci et al., 2015) or bilingual data (Guo et al., 2014;Ettinger et al., 2016;Šuster et al., 2016), which may be unavailable in some language, our model can be trained using only an unlabeled corpus. Also, some prior work proposed to learn topical embeddings and word embeddings jointly in order to consider the contexts (Liu et al., 2015a,b), whereas this paper focuses on learning multi-sense  Figure 1: The MUSE architecture with a 3-step learning algorithm: 1) collocation sampling, 2) sense selection for sense representation learning, and 3) optimizing sense selection with a reward signal from sense representation. Reward signal is only passed to the target word to stabilize model training due to directional architecture in the sense representation module. word embeddings.

Proposed Approach: MUSE
This work proposes a framework to modularize two key mechanisms for multi-sense word representations: a sense selection module and a sense representation module. The sense selection module decides which sense to use given a text context, whereas the sense representation module learns meaningful representations based on its statistical characteristics. Unlike prior work that must suffer from either inefficient sense selection (Qiu et al., 2016) or coarse-grained representation learning (Neelakantan et al., 2014;Li and Jurafsky, 2015;Bartunov et al., 2016), the proposed modularized framework is capable of performing efficient sense selection and learning representations in pure sense level simultaneously.
To learn sense-level representations, a sense selection model should be first established for sense identity decoding. On the other hand, the sense embeddings should guide the sense selection model when decoding a sense identity sequence. Therefore, these two modules should be tangled. This indicates that a naive two-stage algorithm or two separate learning algorithms proposed by prior work are not optimal.
By connecting the proposed formulation with reinforcement learning literature, we design a novel joint training algorithm. Besides, taking advantage of the form of reinforcement learning, we are among the first to investigate various exploration techniques in sense selection for unsuper-vised sense embedding learning.

Model Architecture
Our model architecture is illustrated in Figure 1, where there are two modules in optimization.

Sense Selection Module
Formally speaking, given a corpus C, vocabulary W , and the t-th word C t = w i ∈ W , we would like to find the most probable sense z ik ∈ Z i , where Z i is the set of senses in word w i . Assuming that a word sense is determined by the local context, we exploit a local contextC t = {C t−m , · · · , C t+m } for sense selection according to the Markov assumption, where m is the size of a context window. Then we can either formulate a probabilistic policy π(z ik |C t ) about sense selection or estimate the individual likelihood q(z ik |C t ) for each sense identity.
To ensure efficiency, here we exploit a linear neural architecture that takes word-level input tokens and outputs sense-level identities. The architecture is similar to continuous bag-of-words (CBOW) (Mikolov et al., 2013a). Specifically, given a word embedding matrix P , the local context can be modeled as the summation of word embeddings from its contextC t . The output can be formulated with a 3-mode tensor Q, whose dimensions denote words, senses, and latent variables. Then we can model π(z ik |C t ) or q(z ik |C t ) correspondingly. Here we model π(·) as a categorical distribution using a softmax layer: .
(1) On the other hand, the likelihood of selecting distinct sense identities, q(z ik |C t ), is modeled as a Bernoulli distribution with a sigmoid function σ(·): Different modeling approaches require different learning methods, especially for the unsupervised setting. We leave the corresponding learning algorithms in § 3.2. Finally, with a built sense selection module, we can apply any selection algorithm such as a greedy selection strategy to infer the sense identity z ik given a word w i with its context C t .
We note that modularized model enables efficient sense selection by leveraging word-level tokens, while remaining purely sense-level tokens in the representation module. Specifically, if n denotes max k |Z k |, decoding L words takes O(nL) senses to be searched due to independent sense selection. The prior work using a single model with purely sense-level tokens (Qiu et al., 2016) requires exponential time to calculate the collocation energy for every possible combination of sense identities within a context window, O(n 2m ), for a single target sense. Further, Qiu et al. (2016) took an additional sequence decoding step with quadratic time complexity O(n 4m L), based on an exponential number n 2m in the base unit. It demonstrates the achievement about sense inference efficiency in our proposed model.

Sense Representation Module
The semantic representation learning is typically formulated as a maximum likelihood estimation (MLE) problem for collocation likelihood. In this paper, we use the skip-gram formulation (Mikolov et al., 2013b) considering that it requires less training time, where only two sense identities are required for stochastic training. Other popular candidates, like GloVe (Pennington et al., 2014) and CBOW (Mikolov et al., 2013a), require more sense identities to be selected as input and thus not suitable for our scenario. For example, GloVe (Pennington et al., 2014) takes computationally expensive collocation counting statistics for each token in a corpus as input, which requires sense selection for every occurrence of the target word across the whole corpus for a single optimization step.
To learn the representations, we first create input sense representation matrix U and collocation estimation matrix V as the learning targets. Given a target word w i and collocated word w j with corresponding local contexts, we map them to their sense identities as z ik and z jl by the sense selection module, and maximize the sense collocation log likelihood log L(·). A natural choice of the likelihood function is formulated as a categorical distribution over all possible collocated senses given the target sense z ik : .
(3) Instead of enumerating all possible collocated senses which is computationally expensive, we use the skip-gram objective (4) (Mikolov et al., 2013b) to approximate (3) as shown in the green block of Figure 1.
where p neg (z) is the distribution over all senses for negative samples. In our experiment with |Z i | senses for word w i , we use (1/|Z i |) word-level unigram as sense-level unigram for efficiency and the 3/4-th power trick in Mikolov et al. (2013b). We note that our modular framework can easily maintain purely sense-level tokens with an arbitrary representation learning model. In contrast, most related work using probabilistic modeling (Tian et al., 2014;Jauhar et al., 2015;Li and Jurafsky, 2015;Bartunov et al., 2016) binded sense representations with the sense selection mechanism, so efficient sense selection by leveraging wordlevel tokens can be achieved only at the cost of mixing word-level and sense-level tokens in their representation learning process.

Learning
Without the supervised signal for the proposed modules, it is desirable to connect two modules in a way where they can improve each other by their own estimations. First, a trivial way is to forward the prediction of the sense selection module to the representation module. Then we cast the estimated collocation likelihood as a reward signal for the selected sense for effective learning.
To realize the above procedure, we cast the learning problem a one-step Markov decision process (MDP) (Sutton and Barto, 1998), where the state, action, and reward correspond to contextC t , sense z ik , and collocation log likelihood logL(·), respectively. Based on different modeling methods ((1) or (2)) in the sense selection module, we connect the model to respective reinforcement learning algorithms to solve the MDP. Specifically, we refer (1) to policy distribution and refer (2) to Q-value estimation in the reinforcement learning literature.
The proposed MDP framework embodies several nuances of sense selection. First, the decision of a word sense is Markov: taking the whole corpus into consideration is not more helpful than a handful of necessary local contexts. Second, the decision making in MDP exploits a hard decision for selecting sense identity, which captures the sense selection process more naturally than a joint probability distribution among senses (Qiu et al., 2016). Finally, we exploit the reward mechanism in MDP to enable joint training: the estimation of sense representation is treated as a reward signal to guide sense selection. In contrast, the decision making under clustering (Huang et al., 2012;Neelakantan et al., 2014) considers the similarity within clusters instead of the outcome of a decision using a reward signal as MDP.

Policy Gradient Method
Because (1) fits a valid probability distribution, an intuitive optimization target is the expectation of resulting collocation likelihood among each sense. In addition, as the skip-gram formulation in (4) is unidirectional (L(z ik | z jl ) =L(z jl | z ik )), we perform one-side optimization for the target sense z ik to stabilize model training 3 . That is, for the target word w i and the collocated word w j given respective contextsC t andC t (0 < |t − t | ≤ m), we first draw a sense z jl for w j from the policy π(· |C t ) and optimize the expected collocation likelihood for the target sense z ik as follows, Note that (4) can be merged into (5) as a single objective. The objective is differentiable and 3 We observe about 4% performance drop by optimizing input selection z ik and output selection z jl simultaneously. supports stochastic optimization (Lei et al., 2016), which uses a stochastic sample z ik for optimization.
However, there are two possible disadvantages in this formulation. First, because the policy assumes the probability distribution in (1), optimizing the selected sense must affect the estimation of the other senses. Second, if applying stochastic gradient ascent to optimizing (5), it would always lower the probability estimation for the selected sense z ik even if the model accurately selects the right sense. The detailed proof is in Appendix A.

Value-Based Method
To address the above issues, we apply the Qlearning algorithm (Mnih et al., 2013). Instead of maintaining a probabilistic policy for sense selection, Q-learning estimates the Q-value (resulting collocation log likelihood) for each sense candidate directly and independently. Thus, the estimation of unselected senses may not be influenced by the selected one. Note that in one-step MDP, the reward is equivalent to the Q-value, so we will use reward and Q-value interchangeably, hereinafter, based on the context. We further follow the convention of recent neural reinforcement learning by reducing the reward range to aid model training (Mnih et al., 2013). Specifically, we replace the log likelihood logL(·) ∈ (− inf, 0] with the likelihoodL(·) ∈ [0, 1] as the reward function. Due to the monotonic operation in log(), the relative ordering of the reward remains the same.
Furthermore, we exploit the probabilistic nature of likelihood for Q-learning. To elaborate, as Q-learning is used to approximate the Q-value for each action in typical reinforcement learning, most literature adopted square loss to characterize the discrepancy between the target and estimated Q-values (Mnih et al., 2013). In our setting where the Q-value/reward is a likelihood function, our model exploits cross-entropy loss to better capture the characteristics of probability distribution.
Given that the collocation likelihood in (4) is an approximation to the original categorical distribution with a softmax function shown in (3) (Mikolov et al., 2013b), we revise the formulation by omitting the negative sampling term. The resulting formulationL(·) is a Bernoulli distribution indicating whether z jl collocates or not given z ik : There are three advantages about usingL(·) instead of approximatedL(·) and original L(·). First, regarding the variance of estimation,L(·) better captures L(·) thanL(·) becauseL(·) involves sampling: V ar(L(·)) ≥ V ar(L(·)) = V ar(L(·)) = 0. (7) Second, regarding the relative ordering of estimation, for any two collocated senses z jl and z jl with a target sense z ik , the following equivalence holds: Third, for collocation computation, L(·) requires all sense identities andL(·) requires (M +1) sense identities, whereasL(·) only requires 1 sense identity. In sum, the proposedL(·) approximates L(·) with no variance, no "bias" (in terms of relative ordering), and significantly less computation. Finally, because both target distributionL(·) and estimated distribution q(·) in (2) are Bernoulli distributions, we follow the last section to conduct one-side optimization by fixing a collocated sense z jl and optimize the selected sense z ik with cross entropy as min P,Q H(L(z ik | z jl ), q(z ik |C t )). (9)

Joint Training
To jointly train sense selection and sense representation modules, we first select a pair of the collocated senses, z ik and z jl , based on the sense selection module with any selecting strategy (e.g. greedy), and then optimize the sense representation module and the sense selection module using the above derivations. Algorithm 1 describes the proposed MUSE model training procedure. As modular frameworks, the major distinction between our modular framework and twostage clustering-representation learning framework (Neelakantan et al., 2014;Vu and Parker, 2016) is that we establish a reward signal from the sense representation to the sense selection module to enable immediate and joint optimization.

Sense Selection Strategy
Given a fitness estimation for each sense, exploiting the greedy sense is the most popular strategy for clustering algorithms (Neelakantan et al., Algorithm 1: Learning Algorithm (4) for the sense representation module; optimize P, Q by (5) or (9) for the sense selection module; 2014; Kågebäck et al., 2015) and hard-EM algorithms (Qiu et al., 2016;Jauhar et al., 2015) in literature. However, there are two incentives to conduct exploration. First, in the early training stage when the fitness is not well estimated, it is desirable to explore underestimated senses. Second, due to high ambiguity in natural language, sometimes multiple senses in a word would fit in the same context. The dilemma between exploring sub-optimal choices and exploiting the optimal choice is called exploration-exploitation trade-off in reinforcement learning (Sutton and Barto, 1998). We introduce exploration mechanisms for sense selection for both policy gradient and Q-learning. For policy gradient, we sample the policy distribution to approximate the expectation in (5). Because of the flexible formulation of Q-learning, the following classic exploration mechanisms are applied to sense selection: • Greedy: selects the sense with the largest Qvalue (no exploration). • -Greedy: selects a random sense with probability, and adopts the greedy strategy otherwise (Mnih et al., 2013). • Boltzmann: samples the sense based on the Boltzmann distribution modeled by Q-value. We directly use (1) as the Boltzmann distribution for simplicity. We note that Q-learning with Boltzmann sampling yields the same sampling process as policy gradient but different optimization objectives. To our best knowledge, we are among the first to explore several exploration strategies for unsupervised sense embedding learning.
In the following sections, MUSE-Policy denotes the proposed MUSE model with policy learning and MUSE-Greedy denotes the model using corresponding sense selection strategy for Qlearning.

Experiments
We evaluate our proposed MUSE model in both quantitative and qualitative experiments.

Experimental Setup
Our model is trained on the April 2010 Wikipedia dump (Shaoul and Westbury, 2010), which contains approximately 1 billion tokens. For fair comparison, we adopt the same vocabulary set as Huang et al. (2012) and Neelakantan et al. (2014). For preprocessing, we convert all words to their lower cases, apply the Stanford tokenizer and the Stanford sentence tokenizer , and remove all sentences with less than 10 tokens. The number of senses per word in Q is set to 3 as the prior work (Neelakantan et al., 2014).
In the experiments, the context window size is set to 5 (|C t | = 11). Subsampling technique introduced by word2vec (Mikolov et al., 2013b) is applied to accelerate the training process. The learning rate is set to 0.025. The embedding dimension is 300. We initialize Q and V as zeros, and P and U from uniform distribution [− 1/100, 1/100] such that each embedding has unit length in expectation (Lei et al., 2015). Our model uses 25 negative senses for negative sampling in (4). We use = 5% for -Greedy sense selection strategy In optimization, we conduct mini-batch training with 2048 batch size using the following procedure: 1) select senses in the batch; 2) optimize U, V using stochastic training within the batch for efficiency; 3) optimize P, Q using mini-batch training for robustness.

Experiment 1: Contextual Word Similarity
To evaluate the quality of the learned sense embeddings, we compute the similarity score between each word pair given their respective local contexts and compare with the human-judged score using Stanford's Contextual Word Similarities (SCWS) dataset (Huang et al., 2012). Specifically, given a list of word pairs with corresponding contexts, S = {(w i ,C t , w j ,C t )}, we calculate the Spearman's rank correlation ρ between human-judged similarity and model similarity estimations 4 . Two major contextual similarity esti- where d(z ik , z jl ) refers to the cosine similarity between U z ik and U z jl . AvgSimC weights the similarity measurement of each sense pair z ik and z jl by their probability estimations. On the other hand, MaxSimC is a hard measurement that only considers the most probable senses: The baselines for comparison include classic clustering methods (Huang et al., 2012;Neelakantan et al., 2014), EM algorithms (Tian et al., 2014;Qiu et al., 2016;Bartunov et al., 2016), and Chinese Restaurant Process (Li and Jurafsky, 2015) 5 , where all approaches are trained on the same corpus except Qiu et al. (2016) used more recent Wikipedia dumps. The embedding sizes of all baselines are 300, except 50 in Huang et al. (2012). For every competitor with multiple settings, we report the best performance in each similarity measurement setting and show in Table 1  Our MUSE model achieves the state-of-the-art performance on MaxSimC, demonstrating superior quality on independent sense embeddings. On the other hand, MUSE achieves comparable performance with the best competitor in terms of AvgSimC (68.7 vs. 69.3), while MUSE outperforms the same competitor significantly in terms of MaxSimC (67.9 vs. 60.1). The results demonstrate not only the high quality of sense representations but also accurate sense selection.
From the application perspective, MaxSimC refers to a typical scenario using single embedding per word, while AvgSimC employs multiple sense vectors simultaneously per word, which not only brings computational overhead but changes existing neural architecture for NLP. Hence, we argue that MaxSimC better characterize practical usage of a sense representation system than AvgSimC.
Among various learning methods for MUSE, policy gradient performs worst, echoing our argument in § 3.2.1. On the other hand, the superior performance of Boltzmann sampling and -Greedy over Greedy selection demonstrates the effectiveness of exploration.

Experiment 2: Synonym Selection
We further evaluate our model on synonym selection using multi-sense word representations (Jauhar et al., 2015). Three standard synonym selection datasets, ESL-50 (Turney, 2001), RD-300 (Jarmasz andSzpakowicz, 2004), andTOEFL-80 (Landauer andDumais, 1997), are performed. In the datasets, each question consists of a question word w Q and four answer candidates {w A , w B , w C , w D }, and the goal is to select the most semantically synonymous choice among the four candidates. For example, in the TOEFL-80 dataset, a question shows {(Q) enormously, (A) appropriately, (B) uniquely, (C) tremendously, (D) decidedly}, and the answer is (C). For multi-sense representations system, it selects the synonym of the question word w Q using the maximum senselevel cosine similarity as a proxy of the semantic similarity (Jauhar et al., 2015).
Among unsupervised sense embedding approaches, CRP and MSSG refer to the baselines with highest MaxSimC and AvgSimC in Table 1 respectively. Here we report the setting for baselines based on the best average performance in this task. We also show the performance of supervised sense embeddings as an upperbound of unsupervised methods due to the usage of additional supervised information from WordNet.
The results are shown in Table 2, where our MUSE--Greedy and MUSE-Boltzmann significantly outperform all unsupervised sense embeddings methods, echoing the superior quality of our Context k-NN Senses · · · braves finish the season in tie with the los angeles dodgers · · · scoreless otl shootout 6-6 hingis 3-3 7-7 0-0 · · · his later years proudly wore tie with the chinese characters for · · · pants trousers shirt juventus blazer socks anfield · · · of the mulberry or the blackberry and minos sent him to · · · cranberries maple vaccinium apricot apple · · · of the large number of blackberry users in the us federal · · · smartphones sap microsoft ipv6 smartphone · · · shells and/or high explosive squash head hesh and/or anti-tank · · · venter thorax neck spear millimeters fusiform · · · head was shaven to prevent head lice serious threat back then · · · shaved thatcher loki thorax mao luthor chest · · · appoint john pope republican as head of the new army of · · · multi-party appoints unicameral beria appointed

Qualitative Analysis
We further conduct qualitative analysis to check the semantic meanings of different senses learned by MUSE with k-nearest neighbors (k-NN) using sense representations. In addition, we provide contexts in the training corpus where the sense will be selected to validate the sense selection module. Table 3 shows the results. The learned sense embeddings of the words "tie", "blackberry", and "head" clearly correspond to correct senses under different contexts.
Since we address an unsupervised setting that learns sense embeddings from unannotated corpus, the discovered senses highly depend on the training corpus. From our manual inspection, it is common for our model to discover only two senses in a word, like "tie" and "blackberry". However, we maintain our effort in developing unsupervised sense embeddings learning methods in this work, and the number of discovered sense is not a focus.

Conclusion
This paper proposes a novel modularized framework for unsupervised sense representation learning, which supports not only the flexible design of modular tasks but also joint optimization among modules. The proposed model is the first work that implements purely sense-level representation learning with linear-time sense selection, and achieves the state-of-the-art performance on benchmark contextual word similarity and syn-onym selection tasks. In the future, we plan to investigate reinforcement learning methods to incorporate multi-sense word representations for downstream NLP tasks.
Denote Θ = {P, Q} as the parameter set for policy π. The gradient with respect to Θ should be: Accordingly, if we conduct typical stochastic gradient ascent training on J(Θ) with respect to Θ from samples z ik with a learning rate η, the update formula will be: However, the collocation log likelihood should always be non-positive: logL(z jl | z ik ) ≤ 0. Therefore, as long as the collocation log likelihood logL(z jl | z ik ) is negative, the update formula is to minimize the likelihood of choosing z ik , despite the fact that z ik may be good choices. On the other hand, if the log likelihood reaches 0, according to (4), it indicates: logL(z jl | z ik ) = 0 ⇒L(z jl | z ik ) = 1 ⇒ U T z ik V z jl → ∞, U T z ik V zuv → ∞, ∀z uv , which leads to computational overflow from an infinity value.