Pointwise HSIC: A Linear-Time Kernelized Co-occurrence Norm for Sparse Linguistic Expressions

In this paper, we propose a new kernel-based co-occurrence measure that can be applied to sparse linguistic expressions (e.g., sentences) with a very short learning time, as an alternative to pointwise mutual information (PMI). As well as deriving PMI from mutual information, we derive this new measure from the Hilbert–Schmidt independence criterion (HSIC); thus, we call the new measure the pointwise HSIC (PHSIC). PHSIC can be interpreted as a smoothed variant of PMI that allows various similarity metrics (e.g., sentence embeddings) to be plugged in as kernels. Moreover, PHSIC can be estimated by simple and fast (linear in the size of the data) matrix calculations regardless of whether we use linear or nonlinear kernels. Empirically, in a dialogue response selection task, PHSIC is learned thousands of times faster than an RNN-based PMI while outperforming PMI in accuracy. In addition, we also demonstrate that PHSIC is beneficial as a criterion of a data selection task for machine translation owing to its ability to give high (low) scores to a consistent (inconsistent) pair with other pairs.


Introduction
Computing the co-occurrence strength between two linguistic expressions is a fundamental task in natural language processing (NLP). For example, in collocation extraction (Manning and Schütze, 1999), word bigrams are collected from corpora and then strongly co-occurring bigrams (e.g., "New York") are found. In dialogue response selection (Lowe et al., 2015), pairs comprising a context and its response sentence are collected from dialogue corpora and the goal is to rank the candidate responses for each given context sentence. In either case, a set of linguistic expression pairs D = {(x i , y i )} n i=1 is first collected and then the co-occurrence strength of a (new) pair (x, y) is computed.  Table 1: The proposed co-occurrence norm, PHSIC, eliminates the trade-off between robustness to data sparsity and learning time, which PMI has (Section 1).
Pointwise mutual information (PMI) (Church and Hanks, 1989) is frequently used to model the co-occurrence strength of linguistic expression pairs. There are two typical types of PMI estimation (computation) method. One is a countingbased estimator using maximum likelihood estimation, sometimes with smoothing techniques, for example, PMI MLE (x, y; D) = log n · c(x, y) where c(x, y) denotes the frequency of the pair (x, y) in given data D. This is easy to compute and is commonly used to measure co-occurrence between words, such as in collocation extraction 1 ; however, when data D is sparse, i.e., when x or y is a phrase or sentence, this approach is unrealistic. The second method uses recurrent neural networks (RNNs). Li et al. (2016) proposed to em-ploy PMI to suppress dull responses for utterance generation in dialogue systems 2 . They estimated P(y) and P(y|x) using RNN language models and estimated PMI as follows: PMI RNN (x, y; D) = log P RNN (y|x) This way of estimating PMI is applicable to sparse language expressions; however, learning RNN language models is computationally costly.
To eliminate this trade-off between robustness to data sparsity and learning time, in this study we propose a new kernel-based co-occurrence measure, which we call the pointwise Hilbert-Schmidt independence criterion (PHSIC) (see Table 1). Our contributions are as follows: • We formalize PHSIC, which is derived from HSIC (Gretton et al., 2005), a kernel-based dependence measure, in the same way that PMI is derived from mutual information (Section 3). • We give an intuitive explanation why PHSIC is robust to data sparsity. PHSIC is a "smoothed variant of PMI", which allows various similarity metrics to be plugged in as kernels (Section 4). • We propose fast estimators of PHSIC, which are reduced to a simple and fast matrix calculation regardless of whether we use linear or nonlinear kernels (Section 5). • We empirically confirmed the effectiveness of PHSIC, i.e., its robustness to data sparsity and learning time, in two different types of experiment, a dialogue response selection task and a data selection task for machine translation (Section 6).

Problem Setting
Let X and Y denote random variables on X and Y, respectively. In this paper, we deal with the tasks of taking a set of linguistic expression pairs which is regarded as a set of i.i.d. samples drawn from a joint distribution P XY , and then measuring the "co-occurrence strength" for each given pair (x, y) ∈ X × Y. Such tasks include collocation extraction and dialogue response selection (Section 1).

Pointwise HSIC
In this section, we give the formal definition of PHSIC, a new kernel-based co-occurrence measure. We show a summary of this section in Table 2. Intuitively, PHSIC is a "kernelized variant of PMI."

Dependence Measure
As a preliminary step, we introduce the simple concept of dependence (see Dependence Measure in Table 2). Recall that random variables X and Y are independent if and only if the joint probability density P XY and the product of the marginals P X P Y are equivalent. Therefore, we can measure the dependence between random variables X and Y via the difference between P XY and P X P Y . Both the mutual information and the Hilbert-Schmidt independence criterion, to be described below, are such dependence measures.

MI and PMI
We briefly review the well-known mutual information and PMI (see MI & PMI in Table 2).
The mutual information (MI) 3 between two random variables X and Y is defined by (Cover and Thomas, 2006), where KL[· ·] denotes the Kullback-Leibler (KL) divergence. Thus, MI(X, Y ) is the degree of dependence between X and Y measured by the KL divergence between P XY and P X P Y .
Here, by definition of the KL divergence, MI can be represented in the form of the expectation over P XY , i.e., the summation over all possible pairs (x, y) ∈ X ×Y: The shaded part in Equation (5) is actually the pointwise mutual information (PMI) (Church and Hanks, 1989): Therefore, PMI(x, y) can be thought of as the contribution of (x, y) to MI(X, Y ).
Dependence Measure Co-occurrence Measure the dependence between X and Y the contribution of (x, y) (the difference between P XY and P X P Y ) to the dependence between X and Y Table 2: Relationship between the mutual information (MI), the pointwise mutual information (PMI), the Hilbert-Schmidt independence criterion (HSIC), and the pointwise HSIC (PHSIC). As well as defining PMI as the contribution to MI, we define PHSIC as the contribution to HSIC. In short, PHSIC is a "kernelized PMI" (Section 3).

HSIC and PHSIC
As seen in the previous section, PMI can be derived from MI. Here, we consider replacing MI with the Hilbert-Schmidt independence criterion (HSIC). Then, in analogy with the relationship between PMI and MI, we derive PHSIC from HSIC (see HSIC & PHSIC in Table 2).
Let k : X × X → R and : Y × Y → R denote positive definite kernels on X and Y, respectively (intuitively, they are similarity functions between linguistic expressions). The Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2005), a kernel-based dependence measure, is defined by HSIC(X, Y; k, ) := MMD 2 k, [P XY , P X P Y ], (7) where MMD[·, ·] denotes the maximum mean discrepancy (MMD) , which measures the difference between random variables on a kernel-induced feature space. Thus, HSIC(X, Y ; k, ) is the degree of dependence between X and Y measured by the MMD between P XY and P X P Y , while MI is measured by the KL divergence (Equation (4)).
Analogous to MI in Equation (5), HSIC can be represented in the form of the expectation on P XY by a simple deformation: where φ(x) := k(x, ·), ψ(y) := (y, ·), At first glance, these equations are somewhat complicated; however, the estimators of PHSIC we actually use are reduced to a simple matrix calculation in Section 5. Unlike MI in Equation (5), HSIC has two representations: Equation (8) is the representation in feature space and Equation (9) is the representation in data space. Similar to the relationship between MI and PMI (Section 3.2), we define the pointwise Hilbert-Schmidt independence criterion (PHSIC) by the shaded parts in Equations (8) and (9): PHSIC(x, y; X, Y, k, ) Namely, PHSIC(x, y) is defined as the contribution of (x, y) to HSIC(X, Y ).

PHSIC as Smoothed PMI
This section gives an intuitive explanation for the first feature of PHSIC, i.e., the robustness to data sparsity, using Table 3. In short, we show that PHSIC is a "smoothed variant of PMI." First, the maximum likelihood estimator of PMI add scores deduct scores . . , ( xi , yi ), . . . ,( xi , yi ), . . . } Table 3: Comparison of estimators of PMI and PHSIC in terms of methods of matching the given (x, y) and the observed (x i , y i ) in D. PMI matches them in an exact manner, while PHSIC smooths the matching using kernels. Therefore, PHSIC is expected to be robust to data sparsity (Section 4).
in Equation (1) can be rewritten as where I[condition] = 1 if the condition is true and I[condition] = 0 otherwise. According to Equation (16), PMI(x, y) is the amount computed by repeating the following operation (see the first row in Table 3): collate the given (x, y) and the observed (x i , y i ) in D in order, and add the scores if (x, y) and (x i , y i ) match exactly or deduct the scores if either the x side or the y side (but nor both) matches.
Moreover, an estimator of PHSIC in data space (Equation (15)) is where k(·, ·) and (·, ·) are similarity functions centered on the data 4 . According to Equation (17), PHSIC(x, y) is the amount computed by repeating the following operation (see the second row in Table 3): collate the given (x, y) and the observed (x i , y i ) in D in order, and add the scores if the similarities on the x and y sides are both higher (both k(x, x i ) > 0 and (y, y i ) > 0 hold) 5 or deduct the scores if the similarities on either the x or y sides are similar but those on the other side are not similar.
n j=1 k(xi, xj), which is an estimator of the centered kernel k(x, x ) in Equation (13). 5 In addition, the scores are added if the similarity on the x side and that on the y side are both lower, that is, if k(x, xi) < 0 and (y, yi) < 0 hold.
As described above, when comparing the estimators of PMI and PHSIC from the viewpoint of "methods of matching the given (x, y) and the observed (x i , y i )," it is understood that PMI matches them in an exact manner, while PHSIC smooths the matching using kernels (similarity functions).
With this mechanism, even for completely unknown pairs, it is possible to estimate the cooccurrence strength by referring to observed pairs through the kernels. Therefore, PHSIC is expected to be robust to data sparsity and can be applied to phrases and sentences.
Available Kernels for PHSIC In NLP, a variety of similarity functions (i.e., positive definite kernels) are available. We can freely utilize such resources, such as cosine similarity between sentence embeddings. For a more detailed discussion, see Appendix A.

Empirical Estimators of PHSIC
Recall that we have two types of empirical estimator of PMI, the maximum likelihood estimator (Equation (1)) and the RNN-based estimator (Equation (2)). In this section, we describe how to rapidly estimate PHSIC from data. When using the linear kernel or cosine similarity (e.g., cosine similarity between sentence embeddings), PHSIC can be efficiently estimated in feature space (Section 5.1). When using a nonlinear kernel such as the Gaussian kernel, PHSIC can also be estimated efficiently in data space via a simple matrix decomposition (Section 5.2).

Estimation Using Linear Kernel or Cosine
When using the linear kernel or cosine similarity, the estimator of PHSIC in feature space (14) is as follows: . (21) Generally in kernel methods, a feature map φ(·) induced by a kernel k(·, ·) is unknown or highdimensional and it is difficult to compute estimated values in feature space 6 . However, when we use the linear kernel or cosine similarity, feature maps can be explicitly determined (Equation (19)).
Computational Cost When learning Equation (18) with feature maps φ : time and O(nd) space (linear in the size of the input, n). When estimating PHSIC(x, y), computing φ(x), ψ(y) ∈ R d and Equation (18) takes O(d 2 ) time (constant; does not depend on the size of the input, n).

Estimation Using Nonlinear Kernels
When using a nonlinear kernel such as the Gaussian kernel, it is necessary to estimate PHSIC in data space. Using a simple matrix decomposition, this can be achieved with the same computational cost as the estimation in feature space. See Appendix B for a detailed derivation.

Experiments
In this section, we provide empirical evidence for the greater effectiveness of PHSIC than PMI, i.e., a very short learning time and robustness to data sparsity. Among the many potential applications of PHSIC, we choose two fundamental scenarios, (re-)ranking/classification and data selection.
• In the ranking/classification scenario (measuring the co-occurrence strength of new data pairs with reference to observed pairs), PHSIC is applied as a criterion for the dialogue response selection task (Section 6.2). • In the data selection/filtering scenario (ordering the entire set of observed data pairs according to the co-occurrence strength), PHSIC is also applied as a criterion for data selection in the context of machine translation (Section 6.3).

PHSIC Settings
To take advantage of recent developments in representation learning, we used several pre-trained models for encoding sentences into vectors and several kernels between these vectors for PHSIC.
Encoders As sentence encorders, we used two pre-trained models without fine-tuning. First, the sum of the word vectors effectively represents a sentence (Mikolov et al., 2013a): x = w∈x vec(w), y = w∈y vec(w). (22) For vec(·), we used the pre-trained fastText model 7 , which is a high-accuracy and popular word embedding model (Bojanowski et al., 2017); models in 157 languages are publicly distributed (Grave et al., 2018). Second, we also used a DNN-based sentence encoder, called the universal sentence encoder (Cer et al., 2018), which utilizes the deep averaging network (DAN) (Iyyer et al., 2015). The pre-trained model for English sentences we used is publicly available 8 .
Kernels As kernels between these vectors, we used cosine similarity (cos) and the Gaussian kernel (also known as the radial basis function kernel; RBF kernel) and similarly for (y, y ). The experiments are ran with hyperparameter σ = 1.0 for the RBF kernel, and d = 100 for incomplete Cholesky decomposition (for more detail, see Section B).

Ranking: Dialogue Response Selection
In the first experiment, we applied PHSIC as a ranking criterion of the task of dialogue response selection (Lowe et al., 2015); in the task, pairs comprising a context (previous utterance sequence) and its response are collected from dialogue corpora and the goal is to rank the candidate responses for each given context sentence. The task entails sentence sequences (very sparse linguistic expressions); moreover, Li et al. (2016) pointed out that (RNN-based) PMI has a positive impact on suppressing dull responses (e.g., "I don't know.") in dialogue systems. Therefore, PHSIC, another co-occurrence measure, is also expected to be effective for this. With this setting, where the validity of PMI is confirmed, we investigate whether PHSIC can replace RNN-based PMI in terms of both learning time and robustness to data sparsity.

Experimental Settings
Dataset For the training data, we gathered approximately 5 × 10 5 reply chains from Twitter, following Sordoni et al. (2015) 9 . In addition, we randomly selected {10 3 , 10 4 , 10 5 } reply chains from that dataset. Using these small subsets, we confirmed the effect of the difference in the size of the training set (data sparseness) on the learning time and predictive performance.
For validation and test data, we used a small (approximately 2000 pairs each) but highly reliable dataset created by Sordoni et al. (2015) 10 , which consists only of conversations given high scores by human annotators. Therefore, this set was not expected to include dull responses.
For each dataset, we converted each contextmessage-response triple into a context-response pair by concatenating the context and message following Li et al. (2016). In addition, to convert the test set (positive examples) to ten-choice multiplechoice questions, we shuffled the combinations of context and response to generate pseudo-negative examples.
Evaluation Metrics We adopted the following evaluation metrics for the task: (i) ROC-AUC (the area under the receiver operating characteristic curve), (ii) MRR (the mean reciprocal rank), and (iii) Recall@{1,2}. 9 We collected tweets after 2017 for our training set to avoid duplication with the test set, which contains tweets from the year 2012. 10 https://www.microsoft.com/en-us/download/ details.aspx?id=52375
Init.  Experimental Procedure We used the following procedure: (i) train the model with a set of context-response pairs D = {(x i , y i )} n i=1 ; (ii) for each context sentence x in the test data, rank the candidate responses {y j } 10 j=1 by the model; and (iii) report three evaluation metrics.

Baseline Measures As baseline measures, both
(1) an RNN language model P RNN (y) (Mikolov et al., 2010) and (2) a conditional RNN language model P RNN (y|x) (Sutskever et al., 2014) were trained, and (3) PMI based on these language models, RNN-PMI, was also used for experiments (see Equation (2)). We trained these models with all combinations of the following settings: (a) the number of dimensions of the hidden layers being 300 or 1200 and (b) the initialization of the embedding layer being random (uniform on [−0.1, 0.1]) or fastText. For more detailed settings, see Appendix C.

Experimental Results
Learning Time Table 4 shows the experimental results of the learning time 11 . Regardless of the size of the training set n, the learning time for PHSIC is much shorter than that of the RNN-based method. For example, even when the size of the training set n is 5 × 10 5 , PHSIC is approximately 1400-4000 times faster than RNN-based PMI. This is because the estimators of PHSIC are reduced to a deterministic and efficient matrix calculation (Section 5), whereas neural network-based models involve the sequential optimization of parameters via gradient descent methods. Table 5 shows the experimental results of the predictive performance. When the size of the training data is small (n = 10 3 , 10 4 ), that is, when the data is extremely sparse, the predictive performance of PHSIC hardly deteriorates while that of PMI rapidly decays as the number of data decreases. This indicates that PH-SIC is more robust to data sparsity than RNN-based PMI owing to the effect of kernels. Moreover, PH-SIC with the simple cosine kernel outperforms the RNN-based model regardless of the number of data, while the learning time of PHSIC is thousands of times shorter than those of the baseline methods (Section 6.2).

Robustness to Data Sparsity
Additionally we report Spearman's rank correlation coefficient between models to verify whether PHSIC shows similar behavior to PMI. See Appendix D for more detail.

Data Selection for Machine Translation
The aim of our second experiment was to demonstrate that PHSIC is also beneficial as a criterion of data selection. To achieve this, we attempted to apply PHSIC to a parallel corpus filtering task that has been intensively discussed in recent (neural) machine translation (MT, NMT) studies. This task was first adopted as a shared task in the third conference on machine translation (WMT 2018) 12 .
Several existing parallel corpora, especially those automatically gathered from large-scale text data, such as the Web, contain unacceptable amounts of noisy (low-quality) sentence pairs that greatly affect the translation quality. Therefore, the development of an effective method for parallel corpus filtering would potentially have a large influence on the MT community; discarding such noisy pairs may improve the translation quality and shorten the training time.
We expect PHSIC to give low scores to exceptional sentence pairs (misalignments or missing 12 http://www.statmt.org/wmt18/ parallel-corpus-filtering.html translations) during the selection process because PHSIC assigns low scores to pairs that are highly inconsistent with other pairs (see Section 4). Note that applying RNN-based PMI to a parallel corpus selection task is unprofitable since obtaining RNNbased PMI also has an identical computational cost for training a sequence-to-sequence model for MT, and thus, we cannot expect a reduction of the total training time.

Experimental Settings
Dataset We used the ASPEC-JE corpus 13 , which is an official dataset used for the MT-evaluation shared task held in the fourth workshop on Asian translation (WAT 2017) 14 (Nakazawa et al., 2017). ASPEC-JE consists of approximately three million (3M) Japanese-English parallel sentences from scientific paper abstracts. As discussed by Kocmi et al. (2017), ASPEC-JE contains many low-quality parallel sentences that have the potential to significantly degrade the MT quality. In fact, they empirically revealed that using only the reliable part of the training parallel corpus significantly improved the translation quality. Therefore, ASPEC-JE is a suitable dataset for evaluating the data selection ability.
Model For our data selection evaluation, we selected the Transformer architecture (Vaswani et al., 2017) as our baseline NMT model, which is widelyused in the NMT community and known as one of the current state-of-the-art architectures. We utilized fairseq 15 , a publicly available tool for neural sequence-to-sequence models, for building our models.
Experimental Procedure We used the following procedure for this evaluation: (1) rank all parallel sentences in a given parallel corpus according to each criterion, (2) extract the top K ranked parallel sentences, (3) train the NMT model using the extracted parallel sentences, and (4) evaluate the translation quality of the test data using a typical MT automatic evaluation measure, i.e., BLEU (Papineni et al., 2002) 16 . In our experiments we evaluated PHSIC with K = 0.5M and 1M.

Chance
Baseline Measure As a baseline measure, we utilize a publicly available script 17 of fast align (Dyer et al., 2013), which is one of the state-of-theart word aligner. We firstly used the fast align for the training set D = {(x i , y i )} i to obtain the word alignment between each sentence pair (x i , y i ), i.e., a set of aligned word pairs with its probabilities. We then computed the co-occurrence score of (x i , y i ) with sentence-length normalization, i.e., the average log probability of aligned word pairs. Table 6 shows the results of our data selection evaluation. It is common knowledge in NMT that more data gives better performance in general. However, we observed that PHSIC successfully extracted beneficial parallel sentences from the noisy parallel 17 https://github.com/clab/fast align corpus; the result using 1M data extracted from the 3M corpus by PHSIC was almost the same as that using 3M data (the decrease in the BLEU score was only 0.07), whereas that by random extraction reduced the BLEU score by 1.20. This was actually a surprising result because PH-SIC utilizes only monolingual similarity measures (kernels) without any other language resources. This indicates that PHSIC can be applied to a language pair poor in parallel resources. In addition, the surface form and grammatical characteristics between English and Japanese are extremely different 18 ; therefore, we expect that PHSIC will work well regardless of the similarity of the language pair.

Related Work
Dependence Measures Measuring independence or dependence (correlation) between two random variables, i.e., estimating dependence from a set of paired data, is a fundamental task in statistics and a very wide area of data science. To measure the complex nonlinear dependence that real data has, we have several choices.
First, information-theoretic MI (Cover and Thomas, 2006) and its variants (Suzuki et al., 2009;Reshef et al., 2011) are the most commonly used dependence measures. However, to the best of our knowledge, there is no practical method of computing MIs for large-multi class high-dimensional (having a complex generative model) discrete data, such as sparse linguistic data.
Second, several kernel-based dependence measures have been proposed for measuring nonlinear dependence (Akaho, 2001;Bach and Jordan, 2002;Gretton et al., 2005). The reason why kernelbased dependence measures work well for real data is that they do not explicitly estimate densities, which is difficult for high-dimensional data. Among them, HSIC (Gretton et al., 2005) is popular because it has a simple estimation method, which is used for various tasks such as feature selection (Song et al., 2012), dimensionality reduction (Fukumizu et al., 2009), and unsupervised object matching (Quadrianto et al., 2009;Jagarlamudi et al., 2010). We follow this line.
Co-occurrence Measures First, In NLP, PMI (Church and Hanks, 1989) and its variants (Bouma, 2009) are the de facto co-occurrence measures between dense linguistic expressions, such as words (Bouma, 2009) and simple narrative-event expressions (Chambers and Jurafsky, 2008). In recent years, positive PMI (PPMI) has played an important role as a component of word vectors (Levy and Goldberg, 2014).
Second, there are several studies in which the pairwise ranking problem has been solved by using deep neural networks (DNNs) in NLP. Li et al. (2016) proposed a PMI estimation using RNN language models; this was used as a baseline model in our experiments (see Section 6.2). Several studies have used DNN-based binary classifiers modeling P(C = positive | (x, y)) to solve the given ranking problem directly (Hu et al., 2014;Yin et al., 2016;Mueller and Thyagarajan, 2016) (these networks are sometimes called Siamese neural networks). Our study focuses on comparing co-occurrence measures. It is unknown whether Siamese NNs capture the co-occurrence strength; therefore we did not deal with Siamese NNs in this paper.
Finally, to the best of our knowledge, Yokoi et al. (2017)'s paper is the first study that suggested converting HSIC to a pointwise measure. The present study was inspired by their suggestion; here, we have (i) provided a formal definition (population) of PHSIC; (ii) analyzed the relationship between PHSIC and PMI; (iii) proposed linear-time estimation methods; and (iv) experimentally verified the computation speed and robustness to data sparsity of PHSIC for practical applications.

Conclusion
The NLP community has commonly employed PMI to estimate the co-occurrence strength between linguistic expressions; however, existing PMI estimators have a high computational cost when applied to sparse linguistic expressions (Section 1). We proposed a new kernel-based co-occurrence measure, the pointwise Hilbert-Schmidt independent criterion (PHSIC). As well as defining PMI as the contribution to mutual information, PHSIC is defined as the contribution to HSIC; PHSIC is intuitively a "kernelized variant of PMI" (Section 3). PHSIC can be applied to sparse linguistic expressions owing to the mechanism of smoothing by kernels. Comparing the estimators of PMI and PHSIC, PHSIC can be interpreted as a smoothed variant of PMI, which allows various similarity metrics to be plugged in as kernels (Section 4). In addition, PHSIC can be estimated in linear time owing to the efficient matrix calculation, regardless of whether we use linear or nonlinear kernels (Section 5). We conducted a ranking task for dialogue systems and a data selection task for machine translation (Section 6). The experimental results show that (i) the learning of PHSIC was completed thousands of times faster than that of the RNN-based PMI while outperforming it in ranking accuracy (Section 6.2); and (ii) even when using a nonlinear kernel, PHSIC can be applied to a large dataset. Moreover, PHSIC reduces the amount of training data to one third without sacrificing the output translation quality (Section 6.3).
Future Work Using the PHSIC estimator in feature space (Equation (18)), we can generate the most appropriate ψ(y) for a given φ(x) (uniquely, up to scale). That is, if a DNN-based sentence decoder is used, y (a sentence) can be restored from ψ(y) (a feature vector) so that generative models of strong co-occurring sentences can be realized.

A Available Kernels for PHSIC
Similarity between Sentence Vectors A variety of vector representations of phrases and sentences based on the distributional hypothesis have recently been proposed, including sentence encoders (Kiros et al., 2015;Dai and Le, 2015;Iyyer et al., 2015;Hill et al., 2016;Cer et al., 2018) and the sum of word embeddings; it is known as additive compositionality (Mitchell and Lapata, 2010;Mikolov et al., 2013a;Wieting et al., 2015) that we can express the meaning of phrases and sentences well with the sum of word vectors (e.g., word2vec (Mikolov et al., 2013b), GloVe (Pennington et al., 2014), and fastText (Bojanowski et al., 2017)). Note that various pre-trained models of sentence encoders and word embeddings have also been made available.
The cosine of these vectors, which is a positive definite kernel, can be used as a convenient and highly accurate similarity function between phrases or sentences. Other major kernels can also be used, such as the RBF kernel, the Laplacian kernel, and polynomial kernels.
Structured Kernels Various structured kernels for NLP, such as tree kernels, which capture fine structure of sentences such as syntax, were devised in the support vector machine era (Collins and Duffy, 2002;Bunescu and Mooney, 2006;Moschitti, 2006).

Combinations
We can freely combine the previously mentioned kernels because the sum and the product of positive definite kernels are also positive definite kernels (Shawe-Taylor and Cristianini, 2004, Proposition 3.22).

B Derivation of Fast PHSIC Estimation in Data Space
Although estimators of HSIC and PHSIC depend on kernels k, and data D, hereinafter, we use the following notation for the sake of simplicity: PHSIC(x, y) := PHSIC(x, y; D, k, ). (26) Naïve Estimation Fist, an estimator of PHSIC in the data space (15) is where k := (k(x, x 1 ), . . . , k(x, x n )) ∈ R n , so as ; and vector k := 1 n K1 denotes empirical mean of {k i } n i=1 , so as . This estimation has a large computational cost. When learning, computing the vectors k, takes O(n 2 ) time and O(n) space. When estimating PHSIC, computing k, and multiplying the matrix 1 n H takes O(n) time. Fast Estimation via Incomplete Cholesky Decomposition Equation (27) has a large computational cost because it is necessary to construct the Gram matrices K and L ∈ R n×n . In kernel methods, several methods have been proposed for approximating Gram matrices at low cost without constructing them explicitly, such as incomplete Cholesky decomposition (Fine and Scheinberg, 2001).
By incomplete Cholesky decomposition, from data points {x 1 , . . . , x n } ⊆ X and a positive definite kernel k : X × X → R, a matrix A = (a 1 , . . . , a n ) ∈ R n×d (d n) can be obtained with O(nd 2 ) time complexity. This makes it possible to approximate the Gram matrix K by vectors a i ∈ R d without configuring the entire of K: AA ≈ K.
Also, for HSIC, an efficient approximation method utilizing incomplete Cholesky decomposition has been proposed (Gretton et al., 2005, Lemma 2): where A = (a 1 , . . . , a n ) ∈ R n×d is a matrix satisfying AA ≈ K computed via incomplete Cholesky decomposition, so as B (BB ≈ L). Equation (30) can be represented in the form of the expectation on data points: where vector a := 1 n A 1 ∈ R d denotes empirical mean of {a i } n i=1 , so as b := 1 n B 1. Recall that PHSIC(x, y) is the contribution of (x, y) to HSIC(X, Y ) (see Section 3.3); PHSIC then can be efficiently estimated by the shaded part of Equation (31): PHSIC ICD (x, y) = (a−a) C ICD (b−b) . (33) Here, the vector a ∈ R d corresponding to the new x can be calculated by "performing from halfway"  Table 7: Spearman's ρ between the co-occurrence scores computed by the models in the dialogue response selection task (Section 6.2). The size of training set n is 5 × 10 5 . The other notation is the same as in Table 4.
on the incomplete Cholesky decomposition algorithm. Let x (1) , . . . , x (d) denote the dominant x i s adopted during decomposition algorithm. The jth element of a can be computed as follows: so as b ∈ R d corresponding to the new y. The estimation via incomplete Cholesky decomposition (33) is extremely efficient compared to the naive estimation (27); Equation (33)'s computational complexity is equivalent to the estimation in the feature space (18).

C Detailed Settings for Learning RNNs
Detailed settings for learning RNNs used in this research are as follows.
• Hidden layers: single layer LSTMs (Hochreiter and Schmidhuber, 1997) • Vocabulary: words with a frequency: 10 or more (n = 5 × 10 5 ), 2 or more (otherwise) • Dropout rate: 0.1 (300-dim), 0.3 (1200-dim) • Batch size: 64 • Max epoch number: 5 (n = 5 × 10 5 ), 30 (otherwise) • Deep learning framework: Chainer (Tokui et al., 2015) D Correlation Between Models in Dialogue Response Selection Task   Table D shows Spearman's rank correlation coefficient (Spearman's ρ) between the co-occurrence scores on the test set computed by the models in the dialogue response selection task (Section 6.2). This shows that the behavior of RNN-based PMI and PHSIC are considerably different. Furthermore, interestingly, the behavior of PHSICs using different kernels is also different. Possible reasons for these observations are as follows: (1) the difference in the dependence measures (MI or HSIC) on which each model is based; (2) the validity or numerical stability of estimating PMI with RNN language models; and (3) differences in the behavior of PH-SIC originating from differences in the plugged in kernels. A more detailed analysis of the compatibility between tasks and measures (or kernels) is attractive future work.