Cross-Lingual Document Retrieval with Smooth Learning

Cross-lingual document search is an information retrieval task in which the queries’ language and the documents’ language are different. In this paper, we study the instability of neural document search models and propose a novel end-to-end robust framework that achieves improved performance in cross-lingual search with different documents’ languages. This framework includes a novel measure of the relevance, smooth cosine similarity, between queries and documents, and a novel loss function, Smooth Ordinal Search Loss, as the objective function. We further provide theoretical guarantee on the generalization error bound for the proposed framework. We conduct experiments to compare our approach with other document search models, and observe significant gains under commonly used ranking metrics on the cross-lingual document retrieval task in a variety of languages.


Introduction
In modern search engines, cross-lingual information retrieval tasks are becoming prevalent and important. For example, when searching for products on a shopping website in English, immigrants tend to use their native languages to form the queries and would like to see the most desired products which are in English. Another example is in international trading, investors might use English to describe their product to search the sentiment in other languages on online forums from different countries, in order to understand the customers' attitude towards it. Despite that these tasks can be naturally formulated as information retrieval (IR) tasks and resolved by mono-lingual methods, the rising need of cross-lingual IR techniques also requires robust models to deal with queries and documents from different languages.
Early studies on documents retrieval mostly rely on lexical matching, which is error-prone in crosslingual tasks as vocabularies and language styles usually change across different languages, as well the contextual information is largely lost. With the recent surge of deep neural networks (DNNs), researchers are able to go beyond lexical matching by building neural architectures to represent textual information of query and documents by vector representations via non-linear transformations, which have shown great successes in many applications [Salakhutdinov and Hinton, 2009;Huang et al., 2013;Shen et al., 2014;Palangi et al., 2016].
Despite the success of those aforementioned advances, several existing difficulties have not been well addressed, e.g. exploding gradients and convergence guarantee. The most widely used optimization method for training DNNs is stochastic gradient descent, which updates model parameters by taking gradients of the weights. Meanwhile, cosine similarity, a commonly used measure of relevance between query and documents, is not stable as its gradient may go to infinity when the Euclidean norm of the representation vector is close to zero, leading to exploding gradients and resulting in irrational training. Figure 1 demonstrates the gradients of cosine similarity and our proposed smooth cosine similarity. In addition, in most of the previous works, the loss functions are either heuristically designed or margin-based [Rennie and Srebro, 2005], resolving to account for the particular property of the model but lacking interpretability and convergence guarantee. These new scenarios pose a new challenge towards the model: the robustness of the cross-lingual information process needs to be well addressed. We calculate the smooth cosine similarity r of two vectors, (x 1 , x 2 ) and (1, 1). The plots represent the heat-map of the partial derivative with respect to x 1 , where is chosen as 0.05, 0.2, 0.5, from left to right. The plot of non-smooth case with = 0 is similar to the left plot, though the gradient goes to infinity at the center.
To tackle these issues, we introduce an end-to-end robust framework that achieves high accuracy in the cross-lingual information retrieval task with different document languages. Particularly, for each query q from language A, we are given a set of documents D = (d 1 , d 2 , ..., d m ) from a different language B and the degrees of relevance y ∈ N m , where the entries (y 1 , y 2 , ..., y m ) in y typically belong to ordered class {1, ..., K}. The goal of Learning-to-rank is to return an ordered or unordered, depending on the evaluation metrics, subset of the documents D that are more relevant to the query. Most common evaluation metrics, such as the Normalized Discounted Cumulative Gain (NDCG) and Precision, however, are discontinuous and thus cannot be directly optimized. As a result, researchers usually assume an unknown continuous relevance score r, where a higher value r means a larger y and higher relevance between the query and the document. To optimize the model, typically a surrogate loss function is used, which is easier to optimize using the relevance score r. Subsequently a subset of the documents are being chosen by ordering r from high to low, to serve the ranking purpose.
Our contributions in this paper can be summarized as follows: 1. First, we propose a novel measure of relevance between queries and documents, Smooth Cosine Similarity (SCS), whose gradient is bounded such that exploding gradients can be avoided, stabilizing the model training. 2. Second, we propose a smooth loss function, Smooth Ordinal Search Loss (SOSL), and provide theoretical guarantees on the generalization error bound for this proposed framework. 3. Third, we empirically show significant gains with our approaches over other document search models under commonly used ranking metrics on the cross-lingual document retrieval task, by conducting experiments in a variety of languages.

Related Works
Document Retrieval Researchers have applied machine learning methods to a variety of document retrieval tasks. Deerwester et al. [1990] proposed LSI that maps a query and its relevant documents into the same semantic space where they are close by grouping terms appearing in similar contexts into the same cluster. Salakhutdinov and Hinton [2009] proposed a semantic hashing (SH) method which used non-linear deep neural networks to learn features for information retrieval. Siamese Neural Networks was first introduced by Bromley et al. [1993], where two identical neural architectures receive different types of input vectors (e.g., query and document vectors in information retrieval tasks). Huang et al. [2013] introduced deep structured semantic models (DSSM) which projected query and document into a common low-dimensional space using feed-forward neural network models, where they chose the cosine similarity between a query vector and a document vector as their relevance score. Shen et al. [2014] and Palangi et al. [2016] extended the feed-forward structure in DSSM to convolutional neural networks (CNN) and recurrent neural networks (RNN) with Long Short-Term Memory (LSTM) cells. Differing from previous works that used click-through data with only two classes, Nigam et al. [2019] proposed a loss function that differentiates three classes (relevant, partially relevant, and irrelevant) in document search with huge amount of real commercial data.
Cross-Lingual Information Retrieval Traditionally, Cross-Lingual Information Retrieval (CLIR) is conducted in two steps in a pipeline: machine translation followed with monolingual information retrieval [Nie, 2010]. However, this approach requires a well-trained translation model and usually suffers from translation ambiguity [Zhou et al., 2012]. The error propagation from machine translation may even deteriorate the retrieval results. As an alternative, pre-trained word embeddings [Mikolov et al., 2013;Pennington et al., 2014] have led to a surge of improved performance on many language tasks, which learns word representations of different languages on large scale text corpora. Nevertheless, the training objective of these embeddings are different from IR tasks, thus their direct application may be limited.
Generalization Error There has been previous works on the generalization error of learn-to-rank models. Lan et al. [2008] analyzed the stability of pairwise models and gave query-level generalization error bounds. Lan et al. [2009] provided a theoretical framework for ranking algorithms and proved generalization error bounds for three list-wise losses: ListMLE, ListNet and RankCosine. Chapelle and Wu [2010] introduced annealing procedure to find the optimal smooth factor on an extension of surrogate loss SoftRank, and derived a generally applicable bound on the generalization error of query-level learning-torank algorithms. Tewari and Chaudhuri [2015] proved that several loss functions used in learning-to-rank, such as cross-entropy loss, have no degradation in generalization ability as document lists become longer. However, theoretical analysis toward search models with neural architectures is still very limited.

Smooth Neural Document Retrieval
In this section, we propose a novel Smooth Cross-Lingual Document Retrieval framework. This framework consists of three parts. First, we use neural models to encode queries and documents from different languages, and represent them by low-dimensional vectors. Second, we propose a smooth cosine similarity to indicate the relevance score r, which avoids gradient explosion and therefore stabilizes the training process. Finally, we introduce Smooth Ordinal Search Loss for optimizing r.

Text Representation
We use embeddings as low dimensional dense vectors to represent both queries and documents. Differing from mono-lingual tasks, in cross-lingual document retrieval one can rarely observe common tokens between queries and documents. Therefore, cross-lingual document retrieval usually requires embeddings built from different vocabularies as queries and documents are from two different languages [Sasaki et al., 2018]. This requirement naturally expands the size of parameters of the model if we regard embeddings as part of model parameters and fine-tune them during training, thus the retrieval model itself demands a higher stability.
Queries and documents are represented numerically. A query q is tokenized into a list of words q = q 1 q 2 ...q lq of length l q . For example, the tokenization of "Apple is a fruit" is ["Apple", "is", "a", "fruit"] of length 4. Similarly, d = d 1 d 2 ...d l d is an expression of a document of length l d from document language. For simplicity, we use A and B to represent query language and document language, respectively. We also choose the same dimension p for both word embeddings Emb A and Emb B . Therefore, we encode the query q and the document d by embedding representations Q ∈ R p×lq and D ∈ R p×l d , where the ith column of Q is the word embedding from Emb A for the token in the ith position in the query, and the jth column of D is the word embedding from Emb B for the token in the jth position in the document.

Neural Model Architecture
With the embedding representations Q and D, which can be of different sizes due to different number of tokens in the query and the document, we apply neural models onto them to obtain vectors of the same size. Carefully designed neural models are able to perform dimension reduction regarding the sequence length, and thus project the raw texts into the same Euclidean space of the same dimension for both the queries and the documents, regardless of the number of tokens in them. Thus, we can quantify the relevance between a query and a document as they are in the same space by calculating standard metrics, such as cosine similarity. For example, if the embedding size is p, then a query of length l q can be represented by Q ∈ R p×lq , and the representations of two documents of length l d1 and l d2 are D 1 ∈ R p×l d1 and D 2 ∈ R p×l d2 , respectively. In such case, Q, D 1 and D 2 are in different space. To mitigate this difficulty that identifies quantitative comparison of the relevance of two pairs, (Q, D 1 ) and (Q, D 2 ), a well designed neural model is used to transform both Q and D into R p .
In order to achieve this goal, we propose to use average pooling over the columns of Q and D, followed by a non-linear activation function tanh for both query and document models. For a query q, the query model f q : Q → v q ∈ R p takes embedding Q as input and outputs a final representation v q . Similarly, the document model f d : D → v d ∈ R p also generates a final representation v d into the same space using D. There are other modeling choices such as LSTM and CNN. We compare the performance of different neural models in the experiments.
The benefits of this model are in two folds: on one hand, despite non-linearity, the models are smooth and Lipschitz continuous with respect to embedding parameters, and therefore benefits from convergence of generalization error during training; on the other hand, using average pooling avoids extra parameters in the model, thus simplifies the model space and reduces tuning efforts during training, in accordance to the findings in Nigam et al.

Smooth Cosine Similarity
Cosine similarity has been widely used to find the relevance between queries and documents in information retrieval. Given two vectors x and z of same size in Euclidean space, cosine similarity cos(x, z) = x z x z measures how similar these two vectors are irrespective of their norms, i.e. x and z . However, the norms of the vectors play crucial roles when calculating the gradient. More specifically, the gradient goes to infinity if the norms are close to zero, and results in unstable weights update during training. This phenomenon is also known as exploding gradients. The intuition is that for a vector of small norm, a slight disturbance can greatly change the angle between itself and another vector, i.e. the cosine similarity. The use of cosine similarity can lead to exploding gradients regardless of the model structure when gradient descent methods are used for optimization. Recently, most semantic matching models and learning-to-rank models are constructed based on neural architectures. Thus, these retrieval models suffer from this issue greatly since commonly they are optimized by gradient descent methods.
To increase the stability of model training, we further propose Smooth Cosine Similarity (SCS) in replace of the regular cosine similarity. We define the SCS between a query and a document as where · is the Euclidean norm and > 0. Under the framework of SCS, the gradient of r with respect to f q (q) and f d (d) is upper bounded in the whole space and thus stabilizes training procedure. Moreover, by introducing this additional smoothness hyper-parameter into the norm of the feature representation vectors, the similarity score not only measures the angle between vectors, but also adds information about the norm of the vectors. As a result, SCS is not order-preserving from cosine similarity, i.e., cos(x, z 1 ) > cos(x, z 2 ) does not necessarily imply r (x, z 1 ) > r (x, z 2 ). The choice of the hyperparameter is flexible and not sensitive to the model performance, which is further analyzed in the experiments.
Another common method to avoid exploding gradients is gradient clipping, i.e., clipping gradients if their norm exceeds a given threshold. Our proposed SCS does not exclude gradient clipping and in fact they complement with each other. In our pilot experiments, we observe merely using gradient clipping is not sufficient in our cross-lingual document retrieval setup. By adding SCS, we observe improved performance over using gradient clipping alone.

Smooth Ordinal Search Loss
In a search ranking model, it is critical to define a surrogate loss function since the ranking metrics, such as NDCG and Precision are not continuous therefore difficult to optimize. On the choice of a proper loss function, one has to consider two criteria: first, minimizing the surrogate loss on training set should imply a small surrogate loss on test set; second, a small surrogate loss on test set should imply desired ranking metrics results on test set [Chapelle et al., 2011]. To deal with the first criterion, we formulate the search ranking model as an ordinal regression problem. For the second one, we propose to use Smooth Ordinal Search Loss (SOSL) as the surrogate loss.
Recall that a pair of query and document has an ordered relevance level, and the goal of a search ranking model is to select a subset of documents such that more relevant documents are ranked on the top while less relevant documents are ranked lower. Taking three-class ranking problem as a concrete example, the pairs can be grouped into relevant, partially relevant and irrelevant. If a mis-ranking has to exist for a relevant pair, it is preferred to rank a partially relevant document over an irrelevant document, which means not all mistakes are equal.
where r is the smooth relevance score between a query q and a document d, y ∈ {1, ..., K} is the ordered class label denoting the general relevance degree, and −1 = θ 0 < ... < θ K = 1 are the thresholds and I is the indicator function. Differing from that of the margin penalty function used in Rennie and Srebro [2005]'s work, we choose smooth function (·) 2 . The thresholds are within the range [−1, 1] in our setup, instead of the whole real line, due to the property of smooth cosine similarity.
The interpretation of this loss function is intuitive, as shown in Figure 2: if the relevance score falls into the correct segmentation (the true ordered class of a particular pair of query and document is y and θ y −1 ≤ r ≤ θ y ), then the loss is 0; otherwise the loss is the degree of the relevance violating the threshold.

Theoretical Analysis
Generalization error measures the difference between training error and testing error. Typically, if the generalization error of a model is bounded and converges to zero, then minimizing empirical loss on training set implies that the expected loss on unseen testing set is also minimized. Although previous works gave generalization bounds for surrogate losses in learning-to-rank model [Lan et al., 2008[Lan et al., , 2009Tewari and Chaudhuri, 2015], to the best of our knowledge, no theoretical result has been derived in search models with neural architecture. Here, we prove a generalization error bound for the commonly used SGD procedure. This error bound suggests that the generalization gap at any training step T converges to zero when the number of training pairs of query and document goes to infinity. We show the detailed proof in the appendix.
The following Proposition 1 and Lemma 2 show that SOSL is both smooth and Lipschitz continuous with respect to not only relevance score r but also embedding parameters.
Proposition 1. SOSL is smooth and Lipschitz continuous with respect to r.
Lemma 2. Let l(r, y) be a smooth and Lipschitz continuous loss function with respect to r, then in our search ranking model, l(r (q, d), y) = l(f q , f d , y) is also smooth and Lipschitz continuous with respect to model parameters, i.e., embeddings.
From Lemma 2, we can assume l(f q , f d , y) to be α-Lipschitz and γ-smooth. Next, suppose S = ((q 1 , d 1 , y 1 ), ..., (q n , d n , y n )) is a training set sampled from the data distribution with sample size n, and assume unseen testing data is from distribution E. Let f = (f q , f d ) be the neural models for queries and documents. By defining L l S (f ) as the mean training error and L l E (f ) the expected error on testing set, the following Theorem establishes the upper bound for generalization error L l E (f ) − L l S (f ). Theorem 3. (Generalization Error bound) Let l be a smooth loss function bounded by M . Suppose we run SGD for T steps (with step size α t = c/t, t ∈ {1, 2, . . . , T }), with probability at least 1 − δ over the draw of S, Note that in Theorem 3, the bound does not depend on thresholds θ. We ignore the dependency on and α as both are constants. Theorem 3 suggests that for any T , the generalization error after T steps for SGD converges as the sample size of S increases. For some proper step size, we can allow T to increase with n. In particular, if γc < 2, then T may increase at a rate of n while the generalization error still converges.

Experiments
Datasets We use the publicly available large-scale Cross-Lingual Information Retrieval (CLIR) dataset from Wikipedia [Sasaki et al., 2018] for our experiments. All the queries are in English, extracted as the first sentences from English pages, with title words removed. The Relevant (MR) documents are the foreign-language pages having inter-language link to the English pages; the Partially Relevant (SR) documents are those having mutual links to and from the relevant documents. Additionally, we randomly sample 40 other pages as Irrelevant (NR) documents for each query. To provide a comprehensive study, we use the document datasets of two high-resource languages: French (fr), Italian (it) and two low-resource languages: Swahili (sw), Tagalog (tl). Queries are randomly split into training, validation and testing set with the rate of 3:1:1. We include the data statistics in Table 1

SOSL and Other Loss Functions
We first present the results of using different loss functions in different languages with smooth cosine similarity in Table 2. We compared with three commonly used loss: Mean Square Error(MSE), Proportional Odds Loss (PO) [McCullagh, 1980] and 3P art L2 [Nigam et al., 2019]. The hyperparameter is fixed to 1 for all losses. We observe that SOSL outperforms other loss functions in all languages and all metrics. We attribute this success to two folds: first, SOSL encourages smoothness over the optimization of parameters and thus guarantees convergence of generalization error; second, SOSL adds no penalty on the loss when the relevance score falls into the correct segmentation. Note that the performance in low-resource languages (sw and tl) is better than high-resource languages (fr and it) for some metrics, because the former languages have fewer SR documents and fewer total number of documents for each query, which makes it easier to distinguish MR and NR documents and is therefore more likely to rank the MR document on the top. To investigate the reason why SOSL performs the best, we plot the density curves of the relevance score predicted by each loss function on the training set, and illustrate them in Figure 3. All three classes are well separated in top left with SOSL. For 3partL2 loss, however, we can notice a large overlap in group NR and SR. In the plots of P O and MSE losses, few MR documents are classified correctly. Besides, MR and SR documents are mixed together, with a portion of MR incorrectly classified as NR, which is unpreferable. This analysis showcases the power of SOSL to distinguish different types of documents at the end of the training.

Average Pooling and Other Neural Architectures
We also compare the average pooling with other popular neural architectures. We choose DSSM-CNN model [Shen et al., 2014] and the DSSM-LSTM model [Palangi et al., 2016], which are widely used in information retrieval, to compare with. After tuning these two models, we specify their hyper-parameters as follows: we set 0.001 as the initial learning rate and 0.95 as the exponential decay rate for DSSM-CNN and DSSM-LSTM. We set the batch size 128 for DSSM-CNN and stop after 30 epochs and set the batch size 64 for DSSM-LSTM and stop after 15 epochs.
We set the window size as 3, i.e., word-3-gram for DSSM-CNN, with 300 filters in the convolutional layer. A maximum pooling layer and a fully connected layer with output size 64 are stacked after the convolutional layer. In the DSSM-LSTM model, we use the bidirectional LSTM model with the hidden units 64, and concatenate the first hidden state for the forward LSTM and the last hidden state for the backward LSTM, as the output of LSTM layer. This is then followed by a fully connected layer with output size 64 as the final output. To regularize the DSSM-CNN and the DSSM-LSTM model, we apply dropout [Srivastava et al., 2014] on the word embedding layer with 0.4 dropout rate. We observe using separate modules for queries and documents can improve the performance comparing to sharing the same model, so two modules with same structure for queries and documents are used for CNN and LSTM.
In Table 3, we compare the results of the average pooling architecture with the DSSM-CNN model and the DSSM-LSTM model. Both the two original studies used the data type which only contains two classes, positive and negative, therefore they designed the binary loss function accordingly. For a fair comparison, all the neural architectures are followed by the same SOSL with the same thresholds θ used in Table 2. In all the languages and all the evaluation metrics, average pooling performs best among the three. We attribute the success of average pooling to two reasons: first, queries typically have fewer words than documents, and do not tend to have long-range dependencies; second, the original approaches only deal with two classes, lacking the flexibility of dealing with more classes. In addition, CNN and LSTM are more difficult to optimize, as they require more computational resources, training time, and tend to overfit. Our results are in accordance with the findings in Nigam et al.
Language Model structure P mr @1 P mr @5 P r @5 N DCG@5 MAP M RR mr M RR r

Impact of Hyper-parameters
Smoothness factor determines the smoothness of the cosine similarity. When is 0, the loss function with respect to weights in neural models is non-smooth and thus generalization cannot be guaranteed. To show the improvement of adding smoothness and illustrate the effect of , we vary from 0 to 2 for SOSL, in document language French. For each , we use grid search to find the best θ. The results are visualized in Figure 4. It is observed that model performance is improved when the model becomes smooth. Besides, within a large range of (from 0.25 to 1.75), adding smoothness factor surpasses non-smooth cosine similarity. We see a concave curve in all metrics but the choice of is relatively non-sensible. We suggest taking within [0.25, 1] for desired output and stability.

Different number of negative samples
In real industrial searching system, ranking is usually run after documents "filtering" process, e.g., the matching stage, which could greatly reduces the number of documents to be ranked. In Figure 5, we explore the effect of different number of irrelevant (NR) documents. We create 9 different datasets where the number of NR documents per query varies from 20 to 100. The document language is the high-resource language French. We sampled same number of queries for training, validating and testing datasets as the experiments discussed earlier in this paper. The average number of SR documents per query varies but is still close to 12.6. The red curve indicates, as the number of NR documents increases, the data is more "noisy" and therefore more difficult to correctly rank and predict MR document. On the other hand, NDCG, MRR and MAP are relatively resistant to the increased number of NR documents since they only decrease about 10% while the number of NR documents is 5× more. This also validates the stability of our proposed framework.  Table 2, respectively; NDCG is calculated from top 5 documents.

Conclusion
In this study, we propose a smooth learning framework for cross-lingual information retrieval task. We first suggest a novel measure of relevance between queries and documents, namely Smooth Cosine Similarity (SCS), whose gradient is bounded thus able to avoid exploding gradients, enforcing the model to be trained in a more stable way. Additionally, we propose a smooth loss function: Smooth Ordinal Search Loss (SOSL), and provide theoretical guarantees on the generalization error bound for the whole proposed framework. Further, we conduct intensive experiments to compare our approach with existing document search models, and show significant improvements with commonly used ranking metrics on the cross-lingual document retrieval task in several languages. Both the theoretical and the empirical results imply the potentially wide application of this smooth learning framework.

A Appendices
A.1 Proof of Lemma 2 Lemma A.1. If function f and g is L 1 -Lipschitz and L 2 -Lipschitz, then g • f is L 1 L 2 -Lipschitz.
Here we use • to be the function composition opertor, i.e. g • f (·) = g(f (·)). It is easy to prove the Lemma by the definition of L-Lipschitz.
Proof. We have the following inequality, In the second inequality, we use the Lipschitz property that ∇g(f (u)) ≤ L 2 and ∇f (v) ≤ L 1 .

Proof of Proposition 1
Proof. Let K be the number of classes and −1 = θ 0 ≤ θ 1 ≤ ... ≤ θ K = 1. In Immediate Threshold with L 2 loss, It is easy to see that for a given y or for any y. Also ∇SOSL 2 (r, y) = 0 r ∈ (θ y−1 , θ y ) 2 otherwise .
for any r and y, which means |∇SOSL 2 (r, y)| ≤ 2 where 1 is a p × p identity matrix. It is easy to see that where N (p) is a constant only depends on p. Therefore, r(f ) is 2/ -Lipschitz and N (p)/ 2 -smooth. Similarly, this is also true with respect to f d . Deep Neural Networks normally do not enjoy Lipschitz continuity or smoothness due to their expressive power, which make it difficult to analyze its theoretical performance. We now show that our particular models are Lipschitz continuous and smooth for both query and document with respect to Embedding parameters.
For a text query Q with token length t, we represent it by Q = (q 1 , ..., q t ) ∈ R nq×t where q i ∈ R nq be the one-hot vector with the index of i-th token in text Q be 1 and the rest are 0. The output of query encoding networks f q is then Note that hyperbolic tangent tanh(x) = e x −e −x e x +e −x is smooth, and t i=1 1 t W T q q i is linear in terms of W q . We can claim that f q (Q) is smooth and Lipschitz continuous with respect to W q .
Using Lemma A.1, Lemma A.2 and Proposition 1, we can prove Lemma 2.
We can prove the following Theorem from Agarwal [2008].
Theorem A.3. Let A be an ordinal regression algorithm which, given as input a training sample S ∈ (X × [K]) n , learns a real-valued function f S : X → R and a threshold vector θ S ≡ (θ 1 S , ..., θ K−1 S ). Let l be any loss function in this setting such that 0 ≤ l(f S , θ S , (x, y)) ≤ M for all training samples S and all (x, y) ∈ X × [K], and let β : N → R be such that A has loss stability β with respect to l. Then for any 0 < δ < 1 and for any distribution D on X × [K], with probability at least 1 − δ over the draw of S, L l E (f S , θ S ) − L l S (f S , θ S ) ≤ 2β(n) + (4nβ(n) + M ) ln 1/δ 2n Note that the original theorem in Agarwal [2008] is defined with θ over the real line R, but it is trivial to apply to θ defined over [−1, 1]. It is also easy to verify that 0 ≤ l(f S , θ S , (x, y)) ≤ M since our Ordinal Regression loss is bounded. If β(n) has a rate of 1/n, then the generalization bound L l E (f S , θ S ) − L l S (f S , θ S ) goes to zero as n goes to infinity. Next, we will give the definition of stability β and to show that DNNs optimized by SGD satisfy this requirement.

A.3 Stability
The Uniform Stability is a measurement of how the algorithm will be affected if removing one sample from the training set. Let S = ((x 1 , y 1 ), ..., (x n , y n )) be the training set and S \i represents the set by removing the i-th element from S S \i =((x 1 , y 1 ), ..., (x i−1 , y i−1 ), (x i+1 , y i+1 ), ..., (x m , y m )).
The formal definition is first proposed by Bousquet and Elisseeff [2002] and stated as follows, Definition A.3. (Uniform Stability) An algorithm A has uniform stability β with respect to the loss function l if the following holds ∀S ∈ X m , ∀i ∈ {1, ..., m}, l(A S , ·) − l(A S \i , ·) ∞ ≤ β.
Proof of Theorem 3 By combining Lemma 2, Theorem A.3 and Theorem A.4, we can prove Theorem 3.