Grouping Words with Semantic Diversity

Deep Learning-based NLP systems can be sensitive to unseen tokens and hard to learn with high-dimensional inputs, which critically hinder learning generalization. We introduce an approach by grouping input words based on their semantic diversity to simplify input language representation with low ambiguity. Since the semantically diverse words reside in different contexts, we are able to substitute words with their groups and still distinguish word meanings relying on their contexts. We design several algorithms that compute diverse groupings based on random sampling, geometric distances, and entropy maximization, and we prove formal guarantees for the entropy-based algorithms. Experimental results show that our methods generalize NLP models and demonstrate enhanced accuracy on POS tagging and LM tasks and significant improvements on medium-scale machine translation tasks, up to +6.5 BLEU points. Our source code is available at https://github.com/abdulrafae/dg.


Introduction
Natural Language Understanding has seen remarkable success with the rise of Deep Learning. However, human languages' variety and richness result in high-dimensional inputs to NLP models, increasing learning complexity and error rates. First, openvocabulary inputs inevitably bring rare and outof-Vocabulary words (OOVs). Second, network complexity increases with input dimension, specifically the "curse of dimensionality" makes learning difficult on medium and small datasets. This paper addresses these limitations by introducing new grouping methods to compute alternative language representations that simplify textual inputs. We currently have alternative language representations, such as Pinyin, Metaphone, logogram, and Emoji, that exist in natural languages and that have been shown to improve various NLP applications (Du and Way, 2017;Liu et al., 2018;Khan et al., 2020). While these representations can help, they are not developed for NLP performance. Our goal is to design algorithms for computing new language representations specifically to enhance NLP performance in this work. We ask: "Can we compute a generalized language representation to improve NLP applications?" An intuitive approach to answering this question is to group similar words in training and test sets and replace each word with its group. A word grouping viewed as a many-to-one mapping function can significantly reduce the vocabulary size that lowers the input feature dimensions leading to a generalized NLP model learning.
For example, let us take two sentences: (a) "you ask me."; (b) "she tells me." There are five words "ask", "she", "tells", "me", and "you" in the vocabulary. Grouping words into "A" and "B" will reduce the vocabulary size to two, resulting in a simplified language representation. We can apply conventional word clustering to group words after embedding words into a vector space and measuring their distances with cosine similarity. However, clustering can map different sentences to the same sequence of groups, making them indistinguishable. In our example, we cluster similar pronouns "you", "she" and "me" into one group indicated by "A", and verbs "ask", "tells" by "B". Then both sentences are rewritten as "A B A." The distinct meanings of the two original sentences are lost. However, if we group diverse semantic words, namely, "you", "tell" as "A"; "she", "me", "ask" as "B", then we maintain two samples of (a) "A B B." and (b) "B A B." So, the distinct meanings of the two sentences are retained. This example illustrates the need to group words so that each sentence is uniquely represented. Now, to generalize this idea, "How can we design an algorithm that simplifies language representation while preserving meaning expressiveness? " Our key observation is that the context of semantically diverse words varies more than that of semantically close words. In our approach, we measure semantic similarity using the cosine of word embeddings, learned based on context, see (Mikolov et al., 2013;Bojanowski et al., 2017;Pennington et al., 2014). Thus, similar contexts indicate semantic similarity and vice versa. In this way, our diverse grouping uses context to distinguish words from the same group, leading to a more expressive representation.
In this paper, we introduce five novel algorithms in three types that group semantically diverse words together. We develop novel theoretical methods for diverse grouping and port them in our NLP context. We begin by considering random sampling grouping. Next, we develop a grouping algorithm based on geometric distances by designing an algorithm that computes a partition of a set of points in some metric space to maximize the sum of intra-group distances. This approach is essentially the opposite of the objectives used in clustering problems, such as k-means (Forgy, 1965) and k-medians (Jain and Dubes, 1988), where one seeks to minimize a monotonic function of intra-group distances. Finally, we present a grouping algorithm to maximize diversity by maximizing the unigram entropy of the representation.
We show that the unigram entropy algorithm is C−1 4C+4εC -approximations of the optimal solutions of maximizing the entropy of the new representation, where C is the number of groups, and ε is a small positive real number. This bound means that in the worst case, our algorithm is about 1/4 away from the optimal, while in typical cases, it could be very close to the optimal. Importantly, our theoretical results' outcomes show their usefulness in NLP tasks after we appropriately adjust them. In our experiments, each of the above methods significantly enhances the NMT accuracy by up to 6.5 BLEU points (36.9% relatively). Our contribution can be summarized as follows: 1. Diversity Grouping Algorithms. We introduce various algorithms that group semantically diverse words together based on random sampling, geometric distances, and entropy maximization ( §3).
3. Applications in NLP. Importantly, we apply the above algorithms to NLP applications, and we show that they significantly enhance prediction accuracy. ( §5).

Related work
While typical word clustering (Baker and McCallum, 1998;Martin et al., 1998;Feng et al., 2020) (or word class (Halteren et al., 2001) methods collect similar words together, while our method groups semantically diverse words together. However, unlike the common use of clustering to smooth unseen words, our goal is to deduce the input sentence's dimension by grouping diverse words so that a word-group sequence uniquely represents a word sequence.
Our diverse grouping approach is also close to the sparse representation (Wright et al., 2008), which makes the network parameter matrices sparse without changing its dimension. Our methods reduce the Neural Network (NN) input dimension.
Such a dimension reduction can be seen as a kind of regularization on NNs. There have been many types of NN regularization methods. (Louizos et al., 2018) adds a parameter norm penalty to the objective function, (Bertsekas, 2014) adds constrained optimization, and many works exploit the sub-structure of network models, such as dropout, early stopping, and weight decay. Those approaches are very popular but may limit the capacity of models, while our methods benefit from indomain linguistic knowledge. Work such as (Wang et al., 2018) adds augmented data (e.g., noisy data, pseudo data, etc.) but is a domain-dependent approach that inventively increases the training time.
Our work is perpendicular to the successful research in word embedding, whereby a word is mapped one-one onto a real number vector trying to preserve word pair distances. In contrast, our methods map many words into one group in a discrete space. Also, our systems build on BPE, but we do not decompose and recombine words. Therefore, our methods are additive to any improved word embedding (May et al., 2019), or BPE (Provilkov et al., 2020) versions.

Algorithms for Diverse Grouping
We now present our algorithms. We denote the set of words by V, and the set of groups by V, i.e., V is a subset of the powerset of V. Each grouping can be encoded as a function γ that maps each word w ∈ V to some group γ(w) ∈ V.

Random Grouping
Our first approach computes a random grouping, as shown in Algorithm 1. This algorithm's complexity is O(V), where V is vocabulary size.
We map each word to a group chosen uniformly at random. C is a hyperparameter indicating the total number of groups. Because it is expensive to tune C with exhaustive search, we set C as the total number of Metaphones in English, inspired by previous work (Du and Way, 2017;Liu et al., 2018) in which phonetics improves NMT in specific languages. Furthermore, each group's size follows the natural phonetic encoding distribution (e.g., Metaphone (Philips, 1990)) by considering each phonetic encoding as a group. For example, each Metaphone is considered as a group, and the number of groups in the random grouping is set to the number of unique Metaphone.
Algorithm 2 extends Algorithm 1 by learning a Poisson or Gaussian model for the distribution of group sizes. We fit the distribution of the Meta-Algorithm 2 Poisson/Gaussian-based Random Grouping Input: Vocabulary of words V, a phonetic encoding Parameter: Groups [γ 1 , . . . , γ C ], C ∈ N Output: Grouping γ Randomly sample the group size k from Poisson/Gaussian distribution (which is trained on the English Metaphone distribution) 3: phone group sizes into a Poisson or Gaussian distribution. Then, we sample the group size according to this Poisson or Gaussian distribution. Finally, we sample words for each group uniform randomly.

Distance-Based Diverse Grouping
We now introduce our grouping algorithm, which uses distances on the vector representations of words. The complexity of this algorithm is O(V 2 ).
Our approach is described in Algorithm 3, which is inspired by the classical 2-approximation algorithm for k-center clustering (Gonzalez, 1985). Our algorithm works as follows: randomly pick a word from the vocabulary and add it to the list L ′ . Pick the second word that is the furthest from the first word, pick the third word which is furthest from the closest of the two selected words, and so on. Finally, for each group size k that follows a Metaphone encoding size distribution, group the top k words into group one and remove those k words from the list. This process is performed iteratively until all words are assigned. We use cosine-similarity to measure the pairwise distance between words. Figure (1) illustrates the work of the algorithm.

Maximum Entropy-Based Unigram Diverse Grouping
We now present our grouping algorithm, which maximizes unigram entropy. Our ultimate goal is to maximize the information kept (or reduce the infor-

Algorithm 3 Distance-Based Diverse Grouping
Input: Vocabulary of words V Parameter: Groups [γ 1 , . . . , γ C ] with sizes k i Output: Grouping γ 1: Embed V in R N using e.g. word2vec 2: W ← resulting embedding 3: Randomly pick w 0 ∈ W, append w 0 to the ranked list L ′ 4: for 1 ≤ j ≤ |V| do 5: Distance-based diverse grouping does not consider the probability (relative frequency) of each element (original word), i.e., the input distribution. For example, if a word occurs very frequently, e.g., "the", which can be followed by many words (different nouns), then the context of "the" cannot help much to distinguish its meaning. Therefore, it is less ambiguous to assign a frequent word ("the") than an infrequent word to a unique codeword without sharing the codeword. Shannon entropy provides the quantitative measure on information considering such an input distribution. Importantly, Maximum Entropy-Based Unigram Diverse Grouping (Entropy) is a more efficient algorithm, with a complexity of N O(1) , where N is the number of running words in training. Furthermore, we provide a provable guarantee of about 1 c -approximation.
The entropy-based diverse grouping aims to maximize the diversity of group assignments in the given text with respect to its entropy. Because the entropy is maximal when the underlying distribution is as close to uniform as possible, this objective captures the diversity requirement. As an illustration, consider the following text: (1) "she is running very fast."; (2) "he is running very fast."; (3) "running is very popular today." We want to form three groups. This text has a length fourteen; thus, a grouping with high entropy aims to keep the frequency of each group around 5 14 . Therefore, the frequent words like "running" and "is" are likely to be grouped apart; infrequent words will be spread among groups uniformly. For instance, the grouping {1: running, fast, late}, {2: is she today}, {3: he very popular} has a group frequency of 5 14 , 5 14 , 4 14 hence achieving high entropy. Furthermore, the grouped words appear to be diverse enough so that each pair of groups {11, 12, . . . , 33} appears exactly once or twice after we perform the grouping. We consider the entropy with respect to a distribution induced by the relative frequencies of group unigrams.
Formally, for any group γ i we can define a relative frequency of a group as We are interested in a grouping that maximizes (1).

Algorithms for Unigram Entropy Diverse Grouping
We show how to compute a grouping for (1) by adapting the approximation algorithm for submod-ular maximization under matroid constrains due to (Lee et al., 2009). In our terminology, their algorithm applies three operations to all possible pairs (w, γ i ) where w ∈ V and γ i ∈ V. It terminates when for every (w, γ i ) each operation is either impossible to perform or the resulting entropy gain is below (1 + ε/(C|V|) 4 )H old where H old denotes an entropy of the grouping before the operation. These operations are: 1. Put a word w into a group γ i 2. Remove a word w from a group γ i 3. Remove a word w from a group γ i and then put another word v into a group γ j (we allow either w = v or γ i = γ j ).
After we find the initial grouping, some of the words may remain unassigned. We note that in general, adding new words to a grouping may decrease the entropy. As an example, assume that we have two groups γ 1 , γ 2 with c γ 1 = 0.25, c γ 2 = 0.5 and an ungrouped word w with F w = 0.25. The current entropy of a grouping is −0.25 log 0.25 − 0.5 log 0.5. Setting γ(w) = γ 2 means that the contribution of c γ 2 is now −0.75 log 0.75 < −0.5 log 0.5 hence the total entropy decreased. To minimize the potential entropy loss, we map ungrouped words to a group γ j with the smallest partial entropy G(c γ j ) = −c γ j log(c γ j ).
Algorithm 5 explains the detail of the unigram entropy diverse grouping algorithm, respectively. We give proofs to this algorithm with the main result stated as follows: Theorem 1. Given any precision parameter ε > 0, Algorithm 5 runs in polynomial time, and computes a grouping that is a C−1 4C+4εC -approximation to the maximum unigram entropy.
Roughly speaking, our algorithms are about 1/4 away from optimal of maximizing the entropy. In typical cases, our algorithms could be very close to the optimal. Section 4 describes the details of the proofs.
To apply the results of (Lee et al., 2009), we need to show that the grouping set family forms a matroid (Lee et al., 2009) on V × V.

Lemma 1. The grouping set family defines a matroid on V × V.
To satisfy the conditions of a matroid, we need Algorithm 5 Unigram Entropy Diverse Grouping Input: Vocabulary of words V, relative frequencies F w Parameter: Groups [γ 1 , . . . , γ C ], C ∈ N, precision parameter ε Output: Grouping γ : V → V 1: Compute initial grouping γ ′ using Algorithm (4) 2: if γ ′ (w) is undefined for some w then 3: Create a new grouping γ ← γ ′

5:
Find a group γ i 0 with the lowest partial unigram entropy: for all w ∈ W do 7: Set γ(w) ← γ i 0 8: end for 9: end if 10: return γ two properties. As an example 1 , consider V and V as in Figure (2). Firstly, let Q ⊆ V × V be a set that defines a grouping of V. For instance, Q = {(point,1), (graph, 1), (noun, 1), (text, 2), (science, 2)}. Then every R ⊂ Q such as R = {(point,1), (graph, 1), (text, 2)} must define a grouping as well. Secondly, take two sets S, T ⊆ V × V that both define groupings. Then if |T | < |S|, we should always be able to find a pair (w, γ i ) ∈ S \ T such that adding (w, γ i ) to T results in a grouping. In Figure (2), this pair is (point, 1) in S; T ∪ (point, 1) does define a new groping.
Moreover, the algorithm from (Lee et al., 2009) requires an objective function to be submodular (Lee et al., 2009). Intuitively, submodularity means that a function value changes less for larger inputs.

Lemma 2. The function H : V × V → R is nonnegative and submodular.
To see that H is submodular, consider V and V as pictured in Figure (2). Consider groupings γ and γ ′ induced by R and Q from Figure (2). Assume that we add (word, 1) to γ and γ ′ . Relative frequencies of γ 2 remains unchanged for γ, γ ′ . Then the entropy gain for γ ′ and γ depends only on the 1 The full proof is provided in Appendix. partial entropies of a group indexed by 1: Every pair (w, γ i ) grouped by γ is also grouped by γ ′ thus c γ 1 < c γ ′ 1 . Because the function L(x) = −x log (x + F well ) + x log x is monotone decreasing for all real non-negative values x, we have L(c γ 1 ) > L(c γ ′ 1 ). Hence, the larger grouping gains less in entropy than the smaller one. Now we give a sketch of the proof of Theorem 1.
Proof. We claim that H(γ) ≥ C−1 4C+4εC H * where H * is the largest unigram entropy among all groupings. We should consider the case H(γ) < H(γ ′ ). The groupings γ and γ ′ differ only in index i 0 . Thus the difference H(γ) − H(γ ′ ) is equal to the difference in the partial entropies We note that the group γ ′ i 0 with the smallest partial entropy contributes at most H(γ ′ )/C to the total entropy of γ ′ . Moreover, partial entropy of γ i 0 is always non-negative. We obtain H(γ) − H(γ ′ ) ≥ −H(γ ′ )/C. Our bound follows by plugging in the estimation H(γ ′ ) ≥ H * /(4 + 4ε) which is the approximation guarantee for the Algorithm 4 from (Lee et al., 2009).

Experiments
Combination Methods Below, we will discuss how to incorporate our new representation using any of our grouping methods in NLP tasks. Firstly, we group each word independently. Applying a grouping function γ(·) in Section 3 on each word x 1 , x 2 , x 3 , · · · , x i , · · · , x I ′ in an input sentence one by one generates a sequence of word groups γ(x 1 ), γ(x 2 ), γ(x 3 ), · · · , γ(x i ), · · · , γ(x I ′ ) in the same length I ′ . Note that we use the term "word" loosely here; it can mean a word or a subword (of a BPE token), or even a character. The first combination method is concatenation, see Figure 3a. We apply this method in NMT. First, we concatenate two input sources. Next, we apply the Byte-Pair-Encoding (Sennrich et al., 2015) (BPE) and word embeddings implemented byŘehůřek and Sojka (2010) on each word ǫ(x) and its codeword ǫ γ (γ(x)). We separately train word embedding on groups and on words. Thus, ǫ γ (·) and ǫ(·) are different functions. As shown in Figure 3, the input to the NLP system is the embedded words of a sentence, ǫ(x 1 ),ǫ(x 2 ),ǫ(x 3 ), · · · ,ǫ(x i ), · · · ,ǫ(x I ), wherẽ ǫ(x i ) is the concatenation of the embedded words ǫ(x i ) and their groups ǫ γ (γ(x i )): The second method is linear combination on encoder outputs, see Figure 3b. We use this method in part-of-speech (POS) tagging. The input to the linear combiner is the grouped sentence, represented by a sequence of hidden statesh 1 (ǫ(x I )), · · · ,h j (ǫ(x I )), · · · ,h J (ǫ(x I )) of the last position I in each of the encoder layers j ∈ [1, 2, · · · , J]. J is the number of nodes at each decoder layer. Recall that each hidden state is a real vector R d , which is why we can use the vector space operations such as addition on it. For convenience, we denote the last hidden state of the j-th encoder layer, which we take as the input to the decoder,h j (ǫ(x I )), byh j I , the last hidden state of the j-th encoder layer of the original textual sentence h j (ǫ(x I )) by h j I , and the last hidden state of the j-th encoder layer of the grouped sentence h j (ǫ γ (γ(x I ))) by h γ j I . The combined encoder hidden stateh j is a linear interpolation of the hidden states of the textural input and its group input: As shown in Figure (3b), each layer's combined last hidden state is fed into the baseline decoder with the operator of +. α is the encoder weight of the grouped sentence, and here, α = 0.5.
In the following context, we will show the our methods' evaluation results on three representative NLP tasks: (1) Machine translation as a recognition and generation problem; (2) Language modeling as a regression problem; and (3) POS tagging as a typical sequence labeling problem. We will show that our methods have the potential to improve any NLP application with textual inputs.

Neural Machine Translation
Dataset We empirically verify our method on the IWSLT'17 dataset containing 226 thousand sentences. Table 1 shows the vocabulary statistics before and after the pre-processing on the original and the concatenated data. We carry out experiments on the English-to-French (EN-FR) language direction.
We also carry out experiments on additional medium and small NMT tasks. For medium-sized tasks, we use the IWSLT'17 dataset with language directions including English to German (EN-DE), German to English (DE-EN), and English to Chinese (EN-ZH). We use the MTNT'18 dataset with language directions English to French (MTNT EN-FR) and French to English (MTNT FR-EN) for the small-sized task.
Baseline and Setup As a filter in pre-processing, every sentence is restricted to 250 characters and 1.5 length ratio between source and target sentences using Moses tokenizer (Koehn et al., 2007). The Byte-pair encoding model (with 16K BPE operations) is jointly trained on the source textual word inputs, cluster ID inputs, and target outputs. The baseline NMT model is the Convolutional Sequence to Sequence (Gehring et al., 2017) (ConvS2S), with the following parameter setting: the embedding dimension as 512, the learning rate as 0.25, the gradient clipping as 0.1, the dropout ratio as 0.2, and the optimizer as NAG. The training is terminated when the validation loss does not decrease for five consecutive epochs. For Chinese translations, we use the IWSLT post-processing script (IWSLT, 2021). Finally, the translation accuracy is measured with the BLEU score using SacreBLEU (Post, 2018).
Distance Measure To empirically compare across random and distance-based grouping algorithms, we measure the intra-group average distance of group pairs as follows: For each group in the source side vocabulary of the training and test set, compute the sum of the cosine distance 1 − A·B ||A||·||B|| of the embedding of each word pair, then divide it by the total number of word pairs in this group to get the group diversity. Then, average the distance of all groups in the vocabulary. Each algorithm generates the same number (63992) of word groups. The average distance of Poisson/Gaussian-based Random Grouping is 0.1286, and that of Rank-based Diverse Grouping is 0.1291. This finding is consistent with the translation BLEU score in Table 2. The greater the intra-group distance, the higher the accuracy. Maximum entropy approaches cannot be compared with this measure because it takes the entropy as an objective function. We have provided its provable guarantee in Section 4.    Results For IWSLT'17 task, Table 2 shows the improvement when applying each of our methods on the ConvS2S baseline. All of our methods significantly enhance the accuracy of the NMT systems. Among them, the entropy-based diverse grouping achieves the greatest improvement, i.e., +6.5 BLEU points, which is +36.9% relative improvement.
Analysis Figure 4 compares entropy, distancebased D.G. and the baseline method with respect to the sentence-level BLEU score in a histogram (Neubig et al., 2019). The baseline method generates  almost double low-quality translations (347) compared to the distance D.G. and entropy methods (178 and 179), while the latter two methods generate many more high-quality translations with BLEU above 20%. Table 4 shows the FR-EN baseline and entropy translation outputs, respectively. We observe that our entropy method is particularly better than the baseline when the baseline fails in: (1) performing a reasonable translation; (2) missing phrases; (3) mis-translating phrases.

POS Tagging
We evaluate our approach in POS Tagging on Brown Corpus (Francis and Kucera, 1979). Brown corpus is a well-known English dataset for POS and contains 57 341 samples. We uniform randomly sample 64% data as the training set, 16% as the validation set, and 20% as the test set. Our baseline is a Keras (Chollet, 2015) implementation (Joshi, 2018) of Bi-LSTM POS Tagger (Wang et al., 2015). We train word embedding (Mikolov et al., 2013) implemented byŘehůřek and Sojka (2010) with 100 dimensions. Each of the forward and the backward LSTM has 64 dimensions. We use a categorical cross-entropy loss and RMSProp optimizer. We also use early stopping based on validation loss.

Language Modeling (LM)
We train and evaluate the English part of EN-FR IWSLT'17 dataset. We use 256 embedding dimensions, six layers, and eight heads for efficiency. We set dropouts to 0.1, the learning rate to 0.0001, and BPE operations to 32k. We used Adam optimizer with betas of 0.9. As shown in Table 6, Entropybased diverse grouping reduces PPL of the baseline system, i.e., 3.76% relatively.

Conclusion
We introduce a novel approach that generalizes Deep Learning models by grouping input words to maximize their semantic diversity. To this end, we design a family of algorithms based on random sampling, geometric distance, and entropy, and provide provable guarantees to the entropy-based diverse grouping. Our methods reduce the number of low-quality translation outputs (< 10% in BLEU) to half and greatly increase the high-quality translation (> 20% in BLEU) ratio. Experiments show that our approach significant improves over stateof-the-art baselines in Neural Machine Translation (i.e., up to +6.5 BLEU points) and achieves higher accuracy in POS Tagging and Language Modeling.
Definition 2 (Submodular function). A function f : 2 Ω → R, where Ω is finite, is submodular if for any X ⊆ Y ⊆ Ω and any x ∈ Ω \ Y we have For any non-negative real x and fixed a > 0, we denote −(x + a) log 2 (x + a) + x log x as L a (x).
Proof of Lemma 2. First, we show that H(Q) ≥ 0 for all Q ⊆ V × V. By definition, we have H(∅) = 0. Consider an arbitrary non-empty Q ⊆ V × V. For any γ i ∈ V we have Let us denote the frequency of the unigram γ j in Q, Q ′ as c γ j (Q), c γ j (Q ′ ). Since Q and Q ′ differ only in the group γ i ′ we have Similarly, (5) holds for H(R ′ ) − H(R). Thus, to proof (4) it is enough to show We have c γ ′ i (Q ′ ) = c γ ′ i (Q) + F w ′ ; therefore, (5) can be rewritten as L F w ′ (c γ i ′ (Q)). Similarly, c γ ′ i (R ′ ) = c γ ′ i (R)+F w ′ hence we need to establish L F w ′ (c γ i ′ (R)) ≥ L F w ′ (c γ i ′ Q).
For any (w, i ′ ) ∈ R we have (w, i ′ ) ∈ Q; thus c γ i ′ (R) < c γ i ′ (Q), and (6) follows from the fact that L F w ′ (x) is monotone decreasing for all nonnegative real x.
Proof of Theorem 1. By the result (Lee et al., 2009), the Algorithm 5 outputs the map γ ′ such that where γ * is the grouping which achieves largest value of H. We need to show that the approximation guarantee still holds if γ ′ (w) is undefined for some w. After Step 8, the groupings γ ′ and γ differ only for the group i 0 ; thus, and thus for the group i 0 we have From (8) and L(x) ≥ 0 we obtain For a single matroid constrain, the algorithm from (Lee et al., 2009)