Interactive Construction of User-Centric Dictionary for Text Analytics

We propose a methodology to construct a term dictionary for text analytics through an interactive process between a human and a machine, which helps the creation of flexible dictionaries with precise granularity required in typical text analysis. This paper introduces the first formulation of interactive dictionary construction to address this issue. To optimize the interaction, we propose a new algorithm that effectively captures an analyst’s intention starting from only a small number of sample terms. Along with the algorithm, we also design an automatic evaluation framework that provides a systematic assessment of any interactive method for the dictionary creation task. Experiments using real scenario based corpora and dictionaries show that our algorithm outperforms baseline methods, and works even with a small number of interactions.


Introduction
Since the emergence of practical interests in text analytics that finds insights from massive documents (Nasukawa and Nagano, 2001), there are several requirements for enhancing valuable discoveries. The one critical issue we tackle in this paper is an effective construction of a term dictionary (Godbole et al., 2010). The term dictionary, which is an arbitrary set of terms, is used in text analytics to represent interesting analysis perspectives (Nasukawa and Nagano, 2001;Nasukawa, 2009); for example, dictionaries of "product names" and "evaluative description" are required for mining customer reputations about products. The motivation of this paper is how to reduce the human workload for the dictionary con-

Flexible and fine-grained dictionary in this work
Nice Figure 1: Typical dictionaries in previous works (upper) and fine-grained dictionaries in this work (lower) struction as much as possible. To this end, we establish a methodology of interactive dictionary construction that incrementally captures an analyst's intention starting from a small number of sample terms and enables him/her to effortlessly expand terms in the intended dictionary through suggestions by a machine. The term dictionary for text analytics is expensive to be constructed because we need to focus more on terms with flexible granularity for in-depth analysis (Takeuchi et al., 2009;Godbole et al., 2010;Mostafa, 2013). For instance, if the analyst wants to examine product evaluation from both its function and appearance, he/she then needs to separately create those dictionaries whose boundaries are vague and overlapped ( Figure 1). In short, we need to group any terms the analyst wants together depending on documents and the objective of analysis, which forces an ad hoc construction of the term dictionary. This situation is rather severe in the real-world tasks because the vocabulary size for an exhaustive search of the texts is vast, and the analyst will go through re-peated trial and error of creating dictionaries until he/she reaches findings.
At present, there is a demand for a machine that decreases the cost of the ad hoc dictionary construction. As the dictionary construction can be considered as a type of collecting terms, there is a related research field -set expansion that expands a small set of terms by means of bootstrapping (Pantel and Pennacchiotti, 2006). This approach automatically finds new terms for the given set from documents in accordance with a predefined exploration strategy (Pantel et al., 2009;He and Xin, 2011). Although such an automatic procedure is advantageous for reducing the human workload, the quality of the collected terms is suspicious for a term dictionary. For example, a good analysis requires more fine-grained dictionaries than the original targets in set expansion such as distinct ontological terms (e.g., country name, Shen et al. 2017Shen et al. , 2018. Several studies have incorporated a human in the term collection process (Godbole et al., 2010;Coden et al., 2012). Specifically, dictionaries are built in an interactive process where the human gives feedback to the machine and the machine suggests candidates based on the given feedback (Alba et al., 2017(Alba et al., , 2018. Such a humanin-the-loop approach has been an active topic in other fields as well, for instance, image classification (Cui et al., 2016), dialogue system (Li et al., 2017), and audio annotation (Kim and Pardo, 2018). We can generally expect that a reliable feedback provided by human makes a system more accurate. With respect to dictionary construction, however, experimental results in this vein are limited due to the empirical evaluation by just a few participants and the use of a coarse dictionary as the test items. In short, it is a still open question -what is a critical issue for interactive construction of fine-grained term dictionary for text analytics?
Moving in the same promising direction of leveraging both a human and a machine, we establish a well-defined and effective methodology for constructing the term dictionary. In summary, our contribution in this paper is fourfold: (i) We formulate the interactive process of a term collection, which brings clarity to the problem to be solved ( §2). (ii) We develop a method that captures an analyst's intention from a small number of samples with our formulation as the basis ( §3). (iii) We  propose an automatic evaluation framework that provides a systematic assessment for interactive methods ( §4). (iv) Our experimental results show that the proposed method surpasses baseline methods such as set expansion, word embedding and a linear classifier on the crowdsourced dataset. The dataset emulates the real-world scenario of flexible and fine-grained dictionary construction, and we distribute the dataset to the public ( §5).

Task Definition
In this section, we provide the definitions and notations used throughout this paper. First, a term is a string representation of a certain notion such as "apple" and "New York". A dictionary is a collection of terms. A user denotes the person who wants to construct a dictionary, and system denotes the machine that helps the user. Let W be the whole set of terms in documents. Our objective is to rapidly find as many terms of the user's interest U ⊂ W as possible.
As seen in Figure 2, interactive dictionary construction is defined as an iterative process in which each iteration consists of the following steps: 1) User feedback in which the user selects terms for the dictionary from the current candidate terms, and 2) Candidate selection in which the system finds candidate terms for the next user feedback. For the i-th iteration (i = 0, 1, 2, . . .), let C i be the set of terms that the system finds in the candidate selection step and U i be the set of terms that the user selects from C i−1 in the user feedback step as positive examples. Here, U 0 is a special feedback we call seed terms that are directly given by the user first. Note that, because we wish to expand the dictionary, each term in C i should be new to the user in the (i + 1)-th iteration.
In the i-th step of the user feedback (i ≥ 1), we assume that the user can annotate which terms in C i−1 are in U without being aware of the whole However, it is impractical to define our objective as an optimization problem for the asymptotic convergence of U i because the user feedback is done by a human, and i cannot be large. Hence, we try to maximize |C i ∩ U |, the number of suggested terms that match the user's interest. Also, since C i is manually selected by a human user, the proper size of C i is practically limited to 5 ∼ 10. Figure 3 shows the steps from setting the seed terms to giving the first feedback to the first candidates. Using the example in Figure 2, U 0 is {Formal}, C 0 is {Nice, Traditional}, U 1 is {Traditional}, and C 0 \U is {Nice}. The system then next selects C 1 based on U 0 and U 1 (i.e., U 1 ) from W except for the shown terms C 0 ∪ U 1 . It is important that we design the system to be effective so that the overlapped area of C i and U becomes larger.
There are two major challenges for this problem; one is number of seed terms, and the other is term overlaps of different dictionaries. In terms of the first issue, we have only a few seed terms for the target dictionary at the first iteration. If the system requires more seed terms, the advantage of the system drops because it contradicts our purpose to decrease the human workload in constructing the dictionary. Therefore, we need a method that captures the user's intention from a smaller number of samples. In terms of the second issue, identifying terms of user's interest is difficult because boundaries between dictionaries are often overlapped in text analytics as seen in Figure 1. In other words, the system need to be more sensitive to subtle semantic differences only with a few feedbacks.

Method
In this section, we first describe a previous candidate selection model, SetExpan algorithm (Shen et al., 2017) that inspired our method ( §3.1). Subsequently, we introduce our method as the weighted version of SetExpan with improvements in dealing with interactive settings ( §3.2∼). Throughout this section, we discuss the i-th step of candidate selection for a certain i. For simplicity, C i and U i are denoted as C and U , respectively.

Candidate Selection: Similarity Scoring based on Feature Collection
As we stated in §2, the objective of the task is to suggest C that contains as many terms in U as possible. Recall that U is a set of positive examples for terms of the user's interest that are found in previous steps. Following the strategy taken in set expansion (Shen et al., 2017), a straightforward and reasonable approach to determine C is to define Sim(e, e |F ) which returns a similarity score for two terms e and e based on a set of features F , and then to select terms that are most similar to the positive terms in U . The issue is how to obtain the ideal F that assigns a higher score to terms potentially included in U . Shen et al. (2017) formulates this feature selection problem as choosing features with the number of fixed-size Q so that the positive terms are most similar to each other: where U := {e 1 , . . . , e n }. They propose using the Jaccard coefficient for Sim(e i , e j |F ), which narrows the optimization problem to a binary decision on whether to use each feature. This combinatorial problem is NP-hard; hence, they use heuristics to choose an approximation of F * .

From Feature Selection to Feature Weighting with Predefined Similarity
Instead of explicitly choosing features to use in the similarity calculation, we consider using all of the possible features {f 1 , . . . , f L } with the weight w k ∈ R for each feature f k . In addition, we define our optimization problem as finding the best w k for f k (k = 1, . . . , L). Let us develop a formula that extends (1) and takes w k into consideration. First, in such a formula, Sim(e i , e j |F ) should be a weighted sum of the similarity score for each feature f k , denoted as Sim(e i , e j |f ). By replacing F with w in the expression of the similarity function, we have Next, to define the similarity between a term e and U , we assume that the similarity is the average of similarities between e and e i ∈ U , that is, Sim(e, e i |w).
The initial formulation of our optimization problem is thus as follows: Sim(e i , U |w). (4) We show in the Appendix that our formulation of (4) can be considered as the weighted version of (1) under the natural condition that Sim(e i , e i |f k ) = Sim(e j , e j |f k ) for any i, j, and k, and L k=1 w k = 1. It is easy to set Sim(e, e |f k ) satisfying this condition. For a feature f k , we define a vector v f k (e) of an e and define Sim(e, e |f k ) as the standard inner product of v f k (e) and v f k (e ). Then by normalizing all these vectors, Sim(e i , e i |f k ) = v f k (e i ) = 1 holds for any i; hence, the condition is satisfied, and that is a conventional cosine similarity of word vectors (Levy et al., 2015). Thus, any mapping from W to a vector space is available as a feature such as the tf-idf of terms and discrete features (Manning et al., 2008), word2vec (Mikolov et al., 2013), or GloVe (Pennington et al., 2014). Note that the dimension of the vector space may be different among the features.
Hence, we assume v f k (e) is defined for each feature f k and any e. When we use Sim(e, e |f k ) and by a simple calculation, (3) is equal to f k . We simply call v f k ( U ) the centroid of U . Formulas (5) and (6) demonstrates that the similarity between any two terms can be measured by combining the characteristics of the L different feature spaces. We "select" the feature spaces in which terms in U become similar to each other by adjusting the weights, as shown in Figure 4. Note that our feature weighting formulation is categorized as a conventional linear regression that finds f k characterizing U via the weights. Instead of calculating the weights for bare features of each term, our method estimates those for differently predefined feature spaces (i.e., the similarity scores in these spaces). It aims to mitigate the difficulty of finding optimal weights for the vast number of features only from few labeled samples. However, the drawback is that this sacrifices a model's degree of freedom; therefore, we test the effectiveness of our proposed model compared to an ordinary linear classifier in the experiment.

Optimization by User Feedback
Although the initial formulation (4) proved to be a natural extension of the discrete version of feature selection, it does not always work as expected. In this section, we discuss the reason for this and how we can improve the initial formulation of our optimization problem.
By substituting (2) and (3) into (4), the objective 1≤i≤n Sim(e i , U |w) is a linear function of w. Assuming that L k=1 w k = 1, the optimal w is determined by putting all the weight values on a particular feature space which has the highest score in the averaged similarity between the terms in U and the centroid of U . This is equivalent to selecting only one feature space for the similarity computation. Such extreme optimization is not suitable for our interactive setting because the target dictionary is obscure, especially in earlier iter-ations. We want the system to diversify the candidate terms to broadly cover the user's interests and allow the user to discover related vocabularies for a customized dictionary. To address this issue, we modify our formulation of (4) as Sim(e i , U |w). (7) We maximize the minimum similarity score between a term in U and the centroid of U . The idea here is to reduce the distance between the farthest positive term and the centroid. This strategy is analogous to those used in active learning, where examples near the separating hyperplane are actively leveraged (Schohn and Cohn, 2000). Our objective function min 1≤i≤n Sim(e i , U |w) is a concave function of w (see Appendix); therefore, we can solve it by (for example) gradient descent.
We can also leverage negative feedback, i.e., unselected terms in C, to make the system more sophisticated. Let N := C \ U = {z 1 , . . . , z m }, then we can extend (7) by The second term on the right-hand side of (8) increases the distance between the closest negative term and the centroid of U . Again the objective function of (8) is a concave function of w; thus, the information of both positive and negative examples is taken into consideration to learn the optimal w * .

Feedback Denoising
Although our min-maximize optimization strategy diversifies candidates, it may be disadvantageous in terms of the system being affected by outliers. It happens that several terms in U (especially for manually fed terms such as seeds) distribute differently in possessing feature spaces compared to the rest of the positive terms. Such a case holds up the learning because the maximum similarity score of the outliers to the centroid is low. The left side of Figure 5 shows an example of this problem: specifically, the system cannot put a higher weight value on f 1 because the optimization target, which is the most distant one from the centroid ("watermelon" in this case), is biased to f 2 . Feedback denoising is a simple solution to this problem. We apply a clustering algorithm (e.g., K-Means) to terms in U , and obtain K term sets U (0) , U (1) , ..., U (K) . Then, we conduct the optimization by replacing U in (7) and (8) with U (K * ) where K * = arg max K | U (K) |, that is, the majority class among terms in U as shown in the right side of Figure 5. This is effective for denoising irregular terms with respect to feature distribution, and for guiding the system to a promising w * .

Evaluation Framework
In this section, we explain an automatic evaluation framework for interactive dictionary construction. By using a predefined dictionary as the oracle dictionary U * , we emulate the manual feedback process and apply a new evaluation metric to estimate the effectiveness of building a dictionary with consideration of the human interaction.

Human Emulation
We describe the emulation process with U * , and the entire flow of the emulation procedure is in Algorithm 1. At the beginning of the emulation process, a small number of seed terms are randomly chosen from U * , and U 0 is initialized with them (l.1). The number of iterations I (l.2) and the number of suggested terms per iteration |C| (l.3) are also determined. The iteration consisting of user feedback and candidate selection is then launched. In every i-th iteration, the system first suggests the C i based on the known positive terms U i−1 (l.5). After receiving the suggested C i , the automatic evaluation process takes the intersection of C i and U * , and records the overlapped terms as U i (l.6). It also takes the difference set of C i and U * as the negative terms N i (l.7). If the system is trainable, its training process runs before moving to the next iteration (l.8 − 10).
Algorithm 1 Human emulation with oracle dictionary 1: SET seed terms U 0 from U * 2: SET number of iterations I 3: SET number of suggested terms per iteration |C| 4: for i = 1 to I do 5: if System is trainable then 9: Run training with U i (and N i ) 10: end if 11: end for

A Metric for Effectiveness Estimation
In addition to the automatic evaluation process, we introduce a new metric that takes the interaction quality into account when evaluating the accuracy of the candidate selection.
The final goal of dictionary construction is to obtain a complete set of terms consistent with U * ; however, there is a limitation stemming from a user's workload in real scenarios. Given that an effective system should suggest terms of user's interest in earlier iterations, we propose weighted coverage per iteration (W CpI) as the evaluation metric for interactive dictionary construction: where α is the hyperparameter to adjust the importance of the iteration number. We illustrate the intuition of W CpI in Figure 6. W CpI is an area ratio of accumulated positive terms from system suggestions to its upper bound in each iteration. In short, it measures how many correct suggestions the system can provide in the comparison with a "perfect" system that never suggests unrelated terms.
We can also regulate the importance of iteration number by adjusting α. Specifically, a larger value of α underestimates the importance of terms found in the later iterations, in other words, it attaches importance to terms found in the earlier iterations. As an intuitive explanation based on an actual scenario, α is like representing a constant probability for the user to quit dictionary construction midway through. The graphs in Figure 6 compare the calculation of W CpI for the same system suggestions. The right one with α = 0.1, in which we as- sume the user quit creating a dictionary with 10% probability at every iteration, has a higher W CpI than the left one with α = 0.0.

Experiments
We conduct an experiment following the automatic evaluation framework by using public datasets and oracle dictionaries created through crowdsourcing. In the experiment, we compare several methods in addition to our proposed method. As emulation parameters, we set number of seed terms (|U 0 | ), the number of terms in one suggestion (|C|), and number of total iterations (I) to 3, 10, and 30, respectively. Note that we tried different numbers of seeds (1 and 5), but the overall tendencies were the same.

Dataset
We used crowdsourcing to create oracle dictionaries on the Amazon review corpus (Blitzer et al., 2007), which is publicly available. 2 First, we explain the corpus processing and the procedure to construct the oracle dictionaries. We then describe the evaluation items. Our evaluation items will be publicly available for the system evaluation in future research.
Corpus. The corpus originally consists of sub corpora from 25 domains. Given that size and domain vary, we pick five domains; apparel (APP), baby (BAB), camera & photo (CAM), health & personal care (HEL), and sports & outdoors (SPO). We process the raw texts with spaCy 3 and its dis-tributed English model 4 . We then construct the vocabulary with words and noun chunks that appeared more than five times except for standard stopwords. Note that all terms in the vocabulary are identified after lemmatization by spaCy.
Oracle Dictionaries. For each selected corpus, we create oracle dictionaries through crowdsourcing. 5 In the task for workers, we provide predefined dictionaries and ask the worker to choose one or more dictionaries to which a given term belongs. For example, we prepared three independent dictionaries nursery items dictionaries for sleeping, movement, and safety in the BAB corpus, and asked a worker to judge which dictionary includes the term "car seat". With respect to each corpus, we define multiple dictionaries and request three workers to make judgments for every term in the vocabulary. We determine that a term is included in a dictionary when at least one of three workers choose the dictionary for the term. Note that we filter noisy users and their answers beforehand according to the reliability score estimated by the crowdsourcing service 6 . Finally, we also manually clean each dictionary. Excluding dictionaries consisting of less than 15 terms or too much noise, we eventually obtain 22 dictionaries. We list the dictionaries and example terms in Table 1.
Evaluation Item. We generate ten evaluation items per dictionary, for 220 items in total. An evaluation item consists of a unique set of seed terms (U 0 ) and the remaining terms in the corresponding dictionary as the oracle (U * := U \U 0 ). We suggest that fewer seed terms are adequate for evaluating an interactive dictionary construction method; because the purpose is to gather terms with a minimum human effort as mentioned in §2.

Methods
We compare four methods: Word2Vec, SetExpan, logistic regression, and our proposed method with several configurations. All methods possess the same vocabulary W , and all methods excluding Word2Vec use the same feature spaces: tf-idfs of Bag of Words, unigrams, bigrams, and word embeddings. Any feature space is applicable, though.
Word2Vec: Word2Vec is a popular and promising method for representing word meanings in a continuous vector space, and the vector similarity is naturally applicable to interactive dictionary construction (Alba et al., 2018). We use two computation methods of candidate selection based on Word2Vec. The first is w2v(avg)and involves simply taking cosine similarity with an averaged vector of terms among U . The second is w2v(rank)and calculates the mean reciprocal rank from terms in U . Both select the candidates in order of their estimated scores. The embeddings are learned for each corpus with the gensim implementation using the default parameters. 7 SetExpan: We implement SetExpan (SE; Shen et al. 2017), which is a feature-selection method for conventional set expansion. The original version does not involve the user in the iteration and updates U i according to its own criteria to filter incorrect terms. In our scenario, we provide the correct terms in the update phase of U i . We use the same input features with other methods and set the hyperparameters to those Shen et al. (2017) reported as best.
Logistic Regression: We include logistic regression in our comparison because the feature weighting is one of the conventional types of linear discriminant analysis. The logistic regression version, LR, takes a word representation and then predicts the probability of the word appearing in a current dictionary. For word representation, we concatenate vectors in each feature space (explained in §5.2) and then use the vector compressed into 300 dimensions with singular value decomposition. In every iteration, we train LR from scratch with positive and negative terms. For the negative terms at the first iteration (i.e., N 0 ), however, we randomly select |U 0 | of negative words from the entire vocabulary except for dictionary terms. We select candidates following the order of estimated probabilities. While we tried other regression models (SVM and Random Forest) and dimensions of the input vector (noncompression, 50, 100, 200, 500, and 1000), the above condition was the best configuration.

Feature Weighting with Predefined Similarity:
We test six versions of our proposed methods: • FWPS: Our base model without optimization where w is uniform distribution → §3 • +PickOne: Selecting only one feature space with the highest similarity scores among positive terms → §3.3 • +Op(p): With optimization using positive feedback → Eq.(7) • +Op(p/n): With optimization using both positive and negative feedback → Eq.(8) • +Fd(p): With +Op(p)and feedback denoising → §3.4 • +Fd(p/n):With +Op(p/n)and feedback denoising → §3.4 We use the K-means algorithm for +Op(p) and +Op(p/n) with K = 3, though the overall trend was almost the same with K = 2 and 5.
Hybrid: We also introduce a joint method HB that combines LR and an FWPS version. The strategy is simple; HB firstly uses FWPS 's mechanism to broadly cover candidate terms, and then switches to LR when the amount of feedback increases. This mechanism naturally solves LR's problems that require negative feedback from the beginning and demand a moderate number of labels for training. Any of the FWPS versions can be combined with LR; therefore, we chose the best one for our experiment. The switch timing is empirically set to the 5-th iteration. Table 2 lists the W CpI scores for each method across five corpora with α = 0.0. In all domain texts, HB outperforms the others. The scores of LR are second highest, which implies that a combination with a FWPS model boosts performance. Among the versions of FWPS, +PickOne largely drops in score, which indicates the importance of the min-maximizing optimization strategy for this task (see §3.3). However, at least when α = 0.0 that assumes the user never quit the process in the midway through, the performances of FWPS and other versions with optimized w are not different much. In particular, the negative feedback tends to degrade the performance. Subsequently, SE, w2v(avg), and w2v(rank)perform poorly. SE may not be suitable for gathering arbitrary terms from a nonlarge corpus because it was originally designed and tested for collecting ontological terms from large-scaled data (Shen et al., 2017). Also, we find that leveraging embeddings in a straightforward manner is not sufficient, especially for interactive dictionary construction. Let us now discuss changes when adjusting the W CpI's α listed in Table 3. Ignoring corpus differences, we take the average scores among all evaluation items. The most crucial change can be found in LR which significantly drops in score along with an increase in α. When α = 0.1, the score of LR already becomes inferior to most   of the FWPS versions. Also, the scores of FWPS tend to be higher with a larger value of α. When α ≥ 0.3, +Fd(p) performs the best among all methods. In short, LR suggests correct terms in latter iterations; while FWPS, in particular with trainable ones (+Op(p), +Fd(p)), suggests correct terms in earlier iterations. Figure 7 directly describes the score differences with different alphas by showing the hit ratios defined as |U i |/|C| in terms of each iteration number for LR, +Fd(p), and HB. Regardless of the number of seed terms, LR suggests fewer correct terms in earlier iterations, but its hit ratio stably goes beyond +Fd(p) after obtaining a moderate number of training labels (around five iterations, i.e., fifty labels). On the other hand, +Fd(p) performs better by a large margin in earlier iterations than LR. In short, our method using predefined term similarities overcomes the smaller sample issue which a conventional linear classifier suffers from and contributes to quick dictionary construction. This result is practically important because the analyst will go through repeated trial and error -observing documents from various points of views -by creating many small dictionaries. In addition, such contrasts are much stronger when we give only one seed term (the upper graph), which is also meaningful because the user often starts dictionary construction with only one seed term in a real situation. HB enjoys both benefits of coverage by LR and quickness by +Fd(p). In other words, a conventional classifier and our method are complementary; LR becomes favorable when the user prioritizes coverage than quickness, and +Fd(p) becomes favorable when vice versa. As a possible use case of HB, the analyst may quickly find interesting perspectives by creating various dictionaries with one of the FWPS methods, and once finding those, he/she switches to a linear classifier to expand the promising dictionaries more.

Conclusion
To the best of our knowledge, this paper proposes the first formulation of interactive dictionary construction for text analytics, which clarifies the critical issues to resolve. In response to those issues, we provide the method, the evaluation framework, and the experimental dataset. Also, our experimental results show the promising performances of our method in concern with real situations of text analytics. Our systematic study will pave the way to future research about the effective construction of dictionaries for text analytics.

A.1 Proof
We prove that our formulation of the optimization problem is a natural extension of that of Set-Expan, assuming a reasonable normalization constraint for entity vectors and their weights. Notations follow from the main paper. Recall that the formulation of SetExpan is Sim(e i , e j |F ), where Q is the number of features in F and is a fixed integer value. Building on this, our formulation is Sim(e i , U |w).
Substitute (2) and (3) in the main paper into the above formulation to obtain w * = arg max Sim(e i , e j |w) In the right-hand side of the last equation, the second term is a constant when all of the vectors {v f k (e i )} i=1,...,n have the same norm and L k=1 w k = 1. Then our optimization problem is equivalent to Sim(e i , e j |w), which is a continuous version of (1). Next, let us prove that in the modified version of our optimization problem ((7) in the main paper), min 1≤i≤n Sim(e i , U |w) is a concave function of w. Hence we can apply standard techniques of convex optimization to solve (7). First let us rewrite (10) as follows: Sim(e i , U |w) Then it is sufficient to prove the following lemma. Lemma 1. The following function is concave for w when w is defined on a convex set.
Here we use the inequality min 1≤i≤n (A i + B i ) ≥ min 1≤i≤n A i + min 1≤i≤n B i that holds for any sequences of real numbers {A i } i=1,...,n and {B i } i=1,...,n .
Since L k=1 w k = 1, 0 ≤ w k (k = 1, . . . , L) is a convex set, we can apply this lemma to our objective function.