A Case Study of In-House Competition for Ranking Constructive Comments in a News Service

Ranking the user comments posted on a news article is important for online news services because comment visibility directly affects the user experience. Research on ranking comments with different metrics to measure the comment quality has shown “constructiveness” used in argument analysis is promising from a practical standpoint. In this paper, we report a case study in which this constructiveness is examined in the real world. Specifically, we examine an in-house competition to improve the performance of ranking constructive comments and demonstrate the effectiveness of the best obtained model for a commercial service.


Introduction
In online news services, the user comments posted on news articles function as a type of useful content known as user-generated content (UGC). Figure 1 shows examples of comments posted on Yahoo! JAPAN News, a Japanese news portal. 1 By reading these comments along with the article, users can obtain supplementary information such as other users' opinions, experiences, and simplified explanations of the article. There is a limit, however, on the number of comments that can be displayed on a page, and as users typically do not have the time or inclination to read through all the comments, ideally they should be ranked in some way. Prioritizing the comments for display is directly linked to user satisfaction, so improving this ranking is an important issue for such services.
There have already been multiple studies on comment ranking in online news services and discussion forums (Hsu et al., 2009;Das Sarma et al., 2010;Brand and Van Der Merwe, 2014;Wei et al., 2016). All of these studies have utilized user feedback (e.g., "Like"-button clicks in Figure 1) as their ranking metrics. Although such user feedback is easy to obtain, this type of measurement has two drawbacks: (i) user feedback does not always satisfy the service provider's needs, such as to create a fair place (i.e., a news space that is neutral), and (ii) user feedback will be biased by where comments appear in a comment thread (also known as "position bias" (Craswell et al., 2008)). A typical example for (i) can be seen in political comments, where the "goodness" of the comment tends to be decided on the basis of the political views of the majority of the users rather than on its quality. A typical example of (ii) can be illustrated by a case where earlier comments tend to receive more feedback since they are displayed at the top of the page, which implies later comments will be ignored irrespective of their quality. To resolve this issue, Fujita et al. (2019) introduced a metric representing a comment's constructiveness (see Section 2 for details), which has also been studied in argument analysis (Kolhatkar and Taboada, 2017a;Napoles et al., 2017a). Interestingly, they found empirical evidence that the constructiveness has no correlation with the user feedback, which has been commonly used for ranking comments. This implies that we need to consider the constructiveness rather than the user feedback to avoid unfavorable situations (i) and (ii) in real services.
In this paper, we take their study one step further towards practical application. Specifically, in collaboration with Yahoo! JAPAN News, we report a case study of deploying a model that ranks constructive comments in a commercial service. The characteristic unique point of our study is that we aim to improve the ranking quality through an inhouse competition. As represented by Kaggle (Kaggle, 2020), the machine learning competition platform, it has become common to improve a model's performance through a competition format. This kind of experiment has also been conducted in various research areas through shared-task workshops, with the WMT translation task (Barrault et al., 2019), TAC text analysis task (Demner-Fushman et al., 2018), and NTCIR information retrieval task (Kato and Liu, 2019) being well-known examples. Following this trend, we also aim to improve the ranking performance through a competition format. As this kind of work conducted within a company towards a commercial service is rarely released in the form of an academic paper, we expect our findings to become valuable knowledge for practitioners in the field. We clarify the novelty of our study against other previous studies in Section 7.
Our main contributions are as follows: • We report the details of the in-house competition (i.e., constructive comment ranking task) conducted in a commercial news service, Yahoo! JAPAN News, where we obtained a new model with a 2.73% improvement in performance (NDCG) compared to the baseline (Section 3). We also administer a participant survey and discuss positive and negative opinions relating to this competition (Section 6). • We consider several ensembles of the submitted models and show that the best one performed better than the best single model (Section 4). Nevertheless, the service does not find it reasonable for practical use considering the need for maintainability and low latency against the performance increase (0.62%). This suggests that while an ensemble of various models submitted in the competition is promising in an academic sense, it still has challenges in an industrial sense. We believe that this will open a new direction for the ensemble research field to solve such challenges. • We demonstrate that the high-performance models in the competition are practically useful in the real world with a service perspective evaluation (Section 5), and in fact, the service decided to introduce the best single model. • We will release the 59K labeled dataset and the models submitted in the competition for future research.

Preliminaries
Constructiveness: We use the concept of constructiveness to prioritize comments that provide insight and encourage healthy discussion. According to the dictionary (Oxford, 2020), the term "constructive" is defined as "having or intended to have a useful or beneficial purpose." However, this dictionary definition is a bit too generic to determine whether a comment is constructive or not. To avoid individual variation as much as possible, we need a more specific definition for our task. Thus, we follow a previous study (Kolhatkar and Taboada, 2017a) on constructiveness, where a questionnaire administered to 100 people clarified the detailed conditions for constructive comments. Table 1 shows a summarized version of the conditions, which was also used by Fujita et al. (2019). The conditions consist of one precondition for maintaining decency and relevance and four main conditions for representing typical cases of being constructive. Specifically, a constructive comment is defined as one that satisfies the precondition and at least one of the main conditions. YJCCR Dataset: We use (part of) the YJ Constructive Comment Ranking (YJCCR) Dataset, which was created by Fujita et al. (2019). The YJCCR dataset consists of more than 100K Japanese comments labelled with a constructiveness score (C-score), which is a graded numeric score representing the level of constructiveness for ranking comments. The C-score was defined as the number of crowdsourced workers who judged a comment as constructive in response to a yes-or-no (binary) question. As a consequence, the C-score indicates how many people think that a comment is constructive with the goal of sufficiently satisfying as many users as possible.
The detailed settings of the crowdsourcing were as follows. The task was prepared with questions referencing a news article and its comments extracted from Yahoo! JAPAN News and conducted on a crowdsourcing service. The workers were asked to read the definition of constructiveness and then judge whether each comment was con-structive. To ensure reliability, only the results of serious workers who correctly answered qualitycontrol questions that were randomly included in each task were kept. Ten workers were used for each comment in the dataset, so a C-score of 8, for example, means that eight workers judged a comment as constructive. The reliability of this annotation was confirmed with Krippendorff's alpha, which was "moderate agreement." The comments in Figure 1 are actual ones in the YJCCR dataset. The lower comment has a high score (9) because it includes a constructive opinion with some reasoning, whereas the upper comment has a low score (0) since it includes offensive content (see Appendix C for more examples).

In-House Competition
Task: The competition task consisted of ranking comments based on their degree of constructiveness, that is, the C-score defined in Section 2. Specifically, given that we have training data with triples {(a, x, y)} consisting of a news article a, a comment x on the article, and its corresponding C-score y, the task is to predict the ranking of comments for every article in the test dataset {(a, x)}, where the C-scores are unknown. The goal of this task is to create a model that predicts the correct ranking from the training data as closely as possible.
The competition was held for about six weeks (Dec. 13, 2018 -Jan. 23, 2019), and a dozen employees related to the comment ranking service were made aware of it. The information shared among them included not only the dataset but also sample code consisting of a simple feature extraction, model creation, and evaluation pipeline in order to reduce the burden on the participants. We also prepared a leaderboard to display the latest evaluation results for submitted models. The participants reported their evaluation results on the leaderboard and were able to update them any number of times during the competition period. Dataset: The training dataset consisted of a combination of the above-mentioned public dataset YJCCR and a new dataset of long comments created for this study. We used 49,215 comments (9,845 articles with five comments each) from the YJCCR dataset, each comment having a C-score assigned by crowdsourcing. While this dataset only contained comments up to 125 characters in length, we noticed in our preliminary experiments that long comments tended to be incorrectly determined as constructive despite having a bigger impact on visibility than short ones. For that reason, we additionally extracted long comments (from 126 up to the maximum of 400 characters) posted to the articles in YJCCR and created a long comment dataset with C-scores assigned by crowdsourcing in the same way as for YJCCR, as described in Section 2. The resulting combination of the above two datasets yielded 59,120 comments (9,845 articles with an average of six comments). We split it into 80% training data, 10% validation data, and 10% test data to form the competition dataset. Evaluation: We used Normalized Discounted Cumulative Gain (NDCG) (Burges et al., 2005), which is a widely used evaluation measure for ranking tasks. In this competition, we adopted a variant defined as NDCG@k = Z k k i=1 2 r i −1 log 2 (i+1) , which was also used in the Yahoo! Learning to Rank Challenge (Chapelle and Chang, 2011). This NDCG@k computes how close the top k comments predicted by a model are to the correct ranking, where r i is the true C-score of the comment with predicted rank i, and Z k is a normalization term.
To simplify the evaluation process, we set the average value of NDCG@k, i.e., 1 K K k=1 NDCG@k, as the main measure in the competition, where K is the number of comments included in the article. Furthermore, to particularly encourage the performance improvement for long comments, we extracted a dataset consisting of only long comments (305 articles, 917 comments) from the test data and used its NDCG@k value as a supplementary measure. This was meant to reduce the effect of submitting sloppy methods that merely determined long comments to be constructive. From here on we call the normal measure NDCG and the one for long comments NDCG-L. Submitted Models: Eight individuals participated in the competition and submitted 14 models dur-ing the competition period (before the deadline). Figure 2 shows the total number of submissions across the competition period. We can see that the number of submissions was low during the initial period of the competition but increased significantly at the start of the year (beginning of work), a period where time is relatively more available (Jan. 9, 2019), and on the day of the deadline (Jan. 23, 2019). Moreover, after the submission deadline had passed, several participants continued to work on the task and created an additional four models. We included these additional models when carrying out our analysis, although only the models submitted before the deadline were eligible for internal awards. We obtained a wide variety of models created by the participants' trials and errors, but due to space limitations, we only discuss in detail the four highest-performing models, which were Model-4, Model-11, Model-14, and Model-17. The following list includes the summary of each model with its detailed settings and features (see Appendix A for their hyperparameter settings).
•  (Facebook, 2020), an open-source library,that includes a subword-based extension  of the skip-gram model (Mikolov et al., 2013). The training dataset consisted of 100M news articles in the service, and they were split into words using MeCab (ver. 0.996), a Japanese morphological analyzer (Kudo et al., 2004;Kudo, 2020a), with IPADIC (ver. 2.7.0). Finally, the features of each comment were set to the average vector of the pretrained word embeddings for the words in the comment. • Model-11: The model with the highest sum of NDCG and NDCG-L. It is a linear rankSVM (Lee and Lin, 2014) model (pairwise learning) with features based on C-score prediction and the distance between an article and its comment, where this setting is a kind of stacking ensemble. Model: The model was an L2-regularized L2loss linear rankSVM model that was implemented as an instance of the well-known SVM tool LIBLINEAR (ver. 2.1.1) (Lin, 2020).
The cost parameter C was determined from {2 −13 , . . . , 2 1 } on the basis of the performance on the validation set.
Features: The features consisted of two factors. The first was the expected C-score, which was determined by first computing the probabilities of C-scores (considered as classes) using the opensource library fastText (ver. 0.2.0) Facebook, 2020)  which has an encoder-scorer structure consisting of BiLSTMs and Gated CNNs (see Appendix A for the detailed model structure). C-score was predicted by (a) extracting the representations of the input subwords, (b) obtaining one vector averaging their representations, (c) estimating the classification probabilities, regarding the prediction problem of the C-score (0-10) as an 11-class classification problem, and (d) calculating the expected C-score with the probabilities. The loss was a combination of a pointwise loss, i.e., cross entropy loss for C-score probabilities, and a listwise loss, i.e., permutation probability loss for comment lists (Cao et al., 2007). The optimizer was Adam (Kingma and Ba, 2015) with parameters (α = 10 −3 , β 1 = 0.9, β 2 = 0.999, = 10 −8 ), and the training was done in ten epochs with early stopping after random initialization in the range of [−0.01, 0.01], where the batch size was 32 and the dropout rate was 0.3.

Features:
The features (input) were a sequence of subwords based on SentencePiece (ver. 0.1.8) (Kudo and Richardson, 2018;Kudo, 2020b), where the subword model was trained with the training data using the unigram language model algorithm with the vocabulary size of 5,000.
Comparison with Baseline: We analyzed how well the submitted models performed compared to the baseline described below.
• Baseline: A linear rankSVM model (pairwise learning) with features based on term-frequency vectors. It was almost the same as the model in the previous study (Fujita et al., 2019) but was tuned for this competition. Model: The model was an L2-regularized L2loss linear rankSVM model, which was implemented in LIBLINEAR (ver. 2.1.1). The cost parameter C was determined from {2 −13 , . . . , 2 1 } on the basis of the performance on the validation set.
Features: The features consisted of the frequencies of words in each comment. Note that this setting performed better than the one-hot representations, the fractions (normalized frequencies) of the words, the number of distinct words, the tf-idf values, and any combinations thereof. They were scaled to the range of [−1, 1] by using svm-scale in LIBLINEAR. not shown. As we can see, many models performed better than Baseline. Interestingly, a high NDCG score did not necessarily correspond to a high NDCG-L score, and in fact, Model-4 with a high NDCG in particular had a lower NDCG-L than Baseline. The use of the leaderboard had a positive effect for participants submitting highperformance models for both measures in the latter half of the competition (right sides of the graphs).
In the end, the highest performance increase was 2.73% by Model-17 for NDCG and 2.34% by Model-14 for NDCG-L.

Model Ensemble
To further improve the performance, we considered using an ensemble of the models submitted in the competition. For ease of implementation, we focused on unsupervised ensemble methods that combine predicted scores. Assuming practical use, we only used the models that could accurately (or stably) reproduce their leaderboard performance, resulting in ensembles of 12 models. Ensemble Methods: We prepared various ensemble methods covering both commonly used and recently proposed ones as follows.
• ScoreAve: Use the average of the predicted scores of all models as an ensemble score.   (Kobayashi, 2018), where the similarity of two outputs was calculated with NDCG. • WeightEval: Use the weighted average of the top-k promising outputs (Fujita et al., 2020), where k was chosen with the validation set. This method is a hybrid of output selection (PostEval) and output average (NormAve), where NDCG was used as a similarity function for selecting and weighting.
Evaluation Measures: Along with NDCG and NDCG-L, we used NDCG@3 and Prec@3 as supplementary measures, since only the top three comments are displayed first on each article page in the actual service, although users can read all comments on the next comment list page. Prec@3 is defined as the proportion of the predicted top-3 comments being in the correct top-3. Note that Järvelin and Kekäläinen (2002) reported that NDCG is more suitable than precision for graded scores like in our setting.
Results: Table 2 lists the results of the four high-performance models in Section 3 and the six ensembles of submitted models. Looking at the ensemble models, we can see that the recently proposed WeightEval performed the best for the main measure NDCG, and NormAve also performed competitively despite its simplicity. ScoreAve and RankAve did not perform as well as NormAve, as ScoreAve did not adjust outputs with different scales and RankAve failed when trying to adjust them, ignoring score shapes. These results imply that score adjustment (NormAve, TopkAve) and model selection (PostEval, WeightEval) contributed to the performance improvement. As a whole, NormAve is the most promising for practical use, since TopkAve and WeightEval need parameter tuning. Looking at single models, all the models performed better than Baseline for the main measure NDCG, and Model-17 performed the best overall. The differences between Baseline and Model-17 and between Model-17 and NormAve for the main measure NDCG were statistically significant in a Wilcoxon signed-rank test (p < 0.05). The high NDCG-L of Model-14 seems to be related to how to make the features.
Model-14 used maximal substrings, including longer text spans than ordinary words. This implies that Model-14 can successfully characterize long comments, even if it might be harmful for short ones. We may need to consider this effect for other tasks including only long texts, although it was not effective for the main measure NDCG of our task since most comments are short.

Towards Practical Use
To determine if the submitted models can be used in the running service, we carried out a qualitative evaluation from the perspective of service, not just constructiveness. Specifically, we prepared the comment lists ranked by candidate models for each news article and asked three experts in the comment service to rank them. We instructed the experts to evaluate them on the basis of "which list should be provided as a service" rather than "which list is constructive," as the goal of this evaluation was to improve the service quality. As an evaluation measure, we calculated the micro-average of the ranks by the experts over the evaluation data prepared separately from the competition data. We used 104 articles (   Likes/Dislikes. This model has been used in the service. • Latest: A model ranking in descending order of comment date. This model is a naive method used when user feedback and constructiveness scores are not available. • Length: A model ranking in descending order of comment length. This model is a naive method based on the rule of thumb that long comments tend to be constructive. • Baseline: A model ranking in descending order of predicted C-score, which is almost the same as the model in the previous study (Fujita et al., 2019) but has been tuned to this competition. Table 3 shows the results of the qualitative evaluation. We can see that Baseline clearly performed better than the other models. The differences between Baseline and Feedback and between Baseline and Length were statistically significant in a Wilcoxon signed-rank test (p < 0.05). These results mean that the finding in the previous paper holds true even in human evaluation. Baseline vs. Submitted Models: We prepared the four high-performance single models in Table 2 (excluding ensemble models) for comparison with Baseline. We also suggested introducing the most promising ensemble model, NormAve, but the service preferred not to because it would be unreasonable to maintain 12 different models and to re-normalize the scores every time a comment was posted, where static scores must be stored in the DB due to the low latency constraint. Table 4 lists the results of the qualitative evaluation. As shown, the best single model for NDCG, Model-17, also had the best (lowest) average rank. The difference between Baseline and Model-17 was statistically significant in a Wilcoxon signed-rank test (p < 0.05). This implies that a competition format is effective in terms of obtaining an improved model even when we consider service-level judgment. As a result, the service introduced Model-17 into its comment ranking module.
One of the reasons Model-17 performed better than the others seems to be related to the fact that it had a full neural structure (as explained in Section 3), which implies "robustness" (or expressiveness of the model) thanks to a lot of parameters, as in Neyshabur et al. (2017)'s study. In fact, the evaluators reported that Model-17 had few critical errors compared to the other models. Although Model-4 and Model-11 performed well in Table 2 (automatic evaluation), we will have to consider the robustness (or the number of critical errors) from a practical point of view. Note that the detailed investigation of these factors is beyond the scope of this study.

Participant Survey and Future Issues
After the competition, we collected opinions from the participants through an optional survey. We discuss certain positive and negative opinions in detail below (see Appendix B for other opinions). Positive Opinions: The most popular opinion was that the number of model submissions was greater than initially expected. According to the participants, this was mainly due to the game element of the competition, i.e., publicly competing against other participants. In other words, the fun of the task was an implicit incentive to encourage submissions. As a result, we were able to use a wide variety of models for the ensemble experiment (Section 4), which seems to have contributed to the performance improvement. Another interesting opinion was about disclosure of the modeling methods. In this competition, the participants were encouraged to include model descriptions such as structures and features when reporting their evaluation results on the leaderboard. This information helped the participants make improved models, which contributed to the best performance of single models (Section 3). Other positive opinions were related to the improved knowledge and skills acquired by the participants. Negative Opinions: One major negative opinion was about the leaderboard system, where the participants individually posted their own results pertaining to the evaluation tool and test data. This setting allowed the participants to purposefully design models effective only on the test data, although we confirmed that they actually used the validation data for fine-tuning. To hold a competition on a larger scale, we should prepare an automatic evaluation system with private test data. Such a setting is relatively common in strict competitions such as Kaggle, while most test datasets tend to be publicly available in research communities (under research ethics). Another insightful opinion was to make an incentive for exploring new directions, since it is valuable to obtain findings in unknown/rare directions, even if the results are not superior. In addition, model diversity can contribute to the ensemble performance, as discussed above. We suggest preparing a special prize for novelty in order to encourage exploring different directions.

Related Work
Constructiveness: Analyzing the comments on online news services or discussion forums has been extensively studied (Wanas et al., 2008;Ma et al., 2012;Llewellyn et al., 2016;Shi and Lam, 2018). In this line of research, many studies have focused on ranking comments (Hsu et al., 2009;Das Sarma et al., 2010;Brand and Van Der Merwe, 2014;Wei et al., 2016). However, the prior approaches have been based on user feedback, which is completely different from constructiveness.
Constructiveness has been introduced in argument analysis frameworks (Napoles et al., 2017a,b;Kolhatkar and Taboada, 2017a,b;Kolhatkar et al., 2020). The purpose of these studies was to classify constructive comments, whereas Fujita et al. (2019) recently expanded their tasks to a ranking one. They created a new dataset for ranking constructive comments on a news service and showed that the commonly used method that ranks comments by user feedback does not contribute to constructiveness in terms of automatic evaluation (NDCG). Our study has value as a deployment report of their approach, and we also confirmed that constructiveness performed better than user feedback for ranking comments in terms of human evaluation by experts.
Aside from constructiveness and user feedback, we may consider hate speech detection (Kwok and Wang, 2013;Nobata et al., 2016;Davidson et al., 2017) and sentiment analysis (Fan and Sun, 2010;Siersdorfer et al., 2014) as alternative approaches for analyzing the quality of comments on the basis of their content. Although these approaches are useful for other tasks, they do not directly solve our task, namely, ranking constructive comments. For example, the simple comment "Great!" is positive and is not hate speech, but it is not suitable as a top-ranked comment in our task.
Shared Tasks and Competitions: There have been many competitions in various research areas through shared-task workshops, such as the WMT translation task (Barrault et al., 2019), TAC text analysis task (Demner-Fushman et al., 2018), and NTCIR information retrieval task (Kato and Liu, 2019). Their purpose to find good models for a specific task is almost the same as ours, and the main difference (ignoring the task) is that the competition in our work was conducted within a company. As this kind of work towards a commercial service is rarely released in the form of an academic paper, we expect that our findings will become valuable knowledge for practitioners in this field.
As for "learning to rank" tasks, there have also been several competitions such as the Internet Mathematics 2009 (Yandex, 2020), the Yahoo! Learning to Rank Challenge (Chapelle and Chang, 2011), and the Personalized Web Search Challenge (Kharitonov and Serdyukov, 2020). Their tasks are basically to rank pages in terms of relevance to a search query, which is common in the information retrieval field. In contrast, our task is to rank comments in terms of constructiveness. It has value in the sense of applying the concept of argument analysis in the real world.
A unique aspect of our work is the ensemble of submitted models in the competition. Although there have been many studies on model ensembles (Hoi and Jin, 2008;Cormack et al., 2009;Burges et al., 2011), the models for prior ensemble experiments were basically prepared by either random initialization or a researcher's preference, which is different from our competition setting. The most closely related study involves the concept of "Resource by Collaborative Contribution (RbCC)" (Sekine et al., 2019), which collaboratively creates a large-scale dataset for named entity recognition by using the predicted labels of submitted models in a shared task, although their purpose and task were completely different from ours. We believe our findings in a commercial service will be useful for future ensemble studies.

Conclusion
We reported a case study of an in-house competition for ranking constructive comments. Our experimental results showed that the competition format is effective for testing various model structures, and that ensembling submitted models can further improve the ranking performance. Moreover, we confirmed that the submitted models were practically useful in a service perspective evaluation. samples ('neg') = 5, and loss function ('loss') = 'softmax'. • Parameters of LightGBM for Model-14: boosting type ('boosting_type') = Gradient Boosting Decision Tree (Friedman, 2000) ('gbdt'), objective function ('objective') = L2-loss ('regression'), evaluation metric ('metric') = L2loss ('l2'), maximum number of leaves in one tree ('num_leaves') = 128, learning rate ('learn-ing_rate') = 0.1, fraction to randomly select part of features on each iteration or tree ('fea-ture_fraction') = 0.9, fraction to randomly select part of data without resampling ('bag-ging_fraction') = 0.8, frequency for bagging ('bagging_freq') = 5 (every 5 iterations), maximum number of bins that feature values are bucketed in ('max_bin') = 1000, number of iterations ('num_iteration') = 1000, and number of rounds for early stopping ('early_stopping_rounds') = 10 (stop if a validation metric does not improve in last 10 rounds). • Feature construction for NDCG-L. The substrings were extracted by making a dictionary of maximal substrings (whose frequencies were more than 2) from all the comments by using a suffix tree-based extraction algorithm (Okanohara and Tsujii, 2009) with pykwic (ver. 0.1.5), a Python library (Aihara, 2020), and searching for maximal substrings in each comment by using the Eho-Chorasic dictionarymatching algorithm (Aho and Corasick, 1975) with pyachocorasick (ver. 1.4.0), another Python library (Muła, 2020). Table 5 shows the details of the participant survey (translated fromJapanese to English). Table 6 shows examples of scored comments (translated into English) in the YJCCR dataset. Ex. 1 has a high score because it includes a constructive opinion with some reasoning. Ex. 2 has a middle score because the judgement, e.g., whether the comment is a new idea, depends on each worker's background knowledge. Ex. 3 has a low score since it includes offensive content. Opinion + There were more participants than initially expected and a wide variety of models were submitted, so it turned out to be a good competition. + Since the participants disclosed their modeling methods, there were cases where one participant adopted the methods of other participants, which had a positive effect on improving the model's performance. + Although I did not understand much about the work I was not in charge of, my participation in this competition deepened my understanding of the task and made it easier to participate in discussions during meetings. + I managed to learn a lot through trial and error in the competition. -It would be better to have a system that automatically evaluates predictions upon submission. -It would be better to not publicly disclose the test data. -When we were able to create a model with a high performance, we could not share detailed knowledge such as what kind of library was used, so it seems like there is room for improvement in the knowledge sharing system. -It would be good to have a system that rewards not only an increase in performance but also trying out new methods. If we give freedom, punishment should also be strictly given.

6
They are irrational because they smoke, or they smoke because they are irrational. 0 Table 6: Examples of comments and scores for article "Lifting the ban on drinking and smoking at 18."