Prediction Models for Risk of Type-2 Diabetes Using Health Claims

This study focuses on highly accurate prediction of the onset of type-2 diabetes. We investigated whether prediction accuracy can be improved by utilizing lab test data obtained from health checkups and incorporating health claim text data such as medically diagnosed diseases with ICD10 codes and pharmacy information. In a previous study, prediction accuracy was increased slightly by adding diagnosis disease name and independent variables such as prescription medicine. Therefore, in the current study we explored more suitable models for prediction by using state-of-the-art techniques such as XGBoost and long short-term memory (LSTM) based on recurrent neural networks. In the current study, text data was vectorized using word2vec, and the prediction model was compared with logistic regression. The results obtained confirmed that onset of type-2 diabetes can be predicted with a high degree of accuracy when the XGBoost model is used.


Introduction
The incidence of lifestyle-related diseases is increasing in many regions (WHO, 2009;Lim SS et al., 2012). Predicting the onset of lifestylerelated diseases and implementing preventive measures in advance is important for municipalities and insurers. Particularly in type-2 diabetes mellitus, not only medical cost but also indirect cost such as reduced productivity present a serious problem (American Diabetes Association, 2018), and therefore, it is very important to take preventive measures early.
From reports to date on the prediction of the onset of diabetes, it is well known that health checkup data items such as HbA1c, BMI, and ages are important indicators for estimating the onset of type-2 diabetes (Edelstein et al., 1997). Many related studies achieved accurate results by means of logistic regression and cox hazards regression models mainly based on bood test results (Droumaguet et al., 2006;Guasch-Ferré et al., 2012). These studies are aimed at predicting the onset of type-2 diabetes using a simple form. However, it is now common for machine learning and data mining methods to be used due to higher computer performance. Several studies have reported the effectiveness of using machine learning technique to improve classification accuracy (Meng et al., 2013;Tapak et al., 2013;Kavakiotis et al., 2017). Another attempt involved using clinical information such as health claims or electronic health records (EHRs). Health insurance claims data could prove to be a rich source of information for the early detection of type-2 diabetes as a previous study showed a slight improvement in prediction using such data (Krishnan et al., 2013;Razavian et al., 2015).
In this study, we aim to develop and evaluate prediction models for the risk of type-2 diabetes using health insurance claims data in addition to health checkup data.

Related work
Many related studies are based on conventional prediction models for early detection of type-2 diabetes (Schulze et al., 2006, Thomas et al., 2006. Some research groups use a small number of risk factors as variables as their intention is to develop a practical method. A simple risk score enables healthcare providers to evaluate patients for further intervention and treatment (Lindström et al., 2013;Kengne et al., 2014;Nanri et al., 2015). Logistic regression is one of the most effective models in these studies when compared to other machine learning models. On the other hand, currently, healthcare data management systems integrate large amounts of medical information, such as diagnoses, medical procedures, lab test results, and more. Health claims and EHRs are two examples of this medical information which includes medical text data. It is suggested that there are latent factors that could improve diseases prediction models by including diagnoses and prescribed medicines (Krishnan et al., 2013;Razavian et al., 2015). In addition, some natural language processing (NLP) techniques such as word2vec have been widely used to discover novel patterns and features (Choi et al., 2017;Jo et al., 2017). It is expected that data-driven assessment of individual patient risk would provide better personalized care (Neuvirth et al., 2011).
Recently, Razavian et al. (2015) showed that using an L1-regularized logistic regression (L1LR) model with about 900 variables from health insurance claim data resulted in an area under the ROC curve (AUC) of 0.80 compared with an AUC of 0.75 when using conventional diabetes risk factors. The L1LR model is an effective method where there are many independent variables, although a recent machine learning study has suggested that a gradient boosting method (XGBoost) could achieve high performance prediction (Wei et al., 2017). Furthermore, long short-term memory (LSTM), which is based on a recurrent neural networks model, is feasible for long-range dependencies in sequential data.
In this paper, we compare multiple prediction models for diabetes incidence using health checkup and insurance claims data. In the study, three classification models (i.e. L1LR, XGBoost and LSTM) are developed, and their prediction performance is evaluated as an AUC.

Methods
In this section, the dataset and variables used for the evaluation of the proposed methods are described, and three prediction models are also presented.

Dataset
In the experiments, a collection of anonymized yearly health checkup and health claims at a health insurance society in Japan is used. The health checkup items consist of profile information (e.g. age, sex), lab test results (e.g. body mass index, blood pressure, HbA1c), and health questionnaire (e.g. smoking, alcohol intake, exer-cise level). We used 33 health checkup items as features for further experiments. The data were obtained from about 40,000 people aged 20 to 64 years. From the whole dataset, we selected those subjects who had health checkups regularly over a period of at least three years. In addition, we excluded some samples missing blood test data. After selection was complete, the final total sample size was 31,000. We used 20% of the dataset randomly sampled for test data, and the rest was used for training. Subjects were diagnosed with diabetes if they had a measured fasting blood sugar (FBS) ≥126 mg/dL, or HbA1c 6.5%, or a diagnosis of diabetes on a health insurance claim. Outcome was evaluated if a subject had onset of diabetes in a year in the last of dataset.

Health insurance claims
Patient records of health insurance claims include medical cost, laboratory test, medical diagnosed disease with ICD10 (International Statistical Classification of Diseases and Related Health Problems) codes and pharmacy information related to the individuals between the years 2011 and 2016. About 5% of subjects had no claim data and had never visited clinics or hospitals. We used ICD10 codes and medicine name data for additional features. To build a training data, firstly, we checked FBS level and HbA1c of health checkup data, and ICD10 codes of diabetes in health insurance claims to extract positive examples.
Our goal is to predict onset of diabetes later than next year and the after that. Thus, for training and prediction, we did not use health checkup results and health insurance claims of immediate 1 year before of diabetes diagnosis.
Since the health insurance claims are issued in monthly unit, there can be more than one ICD10 codes and medicine names in one health insurance claim. We preprocessed them by using word2vec (Mikolov et al., 2013;Rehurek R 2014;Choi et al., 2017). Here, we regarded array of ICD10 codes or medicinal ingredients of prescribed medicine as one sentence. Then we simply preprocessed by word2vec to obtain distributed expression of ICD10 codes and medicinal ingredients. In our experiments, we set both dimensions of ICD10 vector and medical ingredient vector to be 200. By the aforementioned preprocessing, a health insurance claim of one month was converted to 2 vectors (ICD10 vectors and medical ingredients vectors).

Prediction model
As baseline, a conventional L1LR model was used. For L1 regularization hyper-parameter, we searched over values of [0.001, 0.01, 0.1, 1, 10], and 0.1 was selected as the optimum value.
In the experiment, we compare two state of the art prediction models. One is XGBoost which is a scalable machine learning system based on tree boosting (Chen T. and Guestrin C. 2016). To train the XGBoost model, we used scikit-learn API with default parameters. For XGBoost training and L1LR models training, all features including medical checkup results, and distributed expressions of ICD10 and medical ingredients are simply concatenated. The other prediction model is Long Shortterm Memory (LSTM). Figure 1 shows the LSTM architecture used in our experiments. As shown in the figure, the LSTM method consists of two training parts. The first part is health checkup, and second is the ICD10 code, or/and medicinal ingredients of prescribed medicines.
where and U are weight matrices, and b are bias vectors. (·) and tanh (·) are an elementwise sigmoid function and hyperbolic tangent function, respectively. Using these vectors, the hidden layer vector ( & ) is calculated as follows: Where ⊙ is an element-wise multiplication. In our experiments, we used up to three kinds of feature sets (shows in Table 1). Each feature set is processed by individual LSTM. After processing all of feature sets by LSTMs, each of the last hidden layer vectors are concatenated as follows: By using , the output layer calculates probabilities of diabetes. The output layer calculates probability of diabetes.

Results
Incidence of type 2 diabetes in our dataset was 4%. The characteristics detailed statistics are shown in Table 2.
We developed three models namely XGBoost, LSTM, and L1LR. For each model, we used four patterns of health claim variables. Table 3 shows the AUC when using the three models. The results show that the performance of the XGBoost and LSTM models was superior to that of the L1LR model without health claim features. In our experiments, the highest performance was obtained   when the XGBoost with ICD10 plus medicine features was used. On the other hand, the L1LR model had the lowest AUC, though a slight improvement was obtained by incorporating health claim data. LSTM with the ICD10 model showed a relatively high performance, however, adding prescribed medicine features did not improve its level of prediction.

Discussion
In this study, we compared the predictive performance of a conventional model to that of machine learning-based models using health checkup data and additional health claim features vectorized by word2vec. The results showed that the XGBoost and LSTM models achieved better performance compared to the L1LR model without using health claim information. Adding health claim features improved prediction performance in each of the three models. This is consistent with a previous study in which use of the L1LR model obtained slightly improved prediction performance (Razavian et al. 2015). These results suggest that medical information contains latent signals for risk factors associated with the onset of diabetes.
In terms of how to use health claim data, a previous study used the data as one-hot vectors. However, one-hot encoding cannot express the relationship and meaning between words. On the other hand, word2vec makes it possible to give a latent meaning to the vector. This effect was considered to be valid in the case of the XGBoost model.
In recent years, the LSTM model has been used to estimate disease name or mortality from medical information obtained from medical systems with a high degree of performance (Ayyar et al., 2016;Lipton et al., 2016;Jo et al., 2017). LSTM can embed influence over time series data across multiple layers. Therefore, although we expected this effect in our experiments, prediction performance was not improved much when ICD10 and medicine name were used in combination, compared with the case when using only ICD10. This result can probably be attributed to the difference in the quality of the information between the diagnosis disease name and prescription medicine.
Our study has several limitations. First, the vectorization from health claims data was empirically set to 200 dimensions. However, it is not clear what the optimal dimension is. Second, the duration in terms of years of the dataset is relatively short. From the standpoint of disease prevention, it may be desirable for predictive purposes to extend this period to three years or more. Finally, the dataset sample population may have been biased because our data collection depended on information from one health insurance society.

Conclusion and Future Work
It would be useful in terms of practicality if risk could be estimated easily with noninvasive data. However, it is also very important, from the viewpoint of personal care, to predict onset of disease with a high degree of precision with obtained from various types of medical information. In this study, we developed and evaluated several prediction models for type-2 diabetes to explore an effective means of vectorization using health claims. We used health claims, ICD10 and prescribed medicine name as variables in addition to health checkup data by vectorizing via word2vec. The results showed that the XGBoost model with health claim variables achieved a higher performance compared to the LSTM and L1LR models. Our study suggests that there are potential factors contained in large amounts of medical information which may be signals to the onset of diabetes. It is possible that the LSTM model may still be able to further improve prediction performance as well. As future work, we plan to test the effect of dimensional compression by parameter tuning.