摘要: |
目的 通过对 2 型糖尿病合并高血压的相关因素分析,构建预测模型。方法 选取 475 例 2 型糖尿
病合并高血压患者为病例组,以同期体检中心 505 例健康人群为对照组。将最小绝对值收缩和选择算子(LASSO)
回归筛选出的特征变量作为随机森林(RF)、极端梯度提升(XGBoost)和逻辑回归(logistic regression)的输入,利用
贝叶斯优化方法和交叉验证迭代训练获得最佳的预测模型,最后利用特征重要性排序和 Shapley 加性解释(SHAP)
进行解释分析。 结果 特征选择结果显示尿糖(GLU)(OR=1.189,95%CI=1.170~1.208,P<0.05)、糖尿病遗传史
(OR=1.341,95%CI=1.273~1.411,P<0.05)、年龄(OR=1.006,95%CI=1.004~1.009,P<0.05)、身体质量指数(BMI)
(OR=1.017,95%CI=1.010~1.023,P<0.05)、心率(HR)(OR=1.004,95%CI=1.003~1.006,P<0.05)、文化程度(OR=0.954,
95%CI=0.934~0.975,P<0.05)、居住地(OR=0.958,95%CI=0.931~0.985,P<0.05)为主要的特征变量。算法实验结果表明,
经过参数调优后RF和XGBoost模型性能均优于逻辑回归模型,XGBoost准确率 92.85%略高于RF准确率 92.34%。特征
重要性结果显示,2 型糖尿病合并高血压的影响因素重要性排序依次为GLU、糖尿病遗传史、文化程度、居住地、年龄、
BMI、HR,其中,GLU、糖尿病遗传史、年龄、BMI、HR为危险因素,文化程度、居住地为保护因素。结论 基于XGBoost
的 2 型糖尿病合并高血压预测模型具有更好的性能,通过利用SHAP模型增强模型的可解释性,能够识别出患病的危险
因素,为 2 型糖尿病合并高血压的预防提供参考。 |
关键词: 糖尿病合并高血压 随机森林 极端梯度提升 分类预测 SHAP模型 |
DOI: |
|
基金项目:广东省基础与应用基础研究基金区域联合基金项目(重点项目)(2020B1515120021),广东医科大学学科建设项目(4SG21276P) |
|
Application and comparison of type 2 diabetes with comorbid hypertension classification prediction models based on random forest and XGBoost algorithms |
|
() |
Abstract: |
Objective The objective of this study is to construct a predictive model through analyzing the related factors
of type 2 diabetes mellitus combined with hypertension, aiming to achieve early detection and treatment. Methods A total
of 475 patients with type 2 diabetes mellitus combined with hypertension from the Endocrinology Department of Guangdong
Medical University Affiliated Hospital and Affiliated Second Hospital from March to December 2022 were selected as the case
group, while 505 healthy individuals undergoing physical examinations during the same period were chosen as the control
group. The feature variables selected by Least Absolute Shrinkage and Selection Operator (LASSO) regression were used as
inputs for Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Logistic Regression models. The best predictive
model was obtained through Bayesian optimization and iterative training with cross-validation. Finally, feature importance
ranking and Shapley additive explanation (SHAP) were utilized for interpretation analysis. Results The feature selection
results indicated that glucose in urine (GLU) (OR=1.189, 95%CI=1.170~1.208, P<0.05), family history of diabetes (OR=1.341,
95% CI=1.273~1.411, P<0.05), age (OR=1.006, 95%CI=1.004~1.009, P<0.05), body mass index (BMI) (OR=1.017, 95%
CI=1.010~1.023, P<0.05), heart rate (HR) (OR=1.004, 95% CI=1.003~1.006, P<0.05), education level (OR=0.954, 95%CI=0.934~0.975, P<0.05), and place of residence (OR=0.958, 95% CI=0.931~0.985, P<0.05) were the main feature variables.
Experimental results of the algorithms showed that after parameter optimization, RF and XGBoost models outperformed the
Logistic Regression model, with XGBoost accuracy at 92.85%, slightly higher than RF accuracy at 92.34%. The results of
feature importance show that the influenCIng factors of type 2 diabetes combined with hypertension are ranked in the following
order of importance: GLU, family history of diabetes, education level, residential area, age, BMI, and heart rate (HR). Among
these, GLU, family history of diabetes, age, BMI, and HR are risk factors, while education level and residential area serve as
protective factors. Conclusion The XGBoost-based predictive model for type 2 diabetes mellitus combined with hypertension
exhibited better performance. By enhanCIng the model’s interpretability using the SHAP model, it could identify the disease’s
risk factors, providing reference for the prevention of type 2 diabetes mellitus combined with hypertension. |
Key words: diabetes with hypertension random forest XGBoost classification prediction SHAP model |