Leader: Licia Iacoviello, (NEUROMED); Other collaborator(s): Giuseppe Sergi (UNIPD); Patrizia Rovere Querini (UNISR)
Functional outcomes are of fundamental importance in monitoring healthy aging in the population and may provide useful prognostic markers of clinical risks associated with aging (e.g. hospitalizations and mortality). Applying advanced statistical and AI techniques to large longitudinal cohorts, we will predict adverse functional outcomes – using measures of frailty, cognitive performance, physical and mental quality of life – based on highly dimensional datasets, including both biological and environmental features. This will help stratifying subjects based on their predicted risk of adverse outcomes and identify those population subsets which need deeper follow-up, assessment and healthcare interventions.
Brief description of the activities and of the intermediate results: Algorithms for the prediction of cognitive performance in the Moli-sani cohort have been developed. In particular, feature selection, hyperparameter tuning, training and testing of different models including multiple linear regression (with and without stepwise selection of features) elastic net, ridge and lasso regressions, and Xgboost have been carried out.
Performance across these algorithms has been tested and the most important features within each model have been identified, either through association betas or through agnostic approaches like permutation feature importance and SHAP (SHapley Additive exPlanations) analysis.
Further algorithms will be implemented (e.g. random forests and deep neural networks and enlarge the variety of predictors (e.g. including air pollution, polygenic susceptibility and social stress scales), through a real exposomic approach.
Main policy, industrial and scientific implications: This activity will provide useful tools to predict health outcomes the elderly population through a real exposomic approach.
The MoCA questionnaires collected in the CASSIOPEA study, to evaluate cognitive function, have been analyzed to derive a cognitive performance score. This score has been associated with the pro-inflammatory potential of diet and other lifestyle factors; a possible mediating role of inflammation has been tested.
At the same time, we are using a supervised learning approach to select potential predictive features of cognitive performance through an exposomic approach that includes environmental exposures, lifestyle, genetic and epigenetic susceptibility to dementia, biomarkers, and a detailed biometric and psychometric profile.
We are also evaluating available methods for validating incident dementia cases in the cohort, comparing them with the literature in the field (i.e., evaluating similar studies in population-based cohorts) to estimate the feasibility of building a supervised learning algorithm that includes time-to-event analysis.
We carried out elastic net regression models through the ‘glmStepAIC’ method of the ‘caret’ library (v. 6.0-94) in R. This also involved carrying out hyperparameter tuning and 5-folds cross validation, training and testing in two independent subsets of the cohort accounting of 80% and 20% of the data, respectively. These results were compared with other regularized regressions, namely lasso and ridge regression models, following the same procedure. Then results were compared as above. A feature importance analysis was also performed through the ‘varImp’ method of the ‘caret’ package, to identify the most influential features detected and compare them with other models.
We tested XGboost for the prediction of MoCA score, through the ‘xgbLinear’ and ‘xgbTree’ methods of the ‘caret’ library (v. 6.0-94) in R. This implied feature selection, hyperparameter tuning, training and testing of the model (80% for training, 20% for testing) and 5-fold cross validation. The influence of features on the prediction was analyzed through a SHAP (SHapley Additive exPlanations) analysis using the ‘SHAPforxgboost’ library (v. 0.1.3). Then, we built plots to visualize the computed SHAP values using ‘shapviz’ library (v. 0.9.3), which were compared with the other models tested.
We tested support vector regression (SVR) algorithms for the prediction of MoCA score through the ‘svmLinear’, ‘svmRadial’, and ‘svmPoly’ methods of the ‘caret’ library (v. 6.0-94) in R. This implied feature selection, hyperparameter tuning, training and testing of the models (80% for training, 20% for testing), and 5-folds cross validation. The influence of features on the prediction was analyzed through a SHAP (SHapley Additive exPlanations) analysis using the ‘kernelshap’ library (v. 0.4.1). Then we built plots to visualize SHAP values computed using ‘shapviz’ library (v. 0.9.3), which were compared with the other models tested.
We started implementing further algorithms like deep neural networks (DNN) for the prediction of cognitive performance in the Moli-sani cohort. We explored different hyperparameter tuning (e.g. number of nodes, number of layers, batch size regularization parameters, learning function, dropout rate) combinations through grid search and random search, using the ‘caret’ package combined with in-house functions build to ease the task of comparing prediction performances.
We aim at deepening hyperparamter tuning of DNN models through Bayesian approaches using ‘rBayesianOptimization’ package (v. 1.2.1) in R. This is a crucial aspect for DNN, since its performance significantly depends on the optimal combination of its hyperparameters. The Bayesian optimization of the hyperparameters of the DNN uses surrogate models (like Gaussian processes) to approximate a true objective function, and it is expected to be more efficient and outperforms other optimization strategies (grid and random searches).
In the frame of task 3.1, we started to measure biomarkers of systemic/neuro-inflammation in the subset of the Moli-sani cohort recalled between 2017 and 2020. So far, about 1,000 samples were analyzed for GDF15, FGF21 and NFL, through the Simple Plex ELLA platform (Ella™). Moreover, we are completing DNA extraction from the stored buffy coats of the same population for methylation wide analyses.
The first polygenic risk score for Alzheimer Disease (AD-PRS) was trained using the SBayesRC method proposed by Zheng et al. (2024; doi: 10.1038/s41588-024-01704-y). This method leverages functional genomic annotations to refine signals from genome-wide association study (GWAS) summary statistics. By considering both the likelihood of a variant being causal and its effect size, SBayesRC significantly improves prediction accuracy compared to other methods. The PRS was trained on one of the largest GWAS studies previously published on AD, including 111,326 clinically diagnosed/‘proxy’ Alzheimer’s Disease cases and 677,663 controls (Bellenguez et al. 2022; doi: 10.1038/s41588-022-01024-z). This resulted in a PRS based on 7,243,470 SNPs, available for 23,374 participants with genome-wide array data available after QC and imputation. This score will be included among predictors of continuous cognitive performance (MoCA score) in machine learning models.
Ghulam A, Bonaccio M, Gianfagna F, Costanzo S, Di Castelnuovo A, Gialluisi A, Cerletti C, Donati MB, de Gaetano G, Iacoviello L; Moli-sani Investigators. Association of perceived mental health with mortality, and analysis of potential pathways in Italian men and women: Prospective results from the Moli-sani Study cohort. J Affect Disord. 2024 Sep 1;360:403-411. doi: 10.1016/j.jad.2024.05.114.
Quiccione MS, Tirozzi A, Cassioli G, et al. Are Methylation Patterns in the KALRN Gene Associated with Cognitive and Depressive Symptoms? Findings from the Moli-sani Cohort. Int J Mol Sci. 2024 Sep 25;25(19):10317. doi: 10.3390/ijms251910317. PMID: 39408648.
Bracone F, Gialluisi A, Bonaccio M, et al. Exploring the Correlation Between Pro-Inflammatory Dietary and Lifestyle Patterns and Cognitive Function: Cross-sectional results from the Moli-sani Study. Submitted to: Journal of Affective Disorders (July 2024)