Publications
1.
Mao, Jialin; Goodney, Philip; Banerjee, Samprit; Kostic, Zoran; Smolderen, Kim; Mena-Hurtado, Carlos; Matheny, Michael E.
In: BMJ Surgery, Interventions, & Health Technologies, vol. 7, iss. 1, pp. e000387, 2025.
Abstract | Links | BibTeX | Tags: methodology, outcomes research, real-world data, vascular devices
@article{nokey,
title = {Neural network models for predicting readmission among patients undergoing peripheral vascular intervention using electronic health record data and clinical registry data},
author = {Jialin Mao and Philip Goodney and Samprit Banerjee and Zoran Kostic and Kim Smolderen and Carlos Mena-Hurtado and Michael E. Matheny},
doi = {10.1136/bmjsit-2025-000387},
year = {2025},
date = {2025-06-26},
journal = {BMJ Surgery, Interventions, & Health Technologies},
volume = {7},
issue = {1},
pages = {e000387},
abstract = {Objectives: To determine whether neural network models based on electronic health record (EHR) data can match and augment the performance of models based on clinical registry data in predicting readmission after peripheral vascular intervention (PVI).
Design: Observational cohort study.
Setting: Vascular Quality Initiative registry and INSIGHT Clinical Research Network EHR data from multiple academic institutions in New York City.
Participants: Patients undergoing PVI during January 1, 2013 to September 30, 2021.
Main outcome measures: Our outcome variable was 90-day readmission. We developed logistic regression (LR), multilevel perceptron (MLP), and recurrent neural network (RNN) models using registry alone, EHR data alone, and combined registry-EHR data. EHR data were evaluated using derived variables to match registry variables (EHR-derived data) and clinically meaningful code aggregation (EHR-direct data). Models were evaluated using area under the curve (AUC) for discrimination, Spiegelhalter z score for calibration, and Brier score for overall performance.
Results: The analytical cohort included 2348 patients undergoing PVI (mean age: 69.9±11.5 years). 832 (35%) patients were readmitted within 90 days. LR to predict 90-day readmission based on registry data alone had an AUC of 0.710, Spiegelhalter z score of 1.021, and Brier score of 0.211. MLP based on registry data alone had similar performance. MLP and RNN based on EHR-direct data (MLP: AUC=0.742, Spiegelhalter z=0.933, Brier=0.204; RNN: AUC=0.737, Spiegelhalter z=1.026, Brier=0.206) and registry+EHR-direct data (MLP: AUC=0.756, Spiegelhalter z=0.794, Brier=0.199; RNN: AUC=0.751, Spiegelhalter z=1.057, Brier=0.200) had improved performances. LR based on EHR-direct data and combined registry+EHR-direct data had worse performances.
Conclusions: EHR data, when used with neural network models, can be useful to establish readmission predictive models or augment clinical registry data. EHR-based models can be potentially embedded in the clinical workflow, but model performance may be constrained by the absence of certain information in clinical encounters, such as social determinants of health.},
keywords = {methodology, outcomes research, real-world data, vascular devices},
pubstate = {published},
tppubtype = {article}
}
Objectives: To determine whether neural network models based on electronic health record (EHR) data can match and augment the performance of models based on clinical registry data in predicting readmission after peripheral vascular intervention (PVI).
Design: Observational cohort study.
Setting: Vascular Quality Initiative registry and INSIGHT Clinical Research Network EHR data from multiple academic institutions in New York City.
Participants: Patients undergoing PVI during January 1, 2013 to September 30, 2021.
Main outcome measures: Our outcome variable was 90-day readmission. We developed logistic regression (LR), multilevel perceptron (MLP), and recurrent neural network (RNN) models using registry alone, EHR data alone, and combined registry-EHR data. EHR data were evaluated using derived variables to match registry variables (EHR-derived data) and clinically meaningful code aggregation (EHR-direct data). Models were evaluated using area under the curve (AUC) for discrimination, Spiegelhalter z score for calibration, and Brier score for overall performance.
Results: The analytical cohort included 2348 patients undergoing PVI (mean age: 69.9±11.5 years). 832 (35%) patients were readmitted within 90 days. LR to predict 90-day readmission based on registry data alone had an AUC of 0.710, Spiegelhalter z score of 1.021, and Brier score of 0.211. MLP based on registry data alone had similar performance. MLP and RNN based on EHR-direct data (MLP: AUC=0.742, Spiegelhalter z=0.933, Brier=0.204; RNN: AUC=0.737, Spiegelhalter z=1.026, Brier=0.206) and registry+EHR-direct data (MLP: AUC=0.756, Spiegelhalter z=0.794, Brier=0.199; RNN: AUC=0.751, Spiegelhalter z=1.057, Brier=0.200) had improved performances. LR based on EHR-direct data and combined registry+EHR-direct data had worse performances.
Conclusions: EHR data, when used with neural network models, can be useful to establish readmission predictive models or augment clinical registry data. EHR-based models can be potentially embedded in the clinical workflow, but model performance may be constrained by the absence of certain information in clinical encounters, such as social determinants of health.
Design: Observational cohort study.
Setting: Vascular Quality Initiative registry and INSIGHT Clinical Research Network EHR data from multiple academic institutions in New York City.
Participants: Patients undergoing PVI during January 1, 2013 to September 30, 2021.
Main outcome measures: Our outcome variable was 90-day readmission. We developed logistic regression (LR), multilevel perceptron (MLP), and recurrent neural network (RNN) models using registry alone, EHR data alone, and combined registry-EHR data. EHR data were evaluated using derived variables to match registry variables (EHR-derived data) and clinically meaningful code aggregation (EHR-direct data). Models were evaluated using area under the curve (AUC) for discrimination, Spiegelhalter z score for calibration, and Brier score for overall performance.
Results: The analytical cohort included 2348 patients undergoing PVI (mean age: 69.9±11.5 years). 832 (35%) patients were readmitted within 90 days. LR to predict 90-day readmission based on registry data alone had an AUC of 0.710, Spiegelhalter z score of 1.021, and Brier score of 0.211. MLP based on registry data alone had similar performance. MLP and RNN based on EHR-direct data (MLP: AUC=0.742, Spiegelhalter z=0.933, Brier=0.204; RNN: AUC=0.737, Spiegelhalter z=1.026, Brier=0.206) and registry+EHR-direct data (MLP: AUC=0.756, Spiegelhalter z=0.794, Brier=0.199; RNN: AUC=0.751, Spiegelhalter z=1.057, Brier=0.200) had improved performances. LR based on EHR-direct data and combined registry+EHR-direct data had worse performances.
Conclusions: EHR data, when used with neural network models, can be useful to establish readmission predictive models or augment clinical registry data. EHR-based models can be potentially embedded in the clinical workflow, but model performance may be constrained by the absence of certain information in clinical encounters, such as social determinants of health.
2.
Conderino, Sarah; Divers, Jasmin; Dodson, John A.; Thorpe, Lorna E.; Weiner, Mark G.; Adhikari, Samrachana
Evaluating Methods for Imputing Race and Ethnicity in Electronic Health Record Data Journal Article
In: Health Services Research, vol. 60, iss. 5, pp. e14649, 2025.
Abstract | Links | BibTeX | Tags: electronic health records, methodology
@article{nokey,
title = {Evaluating Methods for Imputing Race and Ethnicity in Electronic Health Record Data},
author = {Sarah Conderino and Jasmin Divers and John A. Dodson and Lorna E. Thorpe and Mark G. Weiner and Samrachana Adhikari},
doi = {10.1111/1475-6773.14649},
year = {2025},
date = {2025-05-27},
journal = {Health Services Research},
volume = {60},
issue = {5},
pages = {e14649},
abstract = {Objective: To compare anonymized and non-anonymized approaches for imputing race and ethnicity in descriptive studies of chronic disease burden using electronic health record (EHR)-based datasets.
Study setting and design: In this New York City-based study, we first conducted simulation analyses under different missing data mechanisms to assess the performance of Bayesian Improved Surname Geocoding (BISG), single imputation using neighborhood majority information, random forest imputation, and multiple imputation with chained equations (MICE). Imputation performance was measured using sensitivity, precision, and overall accuracy; agreement with self-reported race and ethnicity was measured with Cohen's kappa (κ). We then applied these methods to impute race and ethnicity in two EHR-based data sources and compared chronic disease burden (95% CIs) by race and ethnicity across imputation approaches.
Data sources and analytic sample: Our data sources included EHR data from NYU Langone Health and the INSIGHT Clinical Research Network from 3/6/2016 to 3/7/2020 extracted for a parent study on older adults in NYC with multiple chronic conditions.
Principal findings: Under simulation analyses, the non-anonymized BISG imputation provided the most accurate classification of race and ethnicity, ranging from 66% to 73% across missing data mechanisms. Anonymized imputation methods were more sensitive to the missing data mechanism, with agreement dropping when race and ethnicity was missing not at random (MNAR) (κ single = 0.25, κ MICE = 0.25, κ randomforest = 0.33). When these methods were applied to the NYU and INSIGHT cohorts, however, racial and ethnic distributions and chronic disease burden were consistent across all imputation methods. Slight improvements in the precision of estimates were observed under all imputation approaches compared to a complete case analysis.
Conclusions: BISG imputation may provide a more accurate racial and ethnic classification than single or multiple imputation using anonymized covariates, particularly if the missing data mechanism is MNAR. Descriptive studies of disease burden may not be sensitive to methods for imputing missing data.},
keywords = {electronic health records, methodology},
pubstate = {published},
tppubtype = {article}
}
Objective: To compare anonymized and non-anonymized approaches for imputing race and ethnicity in descriptive studies of chronic disease burden using electronic health record (EHR)-based datasets.
Study setting and design: In this New York City-based study, we first conducted simulation analyses under different missing data mechanisms to assess the performance of Bayesian Improved Surname Geocoding (BISG), single imputation using neighborhood majority information, random forest imputation, and multiple imputation with chained equations (MICE). Imputation performance was measured using sensitivity, precision, and overall accuracy; agreement with self-reported race and ethnicity was measured with Cohen's kappa (κ). We then applied these methods to impute race and ethnicity in two EHR-based data sources and compared chronic disease burden (95% CIs) by race and ethnicity across imputation approaches.
Data sources and analytic sample: Our data sources included EHR data from NYU Langone Health and the INSIGHT Clinical Research Network from 3/6/2016 to 3/7/2020 extracted for a parent study on older adults in NYC with multiple chronic conditions.
Principal findings: Under simulation analyses, the non-anonymized BISG imputation provided the most accurate classification of race and ethnicity, ranging from 66% to 73% across missing data mechanisms. Anonymized imputation methods were more sensitive to the missing data mechanism, with agreement dropping when race and ethnicity was missing not at random (MNAR) (κ single = 0.25, κ MICE = 0.25, κ randomforest = 0.33). When these methods were applied to the NYU and INSIGHT cohorts, however, racial and ethnic distributions and chronic disease burden were consistent across all imputation methods. Slight improvements in the precision of estimates were observed under all imputation approaches compared to a complete case analysis.
Conclusions: BISG imputation may provide a more accurate racial and ethnic classification than single or multiple imputation using anonymized covariates, particularly if the missing data mechanism is MNAR. Descriptive studies of disease burden may not be sensitive to methods for imputing missing data.
Study setting and design: In this New York City-based study, we first conducted simulation analyses under different missing data mechanisms to assess the performance of Bayesian Improved Surname Geocoding (BISG), single imputation using neighborhood majority information, random forest imputation, and multiple imputation with chained equations (MICE). Imputation performance was measured using sensitivity, precision, and overall accuracy; agreement with self-reported race and ethnicity was measured with Cohen's kappa (κ). We then applied these methods to impute race and ethnicity in two EHR-based data sources and compared chronic disease burden (95% CIs) by race and ethnicity across imputation approaches.
Data sources and analytic sample: Our data sources included EHR data from NYU Langone Health and the INSIGHT Clinical Research Network from 3/6/2016 to 3/7/2020 extracted for a parent study on older adults in NYC with multiple chronic conditions.
Principal findings: Under simulation analyses, the non-anonymized BISG imputation provided the most accurate classification of race and ethnicity, ranging from 66% to 73% across missing data mechanisms. Anonymized imputation methods were more sensitive to the missing data mechanism, with agreement dropping when race and ethnicity was missing not at random (MNAR) (κ single = 0.25, κ MICE = 0.25, κ randomforest = 0.33). When these methods were applied to the NYU and INSIGHT cohorts, however, racial and ethnic distributions and chronic disease burden were consistent across all imputation methods. Slight improvements in the precision of estimates were observed under all imputation approaches compared to a complete case analysis.
Conclusions: BISG imputation may provide a more accurate racial and ethnic classification than single or multiple imputation using anonymized covariates, particularly if the missing data mechanism is MNAR. Descriptive studies of disease burden may not be sensitive to methods for imputing missing data.
