Skip to Main Content

Bhramar Mukherjee, PhD

bama

Perform mediation analysis in the presence of high-dimensional mediators based on the potential outcome framework. Bayesian Mediation Analysis (BAMA), developed by Song et al (2019) <doi:10.1111/biom.13189> and Song et al (2020) <doi:10.48550/arXiv.2009.11409>, relies on two Bayesian sparse linear mixed models to simultaneously analyze a relatively large number of mediators for a continuous exposure and outcome assuming a small number of mediators are truly active. This sparsity assumption also allows the extension of univariate mediator analysis by casting the identification of active mediators as a variable selection problem and applying Bayesian methods with continuous shrinkage priors on the effects.

Faculty: Bhramar Mukherjee, PhD

Download: Cran R / bama package

Platform: R

Reference: doi.org (bama)


CompMix

Quantitative characterization of the health impacts associated with exposure to chemical mixtures has received considerable attention in current environmental and epidemiological studies. 'CompMix' package allows practitioners to estimate the health impacts from exposure to chemical mixtures data through various statistical approaches, including Lasso, Elastic net, Bayeisan kernel machine regression (BKMR), hierNet, Quantile g-computation, Weighted quantile sum (WQS) and Random forest. Hao W, Cathey A, Aung M, Boss J, Meeker J, Mukherjee B. (2024) "Statistical methods for chemical mixtures: a practitioners guide". <doi:10.1101/2024.03.03.24303677>.

Faculty: Bhramar Mukherjee, PhD

Download: Cran R / CompMix package

Platform: R

Reference: doi.org (CompMix)


CIMPLE

Analyzes longitudinal Electronic Health Record (EHR) data with possibly informative observational time. These methods are grouped into two classes depending on the inferential task. One group focuses on estimating the effect of an exposure on a longitudinal biomarker while the other group assesses the impact of a longitudinal biomarker on time-to-diagnosis outcomes. The accompanying paper is Du et al (2024) <doi:10.48550/arXiv.2410.13113>.

Faculty: Bhramar Mukherjee, PhD

Download: Cran R / CIMPLE package

Platform: R

Reference: doi.org (CIMPLE)


EHRmuse

Analyzes longitudinal Electronic Health Record (EHR) data with possibly informative observational time. These methods are grouped into two classes depending on the inferential task. One group focuses on estimating the effect of an exposure on a longitudinal biomarker while the other group assesses the impact of a longitudinal biomarker on time-to-diagnosis outcomes. The accompanying paper is Du et al (2024) <doi:10.48550/arXiv.2410.13113>.

Faculty: Bhramar Mukherjee, PhD

Download: Cran R / EHRmuse package

Platform: R

Reference: doi.org (EHRmuse)


hdmed

A suite of functions for performing mediation analysis with high-dimensional mediators. In addition to centralizing code from several existing packages for high-dimensional mediation analysis, we provide organized, well-documented functions for a handle of methods which, though programmed their original authors, have not previously been formalized into R packages or been made presentable for public use. The methods we include cover a broad array of approaches and objectives, and are described in detail by both our companion manuscript—"Methods for Mediation Analysis with High-Dimensional DNA Methylation Data: Possible Choices and Comparison"—and the original publications that proposed them. The specific methods offered by our package include the Bayesian sparse linear mixed model (BSLMM) by Song et al. (2019); high-dimensional mediation analysis (HDMA) by Gao et al. (2019); high-dimensional multivariate mediation (HDMM) by Chén et al. (2018); high-dimensional linear mediation analysis (HILMA) by Zhou et al. (2020); high-dimensional mediation analysis (HIMA) by Zhang et al. (2016); latent variable mediation analysis (LVMA) by Derkach et al. (2019); mediation by fixed-effect model (MedFix) by Zhang (2021); pathway LASSO by Zhao & Luo (2022); principal component mediation analysis (PCMA) by Huang & Pan (2016); and sparse principal component mediation analysis (SPCMA) by Zhao et al. (2020). Citations for the corresponding papers can be found in their respective functions.

Faculty: Bhramar Mukherjee, PhD

Download: Cran R / hdmed package

Platform: R

Reference: doi.org (hdmed)


LGEWIS

Functions for genome-wide association studies (GWAS)/gene-environment-wide interaction studies (GEWIS) with longitudinal outcomes and exposures. He et al. (2017) "Set-Based Tests for Gene-Environment Interaction in Longitudinal Studies" and He et al. (2017) "Rare-variant association tests in longitudinal studies, with an application to the Multi-Ethnic Study of Atherosclerosis (MESA)".

Faculty: Bhramar Mukherjee, PhD

Download: Cran R / LGEWIS package

Platform: R

Reference: doi.org (LGEWIS)


lodi

Impute observed values below the limit of detection (LOD) via censored likelihood multiple imputation (CLMI) in single-pollutant models, developed by Boss et al (2019) <doi:10.1097/EDE.0000000000001052>. CLMI handles exposure detection limits that may change throughout the course of exposure assessment. 'lodi' provides functions for imputing and pooling for this method.

Faculty: Bhramar Mukherjee, PhD

Download: Cran R / lodi package

Platform: R

Reference: doi.org (lodi)


medScan

A collection of methods for large scale single mediator hypothesis testing. The six included methods for testing the mediation effect are Sobel's test, Max P test, joint significance test under the composite null hypothesis, high dimensional mediation testing, divide-aggregate composite null test, and Sobel's test under the composite null hypothesis. Du et al (2023) <doi:10.1002/gepi.22510>.

Faculty: Bhramar Mukherjee, PhD

Download: Cran R / medScan package

Platform: R

Reference: doi.org (medScan)


miselect

Penalized regression methods, such as lasso and elastic net, are used in many biomedical applications when simultaneous regression coefficient estimation and variable selection is desired. However, missing data complicates the implementation of these methods, particularly when missingness is handled using multiple imputation. Applying a variable selection algorithm on each imputed dataset will likely lead to different sets of selected predictors, making it difficult to ascertain a final active set without resorting to ad hoc combination rules. 'miselect' presents Stacked Adaptive Elastic Net (saenet) and Grouped Adaptive LASSO (galasso) for continuous and binary outcomes, developed by Du et al (2022) <doi:10.1080/10618600.2022.2035739>. They, by construction, force selection of the same variables across multiply imputed data. 'miselect' also provides cross validated variants of these methods.

Faculty: Bhramar Mukherjee, PhD

Download: Cran R / miselect package

Platform: R

Reference: doi.org (miselect)


PRSweb

To facilitate scientific collaboration on polygenic risk scores (PRSs) research, we created an extensive PRS online repository for 35 common cancer traits integrating freely available genome-wide association studies (GWASs) summary statistics from three sources: published GWASs, the NHGRI-EBI GWAS Catalog, and UK Biobank-based GWASs. Our framework condenses these summary statistics into PRSs using various approaches such as linkage disequilibrium pruning/p value thresholding (fixed or data-adaptively optimized thresholds) and penalized, genome-wide effect size weighting. We evaluated the PRSs in two biobanks: the Michigan Genomics Initiative (MGI), a longitudinal biorepository effort at Michigan Medicine, and the population-based UK Biobank (UKB). For each PRS construct, we provide measures on predictive performance and discrimination. Besides PRS evaluation, the Cancer-PRSweb platform features construct downloads and phenome-wide PRS association study results (PRS-PheWAS) for predictive PRSs. We expect this integrated platform to accelerate PRS-related cancer research.

Faculty: Bhramar Mukherjee, PhD

Download: PRSweb package

Platform: R Shiny

Reference: doi.org (PRSweb)


SAMBA

Health research using data from electronic health records (EHR) has gained popularity, but misclassification of EHR-derived disease status and lack of representativeness of the study sample can result in substantial bias in effect estimates and can impact power and type I error for association tests. Here, the assumed target of inference is the relationship between binary disease status and predictors modeled using a logistic regression model. 'SAMBA' implements several methods for obtaining bias-corrected point estimates along with valid standard errors as proposed in Beesley and Mukherjee (2020) <doi:10.1101/2019.12.26.19015859>, currently under review.

Faculty: Bhramar Mukherjee, PhD

Download: Cran R / SAMBA package

Platform: R

Reference: doi.org (SAMBA)


snif

snif is a R package that implements "Selection of Nonlinear Interactions by a Forward Stepwise Algorithm". snif is currently in the middle of being tested and polished, and as such it is BETA software.

Faculty: Bhramar Mukherjee, PhD

Download: GitHub / snif package

Platform: R

Reference: doi.org (snif)


subgxe

Classical methods for combining summary data from genome-wide association studies (GWAS) only use marginal genetic effects and power can be compromised in the presence of heterogeneity. 'subgxe' is a R package that implements p-value assisted subset testing for association (pASTA), a method developed by Yu et al. (2019) <doi:10.1159/000496867>. pASTA generalizes association analysis based on subsets by incorporating gene-environment interactions into the testing procedure.

Faculty: Bhramar Mukherjee, PhD

Download: Cran R / subgxe package

Platform: R

Reference: doi.org (subgxe)


SynDI

There is a growing need for flexible general frameworks that integrate individual-level data with external summary information for improved statistical inference. External information relevant for a risk prediction model may come in multiple forms, through regression coefficient estimates or predicted values of the outcome variable. Different external models may use different sets of predictors and the algorithm they used to predict the outcome Y given these predictors may or may not be known. The underlying populations corresponding to each external model may be different from each other and from the internal study population. Motivated by a prostate cancer risk prediction problem where novel biomarkers are measured only in the internal study, this paper proposes an imputation-based methodology, where the goal is to fit a target regression model with all available predictors in the internal study while utilizing summary information from external models that may have used only a subset of the predictors. The method allows for heterogeneity of covariate effects across the external populations. The proposed approach generates synthetic outcome data in each external population, uses stacked multiple imputation to create a long dataset with complete covariate information. The final analysis of the stacked imputed data is conducted by weighted regression. This flexible and unified approach can improve statistical efficiency of the estimated coefficients in the internal study, improve predictions by utilizing even partial information available from models that use a subset of the full set of covariates used in the internal study, and provide statistical inference for the external population with potentially different covariate effects from the internal population.

Faculty: Bhramar Mukherjee, PhD

Download: GitHub / SynDI package

Platform: R

Reference: doi.org (SynDI)


synthEHRella

A Python package for synthetic Electronic Health Records (EHR) data generation benchmarking.

Faculty: Bhramar Mukherjee, PhD

Download: GitHub / synthEHRella package

Platform: Python

Reference: doi.org (synthEHRella)


svycdiff

Estimates the population average controlled difference for a given outcome between levels of a binary treatment, exposure, or other group membership variable of interest for clustered, stratified survey samples where sample selection depends on the comparison group. Provides three methods for estimation, namely outcome modeling and two factorizations of inverse probability weighting. Under stronger assumptions, these methods estimate the causal population average treatment effect. Salerno et al., (2024) <doi:10.48550/arXiv.2406.19597>.

Faculty: Bhramar Mukherjee, PhD

Download: Cran R / svycdiff package

Platform: R

Reference: doi.org (svycdiff)