Skip to Main Content

Prominent Statistician Opens Dean’s Speaker Series

August 21, 2023

Professor Bhramar Mukherjee provides insight into challenges, biases, and risks associated with big data in electronic health records research

Big data has transformed the landscape of modern public health research. Whether through electronic health records, wearables, or genomic data, we have access to vast repositories of information.

But there are limits to its applications and researchers need to pay more attention to potential biases – especially those that do not diminish with increased sample size – that can weaken a study’s validity and limit generalizations of inference. This is especially important when it comes to population-based scientific research.

That was the key message delivered by Professor Bhramar Mukherjee, a prominent statistician from the University of Michigan, who presented the inaugural lecture for a new Dean’s speaker series at the Yale School of Public Health called “Leaders in Public Health.”

YSPH Dean Dr. Megan L. Ranney, MD, called Mukherjee a “pioneer” who “embodies every aspect of public health leadership.” A member of the National Academy of Medicine whose scholarship has been cited nearly 16,000 times, Mukherjee, Ranney said, is noted for her “methodological rigor, passion to solve real problems, unbelievable dedication to education and mentorship, and most of all, her commitment to doing work that matters and creates change.”

Mukherjee is the John D. Kalbfleisch Collegiate Professor and chair of biostatistics at the University of Michigan School of Public Health. She also is a professor of epidemiology and global health and associate director for quantitative data sciences at Michigan’s Rogel Cancer Center.

The title of Mukherjee’s August 14 address was “Analysis of “Big” Real-World Health Care Data: Promises and Perils.”

Speaking to a capacity crowd in YSPH’s main science building, plus many joining online, Mukherjee emphasized the importance of collaboration between scientists, statisticians, and other specialists when it comes to designing studies and building vast and inclusive data sets.

“I really believe in inclusive data science where we all are trying to solve a practical problem together, she said. “I definitely feel that statisticians have a lot to contribute, as an integral part of a magnificent research team and where the science is driving our work.”

Having a highly skilled team that can identify and address potential sampling errors, selection bias, and mistakes in classifying medical phenotypes and downstream associations, among other issues, is critically important, she said, especially as sample sizes get larger giving a false sense of precision. This paradox is important to recognize as big data resources like electronic health records are increasingly being used as a basis for inferences in population-based research and driving policy.

Even the fanciest AI cannot rescue you when you're training your algorithms on exclusionary and incorrect datasets.

Professor Bhramer Mukherjee

“Who is in your study is incredibly important. The cohorts you are training your algorithms on determine who you can generalize the results to. Even the fanciest AI cannot rescue you when you're training your algorithms on exclusionary and incorrect datasets,” Mukherjee said. “To formulate a problem, to understand what your target of inference is, you should always see a statistician before you do a study, during your study, and when you do your analysis.”

Mukherjee called electronic health records both “fascinating” in their potential and “frustrating” in the challenges they create for scientists and statisticians alike. Electronic health records, she said, are non-probability samples that are not designed for population-based research. The records are not structured and can include everything from prescription information and lab results to treatments and clinical notes, as a result, the data is driven by patients’ health care-seeking behavior and access to health care.

One way of enhancing the accuracy and reliability of studies using electronic health records is to go beyond standard statistical analysis based on just internal data and leverage more external data sources and assumption-lean models, Mukherjee said.

To reduce bias, she suggested conducting an in-depth retrospective chart review on a study subsample, collecting more data from a randomized subsample; applying data to negative controls, and generating frameworks to conduct sensitivity analysis.

“You have to really appeal to your classical notions of biostatistics and epidemiology …otherwise it’s going to be very difficult to make sense of this data,” she said.

As an example of building a strong data resource for interdisciplinary health research, Mukherjee cited the Michigan Genomics Initiative that was launched by the University of Michigan 10 years ago. The initiative has so far collected biospecimen samples and electronic health information from more than 100,000 people who agreed to participate. Recognizing that electronic health records and genetic health information do not fully codify a person’s existence, the robust database is augmented with information on other social determinants of health such as a patient’s family history, lifestyle, and the environment they live in, Mukherjee said.

This integrated data ecosystem, or what Mukherjee referred to as a “data quilt,” came in handy when COVID-19 became an emergent threat. Study participants agreed to be contacted for additional information when necessary. This integrated data system allowed researchers to mine the data quickly in real-time and provide frontline health care workers with important information to assist with patient management and community response.

Despite researchers’ best efforts, bias can never be fully eradicated, Mukherjee said. But it can be significantly reduced, and researchers should be honest about the limitations of their work.

“I think every analyst using real-world health care data should be required to give a declaration of what the potential sources of bias are in their current analysis so that we are honest and transparent as data scientists,” she said. “We should not make tall order claims based on imperfect data.”

****

The Leaders in Public Health speaker series is intended to highlight extraordinary people and thought leaders in the field of public health. Future speaker events are planned for the entire academic year.

“I have seen over and over again over the course of my career the power that gets unleashed when you bring folks together,” Dean Ranney said. (More than 150 people attended Mukherjee’s lecture in person. Another 100 followed along online.)

“As we pioneer together a new era in our school's independence, we are going to welcome a series of folks to come and join us – to engage in discussions about scientific rigor, about diversity, equity, and inclusion, about ways in which the work that we do within the bounds of this school can affect the health of the world far beyond here at 60 College Street, beyond the walls of Yale, whether within New Haven, Connecticut, the nation, or the globe.”