Big data has transformed the landscape of modern public health research. Whether through electronic health records, wearables, or genomic data, we have access to vast repositories of information.
But there are limits to its applications and researchers need to pay more attention to potential biases – especially those that do not diminish with increased sample size – that can weaken a study’s validity and limit generalizations of inference. This is especially important when it comes to population-based scientific research.
That was the key message delivered by Professor Bhramar Mukherjee, a prominent statistician from the University of Michigan, who presented the inaugural lecture for a new Dean’s speaker series at the Yale School of Public Health called “Leaders in Public Health.”
YSPH Dean Dr. Megan L. Ranney, MD, called Mukherjee a “pioneer” who “embodies every aspect of public health leadership.” A member of the National Academy of Medicine whose scholarship has been cited nearly 16,000 times, Mukherjee, Ranney said, is noted for her “methodological rigor, passion to solve real problems, unbelievable dedication to education and mentorship, and most of all, her commitment to doing work that matters and creates change.”
Mukherjee is the John D. Kalbfleisch Collegiate Professor and chair of biostatistics at the University of Michigan School of Public Health. She also is a professor of epidemiology and global health and associate director for quantitative data sciences at Michigan’s Rogel Cancer Center.
The title of Mukherjee’s August 14 address was “Analysis of “Big” Real-World Health Care Data: Promises and Perils.”
Speaking to a capacity crowd in YSPH’s main science building, plus many joining online, Mukherjee emphasized the importance of collaboration between scientists, statisticians, and other specialists when it comes to designing studies and building vast and inclusive data sets.
“I really believe in inclusive data science where we all are trying to solve a practical problem together, she said. “I definitely feel that statisticians have a lot to contribute, as an integral part of a magnificent research team and where the science is driving our work.”
Having a highly skilled team that can identify and address potential sampling errors, selection bias, and mistakes in classifying medical phenotypes and downstream associations, among other issues, is critically important, she said, especially as sample sizes get larger giving a false sense of precision. This paradox is important to recognize as big data resources like electronic health records are increasingly being used as a basis for inferences in population-based research and driving policy.