YSPH Biostatistics Seminar: “Generalized Bayes Calibration of Compositional Cause-specific Mortality Data from Verbal Autopsies"
October 19, 2023Information
Abhirup Datta, PhD, Associate Professor, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health
October 17, 2023
ID10875
To CiteDCA Citation Guide
- 00:00<v ->And welcome.</v>
- 00:02Today, it's my, eh.
- 00:05Today, it is my pleasure to introduce Professor Abhi Datta
- 00:09from Johns Hopkins University in Baltimore, Maryland.
- 00:13Professor Datta earned his BS and MS
- 00:15from the Indian Statistical Institute
- 00:17in 2008 and 2010 respectively,
- 00:20and PhD from the University of Minnesota in 2016.
- 00:25In addition to being a well-cited researcher
- 00:27with one publication that's almost 600 citations,
- 00:30which is pretty nice,
- 00:32he's also a award-winning educator,
- 00:35having repeatedly won an excellence in teaching award
- 00:37from his institution.
- 00:39So let's welcome Dr. Datta.
- 00:44<v ->Thank you, Robert,</v>
- 00:45for the invitation to come here and give the seminar,
- 00:48and for the very nice introduction.
- 00:50Thank you everyone for coming.
- 00:52My talk is about improving cause-specific mortality data
- 00:56in low and middle-income countries
- 00:58where the main tool to collect data
- 01:00is something called verbal autopsies.
- 01:02And the way I do it
- 01:03is using a statistical approach called generalized Bayes.
- 01:07If you have not heard
- 01:08of verbal autopsies or generalized Bayes,
- 01:11I can tell you that I hadn't heard of either of those things
- 01:14when I started working on the project,
- 01:17so don't worry about that,
- 01:18I try to give an introduction.
- 01:20'Cause I mostly work on a spatial and spatial temporal data
- 01:24and this was a project that came along,
- 01:27which is very different from what I used to work on.
- 01:29But over the years, there's been a nice body of work
- 01:31developed in this project.
- 01:35So this is a joint work
- 01:39with many different institutes and collaborators.
- 01:44The top row is the Hopkins bio stats team,
- 01:46which included my former students,
- 01:48Jacob Fiksel and Brian Gilbert,
- 01:51and my current postdoc, Sandi,
- 01:53and my colleague, Scott Zeger, and I
- 01:56lead the bio stats part of the team.
- 02:00Agbessi is the PI of the project in Mozambique
- 02:03that's sort of picked up developments for this work.
- 02:07And there are a lot of colleagues
- 02:09from the International Health Department
- 02:10that helped to collaborate.
- 02:12And then Li is the PI of a new project
- 02:16who we're going to apply our methodology
- 02:17for producing mortality estimates for the WHO.
- 02:22So we're collaborating with Li there as well.
- 02:25And then a couple of people outside Hopkins,
- 02:27Dianna at CDC and Emory University,
- 02:31as the director of the CHAMPS project.
- 02:35And Ivalda in the government body at Mozambique
- 02:39has been now currently doing the work in Mozambique.
- 02:44So this is funded by three grants from the Gates Foundation.
- 02:49The first one was the grant that kind of started things.
- 02:52And then we have a grant that is kind of developing more
- 02:55on the method side of the world.
- 02:59So, many low and middle-income countries
- 03:05often lack high-quality data on causes of death.
- 03:08Often for most deaths,
- 03:10there is no sort of medical certification
- 03:13or like an autopsy done.
- 03:16And without kind of high-quality data
- 03:19on what people are dying of,
- 03:21it's kind of hard to estimate the disease burden
- 03:23in these countries.
- 03:25And specifically, the quantity of interest
- 03:27is the cause-specific mortality fraction,
- 03:29which is basically the percentage of deaths in a age group
- 03:34that can be attributable to a given cause.
- 03:38So cause-specific mortality fractions
- 03:40are key pieces of information
- 03:42in determining the global burden of disease,
- 03:44which in turn dictates sovereign policy,
- 03:47as well as like resource allocations
- 03:49for programs operating in this country.
- 03:54So verbal autopsy is an alternate way
- 03:57to count deaths and attribute causes
- 03:59without actually doing a clinical autopsy.
- 04:02So verbal autopsy is basically
- 04:04a sort of a systematic interview
- 04:07of the household members of the deceased.
- 04:08So the government or the program has a set of field workers
- 04:12who go out and go from household to household
- 04:15and ask if anyone died in their household
- 04:17within the last several months.
- 04:18And if they died, what were the symptoms?
- 04:20And the set of questions they ask is not standardized
- 04:23by the WHO.
- 04:24Some example questions are here.
- 04:27Most of the questions would have binary answers
- 04:29like yes, no, but there are some questions
- 04:32that have more like continuous responses.
- 04:38So they said the WHO has standardized
- 04:41the verbal autopsy tool.
- 04:43The 2016 version has around 200 to 350 questions,
- 04:47depending on the age group.
- 04:48There are separate sections of the questionnaire
- 04:50for neonates, children deaths and adult deaths.
- 04:54And if you're interested in more information
- 04:56about verbal autopsy, there's a page in WHO about it.
- 05:02So a verbal autopsy, of course,
- 05:04doesn't give you a cause of death,
- 05:05it just gives you a bunch of yes-no responses
- 05:08to various questions related to the symptoms.
- 05:14So a verbal autopsy is basically a survey questionnaire.
- 05:17So you can pass that survey through a computer software
- 05:20and that can give a predictive cause of death.
- 05:23And so there are a bunch
- 05:24of different computer software available.
- 05:27InSilicoVA, developed by Tyler McCormick,
- 05:31Richard Li was a postdoc here,
- 05:34is published in "JASA" in 2016,
- 05:36is one of the, I think,
- 05:37most statistically-principled approaches to do it.
- 05:40But there are other approaches and then you can,
- 05:43this is basically a classification problem.
- 05:45So you're basically given your data on symptoms,
- 05:48you're kind of classifying the cause of death
- 05:50as one of several causes.
- 05:51So you can use standard classifiers
- 05:54and machine learning approaches as well.
- 05:58OpenVA is an excellent resource
- 05:59to learn about verbal autopsies.
- 06:00Again, openVA is,
- 06:04I think Richard is one of the maintainers
- 06:06and creators of openVA.
- 06:11So the COMSA project in Mozambique,
- 06:14one of the main goals was to generate
- 06:17this cause-specific mortality fractions
- 06:19for children's and under,
- 06:21for neonates and under-five children
- 06:24for the country of Mozambique.
- 06:26And the data that we collected was a large dataset
- 06:30of vocal autopsy record
- 06:32for different households that were surveyed
- 06:34and that was a map of Mozambique
- 06:38and the green region show
- 06:41where the data was collected
- 06:43as part of the COMSA project.
- 06:44So in statistical terms, the data just has the symptoms,
- 06:49it doesn't have the true cause of death,
- 06:51so we call it the unlabeled data.
- 06:57So how to go from an unlabeled data to the labeling
- 07:00of the causes of death
- 07:01and then estimate these cause fractions.
- 07:04This is the standard procedure that is typically done
- 07:08and this is what we were supposed to do as well,
- 07:10which is simply take each record,
- 07:12pass it through the computer software
- 07:14and get a cause of death.
- 07:16And once you get a cause of death,
- 07:18then you can sort of simply aggregate.
- 07:19So in the story example,
- 07:21three out of the six cases were assigned to be from HIV.
- 07:25And so the cause-specific mortality fraction for HIV
- 07:27would be 50% and similar for malaria and sepsis and so on.
- 07:32So that's the basic template
- 07:35of how to get a cause-specific mortality fractions
- 07:38from verbal autopsies.
- 07:39The question is can we trust this estimates?
- 07:41Because these are not true causes of death
- 07:43as determined by a doctor or by a clinical procedure.
- 07:46These are cause of death predicted by an algorithm
- 07:48based on just surveying the household members
- 07:52of the deceased.
- 07:57So turns out machine learning has a name
- 08:00for this type of problems,
- 08:01it's called quantification learning,
- 08:04which is basically estimating population prevalence
- 08:07using predicted levels instead of true levels
- 08:10and the predictions are coming from a classifier.
- 08:13And so there has been some work in quantification learning
- 08:16and in the machine learning literature.
- 08:19So when we were working on this problem,
- 08:21we realized that estimating
- 08:22cause-specific mortality fractions
- 08:24using predicted cause of death data from verbal autopsy
- 08:27is an example of quantification learning.
- 08:31So just a sort of an overview of terms that we'll be using
- 08:35and the corresponding statistical notation.
- 08:37So our true cause of death is y which we do not observe.
- 08:42We want to estimate the probability
- 08:43of population prevalence of y,
- 08:45so y is a categorical variable.
- 08:49And so probability of y or p
- 08:51is our cause-specific mortality fraction,
- 08:53which is the estimand.
- 08:55We observed the verbal autopsy, which is a,
- 08:57think of this as a high dimensional
- 09:00or a long list of yes-no answers
- 09:02to the verbal autopsy questions, so that is x,
- 09:06and this x is passed through a software
- 09:08to give a predicted level, which is a of x or simply a.
- 09:17So what we have in the COMSA project
- 09:21is simply an unlabeled dataset
- 09:25which uses these verbal autopsy responses,
- 09:28pass it through a software and get the predicted levels.
- 09:34We do not observe the true levels, y,
- 09:37we may or may not retain the verbal autopsy responses
- 09:40because those are identifiable data
- 09:42and those are often not released,
- 09:43so often, just the predicted cause of that is available.
- 09:47So even these covariates, x, may or may not be available.
- 09:50And then we are interested in estimating the probability
- 09:53that y belongs to one of the C many cause categories,
- 09:58so that's a quantity of interest.
- 10:05For some reason, there is a conditional sign
- 10:07that's missing there.
- 10:09But you can use the law of total probability
- 10:13to write the probability of the predicted cause of death,
- 10:16which is the a,
- 10:18probability of a as a sum of our probability of a given y
- 10:22times probability of y.
- 10:24So there's a conditional sign missing here,
- 10:26I don't don't know what's going on here.
- 10:32But the COMSA data,
- 10:33we only get information on the left-hand side, right?
- 10:36And we want to input upon the quantity probability of y
- 10:41which would be the true CSMFs.
- 10:44So there is only one known quantity
- 10:46with which you can estimate the left-hand side.
- 10:48There are two unknown quantities on the right-hand side.
- 10:50So without making assumptions, you cannot really identify
- 10:54probability of y, right?
- 10:56So any quantification learning methods
- 10:59need to either estimate those conditional probabilities,
- 11:02probability of a given y,
- 11:04or make some assumptions on it.
- 11:08So again, all the conditional signs are missing.
- 11:16The one of the most common approaches,
- 11:19and this is what is used in the verbal autopsy world
- 11:22is called classify and count,
- 11:25which is you simply predict the cause of death
- 11:28and then aggregate.
- 11:29So you're simply claiming that probability of a
- 11:33is same as probability of y which is equivalent to claiming
- 11:36that this misclassification rate matrix
- 11:39is an identity matrix, right?
- 11:41Because you're saying that the left hand quantity
- 11:44is the same as the rightmost quantity, which would be true
- 11:48if there is no misclassification by the algorithm
- 11:51and if the predicted cause of death
- 11:53is always the true cause of death.
- 11:56And that's what is typically done
- 11:58in this cause-specific mortality fraction estimates.
- 12:02But it's a very strong assumption, right?
- 12:04Because it says assuming perfect sensitivity and specificity
- 12:07of the algorithm.
- 12:10So let's look at how perfect the algorithms are.
- 12:12So these are two algorithms,
- 12:13Tariff and InSilicoVA,
- 12:16PHMRC data is a benchmark dataset from four countries
- 12:20that has both the verbal autopsy data
- 12:22as well as a gold standard cause of death diagnosis.
- 12:26And you can see the accuracies of either method
- 12:30is around 30%, so they're far from being
- 12:33like fully accurate.
- 12:36So there is large misclassification rates
- 12:39of these algorithms and if you don't kind of adjust
- 12:42for these misclassifications,
- 12:44this is burden estimates
- 12:46of the cause-specific mortality fractions you get
- 12:48are likely going to be very biased.
- 12:54So this is where the CHAMPS project comes into play.
- 12:58So the CHAMPS is an ongoing project
- 13:00in like seven or eight countries including Mozambique,
- 13:05which is collecting data on both verbal autopsy
- 13:07and a more comprehensive cause of death procedure
- 13:11called minimally invasive tissue sampling.
- 13:14So it basically takes a sample of your tissue
- 13:17of the deceased person and then runs a bunch
- 13:20of pathological tests and imaging analysis
- 13:23and then gives a cause of death.
- 13:25And the MITS cause of death assignments
- 13:30have been shown to be quite accurate when you compare
- 13:33to like a full diagnostic autopsy.
- 13:36So MITS is being done in a bunch
- 13:38of different countries including Mozambique.
- 13:41And for the cases where MITS is being done,
- 13:43the verbal autopsies are also collected.
- 13:46So what you get from this CHAMPS data
- 13:48is a labeled or paired dataset
- 13:50where you have both the verbal autopsy
- 13:52as well as the MITS cause of death
- 13:54and you can pass the verbal autopsy to the software
- 13:58to get the verbal autopsy predicted cause of death.
- 14:00And then you can cross tabulate the two
- 14:02and get an estimate of the misclassification rates, right?
- 14:04Like you can say like,
- 14:06"Oh okay, so there are 10 cases
- 14:08that the MITS cause of death was HIV,
- 14:11out of those 10 cases,
- 14:12seven of them were correctly assigned to HIV
- 14:15by verbal autopsy.
- 14:16So then the sensitivity would be 70%
- 14:20and the false positive would be 30%, so on."
- 14:27So this is the broad idea of the methodology.
- 14:29So for the COMSA data, which is the unpaired data,
- 14:32you get only the verbal autopsy record
- 14:34so you can get an estimate of the predicted cause of deaths
- 14:37from the verbal autopsy.
- 14:39From the CHAMPS data, which is the paired data,
- 14:41you can get an estimate of the misclassification rates.
- 14:44And then the only unknown is then the probabilities
- 14:48of the cause of death
- 14:50if you were able to do the MITS autopsy for every death.
- 14:54So then this is an equation with two knowns and one unknown
- 14:58and you can solve for it and get the calibrating message.
- 15:01So that's the broad idea and we do it in a model-based way.
- 15:09So here's the formal model.
- 15:11So for the CHAMPS dataset with the unlabeled data or the U,
- 15:15we have the predicted labels, ar,
- 15:17and then for the,
- 15:20that's for the COMSA data,
- 15:21and for the CHAMPS data,
- 15:22we have both the predicted labels from verbal autopsy, ar,
- 15:26as well as the MITS determine labels, yr.
- 15:29And our quantity of interest is the probabilities of yr
- 15:34belonging to the different causes.
- 15:41There's a conditional sign missing here.
- 15:44But if the conditional probabilities
- 15:48are denoted by Mij, which is if the MITS cause is i,
- 15:52what is the probability that the via predicted cause is j?
- 15:57Then you can use a law of total probability
- 15:59to write down the marginal distribution
- 16:02of the via predicted cause.
- 16:03So that would be in terms of the misclassification rates
- 16:07and the marginal cause distribution of the MITS-COD.
- 16:10So that's the whole idea.
- 16:11So you can write this in terms of a matrix vector notation
- 16:15as probability of a as M transpose p
- 16:18where M is the misclassification rate matrix,
- 16:21p is the unknown quantity of interest,
- 16:24which is probability that the cause of death
- 16:27is coming from an unknown cause.
- 16:31So the data model is very simple,
- 16:34but the unlabeled data,
- 16:36it follows multinomial with this probability
- 16:38which is coming from this law of total probability.
- 16:41And then for the label data,
- 16:43this is ar given yr equals to i,
- 16:46it follows multinomial with the i
- 16:48throughout the misclassification matrix.
- 16:49So if the MITS-COD is i,
- 16:51the misclassification rates are given by the i
- 16:53throughout the misclassification matrix,
- 16:55so it's multinomial with that probability.
- 16:59And then we've put priors on M and p
- 17:01and then we can get estimates of both M and p.
- 17:04M is a nuisance parameter, p is the parameter of interest.
- 17:10Just to carefully go over what are the assumptions here.
- 17:13The main assumption is that the misclassification rates
- 17:18of verbal autopsy given MITS
- 17:20are the same in your label data
- 17:23as they would be in your unlabeled data.
- 17:25This is not verifiable because we don't have
- 17:28any true cause of death in the unlabeled data,
- 17:30so it's an assumption.
- 17:33Given that the verbal autopsy
- 17:35is a function of your symptoms,
- 17:37the assumption is essentially that given a true cause,
- 17:42the probability of the symptoms are going to be same
- 17:44in your unlabeled dataset as in your labeled dataset.
- 17:49And it's a reasonable assumption
- 17:50as if you have a cause of death,
- 17:53it's likely that you have certain symptoms will appear
- 17:56and some certain symptoms will not appear.
- 17:59And that is true regardless of whether the data is coming
- 18:02from the labeled set or the unlabeled set.
- 18:08We do not assume that the marginal distribution
- 18:12of the CHAMPS data of the causes in the label data
- 18:16is representative of the population
- 18:17because they are not, because the CHAMPS state,
- 18:20so the CHAMPS project is done
- 18:21at specific hospitals in the country
- 18:24and distribution of causes in hospitals
- 18:28are typically not same as distribution
- 18:30of causes in the community.
- 18:31And we are interested
- 18:32in the cause distribution in the population.
- 18:34So there is no assumption
- 18:37that the marginal distribution of y in the label data
- 18:40is same as the marginal distribution of y in unlabeled data,
- 18:43which is our quantity of interest.
- 18:45And the reason there is no assumption
- 18:47is we only model a given y in the label data.
- 18:51We never model y in the label data.
- 18:54So we only model the conditional
- 18:56and the assumption is the condition
- 18:57of misclassification rates are transportable
- 19:00from the labeled to the unlabeled side.
- 19:06So that's the main idea.
- 19:07And this was the first work we did,
- 19:09we just used this top cause prediction.
- 19:13But many of these algorithms
- 19:15are actually probabilistic in nature in the sense
- 19:17that if you look at their outputs,
- 19:18they won't give a single cause of death,
- 19:20but they will give scores to each cause.
- 19:22So for example,
- 19:24this would be a typical output of an algorithm
- 19:26for like say 6%.
- 19:28So for the first person, it will say
- 19:3370% HIV, 20% malaria, 10% sepsis and so on.
- 19:38And the standard procedure is to take the top cause,
- 19:41so for the first person, it would be HIV,
- 19:44for the second person, it will be malaria and so on.
- 19:48So that's how you get a single cause
- 19:50from a probabilistic prediction.
- 19:53So that essentially ignores sort of the scores
- 19:57assigned to the second most likely cause,
- 20:01the third most likely cause and so on.
- 20:04And you ignore those, you can end up with a biased estimate.
- 20:09So you can see these are the CSMF estimates
- 20:12using the top cause,
- 20:14these are the CSM estimates
- 20:15using the exact scores that are assigned
- 20:17and those are different, right?
- 20:18So when we kind of change this probabilistic output
- 20:22to a single cause output, we discard information.
- 20:30So we wanted to extend the work
- 20:32to kind of use the full set of scores and the set of scores
- 20:36can be thought of as a compositional data in the sense
- 20:38that the scores sum up to one
- 20:40because it assigns 100% probability across all causes
- 20:45and then they're each non-negative.
- 20:48The issue is that for the categorical data,
- 20:51our model is based on multinomial distribution.
- 20:53And then for compositional data,
- 20:55the models are typically like Dirichlet
- 20:57or log ratio based models,
- 20:59which are very different from the multinomial distribution.
- 21:03So if we have some cases
- 21:05for which we have categorical output,
- 21:07for some, we have compositional output,
- 21:09this would lead to different models
- 21:11for different parts of the dataset.
- 21:15These Dirichlet or log-ratio models
- 21:17also do not allow zeros in the data.
- 21:20So if you have zeros or ones in the composition,
- 21:22they don't allow that.
- 21:23And then there are very specific models about the data
- 21:27which are subjective model and specification.
- 21:29So the data distribution does not look like a Dirichlet
- 21:33assuming a Dirichlet layer
- 21:34would lead to kind of wrong results.
- 21:41So how do we extend the multinomial framework we had
- 21:46for categorical data to compositional data?
- 21:51Again, there would be a conditional sign here.
- 21:56But the basic assumption that we had
- 21:58for the multinomial case was probability of a given y
- 22:02is the i throughout misclassification matrix, right?
- 22:05And for categorical data, a probability statement
- 22:10is same as an expectation statement, right?
- 22:12So we can equivalently write this
- 22:14as expectation of a given y
- 22:16is the i throughout the M.
- 22:19The advantage of the expectation statement
- 22:20is that it's more generally applicable.
- 22:23It will not be just for categorical data, right?
- 22:27So for categorical data, there's a equivalent.
- 22:30For other data types, this statement can be valid
- 22:33even though the previous statement may not be applicable.
- 22:37So we kind of write this as our model
- 22:41for the compositional data and we make no other assumptions
- 22:45about this distribution.
- 22:46So only a first moment conditional expectation statement
- 22:53without any full distributional specification.
- 22:59So what do we do?
- 23:00So we have expectation of a given y
- 23:03is the i throughout the misclassification matrix.
- 23:08We can use something called
- 23:10the Kullback Leibler Divergence
- 23:12or the cross entropy loss
- 23:14between a and its model expectation.
- 23:17So these are all the conditional signs are missing here.
- 23:22So basically a is the data we observe,
- 23:26this is the modeled expectation,
- 23:29which is basically the i
- 23:30through of the misclassification matrix
- 23:31and we use the cross entropy loss,
- 23:34the Kullback Leibler loss between the two.
- 23:37What's the advantage?
- 23:38So first of all,
- 23:39the Kullback Leibler loss allows zeroes in the composition.
- 23:42So it is well-defined even if you have zeroes or ones.
- 23:45If you take the negative loss and exponentiate it,
- 23:48it's exactly the multinomial likelihood.
- 23:50So if your data is indeed multinomial,
- 23:52you get back your likelihood that you're using
- 23:54for your single class model.
- 23:57But if your data is not multinomial,
- 24:00you get a pseudo likelihood that you can work with.
- 24:04If you can take the derivative of the loss function
- 24:07and take the expectation under the two parameter,
- 24:10you'll see that it's a valid score function
- 24:13in the sense that you get an unbiased estimating equation
- 24:16for your misclassification rate matrix, M,
- 24:19based on just the first moment as option.
- 24:23And then similarly, you can do the same thing
- 24:25for the unlabeled data.
- 24:27The probability statement becomes expectation statement
- 24:30and then we have the Kullback Leibler loss.
- 24:32This is an unbiased estimated equation for both M and p.
- 24:36And again,
- 24:38if the data is truly multinomial and not compositional,
- 24:41this becomes exactly the multinomial likelihood.
- 24:43If the data is compositional,
- 24:45it becomes a pseudo likelihood.
- 24:50Okay, so how do we do Bayes analysis
- 24:52with pseudo likelihoods?
- 24:54So this is where this idea of generalized Bayes
- 24:57or model-free Bayesian inference comes in
- 24:59and there have been parallel developments
- 25:01in both computer science, econometrics and statistics
- 25:04without much communication among the three fields
- 25:07for the last 30, 40 years.
- 25:10Basically, if you're given a loss function
- 25:13without a given like a full likelihood for the data,
- 25:15you can take negative of that loss function
- 25:18multiplied by some tuning parameter, alpha,
- 25:22exponentiate it and treat it as a pseudo likelihood
- 25:26and apply your priors
- 25:27and then your posterior is going to be proportional to this
- 25:30as long as the normalization constant exists.
- 25:33And there has been a lot of work that has shown
- 25:35that this is a valid posterior,
- 25:38it is a generalization of the Bayesian posterior,
- 25:41like if this is an actual likelihood,
- 25:42this is the Bayesian posterior,
- 25:44but if it's not a actual likelihood,
- 25:48this has been shown that it basically minimizes
- 25:49the Bayes risk for that loss function.
- 25:54It has nice asymptotic properties
- 25:56shown by Victor Chernozhukov in this paper
- 25:59and then in this JSS paper in 2016 I think
- 26:04it showed that if you're given a loss function
- 26:06and a prior,
- 26:07this is the only coherent way you can get a posterior.
- 26:12So there's now been a lot of work and it's been called
- 26:15by different names like Gibbs posteriors,
- 26:17pseudo posterior, Laplace-type estimators
- 26:20and quasi-Bayesian estimators along with generalized Bayes.
- 26:25So for our case, we have the pseudo likelihood
- 26:28for the label data.
- 26:29We have the pseudo likelihood for the unlabeled data.
- 26:32We put priors.
- 26:33If all of our data were categorical,
- 26:35this reduces to that multinomial model we had
- 26:38for the categorical data.
- 26:39But if some of the data is compositional,
- 26:41then this becomes generalized Bayes,
- 26:44so we call it generalized Bayes quantification learning.
- 26:47It allows sparsity of the outputs in the sense
- 26:50that if some of the data have zeroes and ones in them,
- 26:54this is well-defined.
- 26:56It's the same pseudo likelihood
- 26:58for categorical compositional predictions.
- 27:01And then it also allows
- 27:02a nice Gibbs sample using conjugacy.
- 27:11One final sort of data aspect we had
- 27:15was that this minimal tissue sampling
- 27:18was also sometimes inconclusive in the sense
- 27:21that they gave two causes.
- 27:22Like often, they were ambiguous between HIV and tuberculosis
- 27:29and they would give one as the immediate cause
- 27:31and one as the underlying cause.
- 27:32So sometimes, even the true cause of death is compositional.
- 27:36So your predicted cause of death is compositional,
- 27:39your true cause of death is also compositional
- 27:41and we call it like b, which represents the belief.
- 27:45And you can show that if you're only given b
- 27:49instead of a single cause of death,
- 27:53your conditional expectation becomes M transpose b
- 27:56instead of the i through of the M matrix.
- 27:59And you can do the same thing
- 28:01using the compositional true cause of death
- 28:05instead of the actual true cause of death.
- 28:08And all the conditional signs are missing here
- 28:10but you can just formulate the Kullback Leibler likelihood
- 28:14to generate pseudo likelihood.
- 28:19So this kind of give rise to a digression
- 28:22where we kind of looked at this is basically
- 28:25your true cause of death is a compositional covariate
- 28:28and your predicted cause of death is a compositional output.
- 28:31So we kind of looked at regression
- 28:33of a compositional outcome on compositional predictors.
- 28:36So this was kind of an offshoot paper
- 28:40where we just developed this piece
- 28:42and if you look at compositional regression,
- 28:45most of the work has been done using Dirichlet models
- 28:50or log ratio transformations.
- 28:52So this was a different approach to that in the sense
- 28:55that it's both transformation free
- 28:57and it doesn't specify a whole distribution
- 28:59like the Dirichlet,
- 29:00it just uses a first moment as option.
- 29:02And we have an R-package to do a regression on composition,
- 29:07to do composition on composition regression called codalm.
- 29:12But going back to the verbal autopsy work,
- 29:16we have the loss functions
- 29:17for the labeled and unlabeled data,
- 29:20we do the negative pseudo likelihoods,
- 29:23put priors on the parameters and we get posterior inference.
- 29:28One last extension of the methodology
- 29:31was that there are multiple different
- 29:34verbal autopsy algorithms and there are papers
- 29:36where every new algorithm comes out and they say
- 29:39they're better than all the previous algorithms.
- 29:41And in practice, you never know which is the best algorithm.
- 29:44So we developed an ensemble method that takes in predictions
- 29:49from multiple algorithms, estimates classifier
- 29:54algorithm-specific misclassification rates
- 29:57and then they're connected to the unknown estimand.
- 30:00So we can show that it gives more weight
- 30:04to the more accurate algorithm in a data-driven way.
- 30:07And then you're not kind of,
- 30:10you don't have to make the choice
- 30:12of which is the best algorithm in advance.
- 30:14If you have multiple candidates,
- 30:15you can use multiple algorithms together.
- 30:23So we looked at some theoretical properties of the method.
- 30:26We have two log functions, one for the label data,
- 30:29one for the unlabeled data.
- 30:31The label data
- 30:32doesn't even feature the estimand, which is p,
- 30:36so it will, on its own, it cannot identify p.
- 30:39The unlabeled data only uses p through this quantity,
- 30:43M transpose p.
- 30:44So again, for different combinations of M and p,
- 30:48as long as this product is the same,
- 30:50it will never be able to identify p on its own.
- 30:53So each loss function on its own
- 30:54cannot identify through parameters.
- 30:57But using both the loss functions together,
- 30:59you can identify the estimand, T,
- 31:02and we were able to show that posterior has nice properties
- 31:06in terms of asymptotic normality
- 31:08and well calibrated interval estimate
- 31:11and near parametric concentration rates.
- 31:13And the theory also extends to the ensemble method
- 31:16and we use some approximations and we give sampler
- 31:19and theory holds for that.
- 31:24Some empirical validations,
- 31:27since we're estimating a probability vector,
- 31:32the common metric that is used is called
- 31:34this chance-corrected normalized absolute accuracy,
- 31:38which is basically a scaled L1 error,
- 31:42centered by the L1 error you would get if you had predicted
- 31:46the cause of death randomly.
- 31:47So this is the error if you predict randomly
- 31:50and then we look at how much improvement we get
- 31:52over random predictions.
- 31:57So this is an illustration of what happens if the data
- 32:01is not Dirichlet and you use Dirichlet distribution.
- 32:03So on the left-hand side,
- 32:05the data is generated from Dirichlet
- 32:08and we use both our method and the Dirichlet-based model
- 32:12and they both do well.
- 32:14On the right-hand side,
- 32:15the data is from an overdispersed Dirichlet
- 32:17and we use the Dirichlet in our model.
- 32:20And because our model doesn't specify a distribution,
- 32:22it just uses a first moment specification,
- 32:25it's much robust and has much higher accuracy
- 32:29than for the Dirichlet which becomes misspecified.
- 32:35And then we also did a bunch of evaluations
- 32:37using the PHMRC data.
- 32:38So what we did was we trained the classifiers
- 32:42on three of the countries leaving one country out
- 32:44and then used a slice of data from that left out country
- 32:47to estimate the misclassification rates,
- 32:50and then we apply our method.
- 32:55The green one is our method
- 32:56and the x axis is the sample size of the dataset
- 33:02used from the left out country
- 33:04to estimate the misclassification rates.
- 33:07The blue one is sort of the uncalibrated one,
- 33:11the red one is the one that is calibrated
- 33:13using the training data.
- 33:14So you can see that our method does better than both of them
- 33:18and the higher the sample size we use
- 33:20from the left out country of interest
- 33:23to estimate the misclassifications, the more accurate it is.
- 33:30And also one interesting aspect
- 33:31was that we looked at calibration
- 33:33using individual algorithms and the calibration
- 33:36using the ensemble one.
- 33:37And more often than not, the ensemble one,
- 33:40which is the orange one,
- 33:42tends to perform similar to the best performing algorithm,
- 33:46and the best performing algorithm can be very different
- 33:48across different countries.
- 33:50For example, in Mexico,
- 33:51InSilicoVA is one of the best performing algorithms,
- 33:54but in Tanzania, InSilicoVA was doing very poorly
- 33:57and then InterVA was one
- 33:59of the better performing algorithms.
- 34:00So the ensemble always tend to give more weights
- 34:03to more accurate algorithms.
- 34:07So this is an overview of what we did for Mozambique.
- 34:10So we had the unlabeled data with only verbal autopsies.
- 34:14We've passed it through two algorithms,
- 34:16InSilicoVA and Expert VA, to get the uncalibrated estimates.
- 34:21Then we had the label data with the MITS cause of death
- 34:23with which we estimated the misclassifications
- 34:25of those two algorithms
- 34:28and then we combine them in the ensemble method
- 34:30and getting calibrated estimates.
- 34:38Some results from Mozambique.
- 34:40We have two age groups,
- 34:42neonatal deaths, first four weeks,
- 34:45and children that's under five years.
- 34:48Two algorithms, seven causes of death for children,
- 34:52five causes of death for neonates.
- 34:55I'm going to just show the neonatal results here.
- 34:57So these are the misclassification matrices for neonates.
- 35:01And ideally, you would want the matrices
- 35:03to have large numbers on the diagonals
- 35:05because those are the correct matches
- 35:07and then small numbers on the off diagonals.
- 35:09But you don't see that,
- 35:10you see quite a bit of large numbers on the off diagonals.
- 35:14One thing that stands out is that
- 35:17if you look at prematurity, it has a very high sensitivity,
- 35:20close to 90%,
- 35:22which means that if the true cause is prematurity,
- 35:25the verbal autopsy correctly diagnoses it.
- 35:28But then it also has high false positives
- 35:31in the sense that if the true cause is infection,
- 35:3420% of time, it is assigned as prematurity.
- 35:37If the true cause is intrapartum related events,
- 35:40almost 30% of time,
- 35:41it's assigned to be prematurity and so on.
- 35:43So it tends to over count a lot of deaths
- 35:46from different causes as prematurity.
- 35:48So what would be the result after calibration
- 35:52is that the percentage of prematurity comes down.
- 35:54So this is the uncalibrated estimate of prematurity.
- 35:58This is the calibrated estimate of prematurity.
- 36:01You can see that it comes down
- 36:02because we can see in the data that there is a lot
- 36:05of over counting of prematurity deaths.
- 36:09So after calibration, it tends to come down quite a bit.
- 36:17And also, we looked at the model estimated sensitivities
- 36:22using both the single cause
- 36:24and the compositional cause of the data.
- 36:27So this is the difference in the sensitivities
- 36:29and you can see that using the compositional cause of death,
- 36:33you'll always get a higher match because it kind of uses
- 36:36information for multiple causes and stuff
- 36:39just considering the top cause.
- 36:41And so it generally leads to better matching
- 36:43between the verbal autopsy and the minimal tissue sampling.
- 36:49Some ongoing work.
- 36:51So when we did this for Mozambique,
- 36:53there was very little amount of payer data.
- 36:57So even though the data was for seven countries,
- 36:59we kind of merged them together
- 37:01and estimated the misclassification rates.
- 37:04Now we have more data coming in for those countries
- 37:07so we have a chance to assess
- 37:08whether the misclassification rates vary by country
- 37:12because if they do,
- 37:12we should model the misclassification rates
- 37:17in a way that's specific to each country.
- 37:21So these are the misclassification rates now
- 37:26resolved by country.
- 37:27So there are six countries, Bangladesh, Ethiopia,
- 37:30Kenya, Mali, Mozambique and Sierra Leone.
- 37:35You can see the estimates.
- 37:36These are the empirical estimates
- 37:37and the confidence intervals for each country.
- 37:40And the horizontal black line
- 37:42is what the pooled estimate looks like.
- 37:44So you can see that there is for some causes like here,
- 37:49there is not a variability across countries.
- 37:51But then for some other cause payers like say here,
- 37:56there's quite a bit of variability across countries.
- 38:00And so now that we are getting more data,
- 38:03the next step for the project
- 38:05is to estimate country-specific misclassification rates.
- 38:09The issue however is that even with more data,
- 38:12there is, I think, around 600 cases here for six countries,
- 38:17which is approximately 100 case per country.
- 38:20And there are 25 cells of the misclassification matrix.
- 38:23So that's like four cases per cell,
- 38:25so that's clearly not enough to do separate
- 38:27country specific models.
- 38:30So we'd have to kind of do
- 38:32a sort of a borrowing of information
- 38:35both across the rows and columns of the matrix
- 38:38but also across different countries.
- 38:42So what we do first is first, we kind of borrow information
- 38:45across the rows and columns of the matrix.
- 38:49And to do this, we start with a,
- 38:52instead of an unstructured misclassification matrix
- 38:55where we estimated each cell separately,
- 38:57we start with a structured misclassification matrix
- 39:00using two basic mechanisms.
- 39:02So we say that a classifier operates using two mechanisms,
- 39:07for a given cause, it can either match that cause
- 39:12and we call that an intrinsic accuracy
- 39:15and that matching probability will be different
- 39:18for different causes, so there are three causes here,
- 39:20and you can see
- 39:21that the matching probability can be different.
- 39:24If it doesn't match the true cause,
- 39:26then it randomly distributes its prediction
- 39:29to the other causes
- 39:31and that random distribution will also have some weights,
- 39:36and those we call the systematic bias
- 39:38or the pool of the classifier.
- 39:40So if it's not matching,
- 39:42we saw that it'll often assign a cause to prematurity
- 39:46regardless of what the true cause is.
- 39:48So that's kind of the basis for this model.
- 39:51And if you have this model,
- 39:52we kind of rearrange these three bars here
- 39:57and then we put in the circle from there.
- 39:59And these will give you the misclassification priorities.
- 40:03So we can write each of the misclassification probabilities
- 40:08in terms of just these six parameters and we can do the same
- 40:13for the green cause and for the blue cause.
- 40:17And so basically, these are the nine misclassification rates
- 40:22written in terms of the six parameters.
- 40:23So this is not that much of a dimension reduction
- 40:26if there are three causes,
- 40:27but if there are in general C causes,
- 40:32this model for misclassification matrix will only have
- 40:342C - 1 parameters as opposed to C square parameters.
- 40:39So in practice, we use seven causes for children
- 40:43and five causes for neonates,
- 40:44so this leads to a lot of dimension reduction.
- 40:49And one of the justification
- 40:53for this dimension reduced model
- 40:54is that if this model is true then the misclassification
- 40:59into different causes,
- 41:01the odds of misclassification into two causes, j and k,
- 41:05will not depend on what the true cause is.
- 41:08And we do see that in the data.
- 41:10So these are different cause payers, j and k,
- 41:13and these are the odds for what the true cause is.
- 41:17So we are plotting the misclassification rates,
- 41:20mij over mik.
- 41:22So this is j and k
- 41:24and the colors here give you i.
- 41:26So you do see that they do not vary
- 41:28for different choices of i,
- 41:30it only is specific to j and k,
- 41:32and that's an equivalent characterization
- 41:36of that systematic preference
- 41:39and intrinsic accuracy model that we have,
- 41:41so we do see that reflected in the data.
- 41:44But we don't have that as the fixed model we have.
- 41:49So this is the best model.
- 41:51We allow some diversion or shrinkage towards it
- 41:54and there's a tuning parameter.
- 41:56So then we get the homogeneous model
- 41:58and then we have a diversion from the homogeneous model
- 42:01to get country specific model.
- 42:03So that's the broad idea,
- 42:04I won't go into the modeling details.
- 42:07And these are the predictions
- 42:09using the country specific model.
- 42:13I won't go into details here, but there are many cases,
- 42:15for example, take it here,
- 42:17star is the empirical rate,
- 42:19angle is the heterogeneous model.
- 42:24And you can see it does much better
- 42:26than the horizontal line, which is the homogeneous model.
- 42:30And we do see it throughout the classification rates.
- 42:36These are the estimates for Bangladesh.
- 42:38So the red density is the pooled estimate
- 42:41of the homogeneous estimate.
- 42:43The blue density is the Bangladesh specific estimate.
- 42:48The dotted vertical line
- 42:50is the empirical estimate for Bangladesh
- 42:52and the solid vertical line
- 42:53is the pooled empirical estimate.
- 42:56So you can see that as we get
- 42:59more and more data from Bangladesh,
- 43:01the country specific estimate moves away
- 43:03from the pooled estimate
- 43:04towards the country specific estimate.
- 43:06So that's basically the hope is going forward,
- 43:12we will have much more data within each country
- 43:14and we'll have estimates that are much closer
- 43:16to the dotted lines than the solid lines.
- 43:22So that's the summary.
- 43:23So in general, these cause of death classifiers
- 43:26are super inaccurate.
- 43:28So we need to calibrate for that and we have limited data
- 43:31to estimate their inaccuracy,
- 43:32so we calibrate them innovation way.
- 43:36The methods give probabilistic cause of death
- 43:39instead of categorical cause of death.
- 43:40So we develop a generalized Bayes approach
- 43:43that is equivalent to a multinomial model
- 43:45if the data is categorical.
- 43:47But if it's not categorical, it becomes a pseudo likelihood
- 43:50Bayesian approach for compositional data
- 43:54and that allows zeroes and ones in the data
- 43:57and is not kind of dependent on the model specification.
- 44:02And then it kind of led to this independent development
- 44:05of the composition on composition regression.
- 44:09Some papers and software.
- 44:10So the single cause paper was the first one,
- 44:13then we extend it to compositional data
- 44:17and develop the theory for it.
- 44:19The package for calibration is available on GitHub
- 44:22and then the composition on composition regression
- 44:25were the separate piece
- 44:26and we have the coda linear model package for it on CRAN.
- 44:30And then we use this approach
- 44:32to produce calibration estimates
- 44:36for neonate and children deaths in Mozambique
- 44:39which were published in the last three papers.
- 44:41Thank you.
- 44:51<v ->Questions? Yes.</v>
- 44:53<v ->So I just had a quick question 'cause you were saying</v>
- 44:55the model basically looks at the symptoms
- 44:58that'll be able to predict which it would be.
- 45:00Does it also factor in what diseases and stuff
- 45:04are most common in those areas or does it kind of just-
- 45:07<v ->Oh, very good question.</v>
- 45:09It does factor it in but in a very crude way
- 45:12in the sense that the models have some settings
- 45:14called like high malaria, low malaria or high HIV, low HIV.
- 45:18So depending on which country you're running it,
- 45:21you will set the setting to like high HIV country
- 45:24or low HIV country, the same for malaria,
- 45:27but it doesn't do anything beyond that,
- 45:30so only at a very close level.
- 45:34<v ->Causes of death or.</v>
- 45:37<v ->So the ICD-10 classification</v>
- 45:40will have around 30 plus causes of death
- 45:42for children's and neonates,
- 45:44I think much more for adults.
- 45:47There are no MITS for adults.
- 45:48MITS was only done for children's and neonates,
- 45:51only now adult MITS are being started,
- 45:54but we have to kind of group them into broader categories
- 45:57because if you have 30 causes,
- 45:59your misclassification matrix will be 30 times 30.
- 46:02So we don't have the data to do estimation
- 46:05at that fine resolution.
- 46:06So we group them into broader categories.
- 46:08So seven for children, five for new neonates.
- 46:11<v ->Is one of the categories, I have no idea,</v>
- 46:14it is totally unknown.
- 46:15And if so, is that different from the uniform distribution
- 46:18across causes of death?
- 46:21<v ->That would be the uniform distribution.</v>
- 46:23There is no category which is, I have no idea,
- 46:25but it'll be probably reflected in a score that is very flat
- 46:28across the causes.
- 46:30<v ->If you think there are seven causes of death</v>
- 46:32and I'm working with the same dataset
- 46:34and I think there are 100 causes of death,
- 46:36will there be substantial differences in our marginal
- 46:39estimates of probability?
- 46:41Because our uniform posteriors
- 46:45place such different amounts of mass across the say
- 46:4830 versus 100 causes of death.
- 46:51<v ->Yes, there will be differences</v>
- 46:54and even when we are aggregating from the 30 causes
- 46:58to seven causes, the assumption is that within each category
- 47:02the misclassification rates are homogeneous
- 47:04within the finer category.
- 47:05So that is an assumption that we're working with.
- 47:08So definitely, there will be differences.
- 47:11<v ->Thank you.</v>
- 47:16<v ->I have one more question.</v>
- 47:22I'll ask a philosophical question
- 47:23if I may. <v ->Sure, yeah.</v>
- 47:24<v ->You commented,</v>
- 47:26I don't know, about halfway through,
- 47:27about how statisticians are working on a thing.
- 47:32Computer scientists are working on the same thing.
- 47:34There's a third group I forget.
- 47:37And nobody talks to each other.
- 47:40Now, many of us are,
- 47:42many of the students here
- 47:44are within the data science track of biostatistics.
- 47:49By the way, love your Twitter handle.
- 47:52But yeah, so how do we bridge those things
- 47:56that we take advantage of these things
- 47:57and it's not three separate versions of the same thing?
- 48:01<v ->I don't know if there's a systematic way.</v>
- 48:04Honestly, I came to know about much of the literature
- 48:08going through the revisions
- 48:09and one of the reviewer associate editors said
- 48:11there is a lot of work here in the econometrics literature,
- 48:14you should take a look.
- 48:15And that's kind of the value
- 48:16of the peer review system I guess.
- 48:17And so we looked at it and yes, there was a lot of work
- 48:20and they just called it different things
- 48:22and so I had no idea
- 48:23when I was searching for that in the literature.
- 48:26And we did see the Victor Chernozhukov paper
- 48:29I think is in "Journal of Economics,"
- 48:30but it's basically an asymptotic statistics paper.
- 48:33It kind of shows that these generalized Bayes stuff,
- 48:36which they call as Laplace-type estimators,
- 48:38has all these nice properties
- 48:40that a standard vision posterior will have.
- 48:43But yeah, I think talking to more people
- 48:46and like interacting and telling about your work
- 48:49will kind of,
- 48:50and someone will say that, oh yeah, I do something similar.
- 48:52You should look at this paper,
- 48:55it's probably. <v ->Hopefully Twitter helps.</v>
- 48:57<v ->Sorry?</v>
- 48:58<v ->Hopefully Twitter helps.</v>
- 48:58<v ->Yeah, yeah, definitely.</v>
- 49:00Engagement through any like in-person
- 49:02or social media platform would be useful, yeah.
- 49:08<v ->All right, well thanks so much.</v>
- 49:08I think we're out of time so we'll stop it there.
- 49:12(attendant muttering indistinctly)
- 49:15Hope everybody has a wonderful fall break.
- 49:17See you next week.
- 49:19(attendants chattering indistinctly)
- 49:37<v Learner>The other organizer.</v>
- 49:38(learner muttering indistinctly)
- 49:39(attendants chattering indistinctly)
- 49:53<v ->Or maybe because they're susceptible.</v>
- 49:55(attendants chattering indistinctly)
- 50:04<v ->Thank you. Anyone else need to sign in?</v>
- 50:06(attendants chattering indistinctly)
- 50:19<v ->Infection but they're also premature babies.</v>
- 50:21(attendants chattering indistinctly)
- 50:30<v ->Premature, but also it's that</v>
- 50:32it's not a distinct.
- 50:34(attendants chattering indistinctly)
- 50:35<v ->Cause of death is very blurry in this day.</v>
- 50:38<v ->Is that part of why like.</v>
- 50:40(attendants chattering indistinctly)
- 50:46<v ->'Cause a symptom given cause session</v>
- 50:49with that much of variation across country.
- 50:51<v Learner>Cause.</v>
- 50:52(learner muttering indistinctly)
- 50:53Cause.
- 50:54<v ->Reporting depends on who is answering.</v>
- 50:58(attendants chattering indistinctly)
- 51:04<v ->You need to go next.</v>
- 51:05<v ->Back to.</v>
- 51:09<v ->I guess, yeah.</v>
- 51:10You need one of us to let you.
- 51:12(lecturer muttering indistinctly)
- 51:14<v ->It might be a short answer.</v>
- 51:16Yeah, and it's short answer.
- 51:17(attendants chattering indistinctly)
- 51:20<v ->I don't have to, will you? (laughs)</v>