YSPH Biostatistics Seminar: “Generalized Bayes Calibration of Compositional Cause-specific Mortality Data from Verbal Autopsies"
October 19, 2023Abhirup Datta, PhD, Associate Professor, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health
October 17, 2023
Information
- ID
- 10875
- To Cite
- DCA Citation Guide
Transcript
- 00:00<v ->And welcome.</v>
- 00:02Today, it's my, eh.
- 00:05Today, it is my pleasure to introduce Professor Abhi Datta
- 00:09from Johns Hopkins University in Baltimore, Maryland.
- 00:13Professor Datta earned his BS and MS
- 00:15from the Indian Statistical Institute
- 00:17in 2008 and 2010 respectively,
- 00:20and PhD from the University of Minnesota in 2016.
- 00:25In addition to being a well-cited researcher
- 00:27with one publication that's almost 600 citations,
- 00:30which is pretty nice,
- 00:32he's also a award-winning educator,
- 00:35having repeatedly won an excellence in teaching award
- 00:37from his institution.
- 00:39So let's welcome Dr. Datta.
- 00:44<v ->Thank you, Robert,</v>
- 00:45for the invitation to come here and give the seminar,
- 00:48and for the very nice introduction.
- 00:50Thank you everyone for coming.
- 00:52My talk is about improving cause-specific mortality data
- 00:56in low and middle-income countries
- 00:58where the main tool to collect data
- 01:00is something called verbal autopsies.
- 01:02And the way I do it
- 01:03is using a statistical approach called generalized Bayes.
- 01:07If you have not heard
- 01:08of verbal autopsies or generalized Bayes,
- 01:11I can tell you that I hadn't heard of either of those things
- 01:14when I started working on the project,
- 01:17so don't worry about that,
- 01:18I try to give an introduction.
- 01:20'Cause I mostly work on a spatial and spatial temporal data
- 01:24and this was a project that came along,
- 01:27which is very different from what I used to work on.
- 01:29But over the years, there's been a nice body of work
- 01:31developed in this project.
- 01:35So this is a joint work
- 01:39with many different institutes and collaborators.
- 01:44The top row is the Hopkins bio stats team,
- 01:46which included my former students,
- 01:48Jacob Fiksel and Brian Gilbert,
- 01:51and my current postdoc, Sandi,
- 01:53and my colleague, Scott Zeger, and I
- 01:56lead the bio stats part of the team.
- 02:00Agbessi is the PI of the project in Mozambique
- 02:03that's sort of picked up developments for this work.
- 02:07And there are a lot of colleagues
- 02:09from the International Health Department
- 02:10that helped to collaborate.
- 02:12And then Li is the PI of a new project
- 02:16who we're going to apply our methodology
- 02:17for producing mortality estimates for the WHO.
- 02:22So we're collaborating with Li there as well.
- 02:25And then a couple of people outside Hopkins,
- 02:27Dianna at CDC and Emory University,
- 02:31as the director of the CHAMPS project.
- 02:35And Ivalda in the government body at Mozambique
- 02:39has been now currently doing the work in Mozambique.
- 02:44So this is funded by three grants from the Gates Foundation.
- 02:49The first one was the grant that kind of started things.
- 02:52And then we have a grant that is kind of developing more
- 02:55on the method side of the world.
- 02:59So, many low and middle-income countries
- 03:05often lack high-quality data on causes of death.
- 03:08Often for most deaths,
- 03:10there is no sort of medical certification
- 03:13or like an autopsy done.
- 03:16And without kind of high-quality data
- 03:19on what people are dying of,
- 03:21it's kind of hard to estimate the disease burden
- 03:23in these countries.
- 03:25And specifically, the quantity of interest
- 03:27is the cause-specific mortality fraction,
- 03:29which is basically the percentage of deaths in a age group
- 03:34that can be attributable to a given cause.
- 03:38So cause-specific mortality fractions
- 03:40are key pieces of information
- 03:42in determining the global burden of disease,
- 03:44which in turn dictates sovereign policy,
- 03:47as well as like resource allocations
- 03:49for programs operating in this country.
- 03:54So verbal autopsy is an alternate way
- 03:57to count deaths and attribute causes
- 03:59without actually doing a clinical autopsy.
- 04:02So verbal autopsy is basically
- 04:04a sort of a systematic interview
- 04:07of the household members of the deceased.
- 04:08So the government or the program has a set of field workers
- 04:12who go out and go from household to household
- 04:15and ask if anyone died in their household
- 04:17within the last several months.
- 04:18And if they died, what were the symptoms?
- 04:20And the set of questions they ask is not standardized
- 04:23by the WHO.
- 04:24Some example questions are here.
- 04:27Most of the questions would have binary answers
- 04:29like yes, no, but there are some questions
- 04:32that have more like continuous responses.
- 04:38So they said the WHO has standardized
- 04:41the verbal autopsy tool.
- 04:43The 2016 version has around 200 to 350 questions,
- 04:47depending on the age group.
- 04:48There are separate sections of the questionnaire
- 04:50for neonates, children deaths and adult deaths.
- 04:54And if you're interested in more information
- 04:56about verbal autopsy, there's a page in WHO about it.
- 05:02So a verbal autopsy, of course,
- 05:04doesn't give you a cause of death,
- 05:05it just gives you a bunch of yes-no responses
- 05:08to various questions related to the symptoms.
- 05:14So a verbal autopsy is basically a survey questionnaire.
- 05:17So you can pass that survey through a computer software
- 05:20and that can give a predictive cause of death.
- 05:23And so there are a bunch
- 05:24of different computer software available.
- 05:27InSilicoVA, developed by Tyler McCormick,
- 05:31Richard Li was a postdoc here,
- 05:34is published in "JASA" in 2016,
- 05:36is one of the, I think,
- 05:37most statistically-principled approaches to do it.
- 05:40But there are other approaches and then you can,
- 05:43this is basically a classification problem.
- 05:45So you're basically given your data on symptoms,
- 05:48you're kind of classifying the cause of death
- 05:50as one of several causes.
- 05:51So you can use standard classifiers
- 05:54and machine learning approaches as well.
- 05:58OpenVA is an excellent resource
- 05:59to learn about verbal autopsies.
- 06:00Again, openVA is,
- 06:04I think Richard is one of the maintainers
- 06:06and creators of openVA.
- 06:11So the COMSA project in Mozambique,
- 06:14one of the main goals was to generate
- 06:17this cause-specific mortality fractions
- 06:19for children's and under,
- 06:21for neonates and under-five children
- 06:24for the country of Mozambique.
- 06:26And the data that we collected was a large dataset
- 06:30of vocal autopsy record
- 06:32for different households that were surveyed
- 06:34and that was a map of Mozambique
- 06:38and the green region show
- 06:41where the data was collected
- 06:43as part of the COMSA project.
- 06:44So in statistical terms, the data just has the symptoms,
- 06:49it doesn't have the true cause of death,
- 06:51so we call it the unlabeled data.
- 06:57So how to go from an unlabeled data to the labeling
- 07:00of the causes of death
- 07:01and then estimate these cause fractions.
- 07:04This is the standard procedure that is typically done
- 07:08and this is what we were supposed to do as well,
- 07:10which is simply take each record,
- 07:12pass it through the computer software
- 07:14and get a cause of death.
- 07:16And once you get a cause of death,
- 07:18then you can sort of simply aggregate.
- 07:19So in the story example,
- 07:21three out of the six cases were assigned to be from HIV.
- 07:25And so the cause-specific mortality fraction for HIV
- 07:27would be 50% and similar for malaria and sepsis and so on.
- 07:32So that's the basic template
- 07:35of how to get a cause-specific mortality fractions
- 07:38from verbal autopsies.
- 07:39The question is can we trust this estimates?
- 07:41Because these are not true causes of death
- 07:43as determined by a doctor or by a clinical procedure.
- 07:46These are cause of death predicted by an algorithm
- 07:48based on just surveying the household members
- 07:52of the deceased.
- 07:57So turns out machine learning has a name
- 08:00for this type of problems,
- 08:01it's called quantification learning,
- 08:04which is basically estimating population prevalence
- 08:07using predicted levels instead of true levels
- 08:10and the predictions are coming from a classifier.
- 08:13And so there has been some work in quantification learning
- 08:16and in the machine learning literature.
- 08:19So when we were working on this problem,
- 08:21we realized that estimating
- 08:22cause-specific mortality fractions
- 08:24using predicted cause of death data from verbal autopsy
- 08:27is an example of quantification learning.
- 08:31So just a sort of an overview of terms that we'll be using
- 08:35and the corresponding statistical notation.
- 08:37So our true cause of death is y which we do not observe.
- 08:42We want to estimate the probability
- 08:43of population prevalence of y,
- 08:45so y is a categorical variable.
- 08:49And so probability of y or p
- 08:51is our cause-specific mortality fraction,
- 08:53which is the estimand.
- 08:55We observed the verbal autopsy, which is a,
- 08:57think of this as a high dimensional
- 09:00or a long list of yes-no answers
- 09:02to the verbal autopsy questions, so that is x,
- 09:06and this x is passed through a software
- 09:08to give a predicted level, which is a of x or simply a.
- 09:17So what we have in the COMSA project
- 09:21is simply an unlabeled dataset
- 09:25which uses these verbal autopsy responses,
- 09:28pass it through a software and get the predicted levels.
- 09:34We do not observe the true levels, y,
- 09:37we may or may not retain the verbal autopsy responses
- 09:40because those are identifiable data
- 09:42and those are often not released,
- 09:43so often, just the predicted cause of that is available.
- 09:47So even these covariates, x, may or may not be available.
- 09:50And then we are interested in estimating the probability
- 09:53that y belongs to one of the C many cause categories,
- 09:58so that's a quantity of interest.
- 10:05For some reason, there is a conditional sign
- 10:07that's missing there.
- 10:09But you can use the law of total probability
- 10:13to write the probability of the predicted cause of death,
- 10:16which is the a,
- 10:18probability of a as a sum of our probability of a given y
- 10:22times probability of y.
- 10:24So there's a conditional sign missing here,
- 10:26I don't don't know what's going on here.
- 10:32But the COMSA data,
- 10:33we only get information on the left-hand side, right?
- 10:36And we want to input upon the quantity probability of y
- 10:41which would be the true CSMFs.
- 10:44So there is only one known quantity
- 10:46with which you can estimate the left-hand side.
- 10:48There are two unknown quantities on the right-hand side.
- 10:50So without making assumptions, you cannot really identify
- 10:54probability of y, right?
- 10:56So any quantification learning methods
- 10:59need to either estimate those conditional probabilities,
- 11:02probability of a given y,
- 11:04or make some assumptions on it.
- 11:08So again, all the conditional signs are missing.
- 11:16The one of the most common approaches,
- 11:19and this is what is used in the verbal autopsy world
- 11:22is called classify and count,
- 11:25which is you simply predict the cause of death
- 11:28and then aggregate.
- 11:29So you're simply claiming that probability of a
- 11:33is same as probability of y which is equivalent to claiming
- 11:36that this misclassification rate matrix
- 11:39is an identity matrix, right?
- 11:41Because you're saying that the left hand quantity
- 11:44is the same as the rightmost quantity, which would be true
- 11:48if there is no misclassification by the algorithm
- 11:51and if the predicted cause of death
- 11:53is always the true cause of death.
- 11:56And that's what is typically done
- 11:58in this cause-specific mortality fraction estimates.
- 12:02But it's a very strong assumption, right?
- 12:04Because it says assuming perfect sensitivity and specificity
- 12:07of the algorithm.
- 12:10So let's look at how perfect the algorithms are.
- 12:12So these are two algorithms,
- 12:13Tariff and InSilicoVA,
- 12:16PHMRC data is a benchmark dataset from four countries
- 12:20that has both the verbal autopsy data
- 12:22as well as a gold standard cause of death diagnosis.
- 12:26And you can see the accuracies of either method
- 12:30is around 30%, so they're far from being
- 12:33like fully accurate.
- 12:36So there is large misclassification rates
- 12:39of these algorithms and if you don't kind of adjust
- 12:42for these misclassifications,
- 12:44this is burden estimates
- 12:46of the cause-specific mortality fractions you get
- 12:48are likely going to be very biased.
- 12:54So this is where the CHAMPS project comes into play.
- 12:58So the CHAMPS is an ongoing project
- 13:00in like seven or eight countries including Mozambique,
- 13:05which is collecting data on both verbal autopsy
- 13:07and a more comprehensive cause of death procedure
- 13:11called minimally invasive tissue sampling.
- 13:14So it basically takes a sample of your tissue
- 13:17of the deceased person and then runs a bunch
- 13:20of pathological tests and imaging analysis
- 13:23and then gives a cause of death.
- 13:25And the MITS cause of death assignments
- 13:30have been shown to be quite accurate when you compare
- 13:33to like a full diagnostic autopsy.
- 13:36So MITS is being done in a bunch
- 13:38of different countries including Mozambique.
- 13:41And for the cases where MITS is being done,
- 13:43the verbal autopsies are also collected.
- 13:46So what you get from this CHAMPS data
- 13:48is a labeled or paired dataset
- 13:50where you have both the verbal autopsy
- 13:52as well as the MITS cause of death
- 13:54and you can pass the verbal autopsy to the software
- 13:58to get the verbal autopsy predicted cause of death.
- 14:00And then you can cross tabulate the two
- 14:02and get an estimate of the misclassification rates, right?
- 14:04Like you can say like,
- 14:06"Oh okay, so there are 10 cases
- 14:08that the MITS cause of death was HIV,
- 14:11out of those 10 cases,
- 14:12seven of them were correctly assigned to HIV
- 14:15by verbal autopsy.
- 14:16So then the sensitivity would be 70%
- 14:20and the false positive would be 30%, so on."
- 14:27So this is the broad idea of the methodology.
- 14:29So for the COMSA data, which is the unpaired data,
- 14:32you get only the verbal autopsy record
- 14:34so you can get an estimate of the predicted cause of deaths
- 14:37from the verbal autopsy.
- 14:39From the CHAMPS data, which is the paired data,
- 14:41you can get an estimate of the misclassification rates.
- 14:44And then the only unknown is then the probabilities
- 14:48of the cause of death
- 14:50if you were able to do the MITS autopsy for every death.
- 14:54So then this is an equation with two knowns and one unknown
- 14:58and you can solve for it and get the calibrating message.
- 15:01So that's the broad idea and we do it in a model-based way.
- 15:09So here's the formal model.
- 15:11So for the CHAMPS dataset with the unlabeled data or the U,
- 15:15we have the predicted labels, ar,
- 15:17and then for the,
- 15:20that's for the COMSA data,
- 15:21and for the CHAMPS data,
- 15:22we have both the predicted labels from verbal autopsy, ar,
- 15:26as well as the MITS determine labels, yr.
- 15:29And our quantity of interest is the probabilities of yr
- 15:34belonging to the different causes.
- 15:41There's a conditional sign missing here.
- 15:44But if the conditional probabilities
- 15:48are denoted by Mij, which is if the MITS cause is i,
- 15:52what is the probability that the via predicted cause is j?
- 15:57Then you can use a law of total probability
- 15:59to write down the marginal distribution
- 16:02of the via predicted cause.
- 16:03So that would be in terms of the misclassification rates
- 16:07and the marginal cause distribution of the MITS-COD.
- 16:10So that's the whole idea.
- 16:11So you can write this in terms of a matrix vector notation
- 16:15as probability of a as M transpose p
- 16:18where M is the misclassification rate matrix,
- 16:21p is the unknown quantity of interest,
- 16:24which is probability that the cause of death
- 16:27is coming from an unknown cause.
- 16:31So the data model is very simple,
- 16:34but the unlabeled data,
- 16:36it follows multinomial with this probability
- 16:38which is coming from this law of total probability.
- 16:41And then for the label data,
- 16:43this is ar given yr equals to i,
- 16:46it follows multinomial with the i
- 16:48throughout the misclassification matrix.
- 16:49So if the MITS-COD is i,
- 16:51the misclassification rates are given by the i
- 16:53throughout the misclassification matrix,
- 16:55so it's multinomial with that probability.
- 16:59And then we've put priors on M and p
- 17:01and then we can get estimates of both M and p.
- 17:04M is a nuisance parameter, p is the parameter of interest.
- 17:10Just to carefully go over what are the assumptions here.
- 17:13The main assumption is that the misclassification rates
- 17:18of verbal autopsy given MITS
- 17:20are the same in your label data
- 17:23as they would be in your unlabeled data.
- 17:25This is not verifiable because we don't have
- 17:28any true cause of death in the unlabeled data,
- 17:30so it's an assumption.
- 17:33Given that the verbal autopsy
- 17:35is a function of your symptoms,
- 17:37the assumption is essentially that given a true cause,
- 17:42the probability of the symptoms are going to be same
- 17:44in your unlabeled dataset as in your labeled dataset.
- 17:49And it's a reasonable assumption
- 17:50as if you have a cause of death,
- 17:53it's likely that you have certain symptoms will appear
- 17:56and some certain symptoms will not appear.
- 17:59And that is true regardless of whether the data is coming
- 18:02from the labeled set or the unlabeled set.
- 18:08We do not assume that the marginal distribution
- 18:12of the CHAMPS data of the causes in the label data
- 18:16is representative of the population
- 18:17because they are not, because the CHAMPS state,
- 18:20so the CHAMPS project is done
- 18:21at specific hospitals in the country
- 18:24and distribution of causes in hospitals
- 18:28are typically not same as distribution
- 18:30of causes in the community.
- 18:31And we are interested
- 18:32in the cause distribution in the population.
- 18:34So there is no assumption
- 18:37that the marginal distribution of y in the label data
- 18:40is same as the marginal distribution of y in unlabeled data,
- 18:43which is our quantity of interest.
- 18:45And the reason there is no assumption
- 18:47is we only model a given y in the label data.
- 18:51We never model y in the label data.
- 18:54So we only model the conditional
- 18:56and the assumption is the condition
- 18:57of misclassification rates are transportable
- 19:00from the labeled to the unlabeled side.
- 19:06So that's the main idea.
- 19:07And this was the first work we did,
- 19:09we just used this top cause prediction.
- 19:13But many of these algorithms
- 19:15are actually probabilistic in nature in the sense
- 19:17that if you look at their outputs,
- 19:18they won't give a single cause of death,
- 19:20but they will give scores to each cause.
- 19:22So for example,
- 19:24this would be a typical output of an algorithm
- 19:26for like say 6%.
- 19:28So for the first person, it will say
- 19:3370% HIV, 20% malaria, 10% sepsis and so on.
- 19:38And the standard procedure is to take the top cause,
- 19:41so for the first person, it would be HIV,
- 19:44for the second person, it will be malaria and so on.
- 19:48So that's how you get a single cause
- 19:50from a probabilistic prediction.
- 19:53So that essentially ignores sort of the scores
- 19:57assigned to the second most likely cause,
- 20:01the third most likely cause and so on.
- 20:04And you ignore those, you can end up with a biased estimate.
- 20:09So you can see these are the CSMF estimates
- 20:12using the top cause,
- 20:14these are the CSM estimates
- 20:15using the exact scores that are assigned
- 20:17and those are different, right?
- 20:18So when we kind of change this probabilistic output
- 20:22to a single cause output, we discard information.
- 20:30So we wanted to extend the work
- 20:32to kind of use the full set of scores and the set of scores
- 20:36can be thought of as a compositional data in the sense
- 20:38that the scores sum up to one
- 20:40because it assigns 100% probability across all causes
- 20:45and then they're each non-negative.
- 20:48The issue is that for the categorical data,
- 20:51our model is based on multinomial distribution.
- 20:53And then for compositional data,
- 20:55the models are typically like Dirichlet
- 20:57or log ratio based models,
- 20:59which are very different from the multinomial distribution.
- 21:03So if we have some cases
- 21:05for which we have categorical output,
- 21:07for some, we have compositional output,
- 21:09this would lead to different models
- 21:11for different parts of the dataset.
- 21:15These Dirichlet or log-ratio models
- 21:17also do not allow zeros in the data.
- 21:20So if you have zeros or ones in the composition,
- 21:22they don't allow that.
- 21:23And then there are very specific models about the data
- 21:27which are subjective model and specification.
- 21:29So the data distribution does not look like a Dirichlet
- 21:33assuming a Dirichlet layer
- 21:34would lead to kind of wrong results.
- 21:41So how do we extend the multinomial framework we had
- 21:46for categorical data to compositional data?
- 21:51Again, there would be a conditional sign here.
- 21:56But the basic assumption that we had
- 21:58for the multinomial case was probability of a given y
- 22:02is the i throughout misclassification matrix, right?
- 22:05And for categorical data, a probability statement
- 22:10is same as an expectation statement, right?
- 22:12So we can equivalently write this
- 22:14as expectation of a given y
- 22:16is the i throughout the M.
- 22:19The advantage of the expectation statement
- 22:20is that it's more generally applicable.
- 22:23It will not be just for categorical data, right?
- 22:27So for categorical data, there's a equivalent.
- 22:30For other data types, this statement can be valid
- 22:33even though the previous statement may not be applicable.
- 22:37So we kind of write this as our model
- 22:41for the compositional data and we make no other assumptions
- 22:45about this distribution.
- 22:46So only a first moment conditional expectation statement
- 22:53without any full distributional specification.
- 22:59So what do we do?
- 23:00So we have expectation of a given y
- 23:03is the i throughout the misclassification matrix.
- 23:08We can use something called
- 23:10the Kullback Leibler Divergence
- 23:12or the cross entropy loss
- 23:14between a and its model expectation.
- 23:17So these are all the conditional signs are missing here.
- 23:22So basically a is the data we observe,
- 23:26this is the modeled expectation,
- 23:29which is basically the i
- 23:30through of the misclassification matrix
- 23:31and we use the cross entropy loss,
- 23:34the Kullback Leibler loss between the two.
- 23:37What's the advantage?
- 23:38So first of all,
- 23:39the Kullback Leibler loss allows zeroes in the composition.
- 23:42So it is well-defined even if you have zeroes or ones.
- 23:45If you take the negative loss and exponentiate it,
- 23:48it's exactly the multinomial likelihood.
- 23:50So if your data is indeed multinomial,
- 23:52you get back your likelihood that you're using
- 23:54for your single class model.
- 23:57But if your data is not multinomial,
- 24:00you get a pseudo likelihood that you can work with.
- 24:04If you can take the derivative of the loss function
- 24:07and take the expectation under the two parameter,
- 24:10you'll see that it's a valid score function
- 24:13in the sense that you get an unbiased estimating equation
- 24:16for your misclassification rate matrix, M,
- 24:19based on just the first moment as option.
- 24:23And then similarly, you can do the same thing
- 24:25for the unlabeled data.
- 24:27The probability statement becomes expectation statement
- 24:30and then we have the Kullback Leibler loss.
- 24:32This is an unbiased estimated equation for both M and p.
- 24:36And again,
- 24:38if the data is truly multinomial and not compositional,
- 24:41this becomes exactly the multinomial likelihood.
- 24:43If the data is compositional,
- 24:45it becomes a pseudo likelihood.
- 24:50Okay, so how do we do Bayes analysis
- 24:52with pseudo likelihoods?
- 24:54So this is where this idea of generalized Bayes
- 24:57or model-free Bayesian inference comes in
- 24:59and there have been parallel developments
- 25:01in both computer science, econometrics and statistics
- 25:04without much communication among the three fields
- 25:07for the last 30, 40 years.
- 25:10Basically, if you're given a loss function
- 25:13without a given like a full likelihood for the data,
- 25:15you can take negative of that loss function
- 25:18multiplied by some tuning parameter, alpha,
- 25:22exponentiate it and treat it as a pseudo likelihood
- 25:26and apply your priors
- 25:27and then your posterior is going to be proportional to this
- 25:30as long as the normalization constant exists.
- 25:33And there has been a lot of work that has shown
- 25:35that this is a valid posterior,
- 25:38it is a generalization of the Bayesian posterior,
- 25:41like if this is an actual likelihood,
- 25:42this is the Bayesian posterior,
- 25:44but if it's not a actual likelihood,
- 25:48this has been shown that it basically minimizes
- 25:49the Bayes risk for that loss function.
- 25:54It has nice asymptotic properties
- 25:56shown by Victor Chernozhukov in this paper
- 25:59and then in this JSS paper in 2016 I think
- 26:04it showed that if you're given a loss function
- 26:06and a prior,
- 26:07this is the only coherent way you can get a posterior.
- 26:12So there's now been a lot of work and it's been called
- 26:15by different names like Gibbs posteriors,
- 26:17pseudo posterior, Laplace-type estimators
- 26:20and quasi-Bayesian estimators along with generalized Bayes.
- 26:25So for our case, we have the pseudo likelihood
- 26:28for the label data.
- 26:29We have the pseudo likelihood for the unlabeled data.
- 26:32We put priors.
- 26:33If all of our data were categorical,
- 26:35this reduces to that multinomial model we had
- 26:38for the categorical data.
- 26:39But if some of the data is compositional,
- 26:41then this becomes generalized Bayes,
- 26:44so we call it generalized Bayes quantification learning.
- 26:47It allows sparsity of the outputs in the sense
- 26:50that if some of the data have zeroes and ones in them,
- 26:54this is well-defined.
- 26:56It's the same pseudo likelihood
- 26:58for categorical compositional predictions.
- 27:01And then it also allows
- 27:02a nice Gibbs sample using conjugacy.
- 27:11One final sort of data aspect we had
- 27:15was that this minimal tissue sampling
- 27:18was also sometimes inconclusive in the sense
- 27:21that they gave two causes.
- 27:22Like often, they were ambiguous between HIV and tuberculosis
- 27:29and they would give one as the immediate cause
- 27:31and one as the underlying cause.
- 27:32So sometimes, even the true cause of death is compositional.
- 27:36So your predicted cause of death is compositional,
- 27:39your true cause of death is also compositional
- 27:41and we call it like b, which represents the belief.
- 27:45And you can show that if you're only given b
- 27:49instead of a single cause of death,
- 27:53your conditional expectation becomes M transpose b
- 27:56instead of the i through of the M matrix.
- 27:59And you can do the same thing
- 28:01using the compositional true cause of death
- 28:05instead of the actual true cause of death.
- 28:08And all the conditional signs are missing here
- 28:10but you can just formulate the Kullback Leibler likelihood
- 28:14to generate pseudo likelihood.
- 28:19So this kind of give rise to a digression
- 28:22where we kind of looked at this is basically
- 28:25your true cause of death is a compositional covariate
- 28:28and your predicted cause of death is a compositional output.
- 28:31So we kind of looked at regression
- 28:33of a compositional outcome on compositional predictors.
- 28:36So this was kind of an offshoot paper
- 28:40where we just developed this piece
- 28:42and if you look at compositional regression,
- 28:45most of the work has been done using Dirichlet models
- 28:50or log ratio transformations.
- 28:52So this was a different approach to that in the sense
- 28:55that it's both transformation free
- 28:57and it doesn't specify a whole distribution
- 28:59like the Dirichlet,
- 29:00it just uses a first moment as option.
- 29:02And we have an R-package to do a regression on composition,
- 29:07to do composition on composition regression called codalm.
- 29:12But going back to the verbal autopsy work,
- 29:16we have the loss functions
- 29:17for the labeled and unlabeled data,
- 29:20we do the negative pseudo likelihoods,
- 29:23put priors on the parameters and we get posterior inference.
- 29:28One last extension of the methodology
- 29:31was that there are multiple different
- 29:34verbal autopsy algorithms and there are papers
- 29:36where every new algorithm comes out and they say
- 29:39they're better than all the previous algorithms.
- 29:41And in practice, you never know which is the best algorithm.
- 29:44So we developed an ensemble method that takes in predictions
- 29:49from multiple algorithms, estimates classifier
- 29:54algorithm-specific misclassification rates
- 29:57and then they're connected to the unknown estimand.
- 30:00So we can show that it gives more weight
- 30:04to the more accurate algorithm in a data-driven way.
- 30:07And then you're not kind of,
- 30:10you don't have to make the choice
- 30:12of which is the best algorithm in advance.
- 30:14If you have multiple candidates,
- 30:15you can use multiple algorithms together.
- 30:23So we looked at some theoretical properties of the method.
- 30:26We have two log functions, one for the label data,
- 30:29one for the unlabeled data.
- 30:31The label data
- 30:32doesn't even feature the estimand, which is p,
- 30:36so it will, on its own, it cannot identify p.
- 30:39The unlabeled data only uses p through this quantity,
- 30:43M transpose p.
- 30:44So again, for different combinations of M and p,
- 30:48as long as this product is the same,
- 30:50it will never be able to identify p on its own.
- 30:53So each loss function on its own
- 30:54cannot identify through parameters.
- 30:57But using both the loss functions together,
- 30:59you can identify the estimand, T,
- 31:02and we were able to show that posterior has nice properties
- 31:06in terms of asymptotic normality
- 31:08and well calibrated interval estimate
- 31:11and near parametric concentration rates.
- 31:13And the theory also extends to the ensemble method
- 31:16and we use some approximations and we give sampler
- 31:19and theory holds for that.
- 31:24Some empirical validations,
- 31:27since we're estimating a probability vector,
- 31:32the common metric that is used is called
- 31:34this chance-corrected normalized absolute accuracy,
- 31:38which is basically a scaled L1 error,
- 31:42centered by the L1 error you would get if you had predicted
- 31:46the cause of death randomly.
- 31:47So this is the error if you predict randomly
- 31:50and then we look at how much improvement we get
- 31:52over random predictions.
- 31:57So this is an illustration of what happens if the data
- 32:01is not Dirichlet and you use Dirichlet distribution.
- 32:03So on the left-hand side,
- 32:05the data is generated from Dirichlet
- 32:08and we use both our method and the Dirichlet-based model
- 32:12and they both do well.
- 32:14On the right-hand side,
- 32:15the data is from an overdispersed Dirichlet
- 32:17and we use the Dirichlet in our model.
- 32:20And because our model doesn't specify a distribution,
- 32:22it just uses a first moment specification,
- 32:25it's much robust and has much higher accuracy
- 32:29than for the Dirichlet which becomes misspecified.
- 32:35And then we also did a bunch of evaluations
- 32:37using the PHMRC data.
- 32:38So what we did was we trained the classifiers
- 32:42on three of the countries leaving one country out
- 32:44and then used a slice of data from that left out country
- 32:47to estimate the misclassification rates,
- 32:50and then we apply our method.
- 32:55The green one is our method
- 32:56and the x axis is the sample size of the dataset
- 33:02used from the left out country
- 33:04to estimate the misclassification rates.
- 33:07The blue one is sort of the uncalibrated one,
- 33:11the red one is the one that is calibrated
- 33:13using the training data.
- 33:14So you can see that our method does better than both of them
- 33:18and the higher the sample size we use
- 33:20from the left out country of interest
- 33:23to estimate the misclassifications, the more accurate it is.
- 33:30And also one interesting aspect
- 33:31was that we looked at calibration
- 33:33using individual algorithms and the calibration
- 33:36using the ensemble one.
- 33:37And more often than not, the ensemble one,
- 33:40which is the orange one,
- 33:42tends to perform similar to the best performing algorithm,
- 33:46and the best performing algorithm can be very different
- 33:48across different countries.
- 33:50For example, in Mexico,
- 33:51InSilicoVA is one of the best performing algorithms,
- 33:54but in Tanzania, InSilicoVA was doing very poorly
- 33:57and then InterVA was one
- 33:59of the better performing algorithms.
- 34:00So the ensemble always tend to give more weights
- 34:03to more accurate algorithms.
- 34:07So this is an overview of what we did for Mozambique.
- 34:10So we had the unlabeled data with only verbal autopsies.
- 34:14We've passed it through two algorithms,
- 34:16InSilicoVA and Expert VA, to get the uncalibrated estimates.
- 34:21Then we had the label data with the MITS cause of death
- 34:23with which we estimated the misclassifications
- 34:25of those two algorithms
- 34:28and then we combine them in the ensemble method
- 34:30and getting calibrated estimates.
- 34:38Some results from Mozambique.
- 34:40We have two age groups,
- 34:42neonatal deaths, first four weeks,
- 34:45and children that's under five years.
- 34:48Two algorithms, seven causes of death for children,
- 34:52five causes of death for neonates.
- 34:55I'm going to just show the neonatal results here.
- 34:57So these are the misclassification matrices for neonates.
- 35:01And ideally, you would want the matrices
- 35:03to have large numbers on the diagonals
- 35:05because those are the correct matches
- 35:07and then small numbers on the off diagonals.
- 35:09But you don't see that,
- 35:10you see quite a bit of large numbers on the off diagonals.
- 35:14One thing that stands out is that
- 35:17if you look at prematurity, it has a very high sensitivity,
- 35:20close to 90%,
- 35:22which means that if the true cause is prematurity,
- 35:25the verbal autopsy correctly diagnoses it.
- 35:28But then it also has high false positives
- 35:31in the sense that if the true cause is infection,
- 35:3420% of time, it is assigned as prematurity.
- 35:37If the true cause is intrapartum related events,
- 35:40almost 30% of time,
- 35:41it's assigned to be prematurity and so on.
- 35:43So it tends to over count a lot of deaths
- 35:46from different causes as prematurity.
- 35:48So what would be the result after calibration
- 35:52is that the percentage of prematurity comes down.
- 35:54So this is the uncalibrated estimate of prematurity.
- 35:58This is the calibrated estimate of prematurity.
- 36:01You can see that it comes down
- 36:02because we can see in the data that there is a lot
- 36:05of over counting of prematurity deaths.
- 36:09So after calibration, it tends to come down quite a bit.
- 36:17And also, we looked at the model estimated sensitivities
- 36:22using both the single cause
- 36:24and the compositional cause of the data.
- 36:27So this is the difference in the sensitivities
- 36:29and you can see that using the compositional cause of death,
- 36:33you'll always get a higher match because it kind of uses
- 36:36information for multiple causes and stuff
- 36:39just considering the top cause.
- 36:41And so it generally leads to better matching
- 36:43between the verbal autopsy and the minimal tissue sampling.
- 36:49Some ongoing work.
- 36:51So when we did this for Mozambique,
- 36:53there was very little amount of payer data.
- 36:57So even though the data was for seven countries,
- 36:59we kind of merged them together
- 37:01and estimated the misclassification rates.
- 37:04Now we have more data coming in for those countries
- 37:07so we have a chance to assess
- 37:08whether the misclassification rates vary by country
- 37:12because if they do,
- 37:12we should model the misclassification rates
- 37:17in a way that's specific to each country.
- 37:21So these are the misclassification rates now
- 37:26resolved by country.
- 37:27So there are six countries, Bangladesh, Ethiopia,
- 37:30Kenya, Mali, Mozambique and Sierra Leone.
- 37:35You can see the estimates.
- 37:36These are the empirical estimates
- 37:37and the confidence intervals for each country.
- 37:40And the horizontal black line
- 37:42is what the pooled estimate looks like.
- 37:44So you can see that there is for some causes like here,
- 37:49there is not a variability across countries.
- 37:51But then for some other cause payers like say here,
- 37:56there's quite a bit of variability across countries.
- 38:00And so now that we are getting more data,
- 38:03the next step for the project
- 38:05is to estimate country-specific misclassification rates.
- 38:09The issue however is that even with more data,
- 38:12there is, I think, around 600 cases here for six countries,
- 38:17which is approximately 100 case per country.
- 38:20And there are 25 cells of the misclassification matrix.
- 38:23So that's like four cases per cell,
- 38:25so that's clearly not enough to do separate
- 38:27country specific models.
- 38:30So we'd have to kind of do
- 38:32a sort of a borrowing of information
- 38:35both across the rows and columns of the matrix
- 38:38but also across different countries.
- 38:42So what we do first is first, we kind of borrow information
- 38:45across the rows and columns of the matrix.
- 38:49And to do this, we start with a,
- 38:52instead of an unstructured misclassification matrix
- 38:55where we estimated each cell separately,
- 38:57we start with a structured misclassification matrix
- 39:00using two basic mechanisms.
- 39:02So we say that a classifier operates using two mechanisms,
- 39:07for a given cause, it can either match that cause
- 39:12and we call that an intrinsic accuracy
- 39:15and that matching probability will be different
- 39:18for different causes, so there are three causes here,
- 39:20and you can see
- 39:21that the matching probability can be different.
- 39:24If it doesn't match the true cause,
- 39:26then it randomly distributes its prediction
- 39:29to the other causes
- 39:31and that random distribution will also have some weights,
- 39:36and those we call the systematic bias
- 39:38or the pool of the classifier.
- 39:40So if it's not matching,
- 39:42we saw that it'll often assign a cause to prematurity
- 39:46regardless of what the true cause is.
- 39:48So that's kind of the basis for this model.
- 39:51And if you have this model,
- 39:52we kind of rearrange these three bars here
- 39:57and then we put in the circle from there.
- 39:59And these will give you the misclassification priorities.
- 40:03So we can write each of the misclassification probabilities
- 40:08in terms of just these six parameters and we can do the same
- 40:13for the green cause and for the blue cause.
- 40:17And so basically, these are the nine misclassification rates
- 40:22written in terms of the six parameters.
- 40:23So this is not that much of a dimension reduction
- 40:26if there are three causes,
- 40:27but if there are in general C causes,
- 40:32this model for misclassification matrix will only have
- 40:342C - 1 parameters as opposed to C square parameters.
- 40:39So in practice, we use seven causes for children
- 40:43and five causes for neonates,
- 40:44so this leads to a lot of dimension reduction.
- 40:49And one of the justification
- 40:53for this dimension reduced model
- 40:54is that if this model is true then the misclassification
- 40:59into different causes,
- 41:01the odds of misclassification into two causes, j and k,
- 41:05will not depend on what the true cause is.
- 41:08And we do see that in the data.
- 41:10So these are different cause payers, j and k,
- 41:13and these are the odds for what the true cause is.
- 41:17So we are plotting the misclassification rates,
- 41:20mij over mik.
- 41:22So this is j and k
- 41:24and the colors here give you i.
- 41:26So you do see that they do not vary
- 41:28for different choices of i,
- 41:30it only is specific to j and k,
- 41:32and that's an equivalent characterization
- 41:36of that systematic preference
- 41:39and intrinsic accuracy model that we have,
- 41:41so we do see that reflected in the data.
- 41:44But we don't have that as the fixed model we have.
- 41:49So this is the best model.
- 41:51We allow some diversion or shrinkage towards it
- 41:54and there's a tuning parameter.
- 41:56So then we get the homogeneous model
- 41:58and then we have a diversion from the homogeneous model
- 42:01to get country specific model.
- 42:03So that's the broad idea,
- 42:04I won't go into the modeling details.
- 42:07And these are the predictions
- 42:09using the country specific model.
- 42:13I won't go into details here, but there are many cases,
- 42:15for example, take it here,
- 42:17star is the empirical rate,
- 42:19angle is the heterogeneous model.
- 42:24And you can see it does much better
- 42:26than the horizontal line, which is the homogeneous model.
- 42:30And we do see it throughout the classification rates.
- 42:36These are the estimates for Bangladesh.
- 42:38So the red density is the pooled estimate
- 42:41of the homogeneous estimate.
- 42:43The blue density is the Bangladesh specific estimate.
- 42:48The dotted vertical line
- 42:50is the empirical estimate for Bangladesh
- 42:52and the solid vertical line
- 42:53is the pooled empirical estimate.
- 42:56So you can see that as we get
- 42:59more and more data from Bangladesh,
- 43:01the country specific estimate moves away
- 43:03from the pooled estimate
- 43:04towards the country specific estimate.
- 43:06So that's basically the hope is going forward,
- 43:12we will have much more data within each country
- 43:14and we'll have estimates that are much closer
- 43:16to the dotted lines than the solid lines.
- 43:22So that's the summary.
- 43:23So in general, these cause of death classifiers
- 43:26are super inaccurate.
- 43:28So we need to calibrate for that and we have limited data
- 43:31to estimate their inaccuracy,
- 43:32so we calibrate them innovation way.
- 43:36The methods give probabilistic cause of death
- 43:39instead of categorical cause of death.
- 43:40So we develop a generalized Bayes approach
- 43:43that is equivalent to a multinomial model
- 43:45if the data is categorical.
- 43:47But if it's not categorical, it becomes a pseudo likelihood
- 43:50Bayesian approach for compositional data
- 43:54and that allows zeroes and ones in the data
- 43:57and is not kind of dependent on the model specification.
- 44:02And then it kind of led to this independent development
- 44:05of the composition on composition regression.
- 44:09Some papers and software.
- 44:10So the single cause paper was the first one,
- 44:13then we extend it to compositional data
- 44:17and develop the theory for it.
- 44:19The package for calibration is available on GitHub
- 44:22and then the composition on composition regression
- 44:25were the separate piece
- 44:26and we have the coda linear model package for it on CRAN.
- 44:30And then we use this approach
- 44:32to produce calibration estimates
- 44:36for neonate and children deaths in Mozambique
- 44:39which were published in the last three papers.
- 44:41Thank you.
- 44:51<v ->Questions? Yes.</v>
- 44:53<v ->So I just had a quick question 'cause you were saying</v>
- 44:55the model basically looks at the symptoms
- 44:58that'll be able to predict which it would be.
- 45:00Does it also factor in what diseases and stuff
- 45:04are most common in those areas or does it kind of just-
- 45:07<v ->Oh, very good question.</v>
- 45:09It does factor it in but in a very crude way
- 45:12in the sense that the models have some settings
- 45:14called like high malaria, low malaria or high HIV, low HIV.
- 45:18So depending on which country you're running it,
- 45:21you will set the setting to like high HIV country
- 45:24or low HIV country, the same for malaria,
- 45:27but it doesn't do anything beyond that,
- 45:30so only at a very close level.
- 45:34<v ->Causes of death or.</v>
- 45:37<v ->So the ICD-10 classification</v>
- 45:40will have around 30 plus causes of death
- 45:42for children's and neonates,
- 45:44I think much more for adults.
- 45:47There are no MITS for adults.
- 45:48MITS was only done for children's and neonates,
- 45:51only now adult MITS are being started,
- 45:54but we have to kind of group them into broader categories
- 45:57because if you have 30 causes,
- 45:59your misclassification matrix will be 30 times 30.
- 46:02So we don't have the data to do estimation
- 46:05at that fine resolution.
- 46:06So we group them into broader categories.
- 46:08So seven for children, five for new neonates.
- 46:11<v ->Is one of the categories, I have no idea,</v>
- 46:14it is totally unknown.
- 46:15And if so, is that different from the uniform distribution
- 46:18across causes of death?
- 46:21<v ->That would be the uniform distribution.</v>
- 46:23There is no category which is, I have no idea,
- 46:25but it'll be probably reflected in a score that is very flat
- 46:28across the causes.
- 46:30<v ->If you think there are seven causes of death</v>
- 46:32and I'm working with the same dataset
- 46:34and I think there are 100 causes of death,
- 46:36will there be substantial differences in our marginal
- 46:39estimates of probability?
- 46:41Because our uniform posteriors
- 46:45place such different amounts of mass across the say
- 46:4830 versus 100 causes of death.
- 46:51<v ->Yes, there will be differences</v>
- 46:54and even when we are aggregating from the 30 causes
- 46:58to seven causes, the assumption is that within each category
- 47:02the misclassification rates are homogeneous
- 47:04within the finer category.
- 47:05So that is an assumption that we're working with.
- 47:08So definitely, there will be differences.
- 47:11<v ->Thank you.</v>
- 47:16<v ->I have one more question.</v>
- 47:22I'll ask a philosophical question
- 47:23if I may. <v ->Sure, yeah.</v>
- 47:24<v ->You commented,</v>
- 47:26I don't know, about halfway through,
- 47:27about how statisticians are working on a thing.
- 47:32Computer scientists are working on the same thing.
- 47:34There's a third group I forget.
- 47:37And nobody talks to each other.
- 47:40Now, many of us are,
- 47:42many of the students here
- 47:44are within the data science track of biostatistics.
- 47:49By the way, love your Twitter handle.
- 47:52But yeah, so how do we bridge those things
- 47:56that we take advantage of these things
- 47:57and it's not three separate versions of the same thing?
- 48:01<v ->I don't know if there's a systematic way.</v>
- 48:04Honestly, I came to know about much of the literature
- 48:08going through the revisions
- 48:09and one of the reviewer associate editors said
- 48:11there is a lot of work here in the econometrics literature,
- 48:14you should take a look.
- 48:15And that's kind of the value
- 48:16of the peer review system I guess.
- 48:17And so we looked at it and yes, there was a lot of work
- 48:20and they just called it different things
- 48:22and so I had no idea
- 48:23when I was searching for that in the literature.
- 48:26And we did see the Victor Chernozhukov paper
- 48:29I think is in "Journal of Economics,"
- 48:30but it's basically an asymptotic statistics paper.
- 48:33It kind of shows that these generalized Bayes stuff,
- 48:36which they call as Laplace-type estimators,
- 48:38has all these nice properties
- 48:40that a standard vision posterior will have.
- 48:43But yeah, I think talking to more people
- 48:46and like interacting and telling about your work
- 48:49will kind of,
- 48:50and someone will say that, oh yeah, I do something similar.
- 48:52You should look at this paper,
- 48:55it's probably. <v ->Hopefully Twitter helps.</v>
- 48:57<v ->Sorry?</v>
- 48:58<v ->Hopefully Twitter helps.</v>
- 48:58<v ->Yeah, yeah, definitely.</v>
- 49:00Engagement through any like in-person
- 49:02or social media platform would be useful, yeah.
- 49:08<v ->All right, well thanks so much.</v>
- 49:08I think we're out of time so we'll stop it there.
- 49:12(attendant muttering indistinctly)
- 49:15Hope everybody has a wonderful fall break.
- 49:17See you next week.
- 49:19(attendants chattering indistinctly)
- 49:37<v Learner>The other organizer.</v>
- 49:38(learner muttering indistinctly)
- 49:39(attendants chattering indistinctly)
- 49:53<v ->Or maybe because they're susceptible.</v>
- 49:55(attendants chattering indistinctly)
- 50:04<v ->Thank you. Anyone else need to sign in?</v>
- 50:06(attendants chattering indistinctly)
- 50:19<v ->Infection but they're also premature babies.</v>
- 50:21(attendants chattering indistinctly)
- 50:30<v ->Premature, but also it's that</v>
- 50:32it's not a distinct.
- 50:34(attendants chattering indistinctly)
- 50:35<v ->Cause of death is very blurry in this day.</v>
- 50:38<v ->Is that part of why like.</v>
- 50:40(attendants chattering indistinctly)
- 50:46<v ->'Cause a symptom given cause session</v>
- 50:49with that much of variation across country.
- 50:51<v Learner>Cause.</v>
- 50:52(learner muttering indistinctly)
- 50:53Cause.
- 50:54<v ->Reporting depends on who is answering.</v>
- 50:58(attendants chattering indistinctly)
- 51:04<v ->You need to go next.</v>
- 51:05<v ->Back to.</v>
- 51:09<v ->I guess, yeah.</v>
- 51:10You need one of us to let you.
- 51:12(lecturer muttering indistinctly)
- 51:14<v ->It might be a short answer.</v>
- 51:16Yeah, and it's short answer.
- 51:17(attendants chattering indistinctly)
- 51:20<v ->I don't have to, will you? (laughs)</v>