YSPH Biostatistics Seminar: “Generalized Bayes Calibration of Compositional Cause-specific Mortality Data from Verbal Autopsies"

October 19, 2023

Information

Abhirup Datta, PhD, Associate Professor, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health

October 17, 2023

ID10875

To CiteDCA Citation Guide

00:00<v ->And welcome.</v>
00:02Today, it's my, eh.
00:05Today, it is my pleasure to introduce Professor Abhi Datta
00:09from Johns Hopkins University in Baltimore, Maryland.
00:13Professor Datta earned his BS and MS
00:15from the Indian Statistical Institute
00:17in 2008 and 2010 respectively,
00:20and PhD from the University of Minnesota in 2016.
00:25In addition to being a well-cited researcher
00:27with one publication that's almost 600 citations,
00:30which is pretty nice,
00:32he's also a award-winning educator,
00:35having repeatedly won an excellence in teaching award
00:37from his institution.
00:39So let's welcome Dr. Datta.
00:44<v ->Thank you, Robert,</v>
00:45for the invitation to come here and give the seminar,
00:48and for the very nice introduction.
00:50Thank you everyone for coming.
00:52My talk is about improving cause-specific mortality data
00:56in low and middle-income countries
00:58where the main tool to collect data
01:00is something called verbal autopsies.
01:02And the way I do it
01:03is using a statistical approach called generalized Bayes.
01:07If you have not heard
01:08of verbal autopsies or generalized Bayes,
01:11I can tell you that I hadn't heard of either of those things
01:14when I started working on the project,
01:17so don't worry about that,
01:18I try to give an introduction.
01:20'Cause I mostly work on a spatial and spatial temporal data
01:24and this was a project that came along,
01:27which is very different from what I used to work on.
01:29But over the years, there's been a nice body of work
01:31developed in this project.
01:35So this is a joint work
01:39with many different institutes and collaborators.
01:44The top row is the Hopkins bio stats team,
01:46which included my former students,
01:48Jacob Fiksel and Brian Gilbert,
01:51and my current postdoc, Sandi,
01:53and my colleague, Scott Zeger, and I
01:56lead the bio stats part of the team.
02:00Agbessi is the PI of the project in Mozambique
02:03that's sort of picked up developments for this work.
02:07And there are a lot of colleagues
02:09from the International Health Department
02:10that helped to collaborate.
02:12And then Li is the PI of a new project
02:16who we're going to apply our methodology
02:17for producing mortality estimates for the WHO.
02:22So we're collaborating with Li there as well.
02:25And then a couple of people outside Hopkins,
02:27Dianna at CDC and Emory University,
02:31as the director of the CHAMPS project.
02:35And Ivalda in the government body at Mozambique
02:39has been now currently doing the work in Mozambique.
02:44So this is funded by three grants from the Gates Foundation.
02:49The first one was the grant that kind of started things.
02:52And then we have a grant that is kind of developing more
02:55on the method side of the world.
02:59So, many low and middle-income countries
03:05often lack high-quality data on causes of death.
03:08Often for most deaths,
03:10there is no sort of medical certification
03:13or like an autopsy done.
03:16And without kind of high-quality data
03:19on what people are dying of,
03:21it's kind of hard to estimate the disease burden
03:23in these countries.
03:25And specifically, the quantity of interest
03:27is the cause-specific mortality fraction,
03:29which is basically the percentage of deaths in a age group
03:34that can be attributable to a given cause.
03:38So cause-specific mortality fractions
03:40are key pieces of information
03:42in determining the global burden of disease,
03:44which in turn dictates sovereign policy,
03:47as well as like resource allocations
03:49for programs operating in this country.
03:54So verbal autopsy is an alternate way
03:57to count deaths and attribute causes
03:59without actually doing a clinical autopsy.
04:02So verbal autopsy is basically
04:04a sort of a systematic interview
04:07of the household members of the deceased.
04:08So the government or the program has a set of field workers
04:12who go out and go from household to household
04:15and ask if anyone died in their household
04:17within the last several months.
04:18And if they died, what were the symptoms?
04:20And the set of questions they ask is not standardized
04:23by the WHO.
04:24Some example questions are here.
04:27Most of the questions would have binary answers
04:29like yes, no, but there are some questions
04:32that have more like continuous responses.
04:38So they said the WHO has standardized
04:41the verbal autopsy tool.
04:43The 2016 version has around 200 to 350 questions,
04:47depending on the age group.
04:48There are separate sections of the questionnaire
04:50for neonates, children deaths and adult deaths.
04:54And if you're interested in more information
04:56about verbal autopsy, there's a page in WHO about it.
05:02So a verbal autopsy, of course,
05:04doesn't give you a cause of death,
05:05it just gives you a bunch of yes-no responses
05:08to various questions related to the symptoms.
05:14So a verbal autopsy is basically a survey questionnaire.
05:17So you can pass that survey through a computer software
05:20and that can give a predictive cause of death.
05:23And so there are a bunch
05:24of different computer software available.
05:27InSilicoVA, developed by Tyler McCormick,
05:31Richard Li was a postdoc here,
05:34is published in "JASA" in 2016,
05:36is one of the, I think,
05:37most statistically-principled approaches to do it.
05:40But there are other approaches and then you can,
05:43this is basically a classification problem.
05:45So you're basically given your data on symptoms,
05:48you're kind of classifying the cause of death
05:50as one of several causes.
05:51So you can use standard classifiers
05:54and machine learning approaches as well.
05:58OpenVA is an excellent resource
05:59to learn about verbal autopsies.
06:00Again, openVA is,
06:04I think Richard is one of the maintainers
06:06and creators of openVA.
06:11So the COMSA project in Mozambique,
06:14one of the main goals was to generate
06:17this cause-specific mortality fractions
06:19for children's and under,
06:21for neonates and under-five children
06:24for the country of Mozambique.
06:26And the data that we collected was a large dataset
06:30of vocal autopsy record
06:32for different households that were surveyed
06:34and that was a map of Mozambique
06:38and the green region show
06:41where the data was collected
06:43as part of the COMSA project.
06:44So in statistical terms, the data just has the symptoms,
06:49it doesn't have the true cause of death,
06:51so we call it the unlabeled data.
06:57So how to go from an unlabeled data to the labeling
07:00of the causes of death
07:01and then estimate these cause fractions.
07:04This is the standard procedure that is typically done
07:08and this is what we were supposed to do as well,
07:10which is simply take each record,
07:12pass it through the computer software
07:14and get a cause of death.
07:16And once you get a cause of death,
07:18then you can sort of simply aggregate.
07:19So in the story example,
07:21three out of the six cases were assigned to be from HIV.
07:25And so the cause-specific mortality fraction for HIV
07:27would be 50% and similar for malaria and sepsis and so on.
07:32So that's the basic template
07:35of how to get a cause-specific mortality fractions
07:38from verbal autopsies.
07:39The question is can we trust this estimates?
07:41Because these are not true causes of death
07:43as determined by a doctor or by a clinical procedure.
07:46These are cause of death predicted by an algorithm
07:48based on just surveying the household members
07:52of the deceased.
07:57So turns out machine learning has a name
08:00for this type of problems,
08:01it's called quantification learning,
08:04which is basically estimating population prevalence
08:07using predicted levels instead of true levels
08:10and the predictions are coming from a classifier.
08:13And so there has been some work in quantification learning
08:16and in the machine learning literature.
08:19So when we were working on this problem,
08:21we realized that estimating
08:22cause-specific mortality fractions
08:24using predicted cause of death data from verbal autopsy
08:27is an example of quantification learning.
08:31So just a sort of an overview of terms that we'll be using
08:35and the corresponding statistical notation.
08:37So our true cause of death is y which we do not observe.
08:42We want to estimate the probability
08:43of population prevalence of y,
08:45so y is a categorical variable.
08:49And so probability of y or p
08:51is our cause-specific mortality fraction,
08:53which is the estimand.
08:55We observed the verbal autopsy, which is a,
08:57think of this as a high dimensional
09:00or a long list of yes-no answers
09:02to the verbal autopsy questions, so that is x,
09:06and this x is passed through a software
09:08to give a predicted level, which is a of x or simply a.
09:17So what we have in the COMSA project
09:21is simply an unlabeled dataset
09:25which uses these verbal autopsy responses,
09:28pass it through a software and get the predicted levels.
09:34We do not observe the true levels, y,
09:37we may or may not retain the verbal autopsy responses
09:40because those are identifiable data
09:42and those are often not released,
09:43so often, just the predicted cause of that is available.
09:47So even these covariates, x, may or may not be available.
09:50And then we are interested in estimating the probability
09:53that y belongs to one of the C many cause categories,
09:58so that's a quantity of interest.
10:05For some reason, there is a conditional sign
10:07that's missing there.
10:09But you can use the law of total probability
10:13to write the probability of the predicted cause of death,
10:16which is the a,
10:18probability of a as a sum of our probability of a given y
10:22times probability of y.
10:24So there's a conditional sign missing here,
10:26I don't don't know what's going on here.
10:32But the COMSA data,
10:33we only get information on the left-hand side, right?
10:36And we want to input upon the quantity probability of y
10:41which would be the true CSMFs.
10:44So there is only one known quantity
10:46with which you can estimate the left-hand side.
10:48There are two unknown quantities on the right-hand side.
10:50So without making assumptions, you cannot really identify
10:54probability of y, right?
10:56So any quantification learning methods
10:59need to either estimate those conditional probabilities,
11:02probability of a given y,
11:04or make some assumptions on it.
11:08So again, all the conditional signs are missing.
11:16The one of the most common approaches,
11:19and this is what is used in the verbal autopsy world
11:22is called classify and count,
11:25which is you simply predict the cause of death
11:28and then aggregate.
11:29So you're simply claiming that probability of a
11:33is same as probability of y which is equivalent to claiming
11:36that this misclassification rate matrix
11:39is an identity matrix, right?
11:41Because you're saying that the left hand quantity
11:44is the same as the rightmost quantity, which would be true
11:48if there is no misclassification by the algorithm
11:51and if the predicted cause of death
11:53is always the true cause of death.
11:56And that's what is typically done
11:58in this cause-specific mortality fraction estimates.
12:02But it's a very strong assumption, right?
12:04Because it says assuming perfect sensitivity and specificity
12:07of the algorithm.
12:10So let's look at how perfect the algorithms are.
12:12So these are two algorithms,
12:13Tariff and InSilicoVA,
12:16PHMRC data is a benchmark dataset from four countries
12:20that has both the verbal autopsy data
12:22as well as a gold standard cause of death diagnosis.
12:26And you can see the accuracies of either method
12:30is around 30%, so they're far from being
12:33like fully accurate.
12:36So there is large misclassification rates
12:39of these algorithms and if you don't kind of adjust
12:42for these misclassifications,
12:44this is burden estimates
12:46of the cause-specific mortality fractions you get
12:48are likely going to be very biased.
12:54So this is where the CHAMPS project comes into play.
12:58So the CHAMPS is an ongoing project
13:00in like seven or eight countries including Mozambique,
13:05which is collecting data on both verbal autopsy
13:07and a more comprehensive cause of death procedure
13:11called minimally invasive tissue sampling.
13:14So it basically takes a sample of your tissue
13:17of the deceased person and then runs a bunch
13:20of pathological tests and imaging analysis
13:23and then gives a cause of death.
13:25And the MITS cause of death assignments
13:30have been shown to be quite accurate when you compare
13:33to like a full diagnostic autopsy.
13:36So MITS is being done in a bunch
13:38of different countries including Mozambique.
13:41And for the cases where MITS is being done,
13:43the verbal autopsies are also collected.
13:46So what you get from this CHAMPS data
13:48is a labeled or paired dataset
13:50where you have both the verbal autopsy
13:52as well as the MITS cause of death
13:54and you can pass the verbal autopsy to the software
13:58to get the verbal autopsy predicted cause of death.
14:00And then you can cross tabulate the two
14:02and get an estimate of the misclassification rates, right?
14:04Like you can say like,
14:06"Oh okay, so there are 10 cases
14:08that the MITS cause of death was HIV,
14:11out of those 10 cases,
14:12seven of them were correctly assigned to HIV
14:15by verbal autopsy.
14:16So then the sensitivity would be 70%
14:20and the false positive would be 30%, so on."
14:27So this is the broad idea of the methodology.
14:29So for the COMSA data, which is the unpaired data,
14:32you get only the verbal autopsy record
14:34so you can get an estimate of the predicted cause of deaths
14:37from the verbal autopsy.
14:39From the CHAMPS data, which is the paired data,
14:41you can get an estimate of the misclassification rates.
14:44And then the only unknown is then the probabilities
14:48of the cause of death
14:50if you were able to do the MITS autopsy for every death.
14:54So then this is an equation with two knowns and one unknown
14:58and you can solve for it and get the calibrating message.
15:01So that's the broad idea and we do it in a model-based way.
15:09So here's the formal model.
15:11So for the CHAMPS dataset with the unlabeled data or the U,
15:15we have the predicted labels, ar,
15:17and then for the,
15:20that's for the COMSA data,
15:21and for the CHAMPS data,
15:22we have both the predicted labels from verbal autopsy, ar,
15:26as well as the MITS determine labels, yr.
15:29And our quantity of interest is the probabilities of yr
15:34belonging to the different causes.
15:41There's a conditional sign missing here.
15:44But if the conditional probabilities
15:48are denoted by Mij, which is if the MITS cause is i,
15:52what is the probability that the via predicted cause is j?
15:57Then you can use a law of total probability
15:59to write down the marginal distribution
16:02of the via predicted cause.
16:03So that would be in terms of the misclassification rates
16:07and the marginal cause distribution of the MITS-COD.
16:10So that's the whole idea.
16:11So you can write this in terms of a matrix vector notation
16:15as probability of a as M transpose p
16:18where M is the misclassification rate matrix,
16:21p is the unknown quantity of interest,
16:24which is probability that the cause of death
16:27is coming from an unknown cause.
16:31So the data model is very simple,
16:34but the unlabeled data,
16:36it follows multinomial with this probability
16:38which is coming from this law of total probability.
16:41And then for the label data,
16:43this is ar given yr equals to i,
16:46it follows multinomial with the i
16:48throughout the misclassification matrix.
16:49So if the MITS-COD is i,
16:51the misclassification rates are given by the i
16:53throughout the misclassification matrix,
16:55so it's multinomial with that probability.
16:59And then we've put priors on M and p
17:01and then we can get estimates of both M and p.
17:04M is a nuisance parameter, p is the parameter of interest.
17:10Just to carefully go over what are the assumptions here.
17:13The main assumption is that the misclassification rates
17:18of verbal autopsy given MITS
17:20are the same in your label data
17:23as they would be in your unlabeled data.
17:25This is not verifiable because we don't have
17:28any true cause of death in the unlabeled data,
17:30so it's an assumption.
17:33Given that the verbal autopsy
17:35is a function of your symptoms,
17:37the assumption is essentially that given a true cause,
17:42the probability of the symptoms are going to be same
17:44in your unlabeled dataset as in your labeled dataset.
17:49And it's a reasonable assumption
17:50as if you have a cause of death,
17:53it's likely that you have certain symptoms will appear
17:56and some certain symptoms will not appear.
17:59And that is true regardless of whether the data is coming
18:02from the labeled set or the unlabeled set.
18:08We do not assume that the marginal distribution
18:12of the CHAMPS data of the causes in the label data
18:16is representative of the population
18:17because they are not, because the CHAMPS state,
18:20so the CHAMPS project is done
18:21at specific hospitals in the country
18:24and distribution of causes in hospitals
18:28are typically not same as distribution
18:30of causes in the community.
18:31And we are interested
18:32in the cause distribution in the population.
18:34So there is no assumption
18:37that the marginal distribution of y in the label data
18:40is same as the marginal distribution of y in unlabeled data,
18:43which is our quantity of interest.
18:45And the reason there is no assumption
18:47is we only model a given y in the label data.
18:51We never model y in the label data.
18:54So we only model the conditional
18:56and the assumption is the condition
18:57of misclassification rates are transportable
19:00from the labeled to the unlabeled side.
19:06So that's the main idea.
19:07And this was the first work we did,
19:09we just used this top cause prediction.
19:13But many of these algorithms
19:15are actually probabilistic in nature in the sense
19:17that if you look at their outputs,
19:18they won't give a single cause of death,
19:20but they will give scores to each cause.
19:22So for example,
19:24this would be a typical output of an algorithm
19:26for like say 6%.
19:28So for the first person, it will say
19:3370% HIV, 20% malaria, 10% sepsis and so on.
19:38And the standard procedure is to take the top cause,
19:41so for the first person, it would be HIV,
19:44for the second person, it will be malaria and so on.
19:48So that's how you get a single cause
19:50from a probabilistic prediction.
19:53So that essentially ignores sort of the scores
19:57assigned to the second most likely cause,
20:01the third most likely cause and so on.
20:04And you ignore those, you can end up with a biased estimate.
20:09So you can see these are the CSMF estimates
20:12using the top cause,
20:14these are the CSM estimates
20:15using the exact scores that are assigned
20:17and those are different, right?
20:18So when we kind of change this probabilistic output
20:22to a single cause output, we discard information.
20:30So we wanted to extend the work
20:32to kind of use the full set of scores and the set of scores
20:36can be thought of as a compositional data in the sense
20:38that the scores sum up to one
20:40because it assigns 100% probability across all causes
20:45and then they're each non-negative.
20:48The issue is that for the categorical data,
20:51our model is based on multinomial distribution.
20:53And then for compositional data,
20:55the models are typically like Dirichlet
20:57or log ratio based models,
20:59which are very different from the multinomial distribution.
21:03So if we have some cases
21:05for which we have categorical output,
21:07for some, we have compositional output,
21:09this would lead to different models
21:11for different parts of the dataset.
21:15These Dirichlet or log-ratio models
21:17also do not allow zeros in the data.
21:20So if you have zeros or ones in the composition,
21:22they don't allow that.
21:23And then there are very specific models about the data
21:27which are subjective model and specification.
21:29So the data distribution does not look like a Dirichlet
21:33assuming a Dirichlet layer
21:34would lead to kind of wrong results.
21:41So how do we extend the multinomial framework we had
21:46for categorical data to compositional data?
21:51Again, there would be a conditional sign here.
21:56But the basic assumption that we had
21:58for the multinomial case was probability of a given y
22:02is the i throughout misclassification matrix, right?
22:05And for categorical data, a probability statement
22:10is same as an expectation statement, right?
22:12So we can equivalently write this
22:14as expectation of a given y
22:16is the i throughout the M.
22:19The advantage of the expectation statement
22:20is that it's more generally applicable.
22:23It will not be just for categorical data, right?
22:27So for categorical data, there's a equivalent.
22:30For other data types, this statement can be valid
22:33even though the previous statement may not be applicable.
22:37So we kind of write this as our model
22:41for the compositional data and we make no other assumptions
22:45about this distribution.
22:46So only a first moment conditional expectation statement
22:53without any full distributional specification.
22:59So what do we do?
23:00So we have expectation of a given y
23:03is the i throughout the misclassification matrix.
23:08We can use something called
23:10the Kullback Leibler Divergence
23:12or the cross entropy loss
23:14between a and its model expectation.
23:17So these are all the conditional signs are missing here.
23:22So basically a is the data we observe,
23:26this is the modeled expectation,
23:29which is basically the i
23:30through of the misclassification matrix
23:31and we use the cross entropy loss,
23:34the Kullback Leibler loss between the two.
23:37What's the advantage?
23:38So first of all,
23:39the Kullback Leibler loss allows zeroes in the composition.
23:42So it is well-defined even if you have zeroes or ones.
23:45If you take the negative loss and exponentiate it,
23:48it's exactly the multinomial likelihood.
23:50So if your data is indeed multinomial,
23:52you get back your likelihood that you're using
23:54for your single class model.
23:57But if your data is not multinomial,
24:00you get a pseudo likelihood that you can work with.
24:04If you can take the derivative of the loss function
24:07and take the expectation under the two parameter,
24:10you'll see that it's a valid score function
24:13in the sense that you get an unbiased estimating equation
24:16for your misclassification rate matrix, M,
24:19based on just the first moment as option.
24:23And then similarly, you can do the same thing
24:25for the unlabeled data.
24:27The probability statement becomes expectation statement
24:30and then we have the Kullback Leibler loss.
24:32This is an unbiased estimated equation for both M and p.
24:36And again,
24:38if the data is truly multinomial and not compositional,
24:41this becomes exactly the multinomial likelihood.
24:43If the data is compositional,
24:45it becomes a pseudo likelihood.
24:50Okay, so how do we do Bayes analysis
24:52with pseudo likelihoods?
24:54So this is where this idea of generalized Bayes
24:57or model-free Bayesian inference comes in
24:59and there have been parallel developments
25:01in both computer science, econometrics and statistics
25:04without much communication among the three fields
25:07for the last 30, 40 years.
25:10Basically, if you're given a loss function
25:13without a given like a full likelihood for the data,
25:15you can take negative of that loss function
25:18multiplied by some tuning parameter, alpha,
25:22exponentiate it and treat it as a pseudo likelihood
25:26and apply your priors
25:27and then your posterior is going to be proportional to this
25:30as long as the normalization constant exists.
25:33And there has been a lot of work that has shown
25:35that this is a valid posterior,
25:38it is a generalization of the Bayesian posterior,
25:41like if this is an actual likelihood,
25:42this is the Bayesian posterior,
25:44but if it's not a actual likelihood,
25:48this has been shown that it basically minimizes
25:49the Bayes risk for that loss function.
25:54It has nice asymptotic properties
25:56shown by Victor Chernozhukov in this paper
25:59and then in this JSS paper in 2016 I think
26:04it showed that if you're given a loss function
26:06and a prior,
26:07this is the only coherent way you can get a posterior.
26:12So there's now been a lot of work and it's been called
26:15by different names like Gibbs posteriors,
26:17pseudo posterior, Laplace-type estimators
26:20and quasi-Bayesian estimators along with generalized Bayes.
26:25So for our case, we have the pseudo likelihood
26:28for the label data.
26:29We have the pseudo likelihood for the unlabeled data.
26:32We put priors.
26:33If all of our data were categorical,
26:35this reduces to that multinomial model we had
26:38for the categorical data.
26:39But if some of the data is compositional,
26:41then this becomes generalized Bayes,
26:44so we call it generalized Bayes quantification learning.
26:47It allows sparsity of the outputs in the sense
26:50that if some of the data have zeroes and ones in them,
26:54this is well-defined.
26:56It's the same pseudo likelihood
26:58for categorical compositional predictions.
27:01And then it also allows
27:02a nice Gibbs sample using conjugacy.
27:11One final sort of data aspect we had
27:15was that this minimal tissue sampling
27:18was also sometimes inconclusive in the sense
27:21that they gave two causes.
27:22Like often, they were ambiguous between HIV and tuberculosis
27:29and they would give one as the immediate cause
27:31and one as the underlying cause.
27:32So sometimes, even the true cause of death is compositional.
27:36So your predicted cause of death is compositional,
27:39your true cause of death is also compositional
27:41and we call it like b, which represents the belief.
27:45And you can show that if you're only given b
27:49instead of a single cause of death,
27:53your conditional expectation becomes M transpose b
27:56instead of the i through of the M matrix.
27:59And you can do the same thing
28:01using the compositional true cause of death
28:05instead of the actual true cause of death.
28:08And all the conditional signs are missing here
28:10but you can just formulate the Kullback Leibler likelihood
28:14to generate pseudo likelihood.
28:19So this kind of give rise to a digression
28:22where we kind of looked at this is basically
28:25your true cause of death is a compositional covariate
28:28and your predicted cause of death is a compositional output.
28:31So we kind of looked at regression
28:33of a compositional outcome on compositional predictors.
28:36So this was kind of an offshoot paper
28:40where we just developed this piece
28:42and if you look at compositional regression,
28:45most of the work has been done using Dirichlet models
28:50or log ratio transformations.
28:52So this was a different approach to that in the sense
28:55that it's both transformation free
28:57and it doesn't specify a whole distribution
28:59like the Dirichlet,
29:00it just uses a first moment as option.
29:02And we have an R-package to do a regression on composition,
29:07to do composition on composition regression called codalm.
29:12But going back to the verbal autopsy work,
29:16we have the loss functions
29:17for the labeled and unlabeled data,
29:20we do the negative pseudo likelihoods,
29:23put priors on the parameters and we get posterior inference.
29:28One last extension of the methodology
29:31was that there are multiple different
29:34verbal autopsy algorithms and there are papers
29:36where every new algorithm comes out and they say
29:39they're better than all the previous algorithms.
29:41And in practice, you never know which is the best algorithm.
29:44So we developed an ensemble method that takes in predictions
29:49from multiple algorithms, estimates classifier
29:54algorithm-specific misclassification rates
29:57and then they're connected to the unknown estimand.
30:00So we can show that it gives more weight
30:04to the more accurate algorithm in a data-driven way.
30:07And then you're not kind of,
30:10you don't have to make the choice
30:12of which is the best algorithm in advance.
30:14If you have multiple candidates,
30:15you can use multiple algorithms together.
30:23So we looked at some theoretical properties of the method.
30:26We have two log functions, one for the label data,
30:29one for the unlabeled data.
30:31The label data
30:32doesn't even feature the estimand, which is p,
30:36so it will, on its own, it cannot identify p.
30:39The unlabeled data only uses p through this quantity,
30:43M transpose p.
30:44So again, for different combinations of M and p,
30:48as long as this product is the same,
30:50it will never be able to identify p on its own.
30:53So each loss function on its own
30:54cannot identify through parameters.
30:57But using both the loss functions together,
30:59you can identify the estimand, T,
31:02and we were able to show that posterior has nice properties
31:06in terms of asymptotic normality
31:08and well calibrated interval estimate
31:11and near parametric concentration rates.
31:13And the theory also extends to the ensemble method
31:16and we use some approximations and we give sampler
31:19and theory holds for that.
31:24Some empirical validations,
31:27since we're estimating a probability vector,
31:32the common metric that is used is called
31:34this chance-corrected normalized absolute accuracy,
31:38which is basically a scaled L1 error,
31:42centered by the L1 error you would get if you had predicted
31:46the cause of death randomly.
31:47So this is the error if you predict randomly
31:50and then we look at how much improvement we get
31:52over random predictions.
31:57So this is an illustration of what happens if the data
32:01is not Dirichlet and you use Dirichlet distribution.
32:03So on the left-hand side,
32:05the data is generated from Dirichlet
32:08and we use both our method and the Dirichlet-based model
32:12and they both do well.
32:14On the right-hand side,
32:15the data is from an overdispersed Dirichlet
32:17and we use the Dirichlet in our model.
32:20And because our model doesn't specify a distribution,
32:22it just uses a first moment specification,
32:25it's much robust and has much higher accuracy
32:29than for the Dirichlet which becomes misspecified.
32:35And then we also did a bunch of evaluations
32:37using the PHMRC data.
32:38So what we did was we trained the classifiers
32:42on three of the countries leaving one country out
32:44and then used a slice of data from that left out country
32:47to estimate the misclassification rates,
32:50and then we apply our method.
32:55The green one is our method
32:56and the x axis is the sample size of the dataset
33:02used from the left out country
33:04to estimate the misclassification rates.
33:07The blue one is sort of the uncalibrated one,
33:11the red one is the one that is calibrated
33:13using the training data.
33:14So you can see that our method does better than both of them
33:18and the higher the sample size we use
33:20from the left out country of interest
33:23to estimate the misclassifications, the more accurate it is.
33:30And also one interesting aspect
33:31was that we looked at calibration
33:33using individual algorithms and the calibration
33:36using the ensemble one.
33:37And more often than not, the ensemble one,
33:40which is the orange one,
33:42tends to perform similar to the best performing algorithm,
33:46and the best performing algorithm can be very different
33:48across different countries.
33:50For example, in Mexico,
33:51InSilicoVA is one of the best performing algorithms,
33:54but in Tanzania, InSilicoVA was doing very poorly
33:57and then InterVA was one
33:59of the better performing algorithms.
34:00So the ensemble always tend to give more weights
34:03to more accurate algorithms.
34:07So this is an overview of what we did for Mozambique.
34:10So we had the unlabeled data with only verbal autopsies.
34:14We've passed it through two algorithms,
34:16InSilicoVA and Expert VA, to get the uncalibrated estimates.
34:21Then we had the label data with the MITS cause of death
34:23with which we estimated the misclassifications
34:25of those two algorithms
34:28and then we combine them in the ensemble method
34:30and getting calibrated estimates.
34:38Some results from Mozambique.
34:40We have two age groups,
34:42neonatal deaths, first four weeks,
34:45and children that's under five years.
34:48Two algorithms, seven causes of death for children,
34:52five causes of death for neonates.
34:55I'm going to just show the neonatal results here.
34:57So these are the misclassification matrices for neonates.
35:01And ideally, you would want the matrices
35:03to have large numbers on the diagonals
35:05because those are the correct matches
35:07and then small numbers on the off diagonals.
35:09But you don't see that,
35:10you see quite a bit of large numbers on the off diagonals.
35:14One thing that stands out is that
35:17if you look at prematurity, it has a very high sensitivity,
35:20close to 90%,
35:22which means that if the true cause is prematurity,
35:25the verbal autopsy correctly diagnoses it.
35:28But then it also has high false positives
35:31in the sense that if the true cause is infection,
35:3420% of time, it is assigned as prematurity.
35:37If the true cause is intrapartum related events,
35:40almost 30% of time,
35:41it's assigned to be prematurity and so on.
35:43So it tends to over count a lot of deaths
35:46from different causes as prematurity.
35:48So what would be the result after calibration
35:52is that the percentage of prematurity comes down.
35:54So this is the uncalibrated estimate of prematurity.
35:58This is the calibrated estimate of prematurity.
36:01You can see that it comes down
36:02because we can see in the data that there is a lot
36:05of over counting of prematurity deaths.
36:09So after calibration, it tends to come down quite a bit.
36:17And also, we looked at the model estimated sensitivities
36:22using both the single cause
36:24and the compositional cause of the data.
36:27So this is the difference in the sensitivities
36:29and you can see that using the compositional cause of death,
36:33you'll always get a higher match because it kind of uses
36:36information for multiple causes and stuff
36:39just considering the top cause.
36:41And so it generally leads to better matching
36:43between the verbal autopsy and the minimal tissue sampling.
36:49Some ongoing work.
36:51So when we did this for Mozambique,
36:53there was very little amount of payer data.
36:57So even though the data was for seven countries,
36:59we kind of merged them together
37:01and estimated the misclassification rates.
37:04Now we have more data coming in for those countries
37:07so we have a chance to assess
37:08whether the misclassification rates vary by country
37:12because if they do,
37:12we should model the misclassification rates
37:17in a way that's specific to each country.
37:21So these are the misclassification rates now
37:26resolved by country.
37:27So there are six countries, Bangladesh, Ethiopia,
37:30Kenya, Mali, Mozambique and Sierra Leone.
37:35You can see the estimates.
37:36These are the empirical estimates
37:37and the confidence intervals for each country.
37:40And the horizontal black line
37:42is what the pooled estimate looks like.
37:44So you can see that there is for some causes like here,
37:49there is not a variability across countries.
37:51But then for some other cause payers like say here,
37:56there's quite a bit of variability across countries.
38:00And so now that we are getting more data,
38:03the next step for the project
38:05is to estimate country-specific misclassification rates.
38:09The issue however is that even with more data,
38:12there is, I think, around 600 cases here for six countries,
38:17which is approximately 100 case per country.
38:20And there are 25 cells of the misclassification matrix.
38:23So that's like four cases per cell,
38:25so that's clearly not enough to do separate
38:27country specific models.
38:30So we'd have to kind of do
38:32a sort of a borrowing of information
38:35both across the rows and columns of the matrix
38:38but also across different countries.
38:42So what we do first is first, we kind of borrow information
38:45across the rows and columns of the matrix.
38:49And to do this, we start with a,
38:52instead of an unstructured misclassification matrix
38:55where we estimated each cell separately,
38:57we start with a structured misclassification matrix
39:00using two basic mechanisms.
39:02So we say that a classifier operates using two mechanisms,
39:07for a given cause, it can either match that cause
39:12and we call that an intrinsic accuracy
39:15and that matching probability will be different
39:18for different causes, so there are three causes here,
39:20and you can see
39:21that the matching probability can be different.
39:24If it doesn't match the true cause,
39:26then it randomly distributes its prediction
39:29to the other causes
39:31and that random distribution will also have some weights,
39:36and those we call the systematic bias
39:38or the pool of the classifier.
39:40So if it's not matching,
39:42we saw that it'll often assign a cause to prematurity
39:46regardless of what the true cause is.
39:48So that's kind of the basis for this model.
39:51And if you have this model,
39:52we kind of rearrange these three bars here
39:57and then we put in the circle from there.
39:59And these will give you the misclassification priorities.
40:03So we can write each of the misclassification probabilities
40:08in terms of just these six parameters and we can do the same
40:13for the green cause and for the blue cause.
40:17And so basically, these are the nine misclassification rates
40:22written in terms of the six parameters.
40:23So this is not that much of a dimension reduction
40:26if there are three causes,
40:27but if there are in general C causes,
40:32this model for misclassification matrix will only have
40:342C - 1 parameters as opposed to C square parameters.
40:39So in practice, we use seven causes for children
40:43and five causes for neonates,
40:44so this leads to a lot of dimension reduction.
40:49And one of the justification
40:53for this dimension reduced model
40:54is that if this model is true then the misclassification
40:59into different causes,
41:01the odds of misclassification into two causes, j and k,
41:05will not depend on what the true cause is.
41:08And we do see that in the data.
41:10So these are different cause payers, j and k,
41:13and these are the odds for what the true cause is.
41:17So we are plotting the misclassification rates,
41:20mij over mik.
41:22So this is j and k
41:24and the colors here give you i.
41:26So you do see that they do not vary
41:28for different choices of i,
41:30it only is specific to j and k,
41:32and that's an equivalent characterization
41:36of that systematic preference
41:39and intrinsic accuracy model that we have,
41:41so we do see that reflected in the data.
41:44But we don't have that as the fixed model we have.
41:49So this is the best model.
41:51We allow some diversion or shrinkage towards it
41:54and there's a tuning parameter.
41:56So then we get the homogeneous model
41:58and then we have a diversion from the homogeneous model
42:01to get country specific model.
42:03So that's the broad idea,
42:04I won't go into the modeling details.
42:07And these are the predictions
42:09using the country specific model.
42:13I won't go into details here, but there are many cases,
42:15for example, take it here,
42:17star is the empirical rate,
42:19angle is the heterogeneous model.
42:24And you can see it does much better
42:26than the horizontal line, which is the homogeneous model.
42:30And we do see it throughout the classification rates.
42:36These are the estimates for Bangladesh.
42:38So the red density is the pooled estimate
42:41of the homogeneous estimate.
42:43The blue density is the Bangladesh specific estimate.
42:48The dotted vertical line
42:50is the empirical estimate for Bangladesh
42:52and the solid vertical line
42:53is the pooled empirical estimate.
42:56So you can see that as we get
42:59more and more data from Bangladesh,
43:01the country specific estimate moves away
43:03from the pooled estimate
43:04towards the country specific estimate.
43:06So that's basically the hope is going forward,
43:12we will have much more data within each country
43:14and we'll have estimates that are much closer
43:16to the dotted lines than the solid lines.
43:22So that's the summary.
43:23So in general, these cause of death classifiers
43:26are super inaccurate.
43:28So we need to calibrate for that and we have limited data
43:31to estimate their inaccuracy,
43:32so we calibrate them innovation way.
43:36The methods give probabilistic cause of death
43:39instead of categorical cause of death.
43:40So we develop a generalized Bayes approach
43:43that is equivalent to a multinomial model
43:45if the data is categorical.
43:47But if it's not categorical, it becomes a pseudo likelihood
43:50Bayesian approach for compositional data
43:54and that allows zeroes and ones in the data
43:57and is not kind of dependent on the model specification.
44:02And then it kind of led to this independent development
44:05of the composition on composition regression.
44:09Some papers and software.
44:10So the single cause paper was the first one,
44:13then we extend it to compositional data
44:17and develop the theory for it.
44:19The package for calibration is available on GitHub
44:22and then the composition on composition regression
44:25were the separate piece
44:26and we have the coda linear model package for it on CRAN.
44:30And then we use this approach
44:32to produce calibration estimates
44:36for neonate and children deaths in Mozambique
44:39which were published in the last three papers.
44:41Thank you.
44:51<v ->Questions? Yes.</v>
44:53<v ->So I just had a quick question 'cause you were saying</v>
44:55the model basically looks at the symptoms
44:58that'll be able to predict which it would be.
45:00Does it also factor in what diseases and stuff
45:04are most common in those areas or does it kind of just-
45:07<v ->Oh, very good question.</v>
45:09It does factor it in but in a very crude way
45:12in the sense that the models have some settings
45:14called like high malaria, low malaria or high HIV, low HIV.
45:18So depending on which country you're running it,
45:21you will set the setting to like high HIV country
45:24or low HIV country, the same for malaria,
45:27but it doesn't do anything beyond that,
45:30so only at a very close level.
45:34<v ->Causes of death or.</v>
45:37<v ->So the ICD-10 classification</v>
45:40will have around 30 plus causes of death
45:42for children's and neonates,
45:44I think much more for adults.
45:47There are no MITS for adults.
45:48MITS was only done for children's and neonates,
45:51only now adult MITS are being started,
45:54but we have to kind of group them into broader categories
45:57because if you have 30 causes,
45:59your misclassification matrix will be 30 times 30.
46:02So we don't have the data to do estimation
46:05at that fine resolution.
46:06So we group them into broader categories.
46:08So seven for children, five for new neonates.
46:11<v ->Is one of the categories, I have no idea,</v>
46:14it is totally unknown.
46:15And if so, is that different from the uniform distribution
46:18across causes of death?
46:21<v ->That would be the uniform distribution.</v>
46:23There is no category which is, I have no idea,
46:25but it'll be probably reflected in a score that is very flat
46:28across the causes.
46:30<v ->If you think there are seven causes of death</v>
46:32and I'm working with the same dataset
46:34and I think there are 100 causes of death,
46:36will there be substantial differences in our marginal
46:39estimates of probability?
46:41Because our uniform posteriors
46:45place such different amounts of mass across the say
46:4830 versus 100 causes of death.
46:51<v ->Yes, there will be differences</v>
46:54and even when we are aggregating from the 30 causes
46:58to seven causes, the assumption is that within each category
47:02the misclassification rates are homogeneous
47:04within the finer category.
47:05So that is an assumption that we're working with.
47:08So definitely, there will be differences.
47:11<v ->Thank you.</v>
47:16<v ->I have one more question.</v>
47:22I'll ask a philosophical question
47:23if I may. <v ->Sure, yeah.</v>
47:24<v ->You commented,</v>
47:26I don't know, about halfway through,
47:27about how statisticians are working on a thing.
47:32Computer scientists are working on the same thing.
47:34There's a third group I forget.
47:37And nobody talks to each other.
47:40Now, many of us are,
47:42many of the students here
47:44are within the data science track of biostatistics.
47:49By the way, love your Twitter handle.
47:52But yeah, so how do we bridge those things
47:56that we take advantage of these things
47:57and it's not three separate versions of the same thing?
48:01<v ->I don't know if there's a systematic way.</v>
48:04Honestly, I came to know about much of the literature
48:08going through the revisions
48:09and one of the reviewer associate editors said
48:11there is a lot of work here in the econometrics literature,
48:14you should take a look.
48:15And that's kind of the value
48:16of the peer review system I guess.
48:17And so we looked at it and yes, there was a lot of work
48:20and they just called it different things
48:22and so I had no idea
48:23when I was searching for that in the literature.
48:26And we did see the Victor Chernozhukov paper
48:29I think is in "Journal of Economics,"
48:30but it's basically an asymptotic statistics paper.
48:33It kind of shows that these generalized Bayes stuff,
48:36which they call as Laplace-type estimators,
48:38has all these nice properties
48:40that a standard vision posterior will have.
48:43But yeah, I think talking to more people
48:46and like interacting and telling about your work
48:49will kind of,
48:50and someone will say that, oh yeah, I do something similar.
48:52You should look at this paper,
48:55it's probably. <v ->Hopefully Twitter helps.</v>
48:57<v ->Sorry?</v>
48:58<v ->Hopefully Twitter helps.</v>
48:58<v ->Yeah, yeah, definitely.</v>
49:00Engagement through any like in-person
49:02or social media platform would be useful, yeah.
49:08<v ->All right, well thanks so much.</v>
49:08I think we're out of time so we'll stop it there.
49:12(attendant muttering indistinctly)
49:15Hope everybody has a wonderful fall break.
49:17See you next week.
49:19(attendants chattering indistinctly)
49:37<v Learner>The other organizer.</v>
49:38(learner muttering indistinctly)
49:39(attendants chattering indistinctly)
49:53<v ->Or maybe because they're susceptible.</v>
49:55(attendants chattering indistinctly)
50:04<v ->Thank you. Anyone else need to sign in?</v>
50:06(attendants chattering indistinctly)
50:19<v ->Infection but they're also premature babies.</v>
50:21(attendants chattering indistinctly)
50:30<v ->Premature, but also it's that</v>
50:32it's not a distinct.
50:34(attendants chattering indistinctly)
50:35<v ->Cause of death is very blurry in this day.</v>
50:38<v ->Is that part of why like.</v>
50:40(attendants chattering indistinctly)
50:46<v ->'Cause a symptom given cause session</v>
50:49with that much of variation across country.
50:51<v Learner>Cause.</v>
50:52(learner muttering indistinctly)
50:53Cause.
50:54<v ->Reporting depends on who is answering.</v>
50:58(attendants chattering indistinctly)
51:04<v ->You need to go next.</v>
51:05<v ->Back to.</v>
51:09<v ->I guess, yeah.</v>
51:10You need one of us to let you.
51:12(lecturer muttering indistinctly)
51:14<v ->It might be a short answer.</v>
51:16Yeah, and it's short answer.
51:17(attendants chattering indistinctly)
51:20<v ->I don't have to, will you? (laughs)</v>