# BIS Seminar: A General Framework for Quantile Estimation with Incomplete Data

October 14, 2020
• 00:00- Hi, everyone.
• 00:02Welcome to the departmental seminar of
• 00:04the Departmental Biostatistics, Yale University.
• 00:09I'm pleased to introduce you Linglong Kong.
• 00:12He was associate professor of the Department of Mathematical
• 00:16and Statistical Sciences at the University of Alberta.
• 00:20He's research interests are on, and correct me if I'm wrong,
• 00:24on functional and neuro imaging data analysis,
• 00:27statistical machine learning,
• 00:29and robost statistics and quantile regression.
• 00:32So today, he is gonna talk about his work on
• 00:35general framework for quantile estimation
• 00:38with incomplete data.
• 00:40Thank you, Linglong. And whenever you're ready.
• 00:44- Thank you Laura for the introduction.
• 00:47And also thanks Professor John for the invitation.
• 00:52I'm very happy to be here, although it's way too early.
• 00:57So today I'm going to talk about general framework for
• 01:01quantile estimation with incomplete data.
• 01:13So, this is a joint work with Peisong from
• 01:20University of Michigan and Jiwei from
• 01:23University of Wisconsin-Madison, and Xingeai.
• 01:27And we started this work when at the second year
• 01:33when I started my position at the University of Alberta.
• 01:37I know Peisong a long time ago before he was a student,
• 01:44and at that time he just started his position as
• 01:48assistant professor at the University of Waterloo.
• 01:51And I invited him to visit me and afterwards,
• 01:56he invited me to visit him.
• 01:58And we feel like we visited each other already,
• 02:02we should get something done.
• 02:04But I remember that I've known where he stayed in his office
• 02:11at the University of Waterloo and thinking about
• 02:15what do we have to do together.
• 02:17And eventually we thought, "Okay, what I'm good at
• 02:21and while all my research area is quantile regression.
• 02:24And what is Peisong good at?
• 02:27One of the research area of Peisong is missing the data."
• 02:31So we said maybe we can put them together,
• 02:34then we are write a couple of formula on the paper.
• 02:41Then we feel like, "Okay, we get a copy already."
• 02:45Then we went to have a dinner.
• 02:48And then one year later Peisong send me like
• 02:52two pages to trap, said maybe we should continue it.
• 02:57And that's the first scenario in this topic,
• 03:04And then another half year, I sent him my feedback.
• 03:12I said, "Why don't we make it more general,
• 03:15make it a framework?"
• 03:17So this semester we're going to be able to apply
• 03:20to honor other scenarios.
• 03:22And then we we both feel it's good idea,
• 03:26then we started working on it.
• 03:27At that time, Jiwei was posed to at a University of Waterloo
• 03:33and Xingeai where my post are.
• 03:35So, we thought together and started a project.
• 03:38Eventually, I wound a project that I'm kind of proud of.
• 03:47So, what's missing data?
• 03:49The missing data arise in almost all
• 03:52serious statistical analysis.
• 03:56Missing on values are representative of the
• 04:03messiness of real world.
• 04:05Why we would have missing a missing value,
• 04:08it could be all kinds of reason.
• 04:12For example, it may be due to social or natural process.
• 04:17Like for example, a student get a graduate,
• 04:20get a job out in clinical trial, people get died, and so on.
• 04:26And also could happen that you survey.
• 04:29For example, in certain question asked,
• 04:35to continue to answer certain questions.
• 04:38Or maybe it's the intention missing
• 04:41as a part of a data collection process.
• 04:45Or some other scenario including random data collection
• 04:48issues respondent refusal or non-response.
• 04:56So, mathematically how we categorize these kind of missing,
• 05:01and here is the three scenario.
• 05:05Now, first scenario we call it missing completely at random.
• 05:10What does that mean?
• 05:11That means the missingness is nothing to do with the
• 05:15person being studied.
• 05:17They're just completely got missing,
• 05:19it's nothing related to any feature of this person.
• 05:23The second scenario is missing at random.
• 05:26Missing is to do with the person, but can be predicted
• 05:30from other information about the person.
• 05:34Like either a certain scenario need these project,
• 05:39the missingness maybe predictive from some
• 05:43auxiliary verbals auxiliary information.
• 05:48The third one is a very hard one, is missing not at random.
• 05:55The missingness depends on observed the information
• 05:59and sometime even the response itself.
• 06:05So, the missingness is specifically related to
• 06:08what is missing.
• 06:09For example, a person to not attend a drug test
• 06:13because the person took drugs the night before.
• 06:17And therefore the second day,
• 06:18he couldn't make to the drug test.
• 06:20Couldn't get to that drug test result.
• 06:23These are three missing mechanism.
• 06:30How do we handle those missing data?
• 06:33There are many strategies.
• 06:35For example, the first one would be,
• 06:37well, let's try to get the meeting data.
• 06:40That would be great.
• 06:42But in reality, that's usually impossible.
• 06:48But the second is, well, as we have incomplete cases,
• 06:57Just analyze the complete case, right?
• 07:02But these could cause some other problems.
• 07:05We will talk about it.
• 07:07And the third one is we replace missing data
• 07:12by some conservative estimation.
• 07:14For example, using sample mean, sample median, and so on.
• 07:20The first one is we are trying to estimate the missing data
• 07:25from other data on the person.
• 07:27We use on sort of more sophisticated method to impute.
• 07:37Now in particular, mathematically speaking,
• 07:43the strategy we are using today do to deal
• 07:46with missing data,
• 07:48the first one is a complete case analysis.
• 07:51These are very simple, okay?
• 07:52We just analyze compete case, okay?
• 07:56And we only analyze in consideration that individuals with
• 08:01no missing data.
• 08:05Sometimes it can provide good result,
• 08:07but the estimation obtained from this complete case analysis
• 08:12maybe biased if they excluded individuals are systematically
• 08:18different from those included.
• 08:20So hence, if the complete case would be a good
• 08:24representation of those missing case,
• 08:28then this method would it be fine.
• 08:34Otherwise, if the complete case is quite different from
• 08:38those we miss, then all result can be biased.
• 08:44And then there's inverse probability weighting method IPW.
• 08:50This is a commonly use method to correct the bias from a
• 08:53complete case analysis.
• 08:56What does that mean?
• 08:57It means, okay, we give each complete case a weight.
• 09:03This weight is the inverse of the probability of
• 09:07being a complete case.
• 09:12Well, this can also cause some bias
• 09:16if this IPW relies on the data distribution.
• 09:25The first strategy is more sophisticated to do
• 09:29these multiple imputation.
• 09:31It's quite common method,
• 09:32especially nowadays in genetic study.
• 09:35How do we do multiple imputation?
• 09:39We create multiple sets of imputation for
• 09:44the missing values, using imputation process
• 09:48with a random component.
• 09:51Now, we have an full data set.
• 09:54Then we analyze each data set.
• 09:59Those full data set can be a little bit different.
• 10:02Can be slightly different because the randomness of
• 10:08the imputation process.
• 10:11Anyway, analyze those data set, complete the data set,
• 10:14and then we get all set of parameter estimates.
• 10:17Then we can combine those result.
• 10:20We can combine this result,
• 10:22and we hopefully we get a better result.
• 10:26The multiple imputation sometimes works quite well,
• 10:31but only if the missing data can be ignored.
• 10:36And also, we have a good imputation models.
• 10:39And while it depends on the nature of the data,
• 10:41the auto mind depends on what kind of imputation model
• 10:45we are going to use.
• 10:51Now, that's how we deal with missing data,
• 10:56the strategy we happen to use to deal with missing data.
• 11:01But let's matched them together in terms of missing data.
• 11:06How we use these meeting dates age to deal with
• 11:11different missing mechanism.
• 11:14For example, if the data is missing complete at random,
• 11:19now in this case, the complete case analysis is quite good.
• 11:25Multiple imputation or any other imputation methods
• 11:29is also okay.
• 11:31Is also valid.
• 11:32So, this missing complete at random is
• 11:36the easiest case to deal with.
• 11:40What if data is missing at random?
• 11:43Then in this case, some complete case analysis are valid
• 11:51and multiple imputation nearly is okay too,
• 11:56if the bias is negligible.
• 12:00Now in a certain case,
• 12:02if the data is missing not at random,
• 12:05then we have to model the missingness explicitly.
• 12:11We need jointly modeling the response.
• 12:15We need jointly model the response,
• 12:17and also the missingness.
• 12:22In practice of course,
• 12:23we try to assume missing and random whenever it's possible
• 12:28and try to avoid to deal with
• 12:32missing not at a random situation.
• 12:34But the reality, it's not anything that we can control.
• 12:41Sometime we have data always missing not either random.
• 12:45Think in that case center or there is one special issue
• 12:53dedicated to missing data, not at a random situation.
• 13:02Now, we have different strategies.
• 13:04And that they state different strategies
• 13:12For example, multiple imputation is generally more efficient
• 13:17than IPW, but it's more complex.
• 13:23And the imputation and IPW approach
• 13:28require to model the data distribution
• 13:32and the missingness probability, respectively.
• 13:35Imputation, we need to model data distribution.
• 13:39IPW, we need model the missingness probability.
• 13:45And also, for all kinds of strategy,
• 13:48we would have have good property,
• 13:52only if the corresponding model is correctly specified.
• 13:59Most existing method are vulnerable to
• 14:03these model misspecifications.
• 14:06Of course can use nonparametric method to reduce the risk
• 14:11of model misspecification, but it's often impractical
• 14:16due to the curse of dimensionality.
• 14:21So now, how do we deal with this model misspecification?
• 14:27We have some method available.
• 14:30For example, we can use a double robust method.
• 14:37In particular, in double robust method,
• 14:40we have this augmented IPW.
• 14:44We are not only model the missingness probability,
• 14:49but also the distribution.
• 14:52Why is double robust?
• 14:54Because the result would be confusing
• 14:58if the model is correct.
• 15:02If the way we model missingness probability
• 15:07or the way we model the distribution is correct,
• 15:12then we would get consistent result.
• 15:14And that's why it's called double robust.
• 15:18Well, now that we are not satisfied with double robust,
• 15:22what about we can a multiple guarantee?
• 15:25So, we have these multiple robust.
• 15:27This is a proposal by Peisong.
• 15:33And they multiple robust method is proposed to account for
• 15:38multiple models for missingness probability
• 15:42and the distribution.
• 15:45In double robust, we can only one model for missingness
• 15:48probability and one model for data distribution.
• 15:51Well, for multiple robust,
• 15:54we get multiple models to model missingness probability,
• 15:59and we can have multiple models to model distribution.
• 16:05The good thing is the estimation result will be consistent
• 16:11if either one or the model is correct.
• 16:19Now, let's look at those crushing mathematically.
• 16:26So, we are looking at missing at random.
• 16:29We assume on the observed data are ID.
• 16:34So we have data R, RY XT.
• 16:38R, we use it to missingness, and the IPW estimator,
• 16:48essentially we are trying to solve these equation.
• 16:52And here, these π is the probability,
• 16:58although makes complete case.
• 17:01And IPW is consistent,
• 17:03only if this πX is correctly specified.
• 17:08And then, then from the equation,
• 17:10we can get consistent estimate of those
• 17:13permit we are interested in.
• 17:17This is IPW. The other one is imputation.
• 17:23For imputation, we need model that take distribution.
• 17:28And here we have on the model of a f(Y|X)
• 17:36And as you can see,
• 17:37we have our imputation for those missing data.
• 17:44This imputation is consistent,
• 17:47only if this state distribution is correctly modeled,
• 17:52this f(Y|X) is correctly modeled.
• 17:58Now for these augmented inverse probability waited method,
• 18:05we actually combined these two together.
• 18:11We had the first part from IPW,
• 18:14second part from implication.
• 18:17So the estimation result would be consistent
• 18:23if either this model for missingness probability
• 18:28or the model for data distribution is correctly specified.
• 18:35Well, for multiple robust method,
• 18:38they have a serious model for missingness probability
• 18:44and a serious model for data distribution.
• 18:49And all result would be consistent,
• 18:53if any one model is correctly specified.
• 19:01Well, this is something
• 19:07Like I said, this is the part Peisong is
• 19:12one of the Peisong research area.
• 19:14For me, my research area is quantile regression.
• 19:18So, internal quantile regression at that time
• 19:23we were thinking, "Okay, those methods,
• 19:26these IPW, AIPW or double robust method,
• 19:32multiple robust method, had been quite well studied
• 19:35for when we model the conditional mean.
• 19:39Therefore, condition of quantile, there are not
• 19:41a lot of methods available.
• 19:44Why we care about the quantile?
• 19:46A quantile not only provide a central feature
• 19:49of the distribution, but also care about the tail behavior.
• 19:57And also under very mild conditions,
• 20:01the quantile function can uniquely determine
• 20:05the underlying distribution.
• 20:07So, there are a lot of advantages to model the quantiles.
• 20:13Then, we decided to study these missingness
• 20:18in quantile estimation.
• 20:21In particular, we proposed a general framework
• 20:23for quantile estimation with missing data.
• 20:30So, our proposed model, these kind of framework,
• 20:35can do a lot of estimation for
• 20:38missingness in quantile estimation.
• 20:43But in this paper,
• 20:46we particularly applied all proposed method,
• 20:50these three scenario.
• 20:52Okay, three commonly encountered situation.
• 20:56The first one we trying to estimate
• 21:01the marginal quantile of response.
• 21:04This response get some missingness.
• 21:09Well, there are fully observed covariates.
• 21:13That's the first scenario, response gets some missingness
• 21:16while the corresponding covariates get fully observed.
• 21:20The second scenario, we are looking at
• 21:23the conditional quantile of a fully observed response.
• 21:28In this scenario, we look at
• 21:31there are some covariates are partialy available.
• 21:36So, we have some missingness for covariates.
• 21:39And then the third scenario, we are still looking at
• 21:43the conditional quantile of a response.
• 21:47And in this case, the response gets some missingness
• 21:52and we have fully observed covariates
• 21:55and also extra auxiliary variable.
• 22:02Now, let's look at the first situation.
• 22:07We want to estimate the marginal quantile.
• 22:10In this scenario, we have the response gets some missingness
• 22:18and we have the covariates fully observed.
• 22:22Now, let m to be the number of subjects with
• 22:26data completely observed.
• 22:30Then our method consists of the following five steps.
• 22:38The first step, we calculate this α or estimate to this α.
• 22:45This α isn't related to the missingness probability, okay?
• 22:52The way we estimate this, is by maximizing
• 22:57the binomial likelihood.
• 23:01So, the first step we estimate the α,
• 23:03and then we get estimate of the missingness probability.
• 23:10Okay?
• 23:11The second step, we calculate gamma.
• 23:16This gamma is related to this data distribution.
• 23:21So, we maximize this data distribution.
• 23:25This gamma is a parameter related to the distribution.
• 23:33And then the third step is we can
• 23:39sort of preliminary estimate of the quantile
• 23:43or the marginal quantile through these imputation process,
• 23:51by solving this equation.
• 23:55And as you can see this is quite close to the AIPW scenario.
• 24:05Okay?
• 24:06And in this equation, this five is the score function
• 24:13of quantile lost function.
• 24:17This prosaic is r - i(r<0).
• 24:23This is the generalized derivative
• 24:28of quantile lost function, okay?
• 24:34Here, this one can not be exact zero.
• 24:39The reason this phosaica is a non-smooth function.
• 24:46and it sometime it won't be exact here.
• 24:53Basically the first step, okay?
• 24:57Now, we have a preliminary estimator
• 25:01of the marginal quantile.
• 25:03The first step is the case that of method
• 25:08is where the multiple robustness is coming from.
• 25:14Now, we calculates weights for the complete case.
• 25:19In total, do we have m complete case.
• 25:21For each case, we calculate the weight.
• 25:24As you can see, the weight is determined by three parts.
• 25:32The first part is related to this alpha,
• 25:36which is related to the missing probability, okay?
• 25:40Missing probability.
• 25:43The second part is related to this gamma.
• 25:47This is related to the data distribution.
• 25:52The third part is related to this cube.
• 25:56This preliminary estimate of these marginal quantile,
• 26:02which is related to this self step.
• 26:07As you can see from the first three step,
• 26:10we are trying to get ready for this,
• 26:14to get the estimate for the weight for the complete case,
• 26:18for this complete case.
• 26:21And also, we have our parameter,
• 26:23though is obtained through
• 26:27minimizing these equation, through minimizing this equation.
• 26:33Now, after we calculate the weight
• 26:36we get off final estimate of our multiple robust estimate
• 26:42by solving the following with estimated equation.
• 26:50This wi is the width.
• 26:52We estimate it from the first four steps.
• 26:58And this posy is a score function of quantile loss, okay?
• 27:06Now, you may get wondering on what's going on
• 27:10with these five steps.
• 27:14And let me try to explain it one by one, okay?
• 27:20In the first step, we get the estimate of alpha, okay?
• 27:24We get the estimate of alpha.
• 27:28In sense trying to model they missingness probability, okay?
• 27:33Missingness probability.
• 27:35And of course, this missingness probability is consistent
• 27:41only if this model is correctly specified, okay?
• 27:45So in the first step, we actually have multiple models
• 27:49to model the missingness probability.
• 27:52And you need a hope at least a one model is correct.
• 27:57Now, in the other case, the missingness probability
• 28:00will not be correctly specified.
• 28:05Well, in the second step, we only estimate gamma.
• 28:09We are trying to model the data distribution
• 28:14and we have models for the data distribution.
• 28:20And then the third step,
• 28:22we are sort of doing some imputation as made
• 28:26of these marginal quantile.
• 28:32And these marginal quantile will be correctly estimated,
• 28:42if those data distribution is correctly specified.
• 28:50Now for the key staff,
• 28:53(coughs)
• 28:54Excuse me.
• 28:55The step four is typical formulation of
• 28:59an empirical likelihood program.
• 29:03I will getting back to this in the next slide,
• 29:08why it's a empirical likelihood program.
• 29:12And this is a key contribution of methodology.
• 29:18Now, in step five, we have the structure of IPW, okay?
• 29:23For complete case, we have weight to correctify, okay?
• 29:32And do this weight actually, is coming from two parts.
• 29:35And one part is from the missingness probability.
• 29:41The other part is from the data distribution.
• 29:45Now, the weight actually does not distinguish
• 29:48the missingness probability and the data distribution.
• 29:54The way it treats them equally.
• 29:59And another note I want to say is step two and four
• 30:03are based on the complete case only.
• 30:12Now, let's look at step four.
• 30:15Okay? Let's look at step four.
• 30:18In step four, we saw assumption are missing at random.
• 30:26It's easy to verify this, okay?
• 30:29Like wx, which is the inverse of the missingness probability
• 30:34times b(X) - E{b(X)}| R-1 = 0, okay?
• 30:43And in thus case, we can let b(X) to be the score function
• 30:48of quantile lost function.
• 30:52And these probability are conditional estimation
• 30:56and the conditional probability under this density.
• 31:01And because of this, okay?
• 31:06We can easily write a sample case, a sample scenario.
• 31:14So, the scenario is like this.
• 31:16All the weight is inactive.
• 31:19Some weight is one,
• 31:22and this is the estimating equation part,
• 31:25estimation equation part.
• 31:29As you can see,
• 31:30this is a typical empirical likelihood scenario.
• 31:40So, this is a typical formulation for empirical likelihood.
• 31:47And the solution actually can be even as in all formula,
• 31:55our previous, can be given by this one, okay?
• 32:02The weight can be determined by this.
• 32:05And though hard, can be estimated by solving this equation.
• 32:16Okay?
• 32:19So, that's all key steps for this methodology, okay?
• 32:29This actually, is the formula we first written down
• 32:35on the paper.
• 32:36And then we thought, "Okay, this might also be able
• 32:40to be applied to the other scenario."
• 32:44Indeed it can be applied in other scenarios.
• 32:48For example, in this quantile regression
• 32:52with missing covariates.
• 32:55In this scenario, all parameter of interest is β0.
• 33:01This β0 is coming from these linear regression.
• 33:05We want to estimate this β0.
• 33:10And all covariates had two paths, X1 and X2.
• 33:17This X1 path is always observed,
• 33:22while this X2 may have some missing.
• 33:27So, the observed data.
• 33:31And I need copies of this format.
• 33:33This missingness response completely observed covariates
• 33:43and some covariates are missing,
• 33:45some covariates are observed, okay?
• 33:49So, in this setting, we want to estimate β0,
• 33:55as in previous scenario.
• 33:59We have two sets of models, okay?
• 34:02One set model is for π, the missing probability.
• 34:08And the other set of model is for data distribution.
• 34:15Here the distribution is related to X2,
• 34:19given the condition of the response
• 34:21and completely of the X1.
• 34:27Now, as previous, we have five steps.
• 34:35Step one and step two are same as in case one.
• 34:40And in step one, we estimate in the missing probability.
• 34:45In step two, we estimate the data distribution.
• 34:53And then in step three,
• 34:55we get preliminary imputation estimate pf β0
• 35:03by solving this seemed a very complicated equation.
• 35:09And here there's Xl, which had two parts,
• 35:17the complete the case and on the missing part.
• 35:21The missing part is random drawn
• 35:24from this data distribution.
• 35:28We estimate from step two.
• 35:32And then the step four, okay?
• 35:35The key is that the empirical likelihood part
• 35:39where we used to compute to the weight.
• 35:46And these weights that I had, is for complete case.
• 35:50And at previous, this weight depends on three parts.
• 35:59One is missing probability, α1 is the distribution.
• 36:05Gamma previous, it depend on the preliminary as estimate
• 36:09of margin quantile.
• 36:11Now, it's related to the preliminary estimate of
• 36:18linear quantile coefficient β.
• 36:22Okay?
• 36:23After we estimate these weight WI,
• 36:27then we can go to the estimating equation part, okay?
• 36:35Let's say five steps. Let's say five steps.
• 36:40As you can see you, step one, step two, step three,
• 36:44is all preexisting method we adapt trying to estimate
• 36:56the missing probability, the data distribution,
• 37:02and also impute to get a preliminary estimate
• 37:05of the parameter we are increasing.
• 37:08And then from all these,
• 37:10we pull all this information together to get
• 37:12a good weight for the compete case.
• 37:18And then the using this empirical likelihood method
• 37:25and then we adjust this complete case with the
• 37:31estimated weight to get a final estimate,
• 37:34to get the final multiple robust estimate.
• 37:41Now the case three, okay?
• 37:45In the case three, the parameter we are interested
• 37:49is still β0.
• 37:51This linear quantile regression are here.
• 37:55The scenario is the full-data vector is (Y, X).
• 38:02In this scenario, Y is missing and random, okay?
• 38:07Of course the simple complete a case analysis
• 38:10where lead to a consistent estimate,
• 38:14but it doesn't mean it will be optimal.
• 38:18Here we are trying to get a more complete educated
• 38:21but still very practical method.
• 38:30We are having some auxiliary variable.
• 38:33As this auxiliary variable,
• 38:36usually not the main study interest,
• 38:40and thus do not enter the quantile regression model.
• 38:43However, we can use it to help us to explain
• 38:48the missingness mechanism
• 38:51and to help us to build a more plausible model
• 38:55for the conditional distribution of Y.
• 39:00Now, here is the observed data.
• 39:06So, we now have an ID copies of these R, RY,
• 39:12this Y gets a missing, X is completely observed,
• 39:19and we have got auxiliary variable S.
• 39:23We have this missing and random scenario.
• 39:27We use π(X, S) to denote the probability,
• 39:34and we use f(Y| X, S) to denote conditional density.
• 39:40As previous, we have multiple models
• 39:43for missing probability,
• 39:46and we have multiple models for data distribution.
• 39:56And then once again, we have the all five steps.
• 40:00The first step, we modeled the missing probability.
• 40:05And here we have this additional auxiliary variable.
• 40:10The second step, we model the data distribution.
• 40:14Again, we have this auxiliary variable.
• 40:17And then step three,
• 40:18we get a preliminary estimate on
• 40:21using this imputation method.
• 40:24We have our preliminary estimate of the parameter
• 40:28we are interested in,
• 40:30which is a linear regression coefficient here.
• 40:36And then after the preparation of step one,
• 40:39step two, and step three,
• 40:41we finally be able to estimate our weight, okay?
• 40:46Our weight is for complete case.
• 40:50And from the formula here,
• 40:52you can tell why I put this scenario as scenario three
• 40:55because it got more and more complicated.
• 40:59Although the weight still depends on three parts,
• 41:02related to the first three step.
• 41:05The missing probability related to this alpha,
• 41:08the data distribution related to this gamma,
• 41:12and the preliminary estimate made by using the imputation
• 41:19in step three.
• 41:25And once we get the weight through
• 41:28this empirical likelihood method,
• 41:30we then put it into this estimating equation.
• 41:34Adjusted by this weight, we can get our proposed estimator
• 41:39as multiple robust estimator of
• 41:41the linear regression coefficient.
• 41:48Okay.
• 41:50(coughs)
• 41:51Our method all framework in general,
• 41:55these five sets, the key thing is step four
• 41:58is empirical likelihood method to estimate the weight.
• 42:03I'll estimate his probability
• 42:06and we will estimate our framework in these three scenarios.
• 42:13Of course there are some other scenarios,
• 42:15and you can easily adapt to these five steps.
• 42:20Now, let's look at some theoretical proprietary.
• 42:23Why we propose these seemingly complicated five steps.
• 42:30We first look at the case one. There are two parts.
• 42:40The second theorem is about asymptotic normality, okay?
• 42:46So, under certain conditions, if...
• 42:51Remember we have two sets of models.
• 42:53One set of model, we modeled the probability.
• 42:57The other set of model, we modeled the data distribution.
• 43:02So if either one from the model
• 43:07of modeling missingness probability
• 43:12or the model set model the data distribution,
• 43:15if either one is correctly specified, Okay?
• 43:21Then, our estimate will be consistent.
• 43:26Our estimate it well be consistent.
• 43:28So, all proposed method allow you to make mistakes, okay?
• 43:37But you at least make one good right decision,
• 43:44then you get a consistent result, okay?
• 43:49Of course if you make all the bad decisions,
• 43:52you didn't choose any track modeling,
• 43:55these two sets of model, then you probably won't be able
• 43:59to get that consistent result.
• 44:01Right?
• 44:04And then the second theorem is about
• 44:07the asymptotic normality.
• 44:09Under certain conditions, the model estimate
• 44:17some multiple robust estimate on the marginal quantile
• 44:20where I have asymptotic normal distribution
• 44:23with mean zero and variates here
• 44:28is related to this variable.
• 44:30Variates is related to this data one random variable.
• 44:38And as you can see these variates of data one
• 44:46actually coming from these three parts,
• 44:50the estimate of the missingness probability,
• 44:53the estimate of these data distribution,
• 44:56and also the imputation process, okay?
• 45:00That's for case one.
• 45:02Similarly for case two, we have these two theorem.
• 45:08Y is consistent.
• 45:11And as long as the one model is correctly specified,
• 45:14we would have this consistency.
• 45:17And then this asymptotic normality,
• 45:21we would have asymptotic normal distribution.
• 45:23And also the variates, they're two, as you can see.
• 45:28The two is ready to first three step
• 45:32to estimate the different component, okay?
• 45:38And then case three, two theorem.
• 45:43Consistency, we need at least one model.
• 45:47As long as one model is correctly specified,
• 45:50we have a consistent result.
• 45:53And we have this asymptotic normalcy
• 45:56and the variates come from their three part. Okay?
• 46:02As you can see, this is a very complicated formula.
• 46:07It's a model getting more and more complicated.
• 46:10And also, if you see that you can compound the variates
• 46:15of the three to the situation with complete case analysis.
• 46:21Because for complete case analysis,
• 46:23we also get the consistent result, but like I said,
• 46:28it doesn't mean the variates would be optimal.
• 46:30And here, we actually can verify the variates of the three
• 46:34will be smaller if our model are correctly specified, okay?
• 46:43Let's say theoretical propriety.
• 46:49Now, let's look at some simulation, okay?
• 46:54We did simulation for each scenario,
• 46:58but due to the timely meet, I will only present two.
• 47:03Let's look at the second scenario.
• 47:05In the second scenario, we have four here.
• 47:09We have X1 follow exponential distribution X2
• 47:13is a normal distribution.
• 47:16And so Y is discrete, one is continuous, okay?
• 47:20The model is the simple linear model
• 47:24and the error distribution Y,
• 47:28as you can see, is heteroscedastic.
• 47:32Because of these error distribution, it's reduced to X1.
• 47:38The missing mechanism for X2,
• 47:42in the second scenario, we have a part of X2 is missing is
• 47:47through this logistic regression, okay?
• 47:50Now, missingness rate is about 38%.
• 47:57Eventually, they have this conditional quantile regression,
• 48:00linear regression, they have those coefficient excess.
• 48:04This is our simulation setup is in the second scenario.
• 48:13Now, we consider two working models for π, okay?
• 48:19The fist one is correct. The second one is incorrect.
• 48:24We can see there are two models for the distribution, okay?
• 48:32All right.
• 48:33This is the incorrect one
• 48:35and for the ordinary least squares regression.
• 48:38And this is correct one with title 0.25 0.75.
• 48:48We have replication, 1,000 times.
• 48:51We have some π equals 500, L is 10.
• 48:55This L is really related to the first step
• 48:59of the imputation.
• 49:03Okay.
• 49:03Now, here is all our simulation result, okay?
• 49:09Although the result has to be multiplied by 100,
• 49:14as you can see Y is very large.
• 49:15And also we denote our mass as 0000, okay?
• 49:25The fist two digit represent
• 49:28the missing probability model.
• 49:31The last two is data distribution.
• 49:35For example, for IPW 1000,
• 49:40that means we only use inverse probability method.
• 49:44And the weight is estimating is based on
• 49:49this correct weight, okay?
• 49:52And for the imputation,
• 49:56that means we only use this data distribution.
• 50:00And for this IM 0010, that means we use our first model,
• 50:08which is to model the data distribution.
• 50:14This is the second model for data distribution.
• 50:18And in either case,
• 50:20is always the first one is correct model.
• 50:23The first one is correct model.
• 50:24The second one is not, okay?
• 50:26That's just from notation.
• 50:28As you can see here using IPW
• 50:31if the model is correctly specified,
• 50:34the bias is quite small
• 50:35and everything is quite good.
• 50:38However, if you miss specify the missingness probability,
• 50:42we see the estimate is quite out of control, okay?
• 50:47Let's say for IM imputation, if you specify correctly
• 50:53the data distribution, the result is good.
• 50:56If not, then it's not.
• 50:58Okay.
• 50:59Then there's multiple robust method.
• 51:03In the multiple robust method,
• 51:08we look at, for example, this one,
• 51:12we get a missing probability correctly specified,
• 51:15then we get a good result.
• 51:17If not, we get bad result as the IPW, okay?
• 51:22But anyway, if we can choose to use all these four models,
• 51:29as you can see, the result is quite good, okay?
• 51:33The taking home method for these simulation study is,
• 51:39if you have some ideas about missingness probability
• 51:47about the state of this data distribution,
• 51:50and you think, "Okay, maybe this one is right
• 51:53or maybe this one is also right, okay?
• 51:56So on my side, just tell you,
• 51:58"Okay, I don't have to just put all these
• 52:04potential candidate potential model into all framework.
• 52:11Then we look at the recount.
• 52:16This one of the simulation is scenario two.
• 52:22We also have a simulation in a scenario three,
• 52:27but I will skip it here and go directly to the
• 52:35real data analysis.
• 52:38So, in this real data analysis, we look at this
• 52:43AIDS clinical Trials Group Protocol 175 or ACTG 175 data.
• 52:52In this research, we evaluate treatment with either a single
• 53:01nucleosides or through HIV-infected subject
• 53:05whose CD4 cells count
• 53:08and are from 200 to 500 per cubic millimeters.
• 53:14So, we consider to arms or treatment.
• 53:17One is standardized,
• 53:19and the other one is with three newer treatments.
• 53:24The two arms respectively,
• 53:29have about 500 and 1,600 subjects.
• 53:34Now, model we are looking at is
• 53:36the linear quantile regression model
• 53:39and with those kind of covariates inside.
• 53:43The data can be found in this package.
• 53:51Now for the data, the average subject is 35 years old,
• 53:57standard variation is about nine,
• 54:01and the variable CD4 96 is missing for approximate 37%.
• 54:10It's quite similar to simulation scenario.
• 54:16Each athlete is part of set up of simulations scenario.
• 54:22However, at baseline during the followup,
• 54:25full measurements on additional variable are correlated
• 54:28with CD4 96 are obtained.
• 54:30So this would be the missing part. We get the missing part.
• 54:39Here we assumed this CD4 96 is the missing and random.
• 54:46And we also have other baseline, for example,
• 54:50CD4 80 and CD4 20, and so on.
• 54:56we will use these as auxiliary variables.
• 55:01So, we have our third scenario
• 55:07in this real data analysis.
• 55:12And why we choose this data?
• 55:16If we look at this CD4 96, the histogram of this, okay?
• 55:24The left one is before we do it's original skill.
• 55:32The right one is after we do log transformation.
• 55:39So, as you can see, the left one is kind of truncated,
• 55:46and the right one also truncated.
• 55:49So you may debate,
• 55:50"Okay, which one I should use?
• 55:52Do I take log transformation or not?
• 55:59Or to be, or not to be."
• 56:03So that's no apparent reason to favor one of them
• 56:10for the imputation method.
• 56:13Now, what do we do?
• 56:17In our proposed method,
• 56:19we can put all these two models in our framework, okay?
• 56:26We don't need to make the choice.
• 56:29And because no apparent reason,
• 56:31we take a log, or not take log.
• 56:33Now, let's put the two together into our model, okay?
• 56:38So we can simultaneously accommodate both simulation.
• 56:44And then we have a eight covariates and auxiliary variable.
• 56:49Then we have this probability is modeled by
• 56:54a logistic regression containing all main effect of X and S.
• 57:02So, here is the result. Here is the result.
• 57:04This is a big table, but let me summarize these table.
• 57:10Okay.
• 57:11They three newer treatment, significantly slow the progress.
• 57:16Our proposed method and the IPW method,
• 57:19produce very similar results, okay
• 57:23And the incubation estimate,
• 57:27one failed to catch difference in the treatment
• 57:31and treatment arm effect for different quantile.
• 57:38The amputation estimator 2 gives
• 57:40an increasing estimation effect and covariance.
• 57:44In addition, the two imputation estimates
• 57:48are quite sensitive to the selection of the working models.
• 58:04Okay?
• 58:05And also, from these real data,
• 58:07we can help complete case analysis
• 58:11overestimate the treatment arm effects once again,
• 58:16so that even sometimes the compete case analysis is valid
• 58:23but there are also advantage to use our proposed method.
• 58:34All right, so here's the summary of my talk.
• 58:40We proposed a general framework for
• 58:44quantile estimation with missing data.
• 58:48And we actually applied these framework
• 58:52in different scenario.
• 58:55Now, the taking home message is,
• 59:00our proposed method or whatever robust against
• 59:04possible model misspecification.
• 59:08So, as we have two sets of model,
• 59:10one for missing probability
• 59:12and one is for data distribution.
• 59:14As long as one model is correct,
• 59:17then we will get good result.
• 59:19And also our method can be easily to be generalized
• 59:23to many other scenario.
• 59:26And I think that's all of my talk,
• 59:32and thank you.
• 59:36- All right.
• 59:37Thank you, Linglong. This was very interesting.
• 59:39I think we're almost out of time, so if there's
• 59:43we have time probably for one question.
• 59:45So if there's any, if not
• 59:48Let's see if there are any questions.
• 59:52Feel free to write in the chat box or on cells.
• 01:00:12Okay.
• 01:00:13Just gonna ask one question
• 01:00:14and then I think I'm gonna ask all the questions
• 01:00:17when we meet.
• 01:00:19Just a quick question.
• 01:00:20Do you know why the complete case analysis have
• 01:00:24overestimation rather than underestimation?
• 01:00:27Like, do you have a feeling why that's the case and what?
• 01:00:33- Well, I don't know. No.
• 01:00:39- Yeah.
• 01:00:40I believe it will be interesting to see what cases,
• 01:00:42like what are the conditions for overestimation
• 01:00:45or underestimation for complete case analysis, I guess.
• 01:00:48I guess, it must depend on the data distribution
• 01:00:52and the missingness mechanism that's been made.
• 01:00:56But I'm not sure one.
• 01:00:59- I agree with you.
• 01:01:01The reason I would answer I don't know,
• 01:01:05because it's really hard to know how the data is miss.
• 01:01:11Although we assume it's missing at runtime.
• 01:01:13- Yeah.
• 01:01:14- But, who knows the reality?
• 01:01:17- Right. Yeah, right.
• 01:01:19I guess, under your assumption of missing at random,
• 01:01:22then I guess there could be conditions for underestimation
• 01:01:27or overestimation under the assumption of where MI.
• 01:01:31But, I don't know.
• 01:01:32I was wondering if people have derived those or not.
• 01:01:36(laughs)
• 01:01:37They could be future work, right?
• 01:01:40(laughs)
• 01:01:42All right.
• 01:01:43Linglong, thank you.
• 01:01:44I'll see you in an hour for a one on one meetings,
• 01:01:47and I know other students and maybe faculty have
• 01:01:51signed up for it to meet with you.
• 01:01:53So, thank you very much.
• 01:01:55And I'll see you later. All right.
• 01:01:57- Thank you.
• 01:01:57- Bye-bye. Thank you everyone for joining.
• 01:01:58Bye.
• 01:01:59- Bye.
• 01:02:00- Bye.