BIS Seminar: A General Framework for Quantile Estimation with Incomplete Data
October 14, 2020Information
Linglong Kong
Department of Mathematical and Statistical Sciences
University of Alberta
October 13, 2020
ID5741
To CiteDCA Citation Guide
- 00:00- Hi, everyone.
- 00:02Welcome to the departmental seminar of
- 00:04the Departmental Biostatistics, Yale University.
- 00:09I'm pleased to introduce you Linglong Kong.
- 00:12He was associate professor of the Department of Mathematical
- 00:16and Statistical Sciences at the University of Alberta.
- 00:20He's research interests are on, and correct me if I'm wrong,
- 00:24on functional and neuro imaging data analysis,
- 00:27statistical machine learning,
- 00:29and robost statistics and quantile regression.
- 00:32So today, he is gonna talk about his work on
- 00:35general framework for quantile estimation
- 00:38with incomplete data.
- 00:40Thank you, Linglong. And whenever you're ready.
- 00:44- Thank you Laura for the introduction.
- 00:47And also thanks Professor John for the invitation.
- 00:52I'm very happy to be here, although it's way too early.
- 00:57So today I'm going to talk about general framework for
- 01:01quantile estimation with incomplete data.
- 01:13So, this is a joint work with Peisong from
- 01:20University of Michigan and Jiwei from
- 01:23University of Wisconsin-Madison, and Xingeai.
- 01:27And we started this work when at the second year
- 01:33when I started my position at the University of Alberta.
- 01:37I know Peisong a long time ago before he was a student,
- 01:44and at that time he just started his position as
- 01:48assistant professor at the University of Waterloo.
- 01:51And I invited him to visit me and afterwards,
- 01:56he invited me to visit him.
- 01:58And we feel like we visited each other already,
- 02:02we should get something done.
- 02:04But I remember that I've known where he stayed in his office
- 02:11at the University of Waterloo and thinking about
- 02:15what do we have to do together.
- 02:17And eventually we thought, "Okay, what I'm good at
- 02:21and while all my research area is quantile regression.
- 02:24And what is Peisong good at?
- 02:27One of the research area of Peisong is missing the data."
- 02:31So we said maybe we can put them together,
- 02:34then we are write a couple of formula on the paper.
- 02:41Then we feel like, "Okay, we get a copy already."
- 02:45Then we went to have a dinner.
- 02:48And then one year later Peisong send me like
- 02:52two pages to trap, said maybe we should continue it.
- 02:57And that's the first scenario in this topic,
- 03:03I'm gonna talk about.
- 03:04And then another half year, I sent him my feedback.
- 03:12I said, "Why don't we make it more general,
- 03:15make it a framework?"
- 03:17So this semester we're going to be able to apply
- 03:20to honor other scenarios.
- 03:22And then we we both feel it's good idea,
- 03:26then we started working on it.
- 03:27At that time, Jiwei was posed to at a University of Waterloo
- 03:33and Xingeai where my post are.
- 03:35So, we thought together and started a project.
- 03:38Eventually, I wound a project that I'm kind of proud of.
- 03:47So, what's missing data?
- 03:49The missing data arise in almost all
- 03:52serious statistical analysis.
- 03:56Missing on values are representative of the
- 04:03messiness of real world.
- 04:05Why we would have missing a missing value,
- 04:08it could be all kinds of reason.
- 04:12For example, it may be due to social or natural process.
- 04:17Like for example, a student get a graduate,
- 04:20get a job out in clinical trial, people get died, and so on.
- 04:26And also could happen that you survey.
- 04:29For example, in certain question asked,
- 04:32only asked respondent answer yes,
- 04:35to continue to answer certain questions.
- 04:38Or maybe it's the intention missing
- 04:41as a part of a data collection process.
- 04:45Or some other scenario including random data collection
- 04:48issues respondent refusal or non-response.
- 04:56So, mathematically how we categorize these kind of missing,
- 05:01and here is the three scenario.
- 05:05Now, first scenario we call it missing completely at random.
- 05:10What does that mean?
- 05:11That means the missingness is nothing to do with the
- 05:15person being studied.
- 05:17They're just completely got missing,
- 05:19it's nothing related to any feature of this person.
- 05:23The second scenario is missing at random.
- 05:26Missing is to do with the person, but can be predicted
- 05:30from other information about the person.
- 05:34Like either a certain scenario need these project,
- 05:39the missingness maybe predictive from some
- 05:43auxiliary verbals auxiliary information.
- 05:48The third one is a very hard one, is missing not at random.
- 05:55The missingness depends on observed the information
- 05:59and sometime even the response itself.
- 06:05So, the missingness is specifically related to
- 06:08what is missing.
- 06:09For example, a person to not attend a drug test
- 06:13because the person took drugs the night before.
- 06:17And therefore the second day,
- 06:18he couldn't make to the drug test.
- 06:20Couldn't get to that drug test result.
- 06:23These are three missing mechanism.
- 06:30How do we handle those missing data?
- 06:33There are many strategies.
- 06:35For example, the first one would be,
- 06:37well, let's try to get the meeting data.
- 06:40That would be great.
- 06:42But in reality, that's usually impossible.
- 06:48But the second is, well, as we have incomplete cases,
- 06:52let's just discard.
- 06:57Just analyze the complete case, right?
- 07:02But these could cause some other problems.
- 07:05We will talk about it.
- 07:07And the third one is we replace missing data
- 07:12by some conservative estimation.
- 07:14For example, using sample mean, sample median, and so on.
- 07:20The first one is we are trying to estimate the missing data
- 07:25from other data on the person.
- 07:27We use on sort of more sophisticated method to impute.
- 07:37Now in particular, mathematically speaking,
- 07:43the strategy we are using today do to deal
- 07:46with missing data,
- 07:48the first one is a complete case analysis.
- 07:51These are very simple, okay?
- 07:52We just analyze compete case, okay?
- 07:56And we only analyze in consideration that individuals with
- 08:01no missing data.
- 08:05Sometimes it can provide good result,
- 08:07but the estimation obtained from this complete case analysis
- 08:12maybe biased if they excluded individuals are systematically
- 08:18different from those included.
- 08:20So hence, if the complete case would be a good
- 08:24representation of those missing case,
- 08:28then this method would it be fine.
- 08:34Otherwise, if the complete case is quite different from
- 08:38those we miss, then all result can be biased.
- 08:44And then there's inverse probability weighting method IPW.
- 08:50This is a commonly use method to correct the bias from a
- 08:53complete case analysis.
- 08:56What does that mean?
- 08:57It means, okay, we give each complete case a weight.
- 09:03This weight is the inverse of the probability of
- 09:07being a complete case.
- 09:12Well, this can also cause some bias
- 09:16if this IPW relies on the data distribution.
- 09:25The first strategy is more sophisticated to do
- 09:29these multiple imputation.
- 09:31It's quite common method,
- 09:32especially nowadays in genetic study.
- 09:35How do we do multiple imputation?
- 09:39We create multiple sets of imputation for
- 09:44the missing values, using imputation process
- 09:48with a random component.
- 09:51Now, we have an full data set.
- 09:54Then we analyze each data set.
- 09:59Those full data set can be a little bit different.
- 10:02Can be slightly different because the randomness of
- 10:08the imputation process.
- 10:11Anyway, analyze those data set, complete the data set,
- 10:14and then we get all set of parameter estimates.
- 10:17Then we can combine those result.
- 10:20We can combine this result,
- 10:22and we hopefully we get a better result.
- 10:26The multiple imputation sometimes works quite well,
- 10:31but only if the missing data can be ignored.
- 10:36And also, we have a good imputation models.
- 10:39And while it depends on the nature of the data,
- 10:41the auto mind depends on what kind of imputation model
- 10:45we are going to use.
- 10:51Now, that's how we deal with missing data,
- 10:56the strategy we happen to use to deal with missing data.
- 11:01But let's matched them together in terms of missing data.
- 11:06How we use these meeting dates age to deal with
- 11:11different missing mechanism.
- 11:14For example, if the data is missing complete at random,
- 11:19now in this case, the complete case analysis is quite good.
- 11:25Multiple imputation or any other imputation methods
- 11:29is also okay.
- 11:31Is also valid.
- 11:32So, this missing complete at random is
- 11:36the easiest case to deal with.
- 11:40What if data is missing at random?
- 11:43Then in this case, some complete case analysis are valid
- 11:51and multiple imputation nearly is okay too,
- 11:56if the bias is negligible.
- 12:00Now in a certain case,
- 12:02if the data is missing not at random,
- 12:05then we have to model the missingness explicitly.
- 12:11We need jointly modeling the response.
- 12:15We need jointly model the response,
- 12:17and also the missingness.
- 12:22In practice of course,
- 12:23we try to assume missing and random whenever it's possible
- 12:28and try to avoid to deal with
- 12:32missing not at a random situation.
- 12:34But the reality, it's not anything that we can control.
- 12:41Sometime we have data always missing not either random.
- 12:45Think in that case center or there is one special issue
- 12:53dedicated to missing data, not at a random situation.
- 13:02Now, we have different strategies.
- 13:04And that they state different strategies
- 13:07have different advantage and disadvantage.
- 13:12For example, multiple imputation is generally more efficient
- 13:17than IPW, but it's more complex.
- 13:23And the imputation and IPW approach
- 13:28require to model the data distribution
- 13:32and the missingness probability, respectively.
- 13:35Imputation, we need to model data distribution.
- 13:39IPW, we need model the missingness probability.
- 13:45And also, for all kinds of strategy,
- 13:48we would have have good property,
- 13:52only if the corresponding model is correctly specified.
- 13:59Most existing method are vulnerable to
- 14:03these model misspecifications.
- 14:06Of course can use nonparametric method to reduce the risk
- 14:11of model misspecification, but it's often impractical
- 14:16due to the curse of dimensionality.
- 14:21So now, how do we deal with this model misspecification?
- 14:27We have some method available.
- 14:30For example, we can use a double robust method.
- 14:37In particular, in double robust method,
- 14:40we have this augmented IPW.
- 14:44We are not only model the missingness probability,
- 14:49but also the distribution.
- 14:52Why is double robust?
- 14:54Because the result would be confusing
- 14:58if the model is correct.
- 15:02If the way we model missingness probability
- 15:07or the way we model the distribution is correct,
- 15:12then we would get consistent result.
- 15:14And that's why it's called double robust.
- 15:18Well, now that we are not satisfied with double robust,
- 15:22what about we can a multiple guarantee?
- 15:25So, we have these multiple robust.
- 15:27This is a proposal by Peisong.
- 15:33And they multiple robust method is proposed to account for
- 15:38multiple models for missingness probability
- 15:42and the distribution.
- 15:45In double robust, we can only one model for missingness
- 15:48probability and one model for data distribution.
- 15:51Well, for multiple robust,
- 15:54we get multiple models to model missingness probability,
- 15:59and we can have multiple models to model distribution.
- 16:05The good thing is the estimation result will be consistent
- 16:11if either one or the model is correct.
- 16:19Now, let's look at those crushing mathematically.
- 16:26So, we are looking at missing at random.
- 16:29We assume on the observed data are ID.
- 16:34So we have data R, RY XT.
- 16:38R, we use it to missingness, and the IPW estimator,
- 16:48essentially we are trying to solve these equation.
- 16:52And here, these π is the probability,
- 16:58although makes complete case.
- 17:01And IPW is consistent,
- 17:03only if this πX is correctly specified.
- 17:08And then, then from the equation,
- 17:10we can get consistent estimate of those
- 17:13permit we are interested in.
- 17:17This is IPW. The other one is imputation.
- 17:23For imputation, we need model that take distribution.
- 17:28And here we have on the model of a f(Y|X)
- 17:36And as you can see,
- 17:37we have our imputation for those missing data.
- 17:44This imputation is consistent,
- 17:47only if this state distribution is correctly modeled,
- 17:52this f(Y|X) is correctly modeled.
- 17:58Now for these augmented inverse probability waited method,
- 18:05we actually combined these two together.
- 18:11We had the first part from IPW,
- 18:14second part from implication.
- 18:17So the estimation result would be consistent
- 18:23if either this model for missingness probability
- 18:28or the model for data distribution is correctly specified.
- 18:35Well, for multiple robust method,
- 18:38they have a serious model for missingness probability
- 18:44and a serious model for data distribution.
- 18:49And all result would be consistent,
- 18:53if any one model is correctly specified.
- 19:01Well, this is something
- 19:03I just get a quick review about this missing data.
- 19:07Like I said, this is the part Peisong is
- 19:12one of the Peisong research area.
- 19:14For me, my research area is quantile regression.
- 19:18So, internal quantile regression at that time
- 19:23we were thinking, "Okay, those methods,
- 19:26these IPW, AIPW or double robust method,
- 19:32multiple robust method, had been quite well studied
- 19:35for when we model the conditional mean.
- 19:39Therefore, condition of quantile, there are not
- 19:41a lot of methods available.
- 19:44Why we care about the quantile?
- 19:46A quantile not only provide a central feature
- 19:49of the distribution, but also care about the tail behavior.
- 19:57And also under very mild conditions,
- 20:01the quantile function can uniquely determine
- 20:05the underlying distribution.
- 20:07So, there are a lot of advantages to model the quantiles.
- 20:13Then, we decided to study these missingness
- 20:18in quantile estimation.
- 20:21In particular, we proposed a general framework
- 20:23for quantile estimation with missing data.
- 20:30So, our proposed model, these kind of framework,
- 20:35can do a lot of estimation for
- 20:38missingness in quantile estimation.
- 20:43But in this paper,
- 20:46we particularly applied all proposed method,
- 20:50these three scenario.
- 20:52Okay, three commonly encountered situation.
- 20:56The first one we trying to estimate
- 21:01the marginal quantile of response.
- 21:04This response get some missingness.
- 21:09Well, there are fully observed covariates.
- 21:13That's the first scenario, response gets some missingness
- 21:16while the corresponding covariates get fully observed.
- 21:20The second scenario, we are looking at
- 21:23the conditional quantile of a fully observed response.
- 21:28In this scenario, we look at
- 21:31there are some covariates are partialy available.
- 21:36So, we have some missingness for covariates.
- 21:39And then the third scenario, we are still looking at
- 21:43the conditional quantile of a response.
- 21:47And in this case, the response gets some missingness
- 21:52and we have fully observed covariates
- 21:55and also extra auxiliary variable.
- 22:02Now, let's look at the first situation.
- 22:07We want to estimate the marginal quantile.
- 22:10In this scenario, we have the response gets some missingness
- 22:18and we have the covariates fully observed.
- 22:22Now, let m to be the number of subjects with
- 22:26data completely observed.
- 22:30Then our method consists of the following five steps.
- 22:38The first step, we calculate this α or estimate to this α.
- 22:45This α isn't related to the missingness probability, okay?
- 22:52The way we estimate this, is by maximizing
- 22:57the binomial likelihood.
- 23:01So, the first step we estimate the α,
- 23:03and then we get estimate of the missingness probability.
- 23:10Okay?
- 23:11The second step, we calculate gamma.
- 23:16This gamma is related to this data distribution.
- 23:21So, we maximize this data distribution.
- 23:25This gamma is a parameter related to the distribution.
- 23:33And then the third step is we can
- 23:39sort of preliminary estimate of the quantile
- 23:43or the marginal quantile through these imputation process,
- 23:51by solving this equation.
- 23:55And as you can see this is quite close to the AIPW scenario.
- 24:05Okay?
- 24:06And in this equation, this five is the score function
- 24:13of quantile lost function.
- 24:17This prosaic is r - i(r<0).
- 24:23This is the generalized derivative
- 24:28of quantile lost function, okay?
- 24:34Here, this one can not be exact zero.
- 24:39The reason this phosaica is a non-smooth function.
- 24:46and it sometime it won't be exact here.
- 24:53Basically the first step, okay?
- 24:57Now, we have a preliminary estimator
- 25:01of the marginal quantile.
- 25:03The first step is the case that of method
- 25:08is where the multiple robustness is coming from.
- 25:14Now, we calculates weights for the complete case.
- 25:19In total, do we have m complete case.
- 25:21For each case, we calculate the weight.
- 25:24As you can see, the weight is determined by three parts.
- 25:32The first part is related to this alpha,
- 25:36which is related to the missing probability, okay?
- 25:40Missing probability.
- 25:43The second part is related to this gamma.
- 25:47This is related to the data distribution.
- 25:52The third part is related to this cube.
- 25:56This preliminary estimate of these marginal quantile,
- 26:02which is related to this self step.
- 26:07As you can see from the first three step,
- 26:10we are trying to get ready for this,
- 26:14to get the estimate for the weight for the complete case,
- 26:18for this complete case.
- 26:21And also, we have our parameter,
- 26:23though is obtained through
- 26:27minimizing these equation, through minimizing this equation.
- 26:33Now, after we calculate the weight
- 26:36we get off final estimate of our multiple robust estimate
- 26:42by solving the following with estimated equation.
- 26:50This wi is the width.
- 26:52We estimate it from the first four steps.
- 26:58And this posy is a score function of quantile loss, okay?
- 27:06Now, you may get wondering on what's going on
- 27:10with these five steps.
- 27:14And let me try to explain it one by one, okay?
- 27:20In the first step, we get the estimate of alpha, okay?
- 27:24We get the estimate of alpha.
- 27:28In sense trying to model they missingness probability, okay?
- 27:33Missingness probability.
- 27:35And of course, this missingness probability is consistent
- 27:41only if this model is correctly specified, okay?
- 27:45So in the first step, we actually have multiple models
- 27:49to model the missingness probability.
- 27:52And you need a hope at least a one model is correct.
- 27:57Now, in the other case, the missingness probability
- 28:00will not be correctly specified.
- 28:05Well, in the second step, we only estimate gamma.
- 28:09We are trying to model the data distribution
- 28:14and we have models for the data distribution.
- 28:20And then the third step,
- 28:22we are sort of doing some imputation as made
- 28:26of these marginal quantile.
- 28:32And these marginal quantile will be correctly estimated,
- 28:42if those data distribution is correctly specified.
- 28:50Now for the key staff,
- 28:53(coughs)
- 28:54Excuse me.
- 28:55The step four is typical formulation of
- 28:59an empirical likelihood program.
- 29:03I will getting back to this in the next slide,
- 29:08why it's a empirical likelihood program.
- 29:12And this is a key contribution of methodology.
- 29:18Now, in step five, we have the structure of IPW, okay?
- 29:23For complete case, we have weight to correctify, okay?
- 29:32And do this weight actually, is coming from two parts.
- 29:35And one part is from the missingness probability.
- 29:41The other part is from the data distribution.
- 29:45Now, the weight actually does not distinguish
- 29:48the missingness probability and the data distribution.
- 29:54The way it treats them equally.
- 29:59And another note I want to say is step two and four
- 30:03are based on the complete case only.
- 30:12Now, let's look at step four.
- 30:15Okay? Let's look at step four.
- 30:18In step four, we saw assumption are missing at random.
- 30:26It's easy to verify this, okay?
- 30:29Like wx, which is the inverse of the missingness probability
- 30:34times b(X) - E{b(X)}| R-1 = 0, okay?
- 30:43And in thus case, we can let b(X) to be the score function
- 30:48of quantile lost function.
- 30:52And these probability are conditional estimation
- 30:56and the conditional probability under this density.
- 31:01And because of this, okay?
- 31:06We can easily write a sample case, a sample scenario.
- 31:14So, the scenario is like this.
- 31:16All the weight is inactive.
- 31:19Some weight is one,
- 31:22and this is the estimating equation part,
- 31:25estimation equation part.
- 31:29As you can see,
- 31:30this is a typical empirical likelihood scenario.
- 31:40So, this is a typical formulation for empirical likelihood.
- 31:47And the solution actually can be even as in all formula,
- 31:55our previous, can be given by this one, okay?
- 32:02The weight can be determined by this.
- 32:05And though hard, can be estimated by solving this equation.
- 32:16Okay?
- 32:19So, that's all key steps for this methodology, okay?
- 32:29This actually, is the formula we first written down
- 32:35on the paper.
- 32:36And then we thought, "Okay, this might also be able
- 32:40to be applied to the other scenario."
- 32:44Indeed it can be applied in other scenarios.
- 32:48For example, in this quantile regression
- 32:52with missing covariates.
- 32:55In this scenario, all parameter of interest is β0.
- 33:01This β0 is coming from these linear regression.
- 33:05We want to estimate this β0.
- 33:10And all covariates had two paths, X1 and X2.
- 33:17This X1 path is always observed,
- 33:22while this X2 may have some missing.
- 33:27So, the observed data.
- 33:31And I need copies of this format.
- 33:33This missingness response completely observed covariates
- 33:43and some covariates are missing,
- 33:45some covariates are observed, okay?
- 33:49So, in this setting, we want to estimate β0,
- 33:55as in previous scenario.
- 33:59We have two sets of models, okay?
- 34:02One set model is for π, the missing probability.
- 34:08And the other set of model is for data distribution.
- 34:15Here the distribution is related to X2,
- 34:19given the condition of the response
- 34:21and completely of the X1.
- 34:27Now, as previous, we have five steps.
- 34:35Step one and step two are same as in case one.
- 34:40And in step one, we estimate in the missing probability.
- 34:45In step two, we estimate the data distribution.
- 34:53And then in step three,
- 34:55we get preliminary imputation estimate pf β0
- 35:03by solving this seemed a very complicated equation.
- 35:09And here there's Xl, which had two parts,
- 35:17the complete the case and on the missing part.
- 35:21The missing part is random drawn
- 35:24from this data distribution.
- 35:28We estimate from step two.
- 35:32And then the step four, okay?
- 35:35The key is that the empirical likelihood part
- 35:39where we used to compute to the weight.
- 35:46And these weights that I had, is for complete case.
- 35:50And at previous, this weight depends on three parts.
- 35:59One is missing probability, α1 is the distribution.
- 36:05Gamma previous, it depend on the preliminary as estimate
- 36:09of margin quantile.
- 36:11Now, it's related to the preliminary estimate of
- 36:18linear quantile coefficient β.
- 36:22Okay?
- 36:23After we estimate these weight WI,
- 36:27then we can go to the estimating equation part, okay?
- 36:35Let's say five steps. Let's say five steps.
- 36:40As you can see you, step one, step two, step three,
- 36:44is all preexisting method we adapt trying to estimate
- 36:56the missing probability, the data distribution,
- 37:02and also impute to get a preliminary estimate
- 37:05of the parameter we are increasing.
- 37:08And then from all these,
- 37:10we pull all this information together to get
- 37:12a good weight for the compete case.
- 37:18And then the using this empirical likelihood method
- 37:25and then we adjust this complete case with the
- 37:31estimated weight to get a final estimate,
- 37:34to get the final multiple robust estimate.
- 37:41Now the case three, okay?
- 37:45In the case three, the parameter we are interested
- 37:49is still β0.
- 37:51This linear quantile regression are here.
- 37:55The scenario is the full-data vector is (Y, X).
- 38:02In this scenario, Y is missing and random, okay?
- 38:07Of course the simple complete a case analysis
- 38:10where lead to a consistent estimate,
- 38:14but it doesn't mean it will be optimal.
- 38:18Here we are trying to get a more complete educated
- 38:21but still very practical method.
- 38:30We are having some auxiliary variable.
- 38:33As this auxiliary variable,
- 38:36usually not the main study interest,
- 38:40and thus do not enter the quantile regression model.
- 38:43However, we can use it to help us to explain
- 38:48the missingness mechanism
- 38:51and to help us to build a more plausible model
- 38:55for the conditional distribution of Y.
- 39:00Now, here is the observed data.
- 39:06So, we now have an ID copies of these R, RY,
- 39:12this Y gets a missing, X is completely observed,
- 39:19and we have got auxiliary variable S.
- 39:23We have this missing and random scenario.
- 39:27We use π(X, S) to denote the probability,
- 39:34and we use f(Y| X, S) to denote conditional density.
- 39:40As previous, we have multiple models
- 39:43for missing probability,
- 39:46and we have multiple models for data distribution.
- 39:56And then once again, we have the all five steps.
- 40:00The first step, we modeled the missing probability.
- 40:05And here we have this additional auxiliary variable.
- 40:10The second step, we model the data distribution.
- 40:14Again, we have this auxiliary variable.
- 40:17And then step three,
- 40:18we get a preliminary estimate on
- 40:21using this imputation method.
- 40:24We have our preliminary estimate of the parameter
- 40:28we are interested in,
- 40:30which is a linear regression coefficient here.
- 40:36And then after the preparation of step one,
- 40:39step two, and step three,
- 40:41we finally be able to estimate our weight, okay?
- 40:46Our weight is for complete case.
- 40:50And from the formula here,
- 40:52you can tell why I put this scenario as scenario three
- 40:55because it got more and more complicated.
- 40:59Although the weight still depends on three parts,
- 41:02related to the first three step.
- 41:05The missing probability related to this alpha,
- 41:08the data distribution related to this gamma,
- 41:12and the preliminary estimate made by using the imputation
- 41:19in step three.
- 41:25And once we get the weight through
- 41:28this empirical likelihood method,
- 41:30we then put it into this estimating equation.
- 41:34Adjusted by this weight, we can get our proposed estimator
- 41:39as multiple robust estimator of
- 41:41the linear regression coefficient.
- 41:48Okay.
- 41:50(coughs)
- 41:51Our method all framework in general,
- 41:55these five sets, the key thing is step four
- 41:58is empirical likelihood method to estimate the weight.
- 42:03I'll estimate his probability
- 42:06and we will estimate our framework in these three scenarios.
- 42:13Of course there are some other scenarios,
- 42:15and you can easily adapt to these five steps.
- 42:20Now, let's look at some theoretical proprietary.
- 42:23Why we propose these seemingly complicated five steps.
- 42:30We first look at the case one. There are two parts.
- 42:36The first theorem is about this consistence.
- 42:40The second theorem is about asymptotic normality, okay?
- 42:46So, under certain conditions, if...
- 42:51Remember we have two sets of models.
- 42:53One set of model, we modeled the probability.
- 42:57The other set of model, we modeled the data distribution.
- 43:02So if either one from the model
- 43:07of modeling missingness probability
- 43:12or the model set model the data distribution,
- 43:15if either one is correctly specified, Okay?
- 43:21Then, our estimate will be consistent.
- 43:26Our estimate it well be consistent.
- 43:28So, all proposed method allow you to make mistakes, okay?
- 43:37But you at least make one good right decision,
- 43:44then you get a consistent result, okay?
- 43:49Of course if you make all the bad decisions,
- 43:52you didn't choose any track modeling,
- 43:55these two sets of model, then you probably won't be able
- 43:59to get that consistent result.
- 44:01Right?
- 44:04And then the second theorem is about
- 44:07the asymptotic normality.
- 44:09Under certain conditions, the model estimate
- 44:17some multiple robust estimate on the marginal quantile
- 44:20where I have asymptotic normal distribution
- 44:23with mean zero and variates here
- 44:28is related to this variable.
- 44:30Variates is related to this data one random variable.
- 44:38And as you can see these variates of data one
- 44:46actually coming from these three parts,
- 44:50the estimate of the missingness probability,
- 44:53the estimate of these data distribution,
- 44:56and also the imputation process, okay?
- 45:00That's for case one.
- 45:02Similarly for case two, we have these two theorem.
- 45:08Y is consistent.
- 45:11And as long as the one model is correctly specified,
- 45:14we would have this consistency.
- 45:17And then this asymptotic normality,
- 45:21we would have asymptotic normal distribution.
- 45:23And also the variates, they're two, as you can see.
- 45:28The two is ready to first three step
- 45:32to estimate the different component, okay?
- 45:38And then case three, two theorem.
- 45:43Consistency, we need at least one model.
- 45:47As long as one model is correctly specified,
- 45:50we have a consistent result.
- 45:53And we have this asymptotic normalcy
- 45:56and the variates come from their three part. Okay?
- 46:02As you can see, this is a very complicated formula.
- 46:07It's a model getting more and more complicated.
- 46:10And also, if you see that you can compound the variates
- 46:15of the three to the situation with complete case analysis.
- 46:21Because for complete case analysis,
- 46:23we also get the consistent result, but like I said,
- 46:28it doesn't mean the variates would be optimal.
- 46:30And here, we actually can verify the variates of the three
- 46:34will be smaller if our model are correctly specified, okay?
- 46:43Let's say theoretical propriety.
- 46:49Now, let's look at some simulation, okay?
- 46:54We did simulation for each scenario,
- 46:58but due to the timely meet, I will only present two.
- 47:03Let's look at the second scenario.
- 47:05In the second scenario, we have four here.
- 47:09We have X1 follow exponential distribution X2
- 47:13is a normal distribution.
- 47:16And so Y is discrete, one is continuous, okay?
- 47:20The model is the simple linear model
- 47:24and the error distribution Y,
- 47:28as you can see, is heteroscedastic.
- 47:32Because of these error distribution, it's reduced to X1.
- 47:38The missing mechanism for X2,
- 47:42in the second scenario, we have a part of X2 is missing is
- 47:47through this logistic regression, okay?
- 47:50Now, missingness rate is about 38%.
- 47:57Eventually, they have this conditional quantile regression,
- 48:00linear regression, they have those coefficient excess.
- 48:04This is our simulation setup is in the second scenario.
- 48:13Now, we consider two working models for π, okay?
- 48:19The fist one is correct. The second one is incorrect.
- 48:24We can see there are two models for the distribution, okay?
- 48:32All right.
- 48:33This is the incorrect one
- 48:35and for the ordinary least squares regression.
- 48:38And this is correct one with title 0.25 0.75.
- 48:48We have replication, 1,000 times.
- 48:51We have some π equals 500, L is 10.
- 48:55This L is really related to the first step
- 48:59of the imputation.
- 49:03Okay.
- 49:03Now, here is all our simulation result, okay?
- 49:09Although the result has to be multiplied by 100,
- 49:14as you can see Y is very large.
- 49:15And also we denote our mass as 0000, okay?
- 49:25The fist two digit represent
- 49:28the missing probability model.
- 49:31The last two is data distribution.
- 49:35For example, for IPW 1000,
- 49:40that means we only use inverse probability method.
- 49:44And the weight is estimating is based on
- 49:49this correct weight, okay?
- 49:52And for the imputation,
- 49:56that means we only use this data distribution.
- 50:00And for this IM 0010, that means we use our first model,
- 50:08which is to model the data distribution.
- 50:14This is the second model for data distribution.
- 50:18And in either case,
- 50:20is always the first one is correct model.
- 50:23The first one is correct model.
- 50:24The second one is not, okay?
- 50:26That's just from notation.
- 50:28As you can see here using IPW
- 50:31if the model is correctly specified,
- 50:34the bias is quite small
- 50:35and everything is quite good.
- 50:38However, if you miss specify the missingness probability,
- 50:42we see the estimate is quite out of control, okay?
- 50:47Let's say for IM imputation, if you specify correctly
- 50:53the data distribution, the result is good.
- 50:56If not, then it's not.
- 50:58Okay.
- 50:59Then there's multiple robust method.
- 51:03In the multiple robust method,
- 51:08we look at, for example, this one,
- 51:12we get a missing probability correctly specified,
- 51:15then we get a good result.
- 51:17If not, we get bad result as the IPW, okay?
- 51:22But anyway, if we can choose to use all these four models,
- 51:29as you can see, the result is quite good, okay?
- 51:33The taking home method for these simulation study is,
- 51:39if you have some ideas about missingness probability
- 51:47about the state of this data distribution,
- 51:50and you think, "Okay, maybe this one is right
- 51:53or maybe this one is also right, okay?
- 51:56So on my side, just tell you,
- 51:58"Okay, I don't have to just put all these
- 52:04potential candidate potential model into all framework.
- 52:11Then we look at the recount.
- 52:16This one of the simulation is scenario two.
- 52:22We also have a simulation in a scenario three,
- 52:27but I will skip it here and go directly to the
- 52:35real data analysis.
- 52:38So, in this real data analysis, we look at this
- 52:43AIDS clinical Trials Group Protocol 175 or ACTG 175 data.
- 52:52In this research, we evaluate treatment with either a single
- 53:01nucleosides or through HIV-infected subject
- 53:05whose CD4 cells count
- 53:08and are from 200 to 500 per cubic millimeters.
- 53:14So, we consider to arms or treatment.
- 53:17One is standardized,
- 53:19and the other one is with three newer treatments.
- 53:24The two arms respectively,
- 53:29have about 500 and 1,600 subjects.
- 53:34Now, model we are looking at is
- 53:36the linear quantile regression model
- 53:39and with those kind of covariates inside.
- 53:43The data can be found in this package.
- 53:51Now for the data, the average subject is 35 years old,
- 53:57standard variation is about nine,
- 54:01and the variable CD4 96 is missing for approximate 37%.
- 54:10It's quite similar to simulation scenario.
- 54:16Each athlete is part of set up of simulations scenario.
- 54:22However, at baseline during the followup,
- 54:25full measurements on additional variable are correlated
- 54:28with CD4 96 are obtained.
- 54:30So this would be the missing part. We get the missing part.
- 54:39Here we assumed this CD4 96 is the missing and random.
- 54:46And we also have other baseline, for example,
- 54:50CD4 80 and CD4 20, and so on.
- 54:56we will use these as auxiliary variables.
- 55:01So, we have our third scenario
- 55:07in this real data analysis.
- 55:12And why we choose this data?
- 55:16If we look at this CD4 96, the histogram of this, okay?
- 55:24The left one is before we do it's original skill.
- 55:32The right one is after we do log transformation.
- 55:39So, as you can see, the left one is kind of truncated,
- 55:46and the right one also truncated.
- 55:49So you may debate,
- 55:50"Okay, which one I should use?
- 55:52Do I take log transformation or not?
- 55:59Or to be, or not to be."
- 56:03So that's no apparent reason to favor one of them
- 56:10for the imputation method.
- 56:13Now, what do we do?
- 56:17In our proposed method,
- 56:19we can put all these two models in our framework, okay?
- 56:26We don't need to make the choice.
- 56:29And because no apparent reason,
- 56:31we take a log, or not take log.
- 56:33Now, let's put the two together into our model, okay?
- 56:38So we can simultaneously accommodate both simulation.
- 56:44And then we have a eight covariates and auxiliary variable.
- 56:49Then we have this probability is modeled by
- 56:54a logistic regression containing all main effect of X and S.
- 57:02So, here is the result. Here is the result.
- 57:04This is a big table, but let me summarize these table.
- 57:10Okay.
- 57:11They three newer treatment, significantly slow the progress.
- 57:16Our proposed method and the IPW method,
- 57:19produce very similar results, okay
- 57:23And the incubation estimate,
- 57:27one failed to catch difference in the treatment
- 57:31and treatment arm effect for different quantile.
- 57:38The amputation estimator 2 gives
- 57:40an increasing estimation effect and covariance.
- 57:44In addition, the two imputation estimates
- 57:48are quite sensitive to the selection of the working models.
- 58:04Okay?
- 58:05And also, from these real data,
- 58:07we can help complete case analysis
- 58:11overestimate the treatment arm effects once again,
- 58:16so that even sometimes the compete case analysis is valid
- 58:23but there are also advantage to use our proposed method.
- 58:34All right, so here's the summary of my talk.
- 58:40We proposed a general framework for
- 58:44quantile estimation with missing data.
- 58:48And we actually applied these framework
- 58:52in different scenario.
- 58:55Now, the taking home message is,
- 59:00our proposed method or whatever robust against
- 59:04possible model misspecification.
- 59:08So, as we have two sets of model,
- 59:10one for missing probability
- 59:12and one is for data distribution.
- 59:14As long as one model is correct,
- 59:17then we will get good result.
- 59:19And also our method can be easily to be generalized
- 59:23to many other scenario.
- 59:26And I think that's all of my talk,
- 59:32and thank you.
- 59:36- All right.
- 59:37Thank you, Linglong. This was very interesting.
- 59:39I think we're almost out of time, so if there's
- 59:43we have time probably for one question.
- 59:45So if there's any, if not
- 59:48Let's see if there are any questions.
- 59:52Feel free to write in the chat box or on cells.
- 01:00:12Okay.
- 01:00:13Just gonna ask one question
- 01:00:14and then I think I'm gonna ask all the questions
- 01:00:17when we meet.
- 01:00:19Just a quick question.
- 01:00:20Do you know why the complete case analysis have
- 01:00:24overestimation rather than underestimation?
- 01:00:27Like, do you have a feeling why that's the case and what?
- 01:00:33- Well, I don't know. No.
- 01:00:39- Yeah.
- 01:00:40I believe it will be interesting to see what cases,
- 01:00:42like what are the conditions for overestimation
- 01:00:45or underestimation for complete case analysis, I guess.
- 01:00:48I guess, it must depend on the data distribution
- 01:00:52and the missingness mechanism that's been made.
- 01:00:56But I'm not sure one.
- 01:00:59- I agree with you.
- 01:01:01The reason I would answer I don't know,
- 01:01:05because it's really hard to know how the data is miss.
- 01:01:11Although we assume it's missing at runtime.
- 01:01:13- Yeah.
- 01:01:14- But, who knows the reality?
- 01:01:17- Right. Yeah, right.
- 01:01:19I guess, under your assumption of missing at random,
- 01:01:22then I guess there could be conditions for underestimation
- 01:01:27or overestimation under the assumption of where MI.
- 01:01:31But, I don't know.
- 01:01:32I was wondering if people have derived those or not.
- 01:01:36(laughs)
- 01:01:37They could be future work, right?
- 01:01:40(laughs)
- 01:01:42All right.
- 01:01:43Linglong, thank you.
- 01:01:44I'll see you in an hour for a one on one meetings,
- 01:01:47and I know other students and maybe faculty have
- 01:01:51signed up for it to meet with you.
- 01:01:53So, thank you very much.
- 01:01:55And I'll see you later. All right.
- 01:01:57- Thank you.
- 01:01:57- Bye-bye. Thank you everyone for joining.
- 01:01:58Bye.
- 01:01:59- Bye.
- 01:02:00- Bye.