BIS Seminar: A General Framework for Quantile Estimation with Incomplete Data

October 14, 2020

Information

Linglong Kong

Department of Mathematical and Statistical Sciences

University of Alberta

October 13, 2020

ID5741

To CiteDCA Citation Guide

00:00- Hi, everyone.
00:02Welcome to the departmental seminar of
00:04the Departmental Biostatistics, Yale University.
00:09I'm pleased to introduce you Linglong Kong.
00:12He was associate professor of the Department of Mathematical
00:16and Statistical Sciences at the University of Alberta.
00:20He's research interests are on, and correct me if I'm wrong,
00:24on functional and neuro imaging data analysis,
00:27statistical machine learning,
00:29and robost statistics and quantile regression.
00:32So today, he is gonna talk about his work on
00:35general framework for quantile estimation
00:38with incomplete data.
00:40Thank you, Linglong. And whenever you're ready.
00:44- Thank you Laura for the introduction.
00:47And also thanks Professor John for the invitation.
00:52I'm very happy to be here, although it's way too early.
00:57So today I'm going to talk about general framework for
01:01quantile estimation with incomplete data.
01:13So, this is a joint work with Peisong from
01:20University of Michigan and Jiwei from
01:23University of Wisconsin-Madison, and Xingeai.
01:27And we started this work when at the second year
01:33when I started my position at the University of Alberta.
01:37I know Peisong a long time ago before he was a student,
01:44and at that time he just started his position as
01:48assistant professor at the University of Waterloo.
01:51And I invited him to visit me and afterwards,
01:56he invited me to visit him.
01:58And we feel like we visited each other already,
02:02we should get something done.
02:04But I remember that I've known where he stayed in his office
02:11at the University of Waterloo and thinking about
02:15what do we have to do together.
02:17And eventually we thought, "Okay, what I'm good at
02:21and while all my research area is quantile regression.
02:24And what is Peisong good at?
02:27One of the research area of Peisong is missing the data."
02:31So we said maybe we can put them together,
02:34then we are write a couple of formula on the paper.
02:41Then we feel like, "Okay, we get a copy already."
02:45Then we went to have a dinner.
02:48And then one year later Peisong send me like
02:52two pages to trap, said maybe we should continue it.
02:57And that's the first scenario in this topic,
03:03I'm gonna talk about.
03:04And then another half year, I sent him my feedback.
03:12I said, "Why don't we make it more general,
03:15make it a framework?"
03:17So this semester we're going to be able to apply
03:20to honor other scenarios.
03:22And then we we both feel it's good idea,
03:26then we started working on it.
03:27At that time, Jiwei was posed to at a University of Waterloo
03:33and Xingeai where my post are.
03:35So, we thought together and started a project.
03:38Eventually, I wound a project that I'm kind of proud of.
03:47So, what's missing data?
03:49The missing data arise in almost all
03:52serious statistical analysis.
03:56Missing on values are representative of the
04:03messiness of real world.
04:05Why we would have missing a missing value,
04:08it could be all kinds of reason.
04:12For example, it may be due to social or natural process.
04:17Like for example, a student get a graduate,
04:20get a job out in clinical trial, people get died, and so on.
04:26And also could happen that you survey.
04:29For example, in certain question asked,
04:32only asked respondent answer yes,
04:35to continue to answer certain questions.
04:38Or maybe it's the intention missing
04:41as a part of a data collection process.
04:45Or some other scenario including random data collection
04:48issues respondent refusal or non-response.
04:56So, mathematically how we categorize these kind of missing,
05:01and here is the three scenario.
05:05Now, first scenario we call it missing completely at random.
05:10What does that mean?
05:11That means the missingness is nothing to do with the
05:15person being studied.
05:17They're just completely got missing,
05:19it's nothing related to any feature of this person.
05:23The second scenario is missing at random.
05:26Missing is to do with the person, but can be predicted
05:30from other information about the person.
05:34Like either a certain scenario need these project,
05:39the missingness maybe predictive from some
05:43auxiliary verbals auxiliary information.
05:48The third one is a very hard one, is missing not at random.
05:55The missingness depends on observed the information
05:59and sometime even the response itself.
06:05So, the missingness is specifically related to
06:08what is missing.
06:09For example, a person to not attend a drug test
06:13because the person took drugs the night before.
06:17And therefore the second day,
06:18he couldn't make to the drug test.
06:20Couldn't get to that drug test result.
06:23These are three missing mechanism.
06:30How do we handle those missing data?
06:33There are many strategies.
06:35For example, the first one would be,
06:37well, let's try to get the meeting data.
06:40That would be great.
06:42But in reality, that's usually impossible.
06:48But the second is, well, as we have incomplete cases,
06:52let's just discard.
06:57Just analyze the complete case, right?
07:02But these could cause some other problems.
07:05We will talk about it.
07:07And the third one is we replace missing data
07:12by some conservative estimation.
07:14For example, using sample mean, sample median, and so on.
07:20The first one is we are trying to estimate the missing data
07:25from other data on the person.
07:27We use on sort of more sophisticated method to impute.
07:37Now in particular, mathematically speaking,
07:43the strategy we are using today do to deal
07:46with missing data,
07:48the first one is a complete case analysis.
07:51These are very simple, okay?
07:52We just analyze compete case, okay?
07:56And we only analyze in consideration that individuals with
08:01no missing data.
08:05Sometimes it can provide good result,
08:07but the estimation obtained from this complete case analysis
08:12maybe biased if they excluded individuals are systematically
08:18different from those included.
08:20So hence, if the complete case would be a good
08:24representation of those missing case,
08:28then this method would it be fine.
08:34Otherwise, if the complete case is quite different from
08:38those we miss, then all result can be biased.
08:44And then there's inverse probability weighting method IPW.
08:50This is a commonly use method to correct the bias from a
08:53complete case analysis.
08:56What does that mean?
08:57It means, okay, we give each complete case a weight.
09:03This weight is the inverse of the probability of
09:07being a complete case.
09:12Well, this can also cause some bias
09:16if this IPW relies on the data distribution.
09:25The first strategy is more sophisticated to do
09:29these multiple imputation.
09:31It's quite common method,
09:32especially nowadays in genetic study.
09:35How do we do multiple imputation?
09:39We create multiple sets of imputation for
09:44the missing values, using imputation process
09:48with a random component.
09:51Now, we have an full data set.
09:54Then we analyze each data set.
09:59Those full data set can be a little bit different.
10:02Can be slightly different because the randomness of
10:08the imputation process.
10:11Anyway, analyze those data set, complete the data set,
10:14and then we get all set of parameter estimates.
10:17Then we can combine those result.
10:20We can combine this result,
10:22and we hopefully we get a better result.
10:26The multiple imputation sometimes works quite well,
10:31but only if the missing data can be ignored.
10:36And also, we have a good imputation models.
10:39And while it depends on the nature of the data,
10:41the auto mind depends on what kind of imputation model
10:45we are going to use.
10:51Now, that's how we deal with missing data,
10:56the strategy we happen to use to deal with missing data.
11:01But let's matched them together in terms of missing data.
11:06How we use these meeting dates age to deal with
11:11different missing mechanism.
11:14For example, if the data is missing complete at random,
11:19now in this case, the complete case analysis is quite good.
11:25Multiple imputation or any other imputation methods
11:29is also okay.
11:31Is also valid.
11:32So, this missing complete at random is
11:36the easiest case to deal with.
11:40What if data is missing at random?
11:43Then in this case, some complete case analysis are valid
11:51and multiple imputation nearly is okay too,
11:56if the bias is negligible.
12:00Now in a certain case,
12:02if the data is missing not at random,
12:05then we have to model the missingness explicitly.
12:11We need jointly modeling the response.
12:15We need jointly model the response,
12:17and also the missingness.
12:22In practice of course,
12:23we try to assume missing and random whenever it's possible
12:28and try to avoid to deal with
12:32missing not at a random situation.
12:34But the reality, it's not anything that we can control.
12:41Sometime we have data always missing not either random.
12:45Think in that case center or there is one special issue
12:53dedicated to missing data, not at a random situation.
13:02Now, we have different strategies.
13:04And that they state different strategies
13:07have different advantage and disadvantage.
13:12For example, multiple imputation is generally more efficient
13:17than IPW, but it's more complex.
13:23And the imputation and IPW approach
13:28require to model the data distribution
13:32and the missingness probability, respectively.
13:35Imputation, we need to model data distribution.
13:39IPW, we need model the missingness probability.
13:45And also, for all kinds of strategy,
13:48we would have have good property,
13:52only if the corresponding model is correctly specified.
13:59Most existing method are vulnerable to
14:03these model misspecifications.
14:06Of course can use nonparametric method to reduce the risk
14:11of model misspecification, but it's often impractical
14:16due to the curse of dimensionality.
14:21So now, how do we deal with this model misspecification?
14:27We have some method available.
14:30For example, we can use a double robust method.
14:37In particular, in double robust method,
14:40we have this augmented IPW.
14:44We are not only model the missingness probability,
14:49but also the distribution.
14:52Why is double robust?
14:54Because the result would be confusing
14:58if the model is correct.
15:02If the way we model missingness probability
15:07or the way we model the distribution is correct,
15:12then we would get consistent result.
15:14And that's why it's called double robust.
15:18Well, now that we are not satisfied with double robust,
15:22what about we can a multiple guarantee?
15:25So, we have these multiple robust.
15:27This is a proposal by Peisong.
15:33And they multiple robust method is proposed to account for
15:38multiple models for missingness probability
15:42and the distribution.
15:45In double robust, we can only one model for missingness
15:48probability and one model for data distribution.
15:51Well, for multiple robust,
15:54we get multiple models to model missingness probability,
15:59and we can have multiple models to model distribution.
16:05The good thing is the estimation result will be consistent
16:11if either one or the model is correct.
16:19Now, let's look at those crushing mathematically.
16:26So, we are looking at missing at random.
16:29We assume on the observed data are ID.
16:34So we have data R, RY XT.
16:38R, we use it to missingness, and the IPW estimator,
16:48essentially we are trying to solve these equation.
16:52And here, these π is the probability,
16:58although makes complete case.
17:01And IPW is consistent,
17:03only if this πX is correctly specified.
17:08And then, then from the equation,
17:10we can get consistent estimate of those
17:13permit we are interested in.
17:17This is IPW. The other one is imputation.
17:23For imputation, we need model that take distribution.
17:28And here we have on the model of a f(Y|X)
17:36And as you can see,
17:37we have our imputation for those missing data.
17:44This imputation is consistent,
17:47only if this state distribution is correctly modeled,
17:52this f(Y|X) is correctly modeled.
17:58Now for these augmented inverse probability waited method,
18:05we actually combined these two together.
18:11We had the first part from IPW,
18:14second part from implication.
18:17So the estimation result would be consistent
18:23if either this model for missingness probability
18:28or the model for data distribution is correctly specified.
18:35Well, for multiple robust method,
18:38they have a serious model for missingness probability
18:44and a serious model for data distribution.
18:49And all result would be consistent,
18:53if any one model is correctly specified.
19:01Well, this is something
19:03I just get a quick review about this missing data.
19:07Like I said, this is the part Peisong is
19:12one of the Peisong research area.
19:14For me, my research area is quantile regression.
19:18So, internal quantile regression at that time
19:23we were thinking, "Okay, those methods,
19:26these IPW, AIPW or double robust method,
19:32multiple robust method, had been quite well studied
19:35for when we model the conditional mean.
19:39Therefore, condition of quantile, there are not
19:41a lot of methods available.
19:44Why we care about the quantile?
19:46A quantile not only provide a central feature
19:49of the distribution, but also care about the tail behavior.
19:57And also under very mild conditions,
20:01the quantile function can uniquely determine
20:05the underlying distribution.
20:07So, there are a lot of advantages to model the quantiles.
20:13Then, we decided to study these missingness
20:18in quantile estimation.
20:21In particular, we proposed a general framework
20:23for quantile estimation with missing data.
20:30So, our proposed model, these kind of framework,
20:35can do a lot of estimation for
20:38missingness in quantile estimation.
20:43But in this paper,
20:46we particularly applied all proposed method,
20:50these three scenario.
20:52Okay, three commonly encountered situation.
20:56The first one we trying to estimate
21:01the marginal quantile of response.
21:04This response get some missingness.
21:09Well, there are fully observed covariates.
21:13That's the first scenario, response gets some missingness
21:16while the corresponding covariates get fully observed.
21:20The second scenario, we are looking at
21:23the conditional quantile of a fully observed response.
21:28In this scenario, we look at
21:31there are some covariates are partialy available.
21:36So, we have some missingness for covariates.
21:39And then the third scenario, we are still looking at
21:43the conditional quantile of a response.
21:47And in this case, the response gets some missingness
21:52and we have fully observed covariates
21:55and also extra auxiliary variable.
22:02Now, let's look at the first situation.
22:07We want to estimate the marginal quantile.
22:10In this scenario, we have the response gets some missingness
22:18and we have the covariates fully observed.
22:22Now, let m to be the number of subjects with
22:26data completely observed.
22:30Then our method consists of the following five steps.
22:38The first step, we calculate this α or estimate to this α.
22:45This α isn't related to the missingness probability, okay?
22:52The way we estimate this, is by maximizing
22:57the binomial likelihood.
23:01So, the first step we estimate the α,
23:03and then we get estimate of the missingness probability.
23:10Okay?
23:11The second step, we calculate gamma.
23:16This gamma is related to this data distribution.
23:21So, we maximize this data distribution.
23:25This gamma is a parameter related to the distribution.
23:33And then the third step is we can
23:39sort of preliminary estimate of the quantile
23:43or the marginal quantile through these imputation process,
23:51by solving this equation.
23:55And as you can see this is quite close to the AIPW scenario.
24:05Okay?
24:06And in this equation, this five is the score function
24:13of quantile lost function.
24:17This prosaic is r - i(r<0).
24:23This is the generalized derivative
24:28of quantile lost function, okay?
24:34Here, this one can not be exact zero.
24:39The reason this phosaica is a non-smooth function.
24:46and it sometime it won't be exact here.
24:53Basically the first step, okay?
24:57Now, we have a preliminary estimator
25:01of the marginal quantile.
25:03The first step is the case that of method
25:08is where the multiple robustness is coming from.
25:14Now, we calculates weights for the complete case.
25:19In total, do we have m complete case.
25:21For each case, we calculate the weight.
25:24As you can see, the weight is determined by three parts.
25:32The first part is related to this alpha,
25:36which is related to the missing probability, okay?
25:40Missing probability.
25:43The second part is related to this gamma.
25:47This is related to the data distribution.
25:52The third part is related to this cube.
25:56This preliminary estimate of these marginal quantile,
26:02which is related to this self step.
26:07As you can see from the first three step,
26:10we are trying to get ready for this,
26:14to get the estimate for the weight for the complete case,
26:18for this complete case.
26:21And also, we have our parameter,
26:23though is obtained through
26:27minimizing these equation, through minimizing this equation.
26:33Now, after we calculate the weight
26:36we get off final estimate of our multiple robust estimate
26:42by solving the following with estimated equation.
26:50This wi is the width.
26:52We estimate it from the first four steps.
26:58And this posy is a score function of quantile loss, okay?
27:06Now, you may get wondering on what's going on
27:10with these five steps.
27:14And let me try to explain it one by one, okay?
27:20In the first step, we get the estimate of alpha, okay?
27:24We get the estimate of alpha.
27:28In sense trying to model they missingness probability, okay?
27:33Missingness probability.
27:35And of course, this missingness probability is consistent
27:41only if this model is correctly specified, okay?
27:45So in the first step, we actually have multiple models
27:49to model the missingness probability.
27:52And you need a hope at least a one model is correct.
27:57Now, in the other case, the missingness probability
28:00will not be correctly specified.
28:05Well, in the second step, we only estimate gamma.
28:09We are trying to model the data distribution
28:14and we have models for the data distribution.
28:20And then the third step,
28:22we are sort of doing some imputation as made
28:26of these marginal quantile.
28:32And these marginal quantile will be correctly estimated,
28:42if those data distribution is correctly specified.
28:50Now for the key staff,
28:53(coughs)
28:54Excuse me.
28:55The step four is typical formulation of
28:59an empirical likelihood program.
29:03I will getting back to this in the next slide,
29:08why it's a empirical likelihood program.
29:12And this is a key contribution of methodology.
29:18Now, in step five, we have the structure of IPW, okay?
29:23For complete case, we have weight to correctify, okay?
29:32And do this weight actually, is coming from two parts.
29:35And one part is from the missingness probability.
29:41The other part is from the data distribution.
29:45Now, the weight actually does not distinguish
29:48the missingness probability and the data distribution.
29:54The way it treats them equally.
29:59And another note I want to say is step two and four
30:03are based on the complete case only.
30:12Now, let's look at step four.
30:15Okay? Let's look at step four.
30:18In step four, we saw assumption are missing at random.
30:26It's easy to verify this, okay?
30:29Like wx, which is the inverse of the missingness probability
30:34times b(X) - E{b(X)}| R-1 = 0, okay?
30:43And in thus case, we can let b(X) to be the score function
30:48of quantile lost function.
30:52And these probability are conditional estimation
30:56and the conditional probability under this density.
31:01And because of this, okay?
31:06We can easily write a sample case, a sample scenario.
31:14So, the scenario is like this.
31:16All the weight is inactive.
31:19Some weight is one,
31:22and this is the estimating equation part,
31:25estimation equation part.
31:29As you can see,
31:30this is a typical empirical likelihood scenario.
31:40So, this is a typical formulation for empirical likelihood.
31:47And the solution actually can be even as in all formula,
31:55our previous, can be given by this one, okay?
32:02The weight can be determined by this.
32:05And though hard, can be estimated by solving this equation.
32:16Okay?
32:19So, that's all key steps for this methodology, okay?
32:29This actually, is the formula we first written down
32:35on the paper.
32:36And then we thought, "Okay, this might also be able
32:40to be applied to the other scenario."
32:44Indeed it can be applied in other scenarios.
32:48For example, in this quantile regression
32:52with missing covariates.
32:55In this scenario, all parameter of interest is β0.
33:01This β0 is coming from these linear regression.
33:05We want to estimate this β0.
33:10And all covariates had two paths, X1 and X2.
33:17This X1 path is always observed,
33:22while this X2 may have some missing.
33:27So, the observed data.
33:31And I need copies of this format.
33:33This missingness response completely observed covariates
33:43and some covariates are missing,
33:45some covariates are observed, okay?
33:49So, in this setting, we want to estimate β0,
33:55as in previous scenario.
33:59We have two sets of models, okay?
34:02One set model is for π, the missing probability.
34:08And the other set of model is for data distribution.
34:15Here the distribution is related to X2,
34:19given the condition of the response
34:21and completely of the X1.
34:27Now, as previous, we have five steps.
34:35Step one and step two are same as in case one.
34:40And in step one, we estimate in the missing probability.
34:45In step two, we estimate the data distribution.
34:53And then in step three,
34:55we get preliminary imputation estimate pf β0
35:03by solving this seemed a very complicated equation.
35:09And here there's Xl, which had two parts,
35:17the complete the case and on the missing part.
35:21The missing part is random drawn
35:24from this data distribution.
35:28We estimate from step two.
35:32And then the step four, okay?
35:35The key is that the empirical likelihood part
35:39where we used to compute to the weight.
35:46And these weights that I had, is for complete case.
35:50And at previous, this weight depends on three parts.
35:59One is missing probability, α1 is the distribution.
36:05Gamma previous, it depend on the preliminary as estimate
36:09of margin quantile.
36:11Now, it's related to the preliminary estimate of
36:18linear quantile coefficient β.
36:22Okay?
36:23After we estimate these weight WI,
36:27then we can go to the estimating equation part, okay?
36:35Let's say five steps. Let's say five steps.
36:40As you can see you, step one, step two, step three,
36:44is all preexisting method we adapt trying to estimate
36:56the missing probability, the data distribution,
37:02and also impute to get a preliminary estimate
37:05of the parameter we are increasing.
37:08And then from all these,
37:10we pull all this information together to get
37:12a good weight for the compete case.
37:18And then the using this empirical likelihood method
37:25and then we adjust this complete case with the
37:31estimated weight to get a final estimate,
37:34to get the final multiple robust estimate.
37:41Now the case three, okay?
37:45In the case three, the parameter we are interested
37:49is still β0.
37:51This linear quantile regression are here.
37:55The scenario is the full-data vector is (Y, X).
38:02In this scenario, Y is missing and random, okay?
38:07Of course the simple complete a case analysis
38:10where lead to a consistent estimate,
38:14but it doesn't mean it will be optimal.
38:18Here we are trying to get a more complete educated
38:21but still very practical method.
38:30We are having some auxiliary variable.
38:33As this auxiliary variable,
38:36usually not the main study interest,
38:40and thus do not enter the quantile regression model.
38:43However, we can use it to help us to explain
38:48the missingness mechanism
38:51and to help us to build a more plausible model
38:55for the conditional distribution of Y.
39:00Now, here is the observed data.
39:06So, we now have an ID copies of these R, RY,
39:12this Y gets a missing, X is completely observed,
39:19and we have got auxiliary variable S.
39:23We have this missing and random scenario.
39:27We use π(X, S) to denote the probability,
39:34and we use f(Y| X, S) to denote conditional density.
39:40As previous, we have multiple models
39:43for missing probability,
39:46and we have multiple models for data distribution.
39:56And then once again, we have the all five steps.
40:00The first step, we modeled the missing probability.
40:05And here we have this additional auxiliary variable.
40:10The second step, we model the data distribution.
40:14Again, we have this auxiliary variable.
40:17And then step three,
40:18we get a preliminary estimate on
40:21using this imputation method.
40:24We have our preliminary estimate of the parameter
40:28we are interested in,
40:30which is a linear regression coefficient here.
40:36And then after the preparation of step one,
40:39step two, and step three,
40:41we finally be able to estimate our weight, okay?
40:46Our weight is for complete case.
40:50And from the formula here,
40:52you can tell why I put this scenario as scenario three
40:55because it got more and more complicated.
40:59Although the weight still depends on three parts,
41:02related to the first three step.
41:05The missing probability related to this alpha,
41:08the data distribution related to this gamma,
41:12and the preliminary estimate made by using the imputation
41:19in step three.
41:25And once we get the weight through
41:28this empirical likelihood method,
41:30we then put it into this estimating equation.
41:34Adjusted by this weight, we can get our proposed estimator
41:39as multiple robust estimator of
41:41the linear regression coefficient.
41:48Okay.
41:50(coughs)
41:51Our method all framework in general,
41:55these five sets, the key thing is step four
41:58is empirical likelihood method to estimate the weight.
42:03I'll estimate his probability
42:06and we will estimate our framework in these three scenarios.
42:13Of course there are some other scenarios,
42:15and you can easily adapt to these five steps.
42:20Now, let's look at some theoretical proprietary.
42:23Why we propose these seemingly complicated five steps.
42:30We first look at the case one. There are two parts.
42:36The first theorem is about this consistence.
42:40The second theorem is about asymptotic normality, okay?
42:46So, under certain conditions, if...
42:51Remember we have two sets of models.
42:53One set of model, we modeled the probability.
42:57The other set of model, we modeled the data distribution.
43:02So if either one from the model
43:07of modeling missingness probability
43:12or the model set model the data distribution,
43:15if either one is correctly specified, Okay?
43:21Then, our estimate will be consistent.
43:26Our estimate it well be consistent.
43:28So, all proposed method allow you to make mistakes, okay?
43:37But you at least make one good right decision,
43:44then you get a consistent result, okay?
43:49Of course if you make all the bad decisions,
43:52you didn't choose any track modeling,
43:55these two sets of model, then you probably won't be able
43:59to get that consistent result.
44:01Right?
44:04And then the second theorem is about
44:07the asymptotic normality.
44:09Under certain conditions, the model estimate
44:17some multiple robust estimate on the marginal quantile
44:20where I have asymptotic normal distribution
44:23with mean zero and variates here
44:28is related to this variable.
44:30Variates is related to this data one random variable.
44:38And as you can see these variates of data one
44:46actually coming from these three parts,
44:50the estimate of the missingness probability,
44:53the estimate of these data distribution,
44:56and also the imputation process, okay?
45:00That's for case one.
45:02Similarly for case two, we have these two theorem.
45:08Y is consistent.
45:11And as long as the one model is correctly specified,
45:14we would have this consistency.
45:17And then this asymptotic normality,
45:21we would have asymptotic normal distribution.
45:23And also the variates, they're two, as you can see.
45:28The two is ready to first three step
45:32to estimate the different component, okay?
45:38And then case three, two theorem.
45:43Consistency, we need at least one model.
45:47As long as one model is correctly specified,
45:50we have a consistent result.
45:53And we have this asymptotic normalcy
45:56and the variates come from their three part. Okay?
46:02As you can see, this is a very complicated formula.
46:07It's a model getting more and more complicated.
46:10And also, if you see that you can compound the variates
46:15of the three to the situation with complete case analysis.
46:21Because for complete case analysis,
46:23we also get the consistent result, but like I said,
46:28it doesn't mean the variates would be optimal.
46:30And here, we actually can verify the variates of the three
46:34will be smaller if our model are correctly specified, okay?
46:43Let's say theoretical propriety.
46:49Now, let's look at some simulation, okay?
46:54We did simulation for each scenario,
46:58but due to the timely meet, I will only present two.
47:03Let's look at the second scenario.
47:05In the second scenario, we have four here.
47:09We have X1 follow exponential distribution X2
47:13is a normal distribution.
47:16And so Y is discrete, one is continuous, okay?
47:20The model is the simple linear model
47:24and the error distribution Y,
47:28as you can see, is heteroscedastic.
47:32Because of these error distribution, it's reduced to X1.
47:38The missing mechanism for X2,
47:42in the second scenario, we have a part of X2 is missing is
47:47through this logistic regression, okay?
47:50Now, missingness rate is about 38%.
47:57Eventually, they have this conditional quantile regression,
48:00linear regression, they have those coefficient excess.
48:04This is our simulation setup is in the second scenario.
48:13Now, we consider two working models for π, okay?
48:19The fist one is correct. The second one is incorrect.
48:24We can see there are two models for the distribution, okay?
48:32All right.
48:33This is the incorrect one
48:35and for the ordinary least squares regression.
48:38And this is correct one with title 0.25 0.75.
48:48We have replication, 1,000 times.
48:51We have some π equals 500, L is 10.
48:55This L is really related to the first step
48:59of the imputation.
49:03Okay.
49:03Now, here is all our simulation result, okay?
49:09Although the result has to be multiplied by 100,
49:14as you can see Y is very large.
49:15And also we denote our mass as 0000, okay?
49:25The fist two digit represent
49:28the missing probability model.
49:31The last two is data distribution.
49:35For example, for IPW 1000,
49:40that means we only use inverse probability method.
49:44And the weight is estimating is based on
49:49this correct weight, okay?
49:52And for the imputation,
49:56that means we only use this data distribution.
50:00And for this IM 0010, that means we use our first model,
50:08which is to model the data distribution.
50:14This is the second model for data distribution.
50:18And in either case,
50:20is always the first one is correct model.
50:23The first one is correct model.
50:24The second one is not, okay?
50:26That's just from notation.
50:28As you can see here using IPW
50:31if the model is correctly specified,
50:34the bias is quite small
50:35and everything is quite good.
50:38However, if you miss specify the missingness probability,
50:42we see the estimate is quite out of control, okay?
50:47Let's say for IM imputation, if you specify correctly
50:53the data distribution, the result is good.
50:56If not, then it's not.
50:58Okay.
50:59Then there's multiple robust method.
51:03In the multiple robust method,
51:08we look at, for example, this one,
51:12we get a missing probability correctly specified,
51:15then we get a good result.
51:17If not, we get bad result as the IPW, okay?
51:22But anyway, if we can choose to use all these four models,
51:29as you can see, the result is quite good, okay?
51:33The taking home method for these simulation study is,
51:39if you have some ideas about missingness probability
51:47about the state of this data distribution,
51:50and you think, "Okay, maybe this one is right
51:53or maybe this one is also right, okay?
51:56So on my side, just tell you,
51:58"Okay, I don't have to just put all these
52:04potential candidate potential model into all framework.
52:11Then we look at the recount.
52:16This one of the simulation is scenario two.
52:22We also have a simulation in a scenario three,
52:27but I will skip it here and go directly to the
52:35real data analysis.
52:38So, in this real data analysis, we look at this
52:43AIDS clinical Trials Group Protocol 175 or ACTG 175 data.
52:52In this research, we evaluate treatment with either a single
53:01nucleosides or through HIV-infected subject
53:05whose CD4 cells count
53:08and are from 200 to 500 per cubic millimeters.
53:14So, we consider to arms or treatment.
53:17One is standardized,
53:19and the other one is with three newer treatments.
53:24The two arms respectively,
53:29have about 500 and 1,600 subjects.
53:34Now, model we are looking at is
53:36the linear quantile regression model
53:39and with those kind of covariates inside.
53:43The data can be found in this package.
53:51Now for the data, the average subject is 35 years old,
53:57standard variation is about nine,
54:01and the variable CD4 96 is missing for approximate 37%.
54:10It's quite similar to simulation scenario.
54:16Each athlete is part of set up of simulations scenario.
54:22However, at baseline during the followup,
54:25full measurements on additional variable are correlated
54:28with CD4 96 are obtained.
54:30So this would be the missing part. We get the missing part.
54:39Here we assumed this CD4 96 is the missing and random.
54:46And we also have other baseline, for example,
54:50CD4 80 and CD4 20, and so on.
54:56we will use these as auxiliary variables.
55:01So, we have our third scenario
55:07in this real data analysis.
55:12And why we choose this data?
55:16If we look at this CD4 96, the histogram of this, okay?
55:24The left one is before we do it's original skill.
55:32The right one is after we do log transformation.
55:39So, as you can see, the left one is kind of truncated,
55:46and the right one also truncated.
55:49So you may debate,
55:50"Okay, which one I should use?
55:52Do I take log transformation or not?
55:59Or to be, or not to be."
56:03So that's no apparent reason to favor one of them
56:10for the imputation method.
56:13Now, what do we do?
56:17In our proposed method,
56:19we can put all these two models in our framework, okay?
56:26We don't need to make the choice.
56:29And because no apparent reason,
56:31we take a log, or not take log.
56:33Now, let's put the two together into our model, okay?
56:38So we can simultaneously accommodate both simulation.
56:44And then we have a eight covariates and auxiliary variable.
56:49Then we have this probability is modeled by
56:54a logistic regression containing all main effect of X and S.
57:02So, here is the result. Here is the result.
57:04This is a big table, but let me summarize these table.
57:10Okay.
57:11They three newer treatment, significantly slow the progress.
57:16Our proposed method and the IPW method,
57:19produce very similar results, okay
57:23And the incubation estimate,
57:27one failed to catch difference in the treatment
57:31and treatment arm effect for different quantile.
57:38The amputation estimator 2 gives
57:40an increasing estimation effect and covariance.
57:44In addition, the two imputation estimates
57:48are quite sensitive to the selection of the working models.
58:04Okay?
58:05And also, from these real data,
58:07we can help complete case analysis
58:11overestimate the treatment arm effects once again,
58:16so that even sometimes the compete case analysis is valid
58:23but there are also advantage to use our proposed method.
58:34All right, so here's the summary of my talk.
58:40We proposed a general framework for
58:44quantile estimation with missing data.
58:48And we actually applied these framework
58:52in different scenario.
58:55Now, the taking home message is,
59:00our proposed method or whatever robust against
59:04possible model misspecification.
59:08So, as we have two sets of model,
59:10one for missing probability
59:12and one is for data distribution.
59:14As long as one model is correct,
59:17then we will get good result.
59:19And also our method can be easily to be generalized
59:23to many other scenario.
59:26And I think that's all of my talk,
59:32and thank you.
59:36- All right.
59:37Thank you, Linglong. This was very interesting.
59:39I think we're almost out of time, so if there's
59:43we have time probably for one question.
59:45So if there's any, if not
59:48Let's see if there are any questions.
59:52Feel free to write in the chat box or on cells.
01:00:12Okay.
01:00:13Just gonna ask one question
01:00:14and then I think I'm gonna ask all the questions
01:00:17when we meet.
01:00:19Just a quick question.
01:00:20Do you know why the complete case analysis have
01:00:24overestimation rather than underestimation?
01:00:27Like, do you have a feeling why that's the case and what?
01:00:33- Well, I don't know. No.
01:00:39- Yeah.
01:00:40I believe it will be interesting to see what cases,
01:00:42like what are the conditions for overestimation
01:00:45or underestimation for complete case analysis, I guess.
01:00:48I guess, it must depend on the data distribution
01:00:52and the missingness mechanism that's been made.
01:00:56But I'm not sure one.
01:00:59- I agree with you.
01:01:01The reason I would answer I don't know,
01:01:05because it's really hard to know how the data is miss.
01:01:11Although we assume it's missing at runtime.
01:01:13- Yeah.
01:01:14- But, who knows the reality?
01:01:17- Right. Yeah, right.
01:01:19I guess, under your assumption of missing at random,
01:01:22then I guess there could be conditions for underestimation
01:01:27or overestimation under the assumption of where MI.
01:01:31But, I don't know.
01:01:32I was wondering if people have derived those or not.
01:01:36(laughs)
01:01:37They could be future work, right?
01:01:40(laughs)
01:01:42All right.
01:01:43Linglong, thank you.
01:01:44I'll see you in an hour for a one on one meetings,
01:01:47and I know other students and maybe faculty have
01:01:51signed up for it to meet with you.
01:01:53So, thank you very much.
01:01:55And I'll see you later. All right.
01:01:57- Thank you.
01:01:57- Bye-bye. Thank you everyone for joining.
01:01:58Bye.
01:01:59- Bye.
01:02:00- Bye.