YSPH Biostatistics Seminar: "Measures of Selection Bias for Proportions Estimated from Non-Probability Samples"

November 16, 2021

Information

Rebecca Andridge, PhD, Associate Professor, Department of Biostatistics, The Ohio State University

November 16, 2021

ID7167

To CiteDCA Citation Guide

00:00Hi.
00:01Hi everybody.
00:02Students Hi.
00:03It's my pleasure today
00:03to introduce Professor Rebecca Andridge.
00:06Professor Andridge has a Bachelors' in Economics in Stanford
00:10and her Master's and PhD in Biostatistics
00:13from the University of Michigan.
00:16She an expert in group randomized trials
00:18and methods of missing data
00:19especially for that ever so tricky case that is not,
00:23or so where data is missing not at random.
00:26She's been faculty in Biostatistics in Ohio State University
00:29since 2009.
00:31She's an award-winning educator
00:32and a 2020 Fellow of the Americans Associates,
00:36and we're very honored to have a huge day.
00:38Let's welcome professor Andridge.
00:40(students clapping)
00:43Thank you for the very generous introduction.
00:46I have to tell you,
00:47it's so exciting to see a room full of students.
00:51I am currently teaching online class
00:52and the students don't all congregate in a room.
00:54So it's like been years since I've seen this.
00:58So I'm of course gonna share my slides.
01:01I want to warn everybody that I am working from home today.
01:06And while we will not be interrupted by my children
01:09we might be interrupted or I might be interrupted
01:11by the construction going on in my house,
01:13my cats or my fellow work at home husband.
01:16So I'm gonna try to keep the distractions to a minimum
01:18but that is the way of the world in 2020,
01:22in the pandemic life.
01:24So today I'm gonna be talking about some work
01:26I've done with some colleagues
01:27actually at the University of Michigan.
01:29Talking about selection bias
01:31in proportions estimated from non-probability samples.
01:36So I'm gonna start with some background and definitions
01:38and we'll start with kind of overview
01:40of what's the problem we're trying to address.
01:43So big data are everywhere, right?
01:45We all have heard that phrase being bandied about, big data.
01:49They're everywhere and they're cheap.
01:50You got Twitter data, internet search data, online surveys,
01:53things like predicting the flu using Instagram, right?
01:56All these massive sources of data.
01:59And these data often, I would say pretty much all the ways
02:03arise from what are called non-probability samples.
02:07So when we have a non-probability sample
02:08we can't use what are called design based methods
02:11for inference,
02:11you actually have to use model based approaches.
02:14So I'm not gonna assume that everybody knows
02:16all these words that I've found out here,
02:18so I'm gonna go into some definitions.
02:22So our goal is to develop an index of selection bias
02:25that lets us get at how bad the problem might be,
02:28how much bias might we have due to non-random selection
02:32into our sample?
02:34So a probability sample is a situation
02:38where you're collecting data
02:39where each unit in the population
02:41has a known positive probability of selection.
02:44And randomness is involved in the selection of which units
02:47come into the sample, right?
02:49So this is your stereotypical complex survey design
02:53or your sample survey.
02:55Large government sponsored surveys
02:57like the National Health and Nutrition Examination Survey,
03:00NHANES or NHIS or any number of large surveys
03:04that you've probably come across,
03:06you know, in application and your biostatistics courses.
03:09So for these large surveys
03:11we do what's called design-based inference.
03:14So that's where we rely on the design
03:16of the data collection mechanism
03:18in order for us to get unbiased estimates
03:20of population quantities,
03:21and we can do this without making any model assumptions.
03:24So we don't have to assume
03:26that let's say body mass index has a normal distribution.
03:29We literally don't have to specify distribution at all.
03:32It's all about the random selection into the sample
03:35that lets us get our estimates
03:36and be assured that we have unbiased estimates.
03:40So here's an example in case there are folks
03:43out in the audience who don't have experience
03:45with the sort of complex survey design or design features.
03:48So this is a really silly little example
03:49of a stratified sample.
03:51So here I have a population
03:53of two different types of animals.
03:55I have cats and I have dogs.
03:57And in this population I happen to have 12 cats and $8.
04:01And I have taken a sample.
04:03Stratified sample where I took two cats and two dogs.
04:07So in this design the selection probabilities
04:09are known for all of the units, right?
04:11Because I know that there's a two out of eight chance
04:14I pick a dog and a two out of 12 chance
04:16that I pick a cat, right?
04:18So the probability a cat is selected is 1/6
04:21then the probability of dog is selected is 1/4.
04:23Now, how do I estimate a proportion of interest?
04:26Let's say it's the proportion of orange animals
04:28in the population.
04:29Like here in my sample,
04:30I have one of four orange animals,
04:32but if I chose that as my estimator
04:34I'd be ignoring the fact that I know how I selected
04:37these animals into my sample.
04:39So what we do is we wait the sample units
04:42to produce design unbiased estimates, right?
04:44Because this one dog kinda counts
04:48differently than one cat, right?
04:50Because there were only eight dogs
04:51to begin with but there were 12 cats.
04:54So if I want to estimate the proportion of orange animals
04:57I would say this cat is a weight is six
05:00because there's two of them and 12 total.
05:02So 12 divided by two is six.
05:04So there's six in the numerator.
05:06And then the denominator is the sum of the weights
05:08of all the selected units,
05:10the cats are each six and the dogs are each four.
05:12So I actually get my estimate a proportion of 30%.
05:15So instead of 25%.
05:17So that kind of weighted estimator
05:18is what we do in probability sampling.
05:20And we don't have to say what the distribution
05:22of dogs or cats is in the sample
05:24or orangeness in the sample,
05:26we entirely rely on the selection mechanism.
05:30What ended up happening in the real world
05:32a lot of the time is we don't actually get to use
05:35those kinds of complex designs.
05:36And instead we collect data
05:38through what's called a non-probability sample.
05:40So in a non-probability sample,
05:42it's pretty easy to define.
05:43You cannot calculate the probability of selection
05:46into the sample, right?
05:47So we simply don't know what the mechanism
05:49was that made at unit enter our sample.
05:53I know there's the biostatistics students in the audience,
05:55and you've all probably done a lot of data analysis.
05:57And I would venture a guess that a lot of the times
06:00your application datasets
06:01are non-probability samples, right?
06:03A lot of the times there are convenience samples.
06:05I work a lot with biomedical researchers
06:07studying cancer patients.
06:08Well guess what, it's almost always a convenient sample
06:12of cancer patients, right?
06:13It's who will agree to be in the study?
06:15Who can I find to be in my study?
06:17Other types of non-probability samples
06:19include things like voluntary or self-selection sampling,
06:22quota sampling, that's a really old,
06:24old school method from polling back many years ago.
06:28Judgment sampling or snowball sampling.
06:30So there's a lot of different ways
06:31you can get non-probability samples.
06:34So if we go back to the dog and cat example,
06:37if I didn't know anything about how these animals
06:39got into my sample and I just saw the four of them,
06:41and one of them was orange,
06:43I guess, I'm gonna guess 25% of my population is orange.
06:48Right?
06:49I don't have any other information
06:50I can't recreate the population
06:53like I could with the weighting.
06:54Where I knew how many cats in the population
06:57did each of my sampled cats represent
06:59and similarly for the dogs.
07:01So of course our best guess looking at these data
07:03would just be 25%, right?
07:05One out of the four animals is orange.
07:07So when you think about a non-probability sample,
07:10how much faith do you put in that estimate,
07:12that proportion?
07:15Hard to say, right?
07:16It depends on what you believe about the population
07:19and how you selected this non-probability sample
07:23but you do not have the safety net of the probability sample
07:26that guaranteed you're gonna get an unbiased estimate
07:28of repeated applications of the sampling.
07:32So I've already used the word selection bias
07:34a lot and sort of being assuming that, you know what I mean.
07:37So now I'm gonna come back to it and define it.
07:40So selection bias is bias arising
07:42when part of the target population
07:45is not in the sample population, right?
07:47So when there's a mismatch between who got into your sample
07:49and who was supposed to get into your sample, right?
07:51Who's the population?
07:53Or in a more general statistical kind of way,
07:56when some population units are sampled at a different rate
07:59than you meant.
08:00It's lik you meant for there to be a certain selection
08:03probability for orange animals or for dogs
08:06but it didn't actually end up that way.
08:08This will end up down the path of selection bias.
08:11And I will note that again, as you are biostats students
08:13you've probably had some epidemiology.
08:15And epidemiologists talk about selection bias as well.
08:17It's the same concept, right?
08:19That concept of who is ending up in your sample.
08:22And is there some sort of a bias in the mechanism?
08:26So selection bias is in fact the predominant
08:28concern with non-probability samples.
08:30In these non-probability samples,
08:32the units in the sample might be really different
08:36from the units not in the sample,
08:37but we can't tell how different they are.
08:40Whether we're talking about people, dogs, cats, hospitals,
08:43whatever we're talking about.
08:44However, these units got into my sample, I don't know.
08:47So I don't know if the people in my sample
08:49look like my population or not.
08:53And an important key thing to know
08:55is that probability samples
08:57when we have a low response rates, right?
08:59So when there are a lot of people not responding
09:01you're basically ending up back
09:03at a non-probability sample, right?
09:05Where we have this beautiful design,
09:07we know everybody's sampling weight, we draw a sample,
09:10oops, ut then only 30% of people respond to my sample.
09:14You're basically injecting that bias back in again.
09:16Sort of undoing the beauty of the probability sample.
09:21So when we think about a selection
09:23bias and selection into a sample,
09:25we can categorize them in two ways.
09:28And Dr. McDougal, actually,
09:30when he was giving you my brief little bio
09:32used the words that I'm sure you've used
09:34in your classes before like ignorable and non-ignorable.
09:37These are usually are more commonly applied
09:39to missingness, right?
09:41So ignorable missingness mechanisms
09:43and non-ignorable missingness mechanisms.
09:45Missing at random, missing completely at random
09:48or missing not at random, right?
09:50Same exact framework here.
09:52But instead of talking about missingness
09:54we're talking about selection into the sample.
09:56So when we have an ignorable selection mechanism,
09:59that means the probability of selection
10:01depends on things I observed.
10:02Right, it depends on the observed characteristics.
10:05When I have a non-negotiable selection mechanism
10:08now that probability of selection depends
10:10on observed characteristics.
10:12Again, this is not really a new concept
10:14if you understanded about missing data,
10:15just apply to selection into the sample.
10:20So in a probability sample
10:22we might have different probabilities of selection
10:24for different types of units like for cats versus for dogs.
10:28But we know exactly how they differ, right?
10:31It's because I designed my survey
10:33based on his characteristic of dog versus cat
10:36and I know exactly the status of dog versus cat
10:38for my entire population in order to do that selection.
10:42So I absolutely can estimate the proportion of orange,
10:45animals unbiasedly in the sense of taking repeated
10:49stratified samples and estimating that proportion.
10:52I hadn't guaranteed that I'm gonna get an unbiased
10:54estimate, right?
10:55So this selection mechanism
10:57is definitely not non-ignorable, right?
11:00This is definitely an ignorable selection mechanism
11:02in the sense that it only depends
11:04on observed characteristics.
11:06But if my four animals had just come from,
11:09I don't know where?
11:10Convenience.
11:11Well now why did they end up in my sample?
11:14It could depend on something that we didn't observe.
11:16What breed of dog it was?
11:18The age of the dog, the color of the dog.
11:20It could have been pretty much anything, right?
11:22That's the problem with the convenient sample.
11:24You don't know why those units
11:25often self-selected to be into your sample.
11:29So now I'm gonna head into the kind of ugly statistical
11:32notation portion of this stock.
11:35So we'll start with estimated proportions.
11:37So we'll use Y as our binary indicator
11:41for the outcome, okay?
11:43But here I'm gonna talk about Y
11:45more generally as all the survey data.
11:49So we'll start with Y as all the survey data,
11:50then we're gonna narrow it down to Y
11:51as the binary indicator?
11:53So we can partition our survey data into the data
11:57for the units we got in the sample
11:58and the data for units that are not in the sample.
12:01I so selected into the sample versus
12:03not selected into the sample.
12:05But for everybody I have Z,
12:07I have some fully observed
12:09what are often called design variables.
12:11So this is where we are using information
12:14that we know about an entire population
12:16to select our sample in the world of probability sampling.
12:20And then S is the selection indicator.
12:23So these three variables have a joint distribution.
12:26And most of the time,
12:27what we care about is Y given Z.
12:30Right, we're interested in estimating
12:32some outcome characteristic
12:34conditional on some other characteristic, right?
12:37Average weight for dogs, average weight for cats, right?
12:40Y given Z.
12:42But Y given Z is only part of the issue,
12:45there's also a selection mechanism, right?
12:48So there's also this function
12:49of how do you predict selection S with Y and Z.
12:53And I'm using this additional Greek letter psi here
12:56to denote additional variables
12:58that might be involved, right?
13:00'Cause selection could depend on more than just Y and Z.
13:03It could depend on something outside
13:04of that set of variables.
13:07So when we have probability sampling,
13:08we have what's called
13:09an extremely ignorable selection mechanism,
13:12which means selection can depend on Z,
13:14like when we stratified on animal type
13:16but it cannot depend on Y.
13:18Either the selected units Y or the excluded units Y
13:22doesn't depend on either.
13:24Kind of vaguely like the MCAR of selection mechanisms.
13:27It doesn't depend on Y at all.
13:29Observed or unobserved.
13:31But it can depend on Z.
13:31So that makes it different than MCAR.
13:34So including a unit into the sample
13:36is independent of those survey outcomes Y
13:39and also any unobserved variables, right?
13:41That phi here, that phi goes away.
13:44So selection only depends on Z.
13:46So if I'm interested in this inference target
13:49I can ignore the selection mechanism.
13:51So this is kind of parallels that idea
13:54in the missingness, in the missing data literature, right?
13:56If I have an ignorable missingness mechanism
13:59I can ignore that part of it.
14:00I don't have to worry about modeling
14:02the probability that a unit is selected.
14:05But the bad news in our non-probability sampling,
14:08very, very arguably true
14:11that you could have non ignorable selection, right?
14:13It's easy to make an argument for why the people
14:16who ended up into your sample,
14:18your convenient sample are different than the people
14:20who don't enter your sample.
14:22Think about some of these big data examples.
14:24Think about Twitter data.
14:26Well, I mean, you know,
14:27the people who use Twitter are different
14:29than the people who don't use Twitter, right?
14:31In lots of different ways.
14:32So if you're going to think about drawing
14:34some kind of inference about the population,
14:36you can't just ignore that selection mechanism.
14:39You need to think about how do they enter
14:41into your Twitter sample
14:42and how might they be different than the people
14:44who did not enter into your Twitter sample.
14:47So when we're thinking about the selection mechanism
14:49basically nothing goes away, right?
14:51We can't ignore this selection mechanism.
14:53But we have to think
14:55about it when we want to make inference,
14:56even when our inference is about Y given Z, right?
14:59Even when we don't actually care
15:00about the selection mechanism.
15:02So the problem with probability samples
15:04is that it's often very, very hard to model S
15:07or we don't really have a good set of data
15:10with which to model the probability
15:11someone ended up in your sample.
15:14And that's basically what you have to do to generalize
15:17to the population, right?
15:19There's methods that exist for non-probability samples
15:21require you to do something along the lines
15:24of finding another dataset
15:26that has similar characteristics
15:27and model the probability of being in the probability
15:30sample, right?
15:31So that's doable in many situations
15:34but what we're looking for is a method
15:35that doesn't require you to do that
15:37but instead says, let's do a sensitivity analysis.
15:40Let's say, how big of a problem
15:43might selection bias be if we ignored
15:46the selection mechanism, right?
15:47If we just sort of took our sample on faith
15:49as if it were an SRS from the population.
15:52How wrong would we be
15:54depending on how bad our selection bias problem is?
15:59So there has been previous work done
16:00in this area, in surveys often.
16:03Try to think about how confident
16:06are we that we can generalize to the population
16:08even when we're doing a probability sample.
16:10So there's work on thinking about the representativeness
16:14of a sample.
16:15So that's again, the generalizability to the population.
16:18So there's something called an R-indicator,
16:21which is a function of response probabilities
16:25or propensities,
16:26but it doesn't involve the survey variables.
16:28So it's literally comparing the probability of response
16:32to a survey for different demographic,
16:34across different demographic characteristics, for example.
16:37Right.
16:38And seeing who is more likely to respond then who else?
16:40And if there are those differences
16:41then adjustments need to be made.
16:44There's also something called the H1 indicator,
16:47which does bring Y into the equation
16:49but it assumes ignorable selection.
16:52So it's going to assume that the Y
16:54excluded gets dropped out.
16:58The selection mechanism is only depends
16:59on things that you observe, so you can ignore it, right?
17:03So it's ignorable.
17:04So that's not what we're interested in.
17:06'Cause we're really worried in the non probability space
17:09that we can't ignore the selection mechanism.
17:13And there isn't relatively new indicator
17:15called that they called the SMUB, S-M-U-B.
17:19That is an index that actually extends
17:21this idea of selection bias
17:23to allow for non ignorable selection.
17:25So it lets you say, well, what would my point estimate
17:29be for a mean if selection were in fact ignorable,
17:33and now let's go to the other extreme,
17:35suppose selection only depends on Y.
17:37And I'm trying to estimate average weight
17:39and whether or not you entered my sample
17:40is entirely dependent on your weight.
17:43That's really not ignorable.
17:45And then it kinda bounds the potential magnitude
17:47for the problem.
17:49So that SMUB, this estimator is really close
17:52to what we want but we want it for proportions.
17:55especially because in survey work and in large datasets,
18:00we very often have categorical data
18:03or very, very often binary data.
18:05If you think about if you've ever participated
18:07in an online survey or filled out those kinds of things
18:10very often, right, You're checking a box.
18:11It's multiple choice, select all that apply.
18:13It's lots and lots of binary data floating around out there.
18:17And I'll show you a couple of examples.
18:19So that was a lot of kind of me talking
18:22at you about the framework.
18:23Now, let me bring this down to a solid example application.
18:27So I'm going to use the national survey
18:30of family growth as a fake population.
18:32So I want you to pretend that I have a population
18:36of 19,800 people, right?
18:38It happens to be that I pulled it from the national survey
18:40of family growth,
18:41that's not really important that that was the source.
18:43I've got this population of about 20,000 people.
18:46But let's pretend we're doing a study
18:48and I was only able to select
18:50into my sample smartphone users.
18:52Because I did some kind of a survey that was on their,
18:54you had to take it on your phone.
18:56So if you did not have a smartphone
18:57you could not be selected into my sample.
19:00In this particular case, in this fake population,
19:03it's a very high selection fraction.
19:04So about 80% of my population is in my sample.
19:07That in and of itself is very unusual, right?
19:11A non-probability sample is usually very,
19:13very small compared to the full population
19:15let's say of the United States
19:17if that's who we're trying to generalize to.
19:18But for the purposes of illustration
19:20it helps to have a pretty high selection fraction.
19:22And we'll assume that the outcome we're interested
19:24in is whether or not the individual has ever been married.
19:28So this is person level data, right?
19:29Ever been married.
19:31And it is...
19:32we wanna estimate it by gender,
19:34and I will note that the NSFG only calculate
19:36or only captures gender as a binary variable.
19:39This is a very long standing survey,
19:41been going on since the seventies.
19:42We know our understanding of gender as a construct
19:45has grown a lot since the seventies
19:47but this survey, and in fact
19:48many governmental surveys still treat gender
19:51as a binary variable.
19:52So that's our limitation here
19:54but I just want to acknowledge that.
19:56So in this particular case,
19:58we know the true selection bias, right?
20:00Because I actually have all roughly 20,000 people
20:04so that therefore I can calculate what's the truth,
20:06and then I can use my smartphone sample and say,
20:08"Well, how much bias is there?"
20:11So it turns out that in the full sample
20:1346.8% of the females have never been married.
20:16And 56.6% of the males had never been married.
20:20But if I use my selected sample of smartphone users
20:23I'm getting a, well, very close,
20:25but slightly smaller estimate for females.
20:2846.6% never married.
20:30And for males it's like about a percentage
20:32point lower than the truth, 55.5%.
20:35So not a huge amount of bias here.
20:38My smartphone users are not all that non-representative
20:41with respect to the entire sample,
20:43at least with respect to whether
20:44or not they've ever been married.
20:47So when we have binary data,
20:49an important point of reference is what happens if we assume
20:53everybody not in my sample is a one, right?
20:55What if everybody not in my sample was never married
20:58or everyone not in my sample
21:01is a no to never married, right?
21:03So like has, has ever been married?
21:05And these are what's called the Manski bounds.
21:07When you fill in all zeros or fill in old bonds
21:10for the missing values or the values
21:12for those non-selected folks.
21:14So we can bound the bias.
21:15So the bias of this estimate of 46.6 or 46.6%
21:21has to be by definition
21:23between negative 0.098 and positive 0.085.
21:26Because those are the two ends of putting all zeros
21:29or all ones for the people who are not in my sample.
21:32So this is unlike a continuous variable, right?
21:35Where we can't actually put a finite bound on the bias.
21:38We can with a proportion, right?
21:40So this is why, for example,
21:42if any of you ever work on smoking cessation studies
21:45often they do sensitivity analysis.
21:47People who drop out assume they're all smoking, right?
21:50Or assume they're all not smoking.
21:51They're not calling it that
21:53but they're getting the Manski bounds.
21:56Okay.
21:57So the question is, can we do better than the Manski bounds?
22:00Because these are actually pretty wide bounds,
22:02relative to the size of the true bias,
22:04and these are very wide.
22:06And imagine a survey where we didn't have 80% selected.
22:10What if we had 10% selected?
22:12Well, then the Manski bounds are gonna be useless, right?
22:14plug in, all zeros plug in all ones,
22:16you're gonna get these insane estimates
22:17that are nowhere close to what you observed.
22:21So going back to the statistical notation,
22:23this is where I said we're going to use Y
22:24in a slightly different way.
22:26Now, Y, and now forward is the binary variable of interest.
22:30So in this case, in this NSFG example
22:33it was never married.
22:35We have a bunch of auxiliary variables that we observed
22:38for everybody in the selected sample;
22:41age, race, education, et cetera,
22:43and I'm gonna call those Z.
22:48Assume also that we have summary statistics
22:51on Z for the selected cases.
22:53So I don't observe Z for everybody, right?
22:55All my non-smartphone users,
22:57I don't know for each one of them, what is their gender?
23:00What is their age? What is their race?
23:02But I don't actually observe that.
23:03But I observed some kinda summary statistic.
23:06But a mean vector and a covariance matrix of Z.
23:09So I have some source of what does my population
23:12look like at an aggregate level?
23:14And in practice, this would come from something
23:16like census data or in a very large probability sample,
23:20something where we would be pretty confident
23:21This is reflective of the population.
23:23Will note that if we have data for the population
23:27and not the non-selected,
23:29then we can kinda do subtraction, right?
23:30We can take the data for the population
23:32and aggregate and go backwards
23:35to figure out what it would be for the non-selected
23:36by effectively backing out the selected cases.
23:40And similarly another problem
23:42is that we don't have the variance.
23:43We could just assume it's what we observe
23:44in the selected cases.
23:46So how are we gonna use this in order
23:48to estimate of selection bias,
23:52what we're gonna come up
23:53with this measure of unadjusted bias for proportions
23:56called the MUBP.
23:59So the MUBP is an extension of the SMUB
24:02that was for means, for continuous variables
24:04to binary outcomes, right?
24:06To proportions.
24:07High-level, it's based on pattern-mixture models.
24:10It requires you to make explicit assumptions
24:13about the distribution of the selection mechanism,
24:15and it provides you a sensitivity analysis,
24:18basically make different assumptions on S,
24:20I don't know what that distribution is,
24:22and you're gonna get a range of bias.
24:24So that's that idea of how wrong might we be?
24:28So we're trying to just tighten those bounds
24:30compared to the Manski bounce.
24:31Where we don't wanna have to rely on plug in all zeros,
24:33plug in all ones,
24:35we wanna shrink that interval
24:36to give us something a little bit more meaningful.
24:38So the basic idea behind how this works
24:41before I show you the formulas is we can measure
24:44the degree of selection bias in Z, right?
24:47Because we observed Z for our selected sample,
24:50and we observed at an aggregate for the population.
24:53So I can see, for example, that if in my selected sample,
24:56I have 55% females but in the population it's 50% females.
25:01Well, I can see that bias.
25:03Right, I can do that comparison.
25:04So absolutely I can tell you how much selection bias
25:08there is for all of my auxiliary variables.
25:11So if my outcome Y is related to my Zs
25:16then knowing something about the selection bias in Z
25:19tells me something about the selection bias in Y.
25:22It doesn't tell me exactly the selection bias in Y
25:25but it gives me some information in the selection bias in Y.
25:28So in the extreme imagine if your Zs
25:32in your selected sample
25:33in aggregate looked exactly like the population.
25:36Well, then you'd be pretty confident, right?
25:40That there's not an enormous amount of selection bias
25:42in Y assuming that Y was related to the Z.
25:46So we're gonna use pattern-mixture models
25:48to explicitly model that distribution of S, right?
25:52And we're especially gonna focus on the case
25:54when selection depends on Y.
25:56It depends on our binary outcome of interest.
26:00So again, Y is that binary variable interest,
26:03we only have it for the selected sample.
26:05In the NSFG example it's whether the woman or man
26:08has ever been married.
26:10We have Z variables available for the selected cases
26:13in micro data and an aggregate for the non-selected sample,
26:16a demographic characteristics
26:18like age, education, marital status, et cetera.
26:22And the way that we're gonna go
26:24about doing this is we're gonna try
26:25to get back to the idea of normality,
26:27because then as you all know, when everything's normal
26:30it's great, right?
26:32It's easy to work with the normal distribution.
26:34So the way we can do that with a binary variable
26:37is we can think about latent variables.
26:39So we're going to think about a latent variable called U.
26:42That is an underlying, unobserved latent variables.
26:45So unobserved for everybody, including our selected sample.
26:48And it's basically thresholded.
26:50And when U crosses zero, well, then Y goes from zero to one.
26:54So I'm sure many, all of you have seen probit regression,
26:58or this is what happens
26:59and this is how probit regression is justified,
27:01via latent variables.
27:04So we're going to take our Zs
27:06that we have for the selected cases,
27:08and essentially reduce the dimensionality.
27:11We're gonna take the Zs,
27:13run a probate regression of Y on Z in the selected cases,
27:17and pull out the linear predictor
27:19from the regression, right?
27:20The X beta, right?
27:22Sorry, Z beta.
27:24And I'm gonna call that X.
27:25That is my proxy for Y or my Y hat, right?
27:30It's just the predicted value from the regression.
27:32And I can get that for every single observation
27:35in my selected sample, of course, right?
27:37Just plug in each individual's Z values
27:39and get out their Y hat.
27:40That's my proxy value.
27:42And it's called the proxy
27:44because it's the prediction, right?
27:45It's our sort of best guess at Y
27:47based on this model.
27:49So I can get it for every observation in my selected sample,
27:52but very importantly I can also get it on average
27:56for the non-selective sample.
27:57So I have all my beta hats for my probit regression,
28:01and I'm gonna plug in Z-bar.
28:03And I'm going to plug in the average value of my Zs.
28:06And that's going to give me the average value
28:08of X for the non-selected cases.
28:11I don't have an actual observed value
28:13for all those non-selective cases
28:15but I have the average, right?
28:16So I could think about comparing the average Z value
28:19in the aggregate, in the non-selected cases
28:22to that average Z among my selected cases.
28:24And that is of course
28:26exactly where we're gonna get those index from.
28:29So I have my selection indicator S,
28:31so in the smartphone example,
28:33that's S equals one for the smartphone users
28:35and S equals zero for the non-smartphone users
28:37who weren't in my sample.
28:39And importantly, I'm going to allow
28:40there to be some other covariates V
28:43floating around in here that are independent of Y and X
28:46but could be related to selection.
28:48Okay.
28:49So it could be related to how you got into my sample
28:51but importantly, not related to the outcome.
28:55So diving into the math here, the equations,
28:59we're gonna assume a proxy pattern-mixture model for U,
29:02the latent variable underlying Y
29:05and X given the selection indicator.
29:08So what a pattern-mixture model does is it says
29:11there's a totally separate distribution
29:14or joint distribution of Y and X for the selected units
29:16and the non-selected units.
29:18Notice that all my mus, all my sigmas, my rho,
29:21they've all got a superscript of j, right?
29:23So that's whether your S equals zero or S equals one.
29:27So two totally different bi-variate normal distributions
29:31before Y and X,
29:33depending on if you're selected or non-selected.
29:35And then we have a marginal distribution
29:37just Bernoulli, for the selection indicator.
29:40However, I'm sure you all immediately are thinking,
29:43"Well, that's great,
29:45"but I don't have any information to estimate
29:47"some of these parameters for the non-selected cases."
29:51Clearly, for the selected cases, right?
29:53S equals one.
29:54I can estimate all of these things.
29:55But I can't estimate them for the non-selected sample
29:58because I might observe X-bar
30:01but I don't observe anything having to do with you.
30:03'Cause I have no Y information.
30:06So in order to identify this model
30:08and be able to come up with estimates
30:09for all of these parameters,
30:10we have to make an assumption about the selection mechanism.
30:13So we assume that the probability of selection
30:16into my sample is a function of U.
30:19So we're allowing it to be not ignorable.
30:21Remember that's underlying Y and X,
30:23that proxy which is a function of Z.
30:25So that's observed and V, those other variables.
30:30And in particular, we're assuming
30:31that it's this funny looking form of combination
30:34of X and U.
30:35That depends on this sensitivity parameter phi.
30:38So phi it's one minus phi times X
30:41and phi times U.
30:43So that's essentially weighting
30:45the contributions of those two pieces.
30:47How much of selection is dependent
30:49on the thing that I observe
30:50or the proxy builds off the auxiliary variables
30:53and how much of it is depending on the underlying latent U
30:56related to Y,
30:57that is definitely not observed
30:58for the non-selected.
31:00Okay.
31:01And there's a little X star here,
31:02that's sort of a technical detail.
31:03We're rescaling the proxy.
31:05So it has the same variance as U,
31:07very unimportant mathematical detail.
31:10So we have this joint distribution
31:13that is conditional on selection status.
31:16And in addition to, we need that one assumption
31:19to identify things.
31:20We also have the latent variable problem.
31:22So latent variables do not have separately identifiable
31:24mean and variance, right?
31:26So that's just...
31:27Outside of the scope of this talk
31:29that's just a fact, right?
31:30So without loss of generality
31:31we're gonna set the variance of the latent variable
31:34for the select a sample equal to one.
31:35So it's just the scale of the latent variable.
31:38So what we actually care about is a function of you, right?
31:42It's the probability Y equals one marginally
31:45in my entire population.
31:46And so the probability Y equals one,
31:48is a probability U is greater than zero.
31:50That's that relationship.
31:51And so it's a weighted average of the proportion
31:55in the selected sample
31:56and the proportion in the non-selected sample, right?
32:00These are just...
32:01If U has this normal distribution
32:02this is how we get down to the probability
32:04U equals zero.
32:05Like those are those two pieces.
32:08So the key parameter that governs
32:10how this MUBP works is a correlation, right?
32:14It's the strength of the relationship between Y
32:17and your covariates.
32:18How good of a model do you have for Y, right?
32:22So remember we think back to that example
32:24of what if I had no biases Z.
32:26Or if Y wasn't related to Z,
32:28well, then who cares that there is no bias in Z.
32:32But we want there to be a strong relationship
32:34between Z and Y so that we can kind of infer from Z to Y.
32:40So that correlation in this latent variable framework
32:43is called the biserial correlation of the binary X
32:46and the continuous.
32:47I mean, sorry, the binary Y and the continuous X, right?
32:50There's lots of different flavors of correlation,
32:53biserial is the name for this one
32:55that's a binary Y and a continuous X
32:57when we're thinking about the latent variable framework.
33:00Importantly, you can estimate
33:01this in the selected sample, right?
33:04So I can estimate the correlation between you and X
33:06among the selected sample.
33:07I can't for the non-selected sample,
33:09of course, but I can for the selected sample.
33:12So the non-identifiable parameters
33:14of that pattern-mixture model, here they are.
33:15Like the mean for the latent variable,
33:17the variance for the latent variable
33:19and that correlation for the non-selected sample
33:22are in fact identified when we make this assumption
33:24on the selection mechanism.
33:26So let's think about some concrete scenarios.
33:30What if phi was zero?
33:32If phi is zero,
33:33we look up here at this part of the formula,
33:35well, then phi drops out it.
33:38So therefore selection only depends on X
33:40and those extra variables V that don't really matter
33:43because V isn't related to X or Y.
33:46This is an ignorable selection mechanism, okay.
33:50If on the other hand phi is one,
33:52well, then it entirely depends on U.
33:54X doesn't matter at all.
33:55This is your worst, worst, worst case scenario, right?
33:58Where whether or not you're in my sample only depends
34:00on U and therefore only depends on the value of Y.
34:04And so this is extremely not ignorable selection.
34:07And of course the truth is likely to lie
34:10somewhere in between, right?
34:11Some sort of non-ignorable mechanism,
34:13a phi between zero and one, so that U matters
34:16but it's not the only thing that matters.
34:18Right, that X matters as well.
34:20Okay.
34:21So this is a kind of moderate,
34:22non-ignorable selection.
34:23That's most likely the closest to reality
34:26with these non-probability samples.
34:30So for a specified value of phi.
34:33So we pick a value for our sensitivity parameter.
34:35There's no information in the data about it.
34:36We just pick it and we can actually estimate the mean of Y
34:40and compare that to the selected sample proportion.
34:43So we take this select a sample proportion,
34:45subtract what we get as the truth
34:47for that particular value of phi,
34:50and that's our measure of bias, right?
34:52So this second piece that's being subtracted
34:54here depends on phi.
34:55Right, it depends on what your value
34:57of your selected parameter is,
34:58or selection for your sensitivity parameter is.
35:01So in a nutshell, pick a selection mechanism
35:03by specifying specifying phi,
35:06estimate the overall proportion,
35:07and then subtract to get your measure of bias.
35:10And again, we don't know whether we're getting
35:12the right answer because it's depending
35:14on the sensitivity parameter
35:15but it's at least going to allow us to bound the problem.
35:19So the formula is quite messy,
35:21but it gives some insight into how this index works.
35:24So this measure of bias is the selected sample
35:27mean minus that estimator, right?
35:29This is the overall mean of Y
35:32based on those latent variables.
35:34And what gets plugged in here
35:36importantly for the mean
35:37and the variance for the non-selected cases
35:39depends on a component that I've got colored blue here,
35:42and a component that I've got color red.
35:44So if we look at the red piece
35:46this is the comparison of the proxy mean for the unselected
35:49and the selected cases.
35:50This is that bias in Z.
35:52The selection bias in Z,
35:54and it's just been standardized
35:55by its estimated variance, right?
35:57So that's how much selection bias
35:59was present in Z via X, right.
36:02Via using it to predict in the appropriate regression.
36:05Similarly, down here, how different is the variance
36:08of the selected and unselected cases for X.
36:10How much bias, selection bias is there in estimating
36:13the variance?
36:14So we're going to use that difference
36:16and scale the observed mean, right?
36:19There's that observed the estimated mean of U
36:22in the selected sample and how much it's gonna shift
36:24by is it depends on the selection,
36:26I mean, the sensitivity parameter phi,
36:29and also that by serial correlation.
36:31So this is why the by serial correlation is so important.
36:34It is gonna dominate how much of the bias
36:37in X we're going to transfer over into Y.
36:42So if phi were zero,
36:44so if we wanna assume
36:45that it is an ignorable selection mechanism,
36:48then this thing in blue here,
36:50think about plugging zero here, zero here, zero everywhere,
36:52is just gonna reduce down to that correlation.
36:55So we're gonna shift the mean of U
36:56for the non-selective cases
36:59based on the correlation times that difference in X.
37:03Whereas if we have phi equals one,
37:06this thing in blue turns into one over the correlation.
37:10So here is where thinking about the magnitude
37:12of the correlation helps.
37:13If the correlation is really big, right?
37:15If the correlation is 0.8, 0.9,
37:17something really large than phi and...
37:20I mean, sorry, then rho and one over rho
37:22are very close, right?
37:230.8 and 1/0.8 are pretty close.
37:26So if we're thinking about bounding this between phi
37:29equals zero and equals one,
37:30our interval is gonna be relatively small.
37:33But if the correlation is small,
37:35the correlation were 0.2, oh, oh, right?
37:37We're gonna get a really big interval
37:39because that correlation,
37:40we're gonna shift with the factor of multiplied by 0.2
37:43but then one over 0.2.
37:44That's gonna be a really big shift
37:46in that mean of the latent variable U
37:48and therefore the mean of Y.
37:51So how do we get these estimates?
37:53We have two possibilities. We can use what we call
37:55modified maximum likelihood estimation.
37:58It's not true.
37:58Maximum likelihood because we estimate
38:00the biserial correlation with something
38:02called a two step method, right?
38:04So instead of doing a full, maximum likelihood,
38:07we kind of take this cheat in which we set that mean of X
38:12for the selected cases equal to what we observe,
38:15And then conditional not to estimate
38:16the by serial correlation.
38:18Yeah.
38:19And as a sensitivity analysis we would plug in zero one
38:22and maybe 0.5 in the middle
38:23as the values sensitivity parameter.
38:26Alternatively, and we feel is a much more attractive
38:29approach is to be Bayesian about this.
38:31So in this MML estimation,
38:34we are implicitly assuming that we know the betas
38:38from that probate regression.
38:39That we're essentially treating X like we know it.
38:42But we don't know X, right?
38:44That probate regression,
38:45those parameters have error associated with them.
38:47Right?
38:48And you can imagine that the bigger your selected sample,
38:49the more precisely estimating those betas,
38:51that's not being reflected
38:53at all in the modified maximum likelihood.
38:56So instead we can be Bayesian.
38:57Put non-informative priors on all the identified parameters.
39:01That's gonna let those,
39:02the error in those betas be propagated.
39:05And so we'll incorporate that uncertainty.
39:07And we can actually, additionally put a prior on phi, right?
39:11So we could just say
39:12let's have it be uniform across zero one.
39:14Right?
39:15So we can see what does it look like if we in totality,
39:18if we assume that phi is somewhere evenly distributed
39:20across that interval.
39:22We've done other things as well.
39:23We've taken like discreet priors.
39:26Oh, let's put a point mass on 0.5 and one
39:29or other different, right?
39:30You can do whatever you want for that prior.
39:33So let's go back to the example
39:35see what it looks like.
39:36If we have the proportion ever married for females
39:38on the left and males on the right,
39:40the true bias is the black dot.
39:43And so the black is the true bias.
39:45The little tiny diamond is the MUBP for 0.5.
39:50An so that's plugging in that average value.
39:52Some selection mechanism that depends on why some,
39:56somewhere in the middle.
39:57So we're actually coming pretty close.
39:58That happens to be, that's pretty close.
40:00And the intervals in green
40:02are the modified maximum likelihood intervals
40:04from phi equals zero to phi equals one,
40:06and the Bayesian intervals are longer, right?
40:08Naturally.
40:09We're incorporating the uncertainty.
40:11Essentially these MUBP,
40:13modified maximum likely intervals are too short.
40:15And we admit that these are too short.
40:18If we plug in all zeros and all ones
40:21for that small proportion of my NSFG population
40:25that we aren't selected into the sample,
40:27we get huge bounds relative to our indicator.
40:31Right?
40:32So remember when I showed you that slide, that bounded,
40:34we know the bias has to be between these two values.
40:37That's what's going on here.
40:38That's what these two values are.
40:39But using the information in Z
40:41we're able to much, much narrow
40:43or make an estimate on where our selection bias is.
40:46So we got much tighter bounds.
40:48An important fact here
40:49is that we have pretty good predictors.
40:50Our correlation, the biserial correlation
40:53is about 0.7 or 0.8.
40:54So these things are pretty correlated
40:56with whether you've been married, age, education, right?
40:59Those things are pretty correlated.
41:01Another variable in the NSFG is income.
41:04So we can think about an indicator for having low income.
41:08Well, as it turns out those variables
41:10we have on everybody; age, education, gender,
41:14those things are not actually that good of predictors,
41:16of low income, very low correlation.
41:19So our index reflects that.
41:21Or you get much, Y, your intervals.
41:23Sort of closer to the Manski bounds.
41:26And in fact, it's exactly returning one of those bounds.
41:29The filling in all zeros bound is returned by this index.
41:33So that's actually an attractive feature.
41:35Right?
41:36We're sort of bounded at the worst possible case
41:38on one end of the bias
41:40but we are still capturing the truth.
41:42The Manski bounds are basically useless,
41:44right in this particular case.
41:47So that's a toy example.
41:50Just gonna quickly show you a real example,
41:53and I'm actually gonna to skip
41:54over the incentive experiment,
41:55which well, very, very interesting
41:57is there's a lot to talk about,
41:59and I'd rather jump straight to the presidential polls.
42:03So there's very much in the news now,
42:08and over the past several years,
42:08this idea of failure of political polling
42:11and this recent high profile failure
42:12of pre-election polls in the US.
42:15So polls are probability samples
42:18but they have very, very, very low response rates.
42:20I don't know how much you know about how they're done,
42:21but they're very, very low response rate.
42:23But think about what we're getting at in a poll,
42:25a binary variable, are you going to vote for Donald Trump?
42:28Yes or no?
42:29Are you gonna vote for Joe Biden?
42:31Yes or no?
42:31These binary variables.
42:32We want to estimate proportions.
42:34That's what political polls aimed to do.
42:36Pre-election polls.
42:37So we have these political polls with these failures.
42:41So we're thinking, maybe it's a selection bias problem.
42:44And that there is some of this people
42:45are entering into this poll differentially,
42:49depending on who they're going to vote for.
42:52So think of it this way,
42:53and I'm gonna use Trump as the example
42:54'cause we're going to estimate,
42:55I'm gonna try to estimate
42:56the proportion of people who will vote
42:57for Former President Trump in the 2020 election.
43:02So, might Trump supporters
43:04just inherently be less likely to answer the call, right?
43:07To answer that poll or to refuse to answer the question
43:11even conditional demographic characteristics, right?
43:13So two people who otherwise look the same
43:16with respect to those Z variables, age, race, education,
43:20the one who's the Trump supporter, someone might argue,
43:22you might be more suspicious of the government
43:24and the polls, and not want to answer
43:26and not come into this poll, not be selected.
43:28As it would be depending on why.
43:31So the MUBP could be used to try to adjust poll estimates.
43:35Say, well, there's your estimate from the poll
43:38but what if selection were not ignorable?
43:40How different would our estimate
43:42of the proportion voting for Trump?
43:45So in this example, our proportion of interest
43:48is the percent of people who are gonna vote for Trump.
43:51The sample that we used
43:53are publicly available data
43:54from seven different pre-election polls
43:56conducted in seven different states by ABC in 2020.
44:01And the way these polls work
44:03is it's a random digit dialing survey.
44:05So that's literally randomly dialing phone numbers.
44:08Many of whom get
44:09throughout 'cause their business, et cetera,
44:10very, very low response rates, 10% or lower.
44:13Very, very, very low response rates to these kinds of polls.
44:17They do, however, try to do some weighting.
44:19So it's not as if they just take that sample and say,
44:21there we go let's estimate the proportion for Trump.
44:23We do waiting adjustments
44:25and they use what's called inter proportional fitting
44:28or raking to get the distribution of key variables
44:33in the sample to look like the population.
44:36So they use census margins for, again,
44:38it's gender as binary, unfortunately,
44:40age, education, race, ethnicity, and party identification.
44:45So, because we're doing this after the election
44:47we know the truth.
44:48We have access to the true official election outcomes
44:50in each state.
44:51So I know the actual proportion of why.
44:54And my population is likely voters,
44:57because that's who we're trying to target
44:58with these pre-election polls.
44:59You wanna know what's the estimated proportion
45:02would vote for Trump among the likely voters.
45:05So the tricky thing is that population
45:07is hard to come by summary statistics.
45:10Likely voters, right?
45:11It's easy to get summary statistics from all people
45:13in the US or all people of voting age in the US
45:16but not likely voters.
45:18So here Y is an indicator for voting for Trump.
45:21Z is auxiliary variable in the ABC poll.
45:24So all those variables I mentioned
45:25before gender age, et cetera.
45:27We actually have very strong predictors
45:29of why basically because of these political ideation,
45:32party identification variables, right?
45:34Not surprisingly the people who identify as Democrats,
45:37very unlikely to be voting for Trump.
45:41The data set that we found that can give us population level
45:44estimates of the mean of Z for the non-selected sample
45:48is a dataset from AP/NORC.
45:50It's called their VoteCast Data.
45:52And they conduct these large surveys
45:55and provide an indicator of likely voter.
45:58So we can basically use this dataset
46:00to describe the demographic characteristics
46:02of likely voters,
46:04instead of just all people who are 18 and older in the US.
46:09The subtle issue is of course,
46:10these AP VoteCast data are not without error,
46:13but we're going to pretend that they are without error.
46:15And that's like a whole other papers.
46:17How do we handle the fact
46:17that my population data have error?
46:19So we're gonna use the unweighted ABC poll data
46:23as the selected sample and estimate the MUBP
46:26with the Bayesian approach with phi
46:27from the uniform distribution.
46:29The poll selection fraction is very, very, very small.
46:32Right, these polls in each state
46:34have about a thousand people in them
46:36but we've got millions of voters in each state.
46:38So the selection fraction is very, very, very small,
46:40total opposite of the smartphone example.
46:43So we'll just jump straight into the answer,
46:46did it work?
46:47Right, this is really exciting.
46:48So the red circle is the true proportion,
46:52oh, sorry, the true bias,
46:53this should say bias down here.
46:55In each of the states.
46:56So these are the seven states
46:57we looked at Arizona, Florida, Michigan, Minnesota,
46:59North Carolina, Pennsylvania, and Wisconsin.
47:02So this horizontal line here at zero that's no bias, right?
47:06So it's estimated, the ABC poll estimate
47:08would have no bias.
47:09And we can see then in Arizona where sort of overestimated
47:13and in the rest of the states
47:14we've got underestimated the support for Trump.
47:16And so that was really the failure was the underestimation
47:19of the support for Trump.
47:20Notice that our Bayesian bounds
47:24cover the true bias everywhere except
47:26in Pennsylvania and Wisconsin.
47:28And so Wisconsin had an enormous bias,
47:30or they way under called the support for Trump
47:33in Wisconsin by 10 percentage points.
47:34Huge problem.
47:35So we're not getting there
47:37but notice that zero is not in our interval.
47:40So our bounds are suggesting
47:43that there was a negative bias from the poll.
47:46So even though we didn't capture the truth,
47:48we've at least crossed the threshold
47:49saying very likely that you are under calling
47:52the support for Trump.
47:55So how do estimates using the MUBP compared to the ABC poll?
47:59Well, we can use the MUBP bounds to basically shift
48:03the ABC poll estimates.
48:05So we're calling those MUBP adjusted, right?
48:08So we've got the truth is...
48:10The true proportion who voted for Trump
48:12are now these red triangles
48:14and then the black circles are the point estimates
48:17from three different methods of estimation,
48:20of obtaining an estimate.
48:21Unweighted from the poll weighted estimate from the poll
48:25and the adjusted by our measure of selection bias,
48:28the non-ignorable selection bias is the last one.
48:30Is MUBP adjusted.
48:32So we can see that in some cases
48:35our adjustment and the polls are pretty similar, right?
48:39But look at, for example, Wisconsin,
48:41all the way over here on the right.
48:42So again, remember I said, we didn't cover the truth
48:44and we didn't cover the true bias
48:46but our indicator is the only one, right?
48:49That's got that much higher shift up towards Trump.
48:52So this is us saying, well,
48:53if there were an underlying selection mechanism
48:57saying that Trump supporters
48:59were inherently less likely to enter this poll,
49:03this is what would happen.
49:04Or this is what your estimated support for Trump would be.
49:07It's shifted up.
49:09We've got a similar sort of success story
49:11I'll say in Minnesota,
49:12we're both of the ABC estimators did not cover the truth
49:16in these pre-election polls but ours did, right.
49:18We were able to sort of shift up and say,
49:21look, if there were selection bias
49:22that depended on whether or not you supported Trump
49:25we would we captured that.
49:27So the important idea here is, you know,
49:29before the election, we wouldn't have these red triangles.
49:34But it's important to be able to see
49:36that this is saying you're under calling
49:39the support for Trump
49:40if there were a non-negligible selection, right?
49:42So it's that idea of a sensitivity analysis?
49:44How bad would we be doing?
49:46And what we would say is in Minnesota and Wisconsin
49:49we'd be very worried
49:50about under calling the support for Trump.
49:56So what have I just shown you?
49:59I'll summarize.
50:01The MUBP is a sensitivity analysis tool
50:04to assess the potential for non-ignorable selection bias.
50:08If we have a phi equals zero, an ignorable selection,
50:12we can adjust that away via weighting
50:14or some other method, right?
50:16So if it's not ignorable, I mean, if it is ignorable
50:18we can ignore the selection mechanism.
50:21On the other extreme if phi is one,
50:23totally not ignorable,
50:24selection is only depending on that outcome
50:26we're trying to measure.
50:28Somewhere in between we've got the 0.5.
50:30That if you really needed a point estimate
50:32of the bias, that would be 0.5.
50:34And in fact, that's what this black dot is.
50:37That's the adjustment at 0.5 for our adjusted estimator.
50:41This MUBP is tailored to binary outcomes,
50:45and it is an improvement over the normal base SMUB.
50:48I didn't show you the,
50:49so the results from simulations that basically show
50:52if you use the normal method on a binary outcome
50:55you get these huge bounds.
50:56You go outside of the Manski bounds, right?
50:58'Cause it's not properly bounded between zero and one,
51:01or your proportion isn't properly bounded.
51:03And importantly, our measure only requires
51:06summary statistics for Z,
51:08for the population or for the non-selected sample.
51:11So I don't have to have a whole separate data set
51:14where I have everybody who didn't get selected
51:16into my sample,
51:16I just need to know the average of these co-variants, right.
51:20I just needs to know Z-bar in order to get my average
51:23proxy for the non-selected.
51:26With weak information,
51:27so if my model is poor then my Manski bounds
51:30are gonna be what's returned.
51:32So that's a good feature of this index.
51:34Is that it is naturally bound
51:36unlike the normal model version.
51:38And we have done additional work to move
51:41beyond just estimating means and proportions
51:43into linear regression and probate progression.
51:46So we've have indices of selection bias
51:48for regression coefficients.
51:50So instead of wanting to know the mean of Y
51:53or the proportion with Y equals one,
51:55what if you wanted to do a regression of Y
51:57on some covariates?
51:59So we have a paper out in the animals of applied statistics
52:02that extends those two regression coefficients.
52:05So I believe I'm pretty much right on the time
52:07I was supposed to end, so I'll say Thank you everyone.
52:09And I'm happy to take questions.
52:11I'll put on my references
52:12of my meeny, miny fonts, yes.
52:20Robert Does anybody have any questions?
52:26From the room?
52:33So.
52:36Dr. Rebecca Let me stop my share.
52:38Student Hey.
52:40I have a very basic one,
52:41mostly more of curiosity (indistinct)
52:44Sure, sure.
52:45What is it that caused the...
52:50We know after the fact that in your example
52:54that there was the direction of the bias,
52:57but why is it that it only shifted in the Trump direction?
53:03Why?
53:03You don't know in advance if something is more likely
53:06or less likely?
53:08Okay.
53:09So excellent question.
53:09So that is effectively,
53:11the direction of the shift is going to match...
53:15The direction of the shift in the mean of Y,
53:17when the proportion is going to match
53:18the shift in X, right?
53:20So if what you get as your mean for your proxy,
53:25for the non-selected sample is bigger
53:28than for your selected sample
53:30then your proportion is gonna get shifted
53:31in that direction?
53:32Right.
53:33It's only ever going to shift it to match the bias in X.
53:37Right?
53:37And so then, which way that shifts Y
53:39depends on what the relationship
53:41is between the covariates Z and X in the probate regression.
53:46But it will always shift it in a particular direction.
53:49I will notice that I fully admit,
53:52our index actually shifted the wrong direction
53:55in one particular case.
53:57Right?
53:57So actually in Florida,
54:00we actually shifted down when we shouldn't.
54:02Right.
54:03So here's the way to estimate and we're shifting down,
54:05but actually the truth is higher.
54:07So we're not always getting it right
54:09we're getting it right when that X is shifting
54:13in the correct direction.
54:14Right?
54:15So it isn't true that we always...
54:17It's true that it always shifts the direction of X,
54:19but it's not a hundred percent true that X
54:22always shifts in the exact same way as Y.
54:24Just most of the time.
54:25There was evidence of underestimating the Trump support,
54:29and that was in fact reflected in that probate regression,
54:32right in that relationship.
54:33The people who replied to the poll were older,
54:36they were higher educated, right?
54:39And so those older,
54:40higher educated people in aggregate
54:43were less likely to vote for Trump.
54:45So that's why we ended up under calling the support
54:48for Trump when we don't account
54:49for that potential non-ignorable selection bias.
54:52Good question though.
54:55Robert Go it, Thank you.
54:56Any other questions (indistinct)
55:09Anybody?
55:16I know I talk fast and that was a lot of stuff
55:19so you know, like get it.
55:21(indistinct)
55:23Alright.
55:24Well, Andridge, Thank you again.
55:26And.
55:27(students clapping)
55:33Thank you.
55:34Thank you for having me.
55:35Robert Yeah.