YSPH Biostatistics Seminar: "Measures of Selection Bias for Proportions Estimated from Non-Probability Samples"
November 16, 2021Information
Rebecca Andridge, PhD, Associate Professor, Department of Biostatistics, The Ohio State University
November 16, 2021
ID7167
To CiteDCA Citation Guide
- 00:00Hi.
- 00:01Hi everybody.
- 00:02Students Hi.
- 00:03It's my pleasure today
- 00:03to introduce Professor Rebecca Andridge.
- 00:06Professor Andridge has a Bachelors' in Economics in Stanford
- 00:10and her Master's and PhD in Biostatistics
- 00:13from the University of Michigan.
- 00:16She an expert in group randomized trials
- 00:18and methods of missing data
- 00:19especially for that ever so tricky case that is not,
- 00:23or so where data is missing not at random.
- 00:26She's been faculty in Biostatistics in Ohio State University
- 00:29since 2009.
- 00:31She's an award-winning educator
- 00:32and a 2020 Fellow of the Americans Associates,
- 00:36and we're very honored to have a huge day.
- 00:38Let's welcome professor Andridge.
- 00:40(students clapping)
- 00:43Thank you for the very generous introduction.
- 00:46I have to tell you,
- 00:47it's so exciting to see a room full of students.
- 00:51I am currently teaching online class
- 00:52and the students don't all congregate in a room.
- 00:54So it's like been years since I've seen this.
- 00:58So I'm of course gonna share my slides.
- 01:01I want to warn everybody that I am working from home today.
- 01:06And while we will not be interrupted by my children
- 01:09we might be interrupted or I might be interrupted
- 01:11by the construction going on in my house,
- 01:13my cats or my fellow work at home husband.
- 01:16So I'm gonna try to keep the distractions to a minimum
- 01:18but that is the way of the world in 2020,
- 01:22in the pandemic life.
- 01:24So today I'm gonna be talking about some work
- 01:26I've done with some colleagues
- 01:27actually at the University of Michigan.
- 01:29Talking about selection bias
- 01:31in proportions estimated from non-probability samples.
- 01:36So I'm gonna start with some background and definitions
- 01:38and we'll start with kind of overview
- 01:40of what's the problem we're trying to address.
- 01:43So big data are everywhere, right?
- 01:45We all have heard that phrase being bandied about, big data.
- 01:49They're everywhere and they're cheap.
- 01:50You got Twitter data, internet search data, online surveys,
- 01:53things like predicting the flu using Instagram, right?
- 01:56All these massive sources of data.
- 01:59And these data often, I would say pretty much all the ways
- 02:03arise from what are called non-probability samples.
- 02:07So when we have a non-probability sample
- 02:08we can't use what are called design based methods
- 02:11for inference,
- 02:11you actually have to use model based approaches.
- 02:14So I'm not gonna assume that everybody knows
- 02:16all these words that I've found out here,
- 02:18so I'm gonna go into some definitions.
- 02:22So our goal is to develop an index of selection bias
- 02:25that lets us get at how bad the problem might be,
- 02:28how much bias might we have due to non-random selection
- 02:32into our sample?
- 02:34So a probability sample is a situation
- 02:38where you're collecting data
- 02:39where each unit in the population
- 02:41has a known positive probability of selection.
- 02:44And randomness is involved in the selection of which units
- 02:47come into the sample, right?
- 02:49So this is your stereotypical complex survey design
- 02:53or your sample survey.
- 02:55Large government sponsored surveys
- 02:57like the National Health and Nutrition Examination Survey,
- 03:00NHANES or NHIS or any number of large surveys
- 03:04that you've probably come across,
- 03:06you know, in application and your biostatistics courses.
- 03:09So for these large surveys
- 03:11we do what's called design-based inference.
- 03:14So that's where we rely on the design
- 03:16of the data collection mechanism
- 03:18in order for us to get unbiased estimates
- 03:20of population quantities,
- 03:21and we can do this without making any model assumptions.
- 03:24So we don't have to assume
- 03:26that let's say body mass index has a normal distribution.
- 03:29We literally don't have to specify distribution at all.
- 03:32It's all about the random selection into the sample
- 03:35that lets us get our estimates
- 03:36and be assured that we have unbiased estimates.
- 03:40So here's an example in case there are folks
- 03:43out in the audience who don't have experience
- 03:45with the sort of complex survey design or design features.
- 03:48So this is a really silly little example
- 03:49of a stratified sample.
- 03:51So here I have a population
- 03:53of two different types of animals.
- 03:55I have cats and I have dogs.
- 03:57And in this population I happen to have 12 cats and $8.
- 04:01And I have taken a sample.
- 04:03Stratified sample where I took two cats and two dogs.
- 04:07So in this design the selection probabilities
- 04:09are known for all of the units, right?
- 04:11Because I know that there's a two out of eight chance
- 04:14I pick a dog and a two out of 12 chance
- 04:16that I pick a cat, right?
- 04:18So the probability a cat is selected is 1/6
- 04:21then the probability of dog is selected is 1/4.
- 04:23Now, how do I estimate a proportion of interest?
- 04:26Let's say it's the proportion of orange animals
- 04:28in the population.
- 04:29Like here in my sample,
- 04:30I have one of four orange animals,
- 04:32but if I chose that as my estimator
- 04:34I'd be ignoring the fact that I know how I selected
- 04:37these animals into my sample.
- 04:39So what we do is we wait the sample units
- 04:42to produce design unbiased estimates, right?
- 04:44Because this one dog kinda counts
- 04:48differently than one cat, right?
- 04:50Because there were only eight dogs
- 04:51to begin with but there were 12 cats.
- 04:54So if I want to estimate the proportion of orange animals
- 04:57I would say this cat is a weight is six
- 05:00because there's two of them and 12 total.
- 05:02So 12 divided by two is six.
- 05:04So there's six in the numerator.
- 05:06And then the denominator is the sum of the weights
- 05:08of all the selected units,
- 05:10the cats are each six and the dogs are each four.
- 05:12So I actually get my estimate a proportion of 30%.
- 05:15So instead of 25%.
- 05:17So that kind of weighted estimator
- 05:18is what we do in probability sampling.
- 05:20And we don't have to say what the distribution
- 05:22of dogs or cats is in the sample
- 05:24or orangeness in the sample,
- 05:26we entirely rely on the selection mechanism.
- 05:30What ended up happening in the real world
- 05:32a lot of the time is we don't actually get to use
- 05:35those kinds of complex designs.
- 05:36And instead we collect data
- 05:38through what's called a non-probability sample.
- 05:40So in a non-probability sample,
- 05:42it's pretty easy to define.
- 05:43You cannot calculate the probability of selection
- 05:46into the sample, right?
- 05:47So we simply don't know what the mechanism
- 05:49was that made at unit enter our sample.
- 05:53I know there's the biostatistics students in the audience,
- 05:55and you've all probably done a lot of data analysis.
- 05:57And I would venture a guess that a lot of the times
- 06:00your application datasets
- 06:01are non-probability samples, right?
- 06:03A lot of the times there are convenience samples.
- 06:05I work a lot with biomedical researchers
- 06:07studying cancer patients.
- 06:08Well guess what, it's almost always a convenient sample
- 06:12of cancer patients, right?
- 06:13It's who will agree to be in the study?
- 06:15Who can I find to be in my study?
- 06:17Other types of non-probability samples
- 06:19include things like voluntary or self-selection sampling,
- 06:22quota sampling, that's a really old,
- 06:24old school method from polling back many years ago.
- 06:28Judgment sampling or snowball sampling.
- 06:30So there's a lot of different ways
- 06:31you can get non-probability samples.
- 06:34So if we go back to the dog and cat example,
- 06:37if I didn't know anything about how these animals
- 06:39got into my sample and I just saw the four of them,
- 06:41and one of them was orange,
- 06:43I guess, I'm gonna guess 25% of my population is orange.
- 06:48Right?
- 06:49I don't have any other information
- 06:50I can't recreate the population
- 06:53like I could with the weighting.
- 06:54Where I knew how many cats in the population
- 06:57did each of my sampled cats represent
- 06:59and similarly for the dogs.
- 07:01So of course our best guess looking at these data
- 07:03would just be 25%, right?
- 07:05One out of the four animals is orange.
- 07:07So when you think about a non-probability sample,
- 07:10how much faith do you put in that estimate,
- 07:12that proportion?
- 07:15Hard to say, right?
- 07:16It depends on what you believe about the population
- 07:19and how you selected this non-probability sample
- 07:23but you do not have the safety net of the probability sample
- 07:26that guaranteed you're gonna get an unbiased estimate
- 07:28of repeated applications of the sampling.
- 07:32So I've already used the word selection bias
- 07:34a lot and sort of being assuming that, you know what I mean.
- 07:37So now I'm gonna come back to it and define it.
- 07:40So selection bias is bias arising
- 07:42when part of the target population
- 07:45is not in the sample population, right?
- 07:47So when there's a mismatch between who got into your sample
- 07:49and who was supposed to get into your sample, right?
- 07:51Who's the population?
- 07:53Or in a more general statistical kind of way,
- 07:56when some population units are sampled at a different rate
- 07:59than you meant.
- 08:00It's lik you meant for there to be a certain selection
- 08:03probability for orange animals or for dogs
- 08:06but it didn't actually end up that way.
- 08:08This will end up down the path of selection bias.
- 08:11And I will note that again, as you are biostats students
- 08:13you've probably had some epidemiology.
- 08:15And epidemiologists talk about selection bias as well.
- 08:17It's the same concept, right?
- 08:19That concept of who is ending up in your sample.
- 08:22And is there some sort of a bias in the mechanism?
- 08:26So selection bias is in fact the predominant
- 08:28concern with non-probability samples.
- 08:30In these non-probability samples,
- 08:32the units in the sample might be really different
- 08:36from the units not in the sample,
- 08:37but we can't tell how different they are.
- 08:40Whether we're talking about people, dogs, cats, hospitals,
- 08:43whatever we're talking about.
- 08:44However, these units got into my sample, I don't know.
- 08:47So I don't know if the people in my sample
- 08:49look like my population or not.
- 08:53And an important key thing to know
- 08:55is that probability samples
- 08:57when we have a low response rates, right?
- 08:59So when there are a lot of people not responding
- 09:01you're basically ending up back
- 09:03at a non-probability sample, right?
- 09:05Where we have this beautiful design,
- 09:07we know everybody's sampling weight, we draw a sample,
- 09:10oops, ut then only 30% of people respond to my sample.
- 09:14You're basically injecting that bias back in again.
- 09:16Sort of undoing the beauty of the probability sample.
- 09:21So when we think about a selection
- 09:23bias and selection into a sample,
- 09:25we can categorize them in two ways.
- 09:28And Dr. McDougal, actually,
- 09:30when he was giving you my brief little bio
- 09:32used the words that I'm sure you've used
- 09:34in your classes before like ignorable and non-ignorable.
- 09:37These are usually are more commonly applied
- 09:39to missingness, right?
- 09:41So ignorable missingness mechanisms
- 09:43and non-ignorable missingness mechanisms.
- 09:45Missing at random, missing completely at random
- 09:48or missing not at random, right?
- 09:50Same exact framework here.
- 09:52But instead of talking about missingness
- 09:54we're talking about selection into the sample.
- 09:56So when we have an ignorable selection mechanism,
- 09:59that means the probability of selection
- 10:01depends on things I observed.
- 10:02Right, it depends on the observed characteristics.
- 10:05When I have a non-negotiable selection mechanism
- 10:08now that probability of selection depends
- 10:10on observed characteristics.
- 10:12Again, this is not really a new concept
- 10:14if you understanded about missing data,
- 10:15just apply to selection into the sample.
- 10:20So in a probability sample
- 10:22we might have different probabilities of selection
- 10:24for different types of units like for cats versus for dogs.
- 10:28But we know exactly how they differ, right?
- 10:31It's because I designed my survey
- 10:33based on his characteristic of dog versus cat
- 10:36and I know exactly the status of dog versus cat
- 10:38for my entire population in order to do that selection.
- 10:42So I absolutely can estimate the proportion of orange,
- 10:45animals unbiasedly in the sense of taking repeated
- 10:49stratified samples and estimating that proportion.
- 10:52I hadn't guaranteed that I'm gonna get an unbiased
- 10:54estimate, right?
- 10:55So this selection mechanism
- 10:57is definitely not non-ignorable, right?
- 11:00This is definitely an ignorable selection mechanism
- 11:02in the sense that it only depends
- 11:04on observed characteristics.
- 11:06But if my four animals had just come from,
- 11:09I don't know where?
- 11:10Convenience.
- 11:11Well now why did they end up in my sample?
- 11:14It could depend on something that we didn't observe.
- 11:16What breed of dog it was?
- 11:18The age of the dog, the color of the dog.
- 11:20It could have been pretty much anything, right?
- 11:22That's the problem with the convenient sample.
- 11:24You don't know why those units
- 11:25often self-selected to be into your sample.
- 11:29So now I'm gonna head into the kind of ugly statistical
- 11:32notation portion of this stock.
- 11:35So we'll start with estimated proportions.
- 11:37So we'll use Y as our binary indicator
- 11:41for the outcome, okay?
- 11:43But here I'm gonna talk about Y
- 11:45more generally as all the survey data.
- 11:49So we'll start with Y as all the survey data,
- 11:50then we're gonna narrow it down to Y
- 11:51as the binary indicator?
- 11:53So we can partition our survey data into the data
- 11:57for the units we got in the sample
- 11:58and the data for units that are not in the sample.
- 12:01I so selected into the sample versus
- 12:03not selected into the sample.
- 12:05But for everybody I have Z,
- 12:07I have some fully observed
- 12:09what are often called design variables.
- 12:11So this is where we are using information
- 12:14that we know about an entire population
- 12:16to select our sample in the world of probability sampling.
- 12:20And then S is the selection indicator.
- 12:23So these three variables have a joint distribution.
- 12:26And most of the time,
- 12:27what we care about is Y given Z.
- 12:30Right, we're interested in estimating
- 12:32some outcome characteristic
- 12:34conditional on some other characteristic, right?
- 12:37Average weight for dogs, average weight for cats, right?
- 12:40Y given Z.
- 12:42But Y given Z is only part of the issue,
- 12:45there's also a selection mechanism, right?
- 12:48So there's also this function
- 12:49of how do you predict selection S with Y and Z.
- 12:53And I'm using this additional Greek letter psi here
- 12:56to denote additional variables
- 12:58that might be involved, right?
- 13:00'Cause selection could depend on more than just Y and Z.
- 13:03It could depend on something outside
- 13:04of that set of variables.
- 13:07So when we have probability sampling,
- 13:08we have what's called
- 13:09an extremely ignorable selection mechanism,
- 13:12which means selection can depend on Z,
- 13:14like when we stratified on animal type
- 13:16but it cannot depend on Y.
- 13:18Either the selected units Y or the excluded units Y
- 13:22doesn't depend on either.
- 13:24Kind of vaguely like the MCAR of selection mechanisms.
- 13:27It doesn't depend on Y at all.
- 13:29Observed or unobserved.
- 13:31But it can depend on Z.
- 13:31So that makes it different than MCAR.
- 13:34So including a unit into the sample
- 13:36is independent of those survey outcomes Y
- 13:39and also any unobserved variables, right?
- 13:41That phi here, that phi goes away.
- 13:44So selection only depends on Z.
- 13:46So if I'm interested in this inference target
- 13:49I can ignore the selection mechanism.
- 13:51So this is kind of parallels that idea
- 13:54in the missingness, in the missing data literature, right?
- 13:56If I have an ignorable missingness mechanism
- 13:59I can ignore that part of it.
- 14:00I don't have to worry about modeling
- 14:02the probability that a unit is selected.
- 14:05But the bad news in our non-probability sampling,
- 14:08very, very arguably true
- 14:11that you could have non ignorable selection, right?
- 14:13It's easy to make an argument for why the people
- 14:16who ended up into your sample,
- 14:18your convenient sample are different than the people
- 14:20who don't enter your sample.
- 14:22Think about some of these big data examples.
- 14:24Think about Twitter data.
- 14:26Well, I mean, you know,
- 14:27the people who use Twitter are different
- 14:29than the people who don't use Twitter, right?
- 14:31In lots of different ways.
- 14:32So if you're going to think about drawing
- 14:34some kind of inference about the population,
- 14:36you can't just ignore that selection mechanism.
- 14:39You need to think about how do they enter
- 14:41into your Twitter sample
- 14:42and how might they be different than the people
- 14:44who did not enter into your Twitter sample.
- 14:47So when we're thinking about the selection mechanism
- 14:49basically nothing goes away, right?
- 14:51We can't ignore this selection mechanism.
- 14:53But we have to think
- 14:55about it when we want to make inference,
- 14:56even when our inference is about Y given Z, right?
- 14:59Even when we don't actually care
- 15:00about the selection mechanism.
- 15:02So the problem with probability samples
- 15:04is that it's often very, very hard to model S
- 15:07or we don't really have a good set of data
- 15:10with which to model the probability
- 15:11someone ended up in your sample.
- 15:14And that's basically what you have to do to generalize
- 15:17to the population, right?
- 15:19There's methods that exist for non-probability samples
- 15:21require you to do something along the lines
- 15:24of finding another dataset
- 15:26that has similar characteristics
- 15:27and model the probability of being in the probability
- 15:30sample, right?
- 15:31So that's doable in many situations
- 15:34but what we're looking for is a method
- 15:35that doesn't require you to do that
- 15:37but instead says, let's do a sensitivity analysis.
- 15:40Let's say, how big of a problem
- 15:43might selection bias be if we ignored
- 15:46the selection mechanism, right?
- 15:47If we just sort of took our sample on faith
- 15:49as if it were an SRS from the population.
- 15:52How wrong would we be
- 15:54depending on how bad our selection bias problem is?
- 15:59So there has been previous work done
- 16:00in this area, in surveys often.
- 16:03Try to think about how confident
- 16:06are we that we can generalize to the population
- 16:08even when we're doing a probability sample.
- 16:10So there's work on thinking about the representativeness
- 16:14of a sample.
- 16:15So that's again, the generalizability to the population.
- 16:18So there's something called an R-indicator,
- 16:21which is a function of response probabilities
- 16:25or propensities,
- 16:26but it doesn't involve the survey variables.
- 16:28So it's literally comparing the probability of response
- 16:32to a survey for different demographic,
- 16:34across different demographic characteristics, for example.
- 16:37Right.
- 16:38And seeing who is more likely to respond then who else?
- 16:40And if there are those differences
- 16:41then adjustments need to be made.
- 16:44There's also something called the H1 indicator,
- 16:47which does bring Y into the equation
- 16:49but it assumes ignorable selection.
- 16:52So it's going to assume that the Y
- 16:54excluded gets dropped out.
- 16:58The selection mechanism is only depends
- 16:59on things that you observe, so you can ignore it, right?
- 17:03So it's ignorable.
- 17:04So that's not what we're interested in.
- 17:06'Cause we're really worried in the non probability space
- 17:09that we can't ignore the selection mechanism.
- 17:13And there isn't relatively new indicator
- 17:15called that they called the SMUB, S-M-U-B.
- 17:19That is an index that actually extends
- 17:21this idea of selection bias
- 17:23to allow for non ignorable selection.
- 17:25So it lets you say, well, what would my point estimate
- 17:29be for a mean if selection were in fact ignorable,
- 17:33and now let's go to the other extreme,
- 17:35suppose selection only depends on Y.
- 17:37And I'm trying to estimate average weight
- 17:39and whether or not you entered my sample
- 17:40is entirely dependent on your weight.
- 17:43That's really not ignorable.
- 17:45And then it kinda bounds the potential magnitude
- 17:47for the problem.
- 17:49So that SMUB, this estimator is really close
- 17:52to what we want but we want it for proportions.
- 17:55especially because in survey work and in large datasets,
- 18:00we very often have categorical data
- 18:03or very, very often binary data.
- 18:05If you think about if you've ever participated
- 18:07in an online survey or filled out those kinds of things
- 18:10very often, right, You're checking a box.
- 18:11It's multiple choice, select all that apply.
- 18:13It's lots and lots of binary data floating around out there.
- 18:17And I'll show you a couple of examples.
- 18:19So that was a lot of kind of me talking
- 18:22at you about the framework.
- 18:23Now, let me bring this down to a solid example application.
- 18:27So I'm going to use the national survey
- 18:30of family growth as a fake population.
- 18:32So I want you to pretend that I have a population
- 18:36of 19,800 people, right?
- 18:38It happens to be that I pulled it from the national survey
- 18:40of family growth,
- 18:41that's not really important that that was the source.
- 18:43I've got this population of about 20,000 people.
- 18:46But let's pretend we're doing a study
- 18:48and I was only able to select
- 18:50into my sample smartphone users.
- 18:52Because I did some kind of a survey that was on their,
- 18:54you had to take it on your phone.
- 18:56So if you did not have a smartphone
- 18:57you could not be selected into my sample.
- 19:00In this particular case, in this fake population,
- 19:03it's a very high selection fraction.
- 19:04So about 80% of my population is in my sample.
- 19:07That in and of itself is very unusual, right?
- 19:11A non-probability sample is usually very,
- 19:13very small compared to the full population
- 19:15let's say of the United States
- 19:17if that's who we're trying to generalize to.
- 19:18But for the purposes of illustration
- 19:20it helps to have a pretty high selection fraction.
- 19:22And we'll assume that the outcome we're interested
- 19:24in is whether or not the individual has ever been married.
- 19:28So this is person level data, right?
- 19:29Ever been married.
- 19:31And it is...
- 19:32we wanna estimate it by gender,
- 19:34and I will note that the NSFG only calculate
- 19:36or only captures gender as a binary variable.
- 19:39This is a very long standing survey,
- 19:41been going on since the seventies.
- 19:42We know our understanding of gender as a construct
- 19:45has grown a lot since the seventies
- 19:47but this survey, and in fact
- 19:48many governmental surveys still treat gender
- 19:51as a binary variable.
- 19:52So that's our limitation here
- 19:54but I just want to acknowledge that.
- 19:56So in this particular case,
- 19:58we know the true selection bias, right?
- 20:00Because I actually have all roughly 20,000 people
- 20:04so that therefore I can calculate what's the truth,
- 20:06and then I can use my smartphone sample and say,
- 20:08"Well, how much bias is there?"
- 20:11So it turns out that in the full sample
- 20:1346.8% of the females have never been married.
- 20:16And 56.6% of the males had never been married.
- 20:20But if I use my selected sample of smartphone users
- 20:23I'm getting a, well, very close,
- 20:25but slightly smaller estimate for females.
- 20:2846.6% never married.
- 20:30And for males it's like about a percentage
- 20:32point lower than the truth, 55.5%.
- 20:35So not a huge amount of bias here.
- 20:38My smartphone users are not all that non-representative
- 20:41with respect to the entire sample,
- 20:43at least with respect to whether
- 20:44or not they've ever been married.
- 20:47So when we have binary data,
- 20:49an important point of reference is what happens if we assume
- 20:53everybody not in my sample is a one, right?
- 20:55What if everybody not in my sample was never married
- 20:58or everyone not in my sample
- 21:01is a no to never married, right?
- 21:03So like has, has ever been married?
- 21:05And these are what's called the Manski bounds.
- 21:07When you fill in all zeros or fill in old bonds
- 21:10for the missing values or the values
- 21:12for those non-selected folks.
- 21:14So we can bound the bias.
- 21:15So the bias of this estimate of 46.6 or 46.6%
- 21:21has to be by definition
- 21:23between negative 0.098 and positive 0.085.
- 21:26Because those are the two ends of putting all zeros
- 21:29or all ones for the people who are not in my sample.
- 21:32So this is unlike a continuous variable, right?
- 21:35Where we can't actually put a finite bound on the bias.
- 21:38We can with a proportion, right?
- 21:40So this is why, for example,
- 21:42if any of you ever work on smoking cessation studies
- 21:45often they do sensitivity analysis.
- 21:47People who drop out assume they're all smoking, right?
- 21:50Or assume they're all not smoking.
- 21:51They're not calling it that
- 21:53but they're getting the Manski bounds.
- 21:56Okay.
- 21:57So the question is, can we do better than the Manski bounds?
- 22:00Because these are actually pretty wide bounds,
- 22:02relative to the size of the true bias,
- 22:04and these are very wide.
- 22:06And imagine a survey where we didn't have 80% selected.
- 22:10What if we had 10% selected?
- 22:12Well, then the Manski bounds are gonna be useless, right?
- 22:14plug in, all zeros plug in all ones,
- 22:16you're gonna get these insane estimates
- 22:17that are nowhere close to what you observed.
- 22:21So going back to the statistical notation,
- 22:23this is where I said we're going to use Y
- 22:24in a slightly different way.
- 22:26Now, Y, and now forward is the binary variable of interest.
- 22:30So in this case, in this NSFG example
- 22:33it was never married.
- 22:35We have a bunch of auxiliary variables that we observed
- 22:38for everybody in the selected sample;
- 22:41age, race, education, et cetera,
- 22:43and I'm gonna call those Z.
- 22:48Assume also that we have summary statistics
- 22:51on Z for the selected cases.
- 22:53So I don't observe Z for everybody, right?
- 22:55All my non-smartphone users,
- 22:57I don't know for each one of them, what is their gender?
- 23:00What is their age? What is their race?
- 23:02But I don't actually observe that.
- 23:03But I observed some kinda summary statistic.
- 23:06But a mean vector and a covariance matrix of Z.
- 23:09So I have some source of what does my population
- 23:12look like at an aggregate level?
- 23:14And in practice, this would come from something
- 23:16like census data or in a very large probability sample,
- 23:20something where we would be pretty confident
- 23:21This is reflective of the population.
- 23:23Will note that if we have data for the population
- 23:27and not the non-selected,
- 23:29then we can kinda do subtraction, right?
- 23:30We can take the data for the population
- 23:32and aggregate and go backwards
- 23:35to figure out what it would be for the non-selected
- 23:36by effectively backing out the selected cases.
- 23:40And similarly another problem
- 23:42is that we don't have the variance.
- 23:43We could just assume it's what we observe
- 23:44in the selected cases.
- 23:46So how are we gonna use this in order
- 23:48to estimate of selection bias,
- 23:52what we're gonna come up
- 23:53with this measure of unadjusted bias for proportions
- 23:56called the MUBP.
- 23:59So the MUBP is an extension of the SMUB
- 24:02that was for means, for continuous variables
- 24:04to binary outcomes, right?
- 24:06To proportions.
- 24:07High-level, it's based on pattern-mixture models.
- 24:10It requires you to make explicit assumptions
- 24:13about the distribution of the selection mechanism,
- 24:15and it provides you a sensitivity analysis,
- 24:18basically make different assumptions on S,
- 24:20I don't know what that distribution is,
- 24:22and you're gonna get a range of bias.
- 24:24So that's that idea of how wrong might we be?
- 24:28So we're trying to just tighten those bounds
- 24:30compared to the Manski bounce.
- 24:31Where we don't wanna have to rely on plug in all zeros,
- 24:33plug in all ones,
- 24:35we wanna shrink that interval
- 24:36to give us something a little bit more meaningful.
- 24:38So the basic idea behind how this works
- 24:41before I show you the formulas is we can measure
- 24:44the degree of selection bias in Z, right?
- 24:47Because we observed Z for our selected sample,
- 24:50and we observed at an aggregate for the population.
- 24:53So I can see, for example, that if in my selected sample,
- 24:56I have 55% females but in the population it's 50% females.
- 25:01Well, I can see that bias.
- 25:03Right, I can do that comparison.
- 25:04So absolutely I can tell you how much selection bias
- 25:08there is for all of my auxiliary variables.
- 25:11So if my outcome Y is related to my Zs
- 25:16then knowing something about the selection bias in Z
- 25:19tells me something about the selection bias in Y.
- 25:22It doesn't tell me exactly the selection bias in Y
- 25:25but it gives me some information in the selection bias in Y.
- 25:28So in the extreme imagine if your Zs
- 25:32in your selected sample
- 25:33in aggregate looked exactly like the population.
- 25:36Well, then you'd be pretty confident, right?
- 25:40That there's not an enormous amount of selection bias
- 25:42in Y assuming that Y was related to the Z.
- 25:46So we're gonna use pattern-mixture models
- 25:48to explicitly model that distribution of S, right?
- 25:52And we're especially gonna focus on the case
- 25:54when selection depends on Y.
- 25:56It depends on our binary outcome of interest.
- 26:00So again, Y is that binary variable interest,
- 26:03we only have it for the selected sample.
- 26:05In the NSFG example it's whether the woman or man
- 26:08has ever been married.
- 26:10We have Z variables available for the selected cases
- 26:13in micro data and an aggregate for the non-selected sample,
- 26:16a demographic characteristics
- 26:18like age, education, marital status, et cetera.
- 26:22And the way that we're gonna go
- 26:24about doing this is we're gonna try
- 26:25to get back to the idea of normality,
- 26:27because then as you all know, when everything's normal
- 26:30it's great, right?
- 26:32It's easy to work with the normal distribution.
- 26:34So the way we can do that with a binary variable
- 26:37is we can think about latent variables.
- 26:39So we're going to think about a latent variable called U.
- 26:42That is an underlying, unobserved latent variables.
- 26:45So unobserved for everybody, including our selected sample.
- 26:48And it's basically thresholded.
- 26:50And when U crosses zero, well, then Y goes from zero to one.
- 26:54So I'm sure many, all of you have seen probit regression,
- 26:58or this is what happens
- 26:59and this is how probit regression is justified,
- 27:01via latent variables.
- 27:04So we're going to take our Zs
- 27:06that we have for the selected cases,
- 27:08and essentially reduce the dimensionality.
- 27:11We're gonna take the Zs,
- 27:13run a probate regression of Y on Z in the selected cases,
- 27:17and pull out the linear predictor
- 27:19from the regression, right?
- 27:20The X beta, right?
- 27:22Sorry, Z beta.
- 27:24And I'm gonna call that X.
- 27:25That is my proxy for Y or my Y hat, right?
- 27:30It's just the predicted value from the regression.
- 27:32And I can get that for every single observation
- 27:35in my selected sample, of course, right?
- 27:37Just plug in each individual's Z values
- 27:39and get out their Y hat.
- 27:40That's my proxy value.
- 27:42And it's called the proxy
- 27:44because it's the prediction, right?
- 27:45It's our sort of best guess at Y
- 27:47based on this model.
- 27:49So I can get it for every observation in my selected sample,
- 27:52but very importantly I can also get it on average
- 27:56for the non-selective sample.
- 27:57So I have all my beta hats for my probit regression,
- 28:01and I'm gonna plug in Z-bar.
- 28:03And I'm going to plug in the average value of my Zs.
- 28:06And that's going to give me the average value
- 28:08of X for the non-selected cases.
- 28:11I don't have an actual observed value
- 28:13for all those non-selective cases
- 28:15but I have the average, right?
- 28:16So I could think about comparing the average Z value
- 28:19in the aggregate, in the non-selected cases
- 28:22to that average Z among my selected cases.
- 28:24And that is of course
- 28:26exactly where we're gonna get those index from.
- 28:29So I have my selection indicator S,
- 28:31so in the smartphone example,
- 28:33that's S equals one for the smartphone users
- 28:35and S equals zero for the non-smartphone users
- 28:37who weren't in my sample.
- 28:39And importantly, I'm going to allow
- 28:40there to be some other covariates V
- 28:43floating around in here that are independent of Y and X
- 28:46but could be related to selection.
- 28:48Okay.
- 28:49So it could be related to how you got into my sample
- 28:51but importantly, not related to the outcome.
- 28:55So diving into the math here, the equations,
- 28:59we're gonna assume a proxy pattern-mixture model for U,
- 29:02the latent variable underlying Y
- 29:05and X given the selection indicator.
- 29:08So what a pattern-mixture model does is it says
- 29:11there's a totally separate distribution
- 29:14or joint distribution of Y and X for the selected units
- 29:16and the non-selected units.
- 29:18Notice that all my mus, all my sigmas, my rho,
- 29:21they've all got a superscript of j, right?
- 29:23So that's whether your S equals zero or S equals one.
- 29:27So two totally different bi-variate normal distributions
- 29:31before Y and X,
- 29:33depending on if you're selected or non-selected.
- 29:35And then we have a marginal distribution
- 29:37just Bernoulli, for the selection indicator.
- 29:40However, I'm sure you all immediately are thinking,
- 29:43"Well, that's great,
- 29:45"but I don't have any information to estimate
- 29:47"some of these parameters for the non-selected cases."
- 29:51Clearly, for the selected cases, right?
- 29:53S equals one.
- 29:54I can estimate all of these things.
- 29:55But I can't estimate them for the non-selected sample
- 29:58because I might observe X-bar
- 30:01but I don't observe anything having to do with you.
- 30:03'Cause I have no Y information.
- 30:06So in order to identify this model
- 30:08and be able to come up with estimates
- 30:09for all of these parameters,
- 30:10we have to make an assumption about the selection mechanism.
- 30:13So we assume that the probability of selection
- 30:16into my sample is a function of U.
- 30:19So we're allowing it to be not ignorable.
- 30:21Remember that's underlying Y and X,
- 30:23that proxy which is a function of Z.
- 30:25So that's observed and V, those other variables.
- 30:30And in particular, we're assuming
- 30:31that it's this funny looking form of combination
- 30:34of X and U.
- 30:35That depends on this sensitivity parameter phi.
- 30:38So phi it's one minus phi times X
- 30:41and phi times U.
- 30:43So that's essentially weighting
- 30:45the contributions of those two pieces.
- 30:47How much of selection is dependent
- 30:49on the thing that I observe
- 30:50or the proxy builds off the auxiliary variables
- 30:53and how much of it is depending on the underlying latent U
- 30:56related to Y,
- 30:57that is definitely not observed
- 30:58for the non-selected.
- 31:00Okay.
- 31:01And there's a little X star here,
- 31:02that's sort of a technical detail.
- 31:03We're rescaling the proxy.
- 31:05So it has the same variance as U,
- 31:07very unimportant mathematical detail.
- 31:10So we have this joint distribution
- 31:13that is conditional on selection status.
- 31:16And in addition to, we need that one assumption
- 31:19to identify things.
- 31:20We also have the latent variable problem.
- 31:22So latent variables do not have separately identifiable
- 31:24mean and variance, right?
- 31:26So that's just...
- 31:27Outside of the scope of this talk
- 31:29that's just a fact, right?
- 31:30So without loss of generality
- 31:31we're gonna set the variance of the latent variable
- 31:34for the select a sample equal to one.
- 31:35So it's just the scale of the latent variable.
- 31:38So what we actually care about is a function of you, right?
- 31:42It's the probability Y equals one marginally
- 31:45in my entire population.
- 31:46And so the probability Y equals one,
- 31:48is a probability U is greater than zero.
- 31:50That's that relationship.
- 31:51And so it's a weighted average of the proportion
- 31:55in the selected sample
- 31:56and the proportion in the non-selected sample, right?
- 32:00These are just...
- 32:01If U has this normal distribution
- 32:02this is how we get down to the probability
- 32:04U equals zero.
- 32:05Like those are those two pieces.
- 32:08So the key parameter that governs
- 32:10how this MUBP works is a correlation, right?
- 32:14It's the strength of the relationship between Y
- 32:17and your covariates.
- 32:18How good of a model do you have for Y, right?
- 32:22So remember we think back to that example
- 32:24of what if I had no biases Z.
- 32:26Or if Y wasn't related to Z,
- 32:28well, then who cares that there is no bias in Z.
- 32:32But we want there to be a strong relationship
- 32:34between Z and Y so that we can kind of infer from Z to Y.
- 32:40So that correlation in this latent variable framework
- 32:43is called the biserial correlation of the binary X
- 32:46and the continuous.
- 32:47I mean, sorry, the binary Y and the continuous X, right?
- 32:50There's lots of different flavors of correlation,
- 32:53biserial is the name for this one
- 32:55that's a binary Y and a continuous X
- 32:57when we're thinking about the latent variable framework.
- 33:00Importantly, you can estimate
- 33:01this in the selected sample, right?
- 33:04So I can estimate the correlation between you and X
- 33:06among the selected sample.
- 33:07I can't for the non-selected sample,
- 33:09of course, but I can for the selected sample.
- 33:12So the non-identifiable parameters
- 33:14of that pattern-mixture model, here they are.
- 33:15Like the mean for the latent variable,
- 33:17the variance for the latent variable
- 33:19and that correlation for the non-selected sample
- 33:22are in fact identified when we make this assumption
- 33:24on the selection mechanism.
- 33:26So let's think about some concrete scenarios.
- 33:30What if phi was zero?
- 33:32If phi is zero,
- 33:33we look up here at this part of the formula,
- 33:35well, then phi drops out it.
- 33:38So therefore selection only depends on X
- 33:40and those extra variables V that don't really matter
- 33:43because V isn't related to X or Y.
- 33:46This is an ignorable selection mechanism, okay.
- 33:50If on the other hand phi is one,
- 33:52well, then it entirely depends on U.
- 33:54X doesn't matter at all.
- 33:55This is your worst, worst, worst case scenario, right?
- 33:58Where whether or not you're in my sample only depends
- 34:00on U and therefore only depends on the value of Y.
- 34:04And so this is extremely not ignorable selection.
- 34:07And of course the truth is likely to lie
- 34:10somewhere in between, right?
- 34:11Some sort of non-ignorable mechanism,
- 34:13a phi between zero and one, so that U matters
- 34:16but it's not the only thing that matters.
- 34:18Right, that X matters as well.
- 34:20Okay.
- 34:21So this is a kind of moderate,
- 34:22non-ignorable selection.
- 34:23That's most likely the closest to reality
- 34:26with these non-probability samples.
- 34:30So for a specified value of phi.
- 34:33So we pick a value for our sensitivity parameter.
- 34:35There's no information in the data about it.
- 34:36We just pick it and we can actually estimate the mean of Y
- 34:40and compare that to the selected sample proportion.
- 34:43So we take this select a sample proportion,
- 34:45subtract what we get as the truth
- 34:47for that particular value of phi,
- 34:50and that's our measure of bias, right?
- 34:52So this second piece that's being subtracted
- 34:54here depends on phi.
- 34:55Right, it depends on what your value
- 34:57of your selected parameter is,
- 34:58or selection for your sensitivity parameter is.
- 35:01So in a nutshell, pick a selection mechanism
- 35:03by specifying specifying phi,
- 35:06estimate the overall proportion,
- 35:07and then subtract to get your measure of bias.
- 35:10And again, we don't know whether we're getting
- 35:12the right answer because it's depending
- 35:14on the sensitivity parameter
- 35:15but it's at least going to allow us to bound the problem.
- 35:19So the formula is quite messy,
- 35:21but it gives some insight into how this index works.
- 35:24So this measure of bias is the selected sample
- 35:27mean minus that estimator, right?
- 35:29This is the overall mean of Y
- 35:32based on those latent variables.
- 35:34And what gets plugged in here
- 35:36importantly for the mean
- 35:37and the variance for the non-selected cases
- 35:39depends on a component that I've got colored blue here,
- 35:42and a component that I've got color red.
- 35:44So if we look at the red piece
- 35:46this is the comparison of the proxy mean for the unselected
- 35:49and the selected cases.
- 35:50This is that bias in Z.
- 35:52The selection bias in Z,
- 35:54and it's just been standardized
- 35:55by its estimated variance, right?
- 35:57So that's how much selection bias
- 35:59was present in Z via X, right.
- 36:02Via using it to predict in the appropriate regression.
- 36:05Similarly, down here, how different is the variance
- 36:08of the selected and unselected cases for X.
- 36:10How much bias, selection bias is there in estimating
- 36:13the variance?
- 36:14So we're going to use that difference
- 36:16and scale the observed mean, right?
- 36:19There's that observed the estimated mean of U
- 36:22in the selected sample and how much it's gonna shift
- 36:24by is it depends on the selection,
- 36:26I mean, the sensitivity parameter phi,
- 36:29and also that by serial correlation.
- 36:31So this is why the by serial correlation is so important.
- 36:34It is gonna dominate how much of the bias
- 36:37in X we're going to transfer over into Y.
- 36:42So if phi were zero,
- 36:44so if we wanna assume
- 36:45that it is an ignorable selection mechanism,
- 36:48then this thing in blue here,
- 36:50think about plugging zero here, zero here, zero everywhere,
- 36:52is just gonna reduce down to that correlation.
- 36:55So we're gonna shift the mean of U
- 36:56for the non-selective cases
- 36:59based on the correlation times that difference in X.
- 37:03Whereas if we have phi equals one,
- 37:06this thing in blue turns into one over the correlation.
- 37:10So here is where thinking about the magnitude
- 37:12of the correlation helps.
- 37:13If the correlation is really big, right?
- 37:15If the correlation is 0.8, 0.9,
- 37:17something really large than phi and...
- 37:20I mean, sorry, then rho and one over rho
- 37:22are very close, right?
- 37:230.8 and 1/0.8 are pretty close.
- 37:26So if we're thinking about bounding this between phi
- 37:29equals zero and equals one,
- 37:30our interval is gonna be relatively small.
- 37:33But if the correlation is small,
- 37:35the correlation were 0.2, oh, oh, right?
- 37:37We're gonna get a really big interval
- 37:39because that correlation,
- 37:40we're gonna shift with the factor of multiplied by 0.2
- 37:43but then one over 0.2.
- 37:44That's gonna be a really big shift
- 37:46in that mean of the latent variable U
- 37:48and therefore the mean of Y.
- 37:51So how do we get these estimates?
- 37:53We have two possibilities. We can use what we call
- 37:55modified maximum likelihood estimation.
- 37:58It's not true.
- 37:58Maximum likelihood because we estimate
- 38:00the biserial correlation with something
- 38:02called a two step method, right?
- 38:04So instead of doing a full, maximum likelihood,
- 38:07we kind of take this cheat in which we set that mean of X
- 38:12for the selected cases equal to what we observe,
- 38:15And then conditional not to estimate
- 38:16the by serial correlation.
- 38:18Yeah.
- 38:19And as a sensitivity analysis we would plug in zero one
- 38:22and maybe 0.5 in the middle
- 38:23as the values sensitivity parameter.
- 38:26Alternatively, and we feel is a much more attractive
- 38:29approach is to be Bayesian about this.
- 38:31So in this MML estimation,
- 38:34we are implicitly assuming that we know the betas
- 38:38from that probate regression.
- 38:39That we're essentially treating X like we know it.
- 38:42But we don't know X, right?
- 38:44That probate regression,
- 38:45those parameters have error associated with them.
- 38:47Right?
- 38:48And you can imagine that the bigger your selected sample,
- 38:49the more precisely estimating those betas,
- 38:51that's not being reflected
- 38:53at all in the modified maximum likelihood.
- 38:56So instead we can be Bayesian.
- 38:57Put non-informative priors on all the identified parameters.
- 39:01That's gonna let those,
- 39:02the error in those betas be propagated.
- 39:05And so we'll incorporate that uncertainty.
- 39:07And we can actually, additionally put a prior on phi, right?
- 39:11So we could just say
- 39:12let's have it be uniform across zero one.
- 39:14Right?
- 39:15So we can see what does it look like if we in totality,
- 39:18if we assume that phi is somewhere evenly distributed
- 39:20across that interval.
- 39:22We've done other things as well.
- 39:23We've taken like discreet priors.
- 39:26Oh, let's put a point mass on 0.5 and one
- 39:29or other different, right?
- 39:30You can do whatever you want for that prior.
- 39:33So let's go back to the example
- 39:35see what it looks like.
- 39:36If we have the proportion ever married for females
- 39:38on the left and males on the right,
- 39:40the true bias is the black dot.
- 39:43And so the black is the true bias.
- 39:45The little tiny diamond is the MUBP for 0.5.
- 39:50An so that's plugging in that average value.
- 39:52Some selection mechanism that depends on why some,
- 39:56somewhere in the middle.
- 39:57So we're actually coming pretty close.
- 39:58That happens to be, that's pretty close.
- 40:00And the intervals in green
- 40:02are the modified maximum likelihood intervals
- 40:04from phi equals zero to phi equals one,
- 40:06and the Bayesian intervals are longer, right?
- 40:08Naturally.
- 40:09We're incorporating the uncertainty.
- 40:11Essentially these MUBP,
- 40:13modified maximum likely intervals are too short.
- 40:15And we admit that these are too short.
- 40:18If we plug in all zeros and all ones
- 40:21for that small proportion of my NSFG population
- 40:25that we aren't selected into the sample,
- 40:27we get huge bounds relative to our indicator.
- 40:31Right?
- 40:32So remember when I showed you that slide, that bounded,
- 40:34we know the bias has to be between these two values.
- 40:37That's what's going on here.
- 40:38That's what these two values are.
- 40:39But using the information in Z
- 40:41we're able to much, much narrow
- 40:43or make an estimate on where our selection bias is.
- 40:46So we got much tighter bounds.
- 40:48An important fact here
- 40:49is that we have pretty good predictors.
- 40:50Our correlation, the biserial correlation
- 40:53is about 0.7 or 0.8.
- 40:54So these things are pretty correlated
- 40:56with whether you've been married, age, education, right?
- 40:59Those things are pretty correlated.
- 41:01Another variable in the NSFG is income.
- 41:04So we can think about an indicator for having low income.
- 41:08Well, as it turns out those variables
- 41:10we have on everybody; age, education, gender,
- 41:14those things are not actually that good of predictors,
- 41:16of low income, very low correlation.
- 41:19So our index reflects that.
- 41:21Or you get much, Y, your intervals.
- 41:23Sort of closer to the Manski bounds.
- 41:26And in fact, it's exactly returning one of those bounds.
- 41:29The filling in all zeros bound is returned by this index.
- 41:33So that's actually an attractive feature.
- 41:35Right?
- 41:36We're sort of bounded at the worst possible case
- 41:38on one end of the bias
- 41:40but we are still capturing the truth.
- 41:42The Manski bounds are basically useless,
- 41:44right in this particular case.
- 41:47So that's a toy example.
- 41:50Just gonna quickly show you a real example,
- 41:53and I'm actually gonna to skip
- 41:54over the incentive experiment,
- 41:55which well, very, very interesting
- 41:57is there's a lot to talk about,
- 41:59and I'd rather jump straight to the presidential polls.
- 42:03So there's very much in the news now,
- 42:08and over the past several years,
- 42:08this idea of failure of political polling
- 42:11and this recent high profile failure
- 42:12of pre-election polls in the US.
- 42:15So polls are probability samples
- 42:18but they have very, very, very low response rates.
- 42:20I don't know how much you know about how they're done,
- 42:21but they're very, very low response rate.
- 42:23But think about what we're getting at in a poll,
- 42:25a binary variable, are you going to vote for Donald Trump?
- 42:28Yes or no?
- 42:29Are you gonna vote for Joe Biden?
- 42:31Yes or no?
- 42:31These binary variables.
- 42:32We want to estimate proportions.
- 42:34That's what political polls aimed to do.
- 42:36Pre-election polls.
- 42:37So we have these political polls with these failures.
- 42:41So we're thinking, maybe it's a selection bias problem.
- 42:44And that there is some of this people
- 42:45are entering into this poll differentially,
- 42:49depending on who they're going to vote for.
- 42:52So think of it this way,
- 42:53and I'm gonna use Trump as the example
- 42:54'cause we're going to estimate,
- 42:55I'm gonna try to estimate
- 42:56the proportion of people who will vote
- 42:57for Former President Trump in the 2020 election.
- 43:02So, might Trump supporters
- 43:04just inherently be less likely to answer the call, right?
- 43:07To answer that poll or to refuse to answer the question
- 43:11even conditional demographic characteristics, right?
- 43:13So two people who otherwise look the same
- 43:16with respect to those Z variables, age, race, education,
- 43:20the one who's the Trump supporter, someone might argue,
- 43:22you might be more suspicious of the government
- 43:24and the polls, and not want to answer
- 43:26and not come into this poll, not be selected.
- 43:28As it would be depending on why.
- 43:31So the MUBP could be used to try to adjust poll estimates.
- 43:35Say, well, there's your estimate from the poll
- 43:38but what if selection were not ignorable?
- 43:40How different would our estimate
- 43:42of the proportion voting for Trump?
- 43:45So in this example, our proportion of interest
- 43:48is the percent of people who are gonna vote for Trump.
- 43:51The sample that we used
- 43:53are publicly available data
- 43:54from seven different pre-election polls
- 43:56conducted in seven different states by ABC in 2020.
- 44:01And the way these polls work
- 44:03is it's a random digit dialing survey.
- 44:05So that's literally randomly dialing phone numbers.
- 44:08Many of whom get
- 44:09throughout 'cause their business, et cetera,
- 44:10very, very low response rates, 10% or lower.
- 44:13Very, very, very low response rates to these kinds of polls.
- 44:17They do, however, try to do some weighting.
- 44:19So it's not as if they just take that sample and say,
- 44:21there we go let's estimate the proportion for Trump.
- 44:23We do waiting adjustments
- 44:25and they use what's called inter proportional fitting
- 44:28or raking to get the distribution of key variables
- 44:33in the sample to look like the population.
- 44:36So they use census margins for, again,
- 44:38it's gender as binary, unfortunately,
- 44:40age, education, race, ethnicity, and party identification.
- 44:45So, because we're doing this after the election
- 44:47we know the truth.
- 44:48We have access to the true official election outcomes
- 44:50in each state.
- 44:51So I know the actual proportion of why.
- 44:54And my population is likely voters,
- 44:57because that's who we're trying to target
- 44:58with these pre-election polls.
- 44:59You wanna know what's the estimated proportion
- 45:02would vote for Trump among the likely voters.
- 45:05So the tricky thing is that population
- 45:07is hard to come by summary statistics.
- 45:10Likely voters, right?
- 45:11It's easy to get summary statistics from all people
- 45:13in the US or all people of voting age in the US
- 45:16but not likely voters.
- 45:18So here Y is an indicator for voting for Trump.
- 45:21Z is auxiliary variable in the ABC poll.
- 45:24So all those variables I mentioned
- 45:25before gender age, et cetera.
- 45:27We actually have very strong predictors
- 45:29of why basically because of these political ideation,
- 45:32party identification variables, right?
- 45:34Not surprisingly the people who identify as Democrats,
- 45:37very unlikely to be voting for Trump.
- 45:41The data set that we found that can give us population level
- 45:44estimates of the mean of Z for the non-selected sample
- 45:48is a dataset from AP/NORC.
- 45:50It's called their VoteCast Data.
- 45:52And they conduct these large surveys
- 45:55and provide an indicator of likely voter.
- 45:58So we can basically use this dataset
- 46:00to describe the demographic characteristics
- 46:02of likely voters,
- 46:04instead of just all people who are 18 and older in the US.
- 46:09The subtle issue is of course,
- 46:10these AP VoteCast data are not without error,
- 46:13but we're going to pretend that they are without error.
- 46:15And that's like a whole other papers.
- 46:17How do we handle the fact
- 46:17that my population data have error?
- 46:19So we're gonna use the unweighted ABC poll data
- 46:23as the selected sample and estimate the MUBP
- 46:26with the Bayesian approach with phi
- 46:27from the uniform distribution.
- 46:29The poll selection fraction is very, very, very small.
- 46:32Right, these polls in each state
- 46:34have about a thousand people in them
- 46:36but we've got millions of voters in each state.
- 46:38So the selection fraction is very, very, very small,
- 46:40total opposite of the smartphone example.
- 46:43So we'll just jump straight into the answer,
- 46:46did it work?
- 46:47Right, this is really exciting.
- 46:48So the red circle is the true proportion,
- 46:52oh, sorry, the true bias,
- 46:53this should say bias down here.
- 46:55In each of the states.
- 46:56So these are the seven states
- 46:57we looked at Arizona, Florida, Michigan, Minnesota,
- 46:59North Carolina, Pennsylvania, and Wisconsin.
- 47:02So this horizontal line here at zero that's no bias, right?
- 47:06So it's estimated, the ABC poll estimate
- 47:08would have no bias.
- 47:09And we can see then in Arizona where sort of overestimated
- 47:13and in the rest of the states
- 47:14we've got underestimated the support for Trump.
- 47:16And so that was really the failure was the underestimation
- 47:19of the support for Trump.
- 47:20Notice that our Bayesian bounds
- 47:24cover the true bias everywhere except
- 47:26in Pennsylvania and Wisconsin.
- 47:28And so Wisconsin had an enormous bias,
- 47:30or they way under called the support for Trump
- 47:33in Wisconsin by 10 percentage points.
- 47:34Huge problem.
- 47:35So we're not getting there
- 47:37but notice that zero is not in our interval.
- 47:40So our bounds are suggesting
- 47:43that there was a negative bias from the poll.
- 47:46So even though we didn't capture the truth,
- 47:48we've at least crossed the threshold
- 47:49saying very likely that you are under calling
- 47:52the support for Trump.
- 47:55So how do estimates using the MUBP compared to the ABC poll?
- 47:59Well, we can use the MUBP bounds to basically shift
- 48:03the ABC poll estimates.
- 48:05So we're calling those MUBP adjusted, right?
- 48:08So we've got the truth is...
- 48:10The true proportion who voted for Trump
- 48:12are now these red triangles
- 48:14and then the black circles are the point estimates
- 48:17from three different methods of estimation,
- 48:20of obtaining an estimate.
- 48:21Unweighted from the poll weighted estimate from the poll
- 48:25and the adjusted by our measure of selection bias,
- 48:28the non-ignorable selection bias is the last one.
- 48:30Is MUBP adjusted.
- 48:32So we can see that in some cases
- 48:35our adjustment and the polls are pretty similar, right?
- 48:39But look at, for example, Wisconsin,
- 48:41all the way over here on the right.
- 48:42So again, remember I said, we didn't cover the truth
- 48:44and we didn't cover the true bias
- 48:46but our indicator is the only one, right?
- 48:49That's got that much higher shift up towards Trump.
- 48:52So this is us saying, well,
- 48:53if there were an underlying selection mechanism
- 48:57saying that Trump supporters
- 48:59were inherently less likely to enter this poll,
- 49:03this is what would happen.
- 49:04Or this is what your estimated support for Trump would be.
- 49:07It's shifted up.
- 49:09We've got a similar sort of success story
- 49:11I'll say in Minnesota,
- 49:12we're both of the ABC estimators did not cover the truth
- 49:16in these pre-election polls but ours did, right.
- 49:18We were able to sort of shift up and say,
- 49:21look, if there were selection bias
- 49:22that depended on whether or not you supported Trump
- 49:25we would we captured that.
- 49:27So the important idea here is, you know,
- 49:29before the election, we wouldn't have these red triangles.
- 49:34But it's important to be able to see
- 49:36that this is saying you're under calling
- 49:39the support for Trump
- 49:40if there were a non-negligible selection, right?
- 49:42So it's that idea of a sensitivity analysis?
- 49:44How bad would we be doing?
- 49:46And what we would say is in Minnesota and Wisconsin
- 49:49we'd be very worried
- 49:50about under calling the support for Trump.
- 49:56So what have I just shown you?
- 49:59I'll summarize.
- 50:01The MUBP is a sensitivity analysis tool
- 50:04to assess the potential for non-ignorable selection bias.
- 50:08If we have a phi equals zero, an ignorable selection,
- 50:12we can adjust that away via weighting
- 50:14or some other method, right?
- 50:16So if it's not ignorable, I mean, if it is ignorable
- 50:18we can ignore the selection mechanism.
- 50:21On the other extreme if phi is one,
- 50:23totally not ignorable,
- 50:24selection is only depending on that outcome
- 50:26we're trying to measure.
- 50:28Somewhere in between we've got the 0.5.
- 50:30That if you really needed a point estimate
- 50:32of the bias, that would be 0.5.
- 50:34And in fact, that's what this black dot is.
- 50:37That's the adjustment at 0.5 for our adjusted estimator.
- 50:41This MUBP is tailored to binary outcomes,
- 50:45and it is an improvement over the normal base SMUB.
- 50:48I didn't show you the,
- 50:49so the results from simulations that basically show
- 50:52if you use the normal method on a binary outcome
- 50:55you get these huge bounds.
- 50:56You go outside of the Manski bounds, right?
- 50:58'Cause it's not properly bounded between zero and one,
- 51:01or your proportion isn't properly bounded.
- 51:03And importantly, our measure only requires
- 51:06summary statistics for Z,
- 51:08for the population or for the non-selected sample.
- 51:11So I don't have to have a whole separate data set
- 51:14where I have everybody who didn't get selected
- 51:16into my sample,
- 51:16I just need to know the average of these co-variants, right.
- 51:20I just needs to know Z-bar in order to get my average
- 51:23proxy for the non-selected.
- 51:26With weak information,
- 51:27so if my model is poor then my Manski bounds
- 51:30are gonna be what's returned.
- 51:32So that's a good feature of this index.
- 51:34Is that it is naturally bound
- 51:36unlike the normal model version.
- 51:38And we have done additional work to move
- 51:41beyond just estimating means and proportions
- 51:43into linear regression and probate progression.
- 51:46So we've have indices of selection bias
- 51:48for regression coefficients.
- 51:50So instead of wanting to know the mean of Y
- 51:53or the proportion with Y equals one,
- 51:55what if you wanted to do a regression of Y
- 51:57on some covariates?
- 51:59So we have a paper out in the animals of applied statistics
- 52:02that extends those two regression coefficients.
- 52:05So I believe I'm pretty much right on the time
- 52:07I was supposed to end, so I'll say Thank you everyone.
- 52:09And I'm happy to take questions.
- 52:11I'll put on my references
- 52:12of my meeny, miny fonts, yes.
- 52:20Robert Does anybody have any questions?
- 52:26From the room?
- 52:33So.
- 52:36Dr. Rebecca Let me stop my share.
- 52:38Student Hey.
- 52:40I have a very basic one,
- 52:41mostly more of curiosity (indistinct)
- 52:44Sure, sure.
- 52:45What is it that caused the...
- 52:50We know after the fact that in your example
- 52:54that there was the direction of the bias,
- 52:57but why is it that it only shifted in the Trump direction?
- 53:03Why?
- 53:03You don't know in advance if something is more likely
- 53:06or less likely?
- 53:08Okay.
- 53:09So excellent question.
- 53:09So that is effectively,
- 53:11the direction of the shift is going to match...
- 53:15The direction of the shift in the mean of Y,
- 53:17when the proportion is going to match
- 53:18the shift in X, right?
- 53:20So if what you get as your mean for your proxy,
- 53:25for the non-selected sample is bigger
- 53:28than for your selected sample
- 53:30then your proportion is gonna get shifted
- 53:31in that direction?
- 53:32Right.
- 53:33It's only ever going to shift it to match the bias in X.
- 53:37Right?
- 53:37And so then, which way that shifts Y
- 53:39depends on what the relationship
- 53:41is between the covariates Z and X in the probate regression.
- 53:46But it will always shift it in a particular direction.
- 53:49I will notice that I fully admit,
- 53:52our index actually shifted the wrong direction
- 53:55in one particular case.
- 53:57Right?
- 53:57So actually in Florida,
- 54:00we actually shifted down when we shouldn't.
- 54:02Right.
- 54:03So here's the way to estimate and we're shifting down,
- 54:05but actually the truth is higher.
- 54:07So we're not always getting it right
- 54:09we're getting it right when that X is shifting
- 54:13in the correct direction.
- 54:14Right?
- 54:15So it isn't true that we always...
- 54:17It's true that it always shifts the direction of X,
- 54:19but it's not a hundred percent true that X
- 54:22always shifts in the exact same way as Y.
- 54:24Just most of the time.
- 54:25There was evidence of underestimating the Trump support,
- 54:29and that was in fact reflected in that probate regression,
- 54:32right in that relationship.
- 54:33The people who replied to the poll were older,
- 54:36they were higher educated, right?
- 54:39And so those older,
- 54:40higher educated people in aggregate
- 54:43were less likely to vote for Trump.
- 54:45So that's why we ended up under calling the support
- 54:48for Trump when we don't account
- 54:49for that potential non-ignorable selection bias.
- 54:52Good question though.
- 54:55Robert Go it, Thank you.
- 54:56Any other questions (indistinct)
- 55:09Anybody?
- 55:16I know I talk fast and that was a lot of stuff
- 55:19so you know, like get it.
- 55:21(indistinct)
- 55:23Alright.
- 55:24Well, Andridge, Thank you again.
- 55:26And.
- 55:27(students clapping)
- 55:33Thank you.
- 55:34Thank you for having me.
- 55:35Robert Yeah.