Skip to Main Content

YSPH Biostatistics Seminar: "Measures of Selection Bias for Proportions Estimated from Non-Probability Samples"

November 16, 2021
  • 00:00Hi.
  • 00:01Hi everybody.
  • 00:02Students Hi.
  • 00:03It's my pleasure today
  • 00:03to introduce Professor Rebecca Andridge.
  • 00:06Professor Andridge has a Bachelors' in Economics in Stanford
  • 00:10and her Master's and PhD in Biostatistics
  • 00:13from the University of Michigan.
  • 00:16She an expert in group randomized trials
  • 00:18and methods of missing data
  • 00:19especially for that ever so tricky case that is not,
  • 00:23or so where data is missing not at random.
  • 00:26She's been faculty in Biostatistics in Ohio State University
  • 00:29since 2009.
  • 00:31She's an award-winning educator
  • 00:32and a 2020 Fellow of the Americans Associates,
  • 00:36and we're very honored to have a huge day.
  • 00:38Let's welcome professor Andridge.
  • 00:40(students clapping)
  • 00:43Thank you for the very generous introduction.
  • 00:46I have to tell you,
  • 00:47it's so exciting to see a room full of students.
  • 00:51I am currently teaching online class
  • 00:52and the students don't all congregate in a room.
  • 00:54So it's like been years since I've seen this.
  • 00:58So I'm of course gonna share my slides.
  • 01:01I want to warn everybody that I am working from home today.
  • 01:06And while we will not be interrupted by my children
  • 01:09we might be interrupted or I might be interrupted
  • 01:11by the construction going on in my house,
  • 01:13my cats or my fellow work at home husband.
  • 01:16So I'm gonna try to keep the distractions to a minimum
  • 01:18but that is the way of the world in 2020,
  • 01:22in the pandemic life.
  • 01:24So today I'm gonna be talking about some work
  • 01:26I've done with some colleagues
  • 01:27actually at the University of Michigan.
  • 01:29Talking about selection bias
  • 01:31in proportions estimated from non-probability samples.
  • 01:36So I'm gonna start with some background and definitions
  • 01:38and we'll start with kind of overview
  • 01:40of what's the problem we're trying to address.
  • 01:43So big data are everywhere, right?
  • 01:45We all have heard that phrase being bandied about, big data.
  • 01:49They're everywhere and they're cheap.
  • 01:50You got Twitter data, internet search data, online surveys,
  • 01:53things like predicting the flu using Instagram, right?
  • 01:56All these massive sources of data.
  • 01:59And these data often, I would say pretty much all the ways
  • 02:03arise from what are called non-probability samples.
  • 02:07So when we have a non-probability sample
  • 02:08we can't use what are called design based methods
  • 02:11for inference,
  • 02:11you actually have to use model based approaches.
  • 02:14So I'm not gonna assume that everybody knows
  • 02:16all these words that I've found out here,
  • 02:18so I'm gonna go into some definitions.
  • 02:22So our goal is to develop an index of selection bias
  • 02:25that lets us get at how bad the problem might be,
  • 02:28how much bias might we have due to non-random selection
  • 02:32into our sample?
  • 02:34So a probability sample is a situation
  • 02:38where you're collecting data
  • 02:39where each unit in the population
  • 02:41has a known positive probability of selection.
  • 02:44And randomness is involved in the selection of which units
  • 02:47come into the sample, right?
  • 02:49So this is your stereotypical complex survey design
  • 02:53or your sample survey.
  • 02:55Large government sponsored surveys
  • 02:57like the National Health and Nutrition Examination Survey,
  • 03:00NHANES or NHIS or any number of large surveys
  • 03:04that you've probably come across,
  • 03:06you know, in application and your biostatistics courses.
  • 03:09So for these large surveys
  • 03:11we do what's called design-based inference.
  • 03:14So that's where we rely on the design
  • 03:16of the data collection mechanism
  • 03:18in order for us to get unbiased estimates
  • 03:20of population quantities,
  • 03:21and we can do this without making any model assumptions.
  • 03:24So we don't have to assume
  • 03:26that let's say body mass index has a normal distribution.
  • 03:29We literally don't have to specify distribution at all.
  • 03:32It's all about the random selection into the sample
  • 03:35that lets us get our estimates
  • 03:36and be assured that we have unbiased estimates.
  • 03:40So here's an example in case there are folks
  • 03:43out in the audience who don't have experience
  • 03:45with the sort of complex survey design or design features.
  • 03:48So this is a really silly little example
  • 03:49of a stratified sample.
  • 03:51So here I have a population
  • 03:53of two different types of animals.
  • 03:55I have cats and I have dogs.
  • 03:57And in this population I happen to have 12 cats and $8.
  • 04:01And I have taken a sample.
  • 04:03Stratified sample where I took two cats and two dogs.
  • 04:07So in this design the selection probabilities
  • 04:09are known for all of the units, right?
  • 04:11Because I know that there's a two out of eight chance
  • 04:14I pick a dog and a two out of 12 chance
  • 04:16that I pick a cat, right?
  • 04:18So the probability a cat is selected is 1/6
  • 04:21then the probability of dog is selected is 1/4.
  • 04:23Now, how do I estimate a proportion of interest?
  • 04:26Let's say it's the proportion of orange animals
  • 04:28in the population.
  • 04:29Like here in my sample,
  • 04:30I have one of four orange animals,
  • 04:32but if I chose that as my estimator
  • 04:34I'd be ignoring the fact that I know how I selected
  • 04:37these animals into my sample.
  • 04:39So what we do is we wait the sample units
  • 04:42to produce design unbiased estimates, right?
  • 04:44Because this one dog kinda counts
  • 04:48differently than one cat, right?
  • 04:50Because there were only eight dogs
  • 04:51to begin with but there were 12 cats.
  • 04:54So if I want to estimate the proportion of orange animals
  • 04:57I would say this cat is a weight is six
  • 05:00because there's two of them and 12 total.
  • 05:02So 12 divided by two is six.
  • 05:04So there's six in the numerator.
  • 05:06And then the denominator is the sum of the weights
  • 05:08of all the selected units,
  • 05:10the cats are each six and the dogs are each four.
  • 05:12So I actually get my estimate a proportion of 30%.
  • 05:15So instead of 25%.
  • 05:17So that kind of weighted estimator
  • 05:18is what we do in probability sampling.
  • 05:20And we don't have to say what the distribution
  • 05:22of dogs or cats is in the sample
  • 05:24or orangeness in the sample,
  • 05:26we entirely rely on the selection mechanism.
  • 05:30What ended up happening in the real world
  • 05:32a lot of the time is we don't actually get to use
  • 05:35those kinds of complex designs.
  • 05:36And instead we collect data
  • 05:38through what's called a non-probability sample.
  • 05:40So in a non-probability sample,
  • 05:42it's pretty easy to define.
  • 05:43You cannot calculate the probability of selection
  • 05:46into the sample, right?
  • 05:47So we simply don't know what the mechanism
  • 05:49was that made at unit enter our sample.
  • 05:53I know there's the biostatistics students in the audience,
  • 05:55and you've all probably done a lot of data analysis.
  • 05:57And I would venture a guess that a lot of the times
  • 06:00your application datasets
  • 06:01are non-probability samples, right?
  • 06:03A lot of the times there are convenience samples.
  • 06:05I work a lot with biomedical researchers
  • 06:07studying cancer patients.
  • 06:08Well guess what, it's almost always a convenient sample
  • 06:12of cancer patients, right?
  • 06:13It's who will agree to be in the study?
  • 06:15Who can I find to be in my study?
  • 06:17Other types of non-probability samples
  • 06:19include things like voluntary or self-selection sampling,
  • 06:22quota sampling, that's a really old,
  • 06:24old school method from polling back many years ago.
  • 06:28Judgment sampling or snowball sampling.
  • 06:30So there's a lot of different ways
  • 06:31you can get non-probability samples.
  • 06:34So if we go back to the dog and cat example,
  • 06:37if I didn't know anything about how these animals
  • 06:39got into my sample and I just saw the four of them,
  • 06:41and one of them was orange,
  • 06:43I guess, I'm gonna guess 25% of my population is orange.
  • 06:48Right?
  • 06:49I don't have any other information
  • 06:50I can't recreate the population
  • 06:53like I could with the weighting.
  • 06:54Where I knew how many cats in the population
  • 06:57did each of my sampled cats represent
  • 06:59and similarly for the dogs.
  • 07:01So of course our best guess looking at these data
  • 07:03would just be 25%, right?
  • 07:05One out of the four animals is orange.
  • 07:07So when you think about a non-probability sample,
  • 07:10how much faith do you put in that estimate,
  • 07:12that proportion?
  • 07:15Hard to say, right?
  • 07:16It depends on what you believe about the population
  • 07:19and how you selected this non-probability sample
  • 07:23but you do not have the safety net of the probability sample
  • 07:26that guaranteed you're gonna get an unbiased estimate
  • 07:28of repeated applications of the sampling.
  • 07:32So I've already used the word selection bias
  • 07:34a lot and sort of being assuming that, you know what I mean.
  • 07:37So now I'm gonna come back to it and define it.
  • 07:40So selection bias is bias arising
  • 07:42when part of the target population
  • 07:45is not in the sample population, right?
  • 07:47So when there's a mismatch between who got into your sample
  • 07:49and who was supposed to get into your sample, right?
  • 07:51Who's the population?
  • 07:53Or in a more general statistical kind of way,
  • 07:56when some population units are sampled at a different rate
  • 07:59than you meant.
  • 08:00It's lik you meant for there to be a certain selection
  • 08:03probability for orange animals or for dogs
  • 08:06but it didn't actually end up that way.
  • 08:08This will end up down the path of selection bias.
  • 08:11And I will note that again, as you are biostats students
  • 08:13you've probably had some epidemiology.
  • 08:15And epidemiologists talk about selection bias as well.
  • 08:17It's the same concept, right?
  • 08:19That concept of who is ending up in your sample.
  • 08:22And is there some sort of a bias in the mechanism?
  • 08:26So selection bias is in fact the predominant
  • 08:28concern with non-probability samples.
  • 08:30In these non-probability samples,
  • 08:32the units in the sample might be really different
  • 08:36from the units not in the sample,
  • 08:37but we can't tell how different they are.
  • 08:40Whether we're talking about people, dogs, cats, hospitals,
  • 08:43whatever we're talking about.
  • 08:44However, these units got into my sample, I don't know.
  • 08:47So I don't know if the people in my sample
  • 08:49look like my population or not.
  • 08:53And an important key thing to know
  • 08:55is that probability samples
  • 08:57when we have a low response rates, right?
  • 08:59So when there are a lot of people not responding
  • 09:01you're basically ending up back
  • 09:03at a non-probability sample, right?
  • 09:05Where we have this beautiful design,
  • 09:07we know everybody's sampling weight, we draw a sample,
  • 09:10oops, ut then only 30% of people respond to my sample.
  • 09:14You're basically injecting that bias back in again.
  • 09:16Sort of undoing the beauty of the probability sample.
  • 09:21So when we think about a selection
  • 09:23bias and selection into a sample,
  • 09:25we can categorize them in two ways.
  • 09:28And Dr. McDougal, actually,
  • 09:30when he was giving you my brief little bio
  • 09:32used the words that I'm sure you've used
  • 09:34in your classes before like ignorable and non-ignorable.
  • 09:37These are usually are more commonly applied
  • 09:39to missingness, right?
  • 09:41So ignorable missingness mechanisms
  • 09:43and non-ignorable missingness mechanisms.
  • 09:45Missing at random, missing completely at random
  • 09:48or missing not at random, right?
  • 09:50Same exact framework here.
  • 09:52But instead of talking about missingness
  • 09:54we're talking about selection into the sample.
  • 09:56So when we have an ignorable selection mechanism,
  • 09:59that means the probability of selection
  • 10:01depends on things I observed.
  • 10:02Right, it depends on the observed characteristics.
  • 10:05When I have a non-negotiable selection mechanism
  • 10:08now that probability of selection depends
  • 10:10on observed characteristics.
  • 10:12Again, this is not really a new concept
  • 10:14if you understanded about missing data,
  • 10:15just apply to selection into the sample.
  • 10:20So in a probability sample
  • 10:22we might have different probabilities of selection
  • 10:24for different types of units like for cats versus for dogs.
  • 10:28But we know exactly how they differ, right?
  • 10:31It's because I designed my survey
  • 10:33based on his characteristic of dog versus cat
  • 10:36and I know exactly the status of dog versus cat
  • 10:38for my entire population in order to do that selection.
  • 10:42So I absolutely can estimate the proportion of orange,
  • 10:45animals unbiasedly in the sense of taking repeated
  • 10:49stratified samples and estimating that proportion.
  • 10:52I hadn't guaranteed that I'm gonna get an unbiased
  • 10:54estimate, right?
  • 10:55So this selection mechanism
  • 10:57is definitely not non-ignorable, right?
  • 11:00This is definitely an ignorable selection mechanism
  • 11:02in the sense that it only depends
  • 11:04on observed characteristics.
  • 11:06But if my four animals had just come from,
  • 11:09I don't know where?
  • 11:10Convenience.
  • 11:11Well now why did they end up in my sample?
  • 11:14It could depend on something that we didn't observe.
  • 11:16What breed of dog it was?
  • 11:18The age of the dog, the color of the dog.
  • 11:20It could have been pretty much anything, right?
  • 11:22That's the problem with the convenient sample.
  • 11:24You don't know why those units
  • 11:25often self-selected to be into your sample.
  • 11:29So now I'm gonna head into the kind of ugly statistical
  • 11:32notation portion of this stock.
  • 11:35So we'll start with estimated proportions.
  • 11:37So we'll use Y as our binary indicator
  • 11:41for the outcome, okay?
  • 11:43But here I'm gonna talk about Y
  • 11:45more generally as all the survey data.
  • 11:49So we'll start with Y as all the survey data,
  • 11:50then we're gonna narrow it down to Y
  • 11:51as the binary indicator?
  • 11:53So we can partition our survey data into the data
  • 11:57for the units we got in the sample
  • 11:58and the data for units that are not in the sample.
  • 12:01I so selected into the sample versus
  • 12:03not selected into the sample.
  • 12:05But for everybody I have Z,
  • 12:07I have some fully observed
  • 12:09what are often called design variables.
  • 12:11So this is where we are using information
  • 12:14that we know about an entire population
  • 12:16to select our sample in the world of probability sampling.
  • 12:20And then S is the selection indicator.
  • 12:23So these three variables have a joint distribution.
  • 12:26And most of the time,
  • 12:27what we care about is Y given Z.
  • 12:30Right, we're interested in estimating
  • 12:32some outcome characteristic
  • 12:34conditional on some other characteristic, right?
  • 12:37Average weight for dogs, average weight for cats, right?
  • 12:40Y given Z.
  • 12:42But Y given Z is only part of the issue,
  • 12:45there's also a selection mechanism, right?
  • 12:48So there's also this function
  • 12:49of how do you predict selection S with Y and Z.
  • 12:53And I'm using this additional Greek letter psi here
  • 12:56to denote additional variables
  • 12:58that might be involved, right?
  • 13:00'Cause selection could depend on more than just Y and Z.
  • 13:03It could depend on something outside
  • 13:04of that set of variables.
  • 13:07So when we have probability sampling,
  • 13:08we have what's called
  • 13:09an extremely ignorable selection mechanism,
  • 13:12which means selection can depend on Z,
  • 13:14like when we stratified on animal type
  • 13:16but it cannot depend on Y.
  • 13:18Either the selected units Y or the excluded units Y
  • 13:22doesn't depend on either.
  • 13:24Kind of vaguely like the MCAR of selection mechanisms.
  • 13:27It doesn't depend on Y at all.
  • 13:29Observed or unobserved.
  • 13:31But it can depend on Z.
  • 13:31So that makes it different than MCAR.
  • 13:34So including a unit into the sample
  • 13:36is independent of those survey outcomes Y
  • 13:39and also any unobserved variables, right?
  • 13:41That phi here, that phi goes away.
  • 13:44So selection only depends on Z.
  • 13:46So if I'm interested in this inference target
  • 13:49I can ignore the selection mechanism.
  • 13:51So this is kind of parallels that idea
  • 13:54in the missingness, in the missing data literature, right?
  • 13:56If I have an ignorable missingness mechanism
  • 13:59I can ignore that part of it.
  • 14:00I don't have to worry about modeling
  • 14:02the probability that a unit is selected.
  • 14:05But the bad news in our non-probability sampling,
  • 14:08very, very arguably true
  • 14:11that you could have non ignorable selection, right?
  • 14:13It's easy to make an argument for why the people
  • 14:16who ended up into your sample,
  • 14:18your convenient sample are different than the people
  • 14:20who don't enter your sample.
  • 14:22Think about some of these big data examples.
  • 14:24Think about Twitter data.
  • 14:26Well, I mean, you know,
  • 14:27the people who use Twitter are different
  • 14:29than the people who don't use Twitter, right?
  • 14:31In lots of different ways.
  • 14:32So if you're going to think about drawing
  • 14:34some kind of inference about the population,
  • 14:36you can't just ignore that selection mechanism.
  • 14:39You need to think about how do they enter
  • 14:41into your Twitter sample
  • 14:42and how might they be different than the people
  • 14:44who did not enter into your Twitter sample.
  • 14:47So when we're thinking about the selection mechanism
  • 14:49basically nothing goes away, right?
  • 14:51We can't ignore this selection mechanism.
  • 14:53But we have to think
  • 14:55about it when we want to make inference,
  • 14:56even when our inference is about Y given Z, right?
  • 14:59Even when we don't actually care
  • 15:00about the selection mechanism.
  • 15:02So the problem with probability samples
  • 15:04is that it's often very, very hard to model S
  • 15:07or we don't really have a good set of data
  • 15:10with which to model the probability
  • 15:11someone ended up in your sample.
  • 15:14And that's basically what you have to do to generalize
  • 15:17to the population, right?
  • 15:19There's methods that exist for non-probability samples
  • 15:21require you to do something along the lines
  • 15:24of finding another dataset
  • 15:26that has similar characteristics
  • 15:27and model the probability of being in the probability
  • 15:30sample, right?
  • 15:31So that's doable in many situations
  • 15:34but what we're looking for is a method
  • 15:35that doesn't require you to do that
  • 15:37but instead says, let's do a sensitivity analysis.
  • 15:40Let's say, how big of a problem
  • 15:43might selection bias be if we ignored
  • 15:46the selection mechanism, right?
  • 15:47If we just sort of took our sample on faith
  • 15:49as if it were an SRS from the population.
  • 15:52How wrong would we be
  • 15:54depending on how bad our selection bias problem is?
  • 15:59So there has been previous work done
  • 16:00in this area, in surveys often.
  • 16:03Try to think about how confident
  • 16:06are we that we can generalize to the population
  • 16:08even when we're doing a probability sample.
  • 16:10So there's work on thinking about the representativeness
  • 16:14of a sample.
  • 16:15So that's again, the generalizability to the population.
  • 16:18So there's something called an R-indicator,
  • 16:21which is a function of response probabilities
  • 16:25or propensities,
  • 16:26but it doesn't involve the survey variables.
  • 16:28So it's literally comparing the probability of response
  • 16:32to a survey for different demographic,
  • 16:34across different demographic characteristics, for example.
  • 16:37Right.
  • 16:38And seeing who is more likely to respond then who else?
  • 16:40And if there are those differences
  • 16:41then adjustments need to be made.
  • 16:44There's also something called the H1 indicator,
  • 16:47which does bring Y into the equation
  • 16:49but it assumes ignorable selection.
  • 16:52So it's going to assume that the Y
  • 16:54excluded gets dropped out.
  • 16:58The selection mechanism is only depends
  • 16:59on things that you observe, so you can ignore it, right?
  • 17:03So it's ignorable.
  • 17:04So that's not what we're interested in.
  • 17:06'Cause we're really worried in the non probability space
  • 17:09that we can't ignore the selection mechanism.
  • 17:13And there isn't relatively new indicator
  • 17:15called that they called the SMUB, S-M-U-B.
  • 17:19That is an index that actually extends
  • 17:21this idea of selection bias
  • 17:23to allow for non ignorable selection.
  • 17:25So it lets you say, well, what would my point estimate
  • 17:29be for a mean if selection were in fact ignorable,
  • 17:33and now let's go to the other extreme,
  • 17:35suppose selection only depends on Y.
  • 17:37And I'm trying to estimate average weight
  • 17:39and whether or not you entered my sample
  • 17:40is entirely dependent on your weight.
  • 17:43That's really not ignorable.
  • 17:45And then it kinda bounds the potential magnitude
  • 17:47for the problem.
  • 17:49So that SMUB, this estimator is really close
  • 17:52to what we want but we want it for proportions.
  • 17:55especially because in survey work and in large datasets,
  • 18:00we very often have categorical data
  • 18:03or very, very often binary data.
  • 18:05If you think about if you've ever participated
  • 18:07in an online survey or filled out those kinds of things
  • 18:10very often, right, You're checking a box.
  • 18:11It's multiple choice, select all that apply.
  • 18:13It's lots and lots of binary data floating around out there.
  • 18:17And I'll show you a couple of examples.
  • 18:19So that was a lot of kind of me talking
  • 18:22at you about the framework.
  • 18:23Now, let me bring this down to a solid example application.
  • 18:27So I'm going to use the national survey
  • 18:30of family growth as a fake population.
  • 18:32So I want you to pretend that I have a population
  • 18:36of 19,800 people, right?
  • 18:38It happens to be that I pulled it from the national survey
  • 18:40of family growth,
  • 18:41that's not really important that that was the source.
  • 18:43I've got this population of about 20,000 people.
  • 18:46But let's pretend we're doing a study
  • 18:48and I was only able to select
  • 18:50into my sample smartphone users.
  • 18:52Because I did some kind of a survey that was on their,
  • 18:54you had to take it on your phone.
  • 18:56So if you did not have a smartphone
  • 18:57you could not be selected into my sample.
  • 19:00In this particular case, in this fake population,
  • 19:03it's a very high selection fraction.
  • 19:04So about 80% of my population is in my sample.
  • 19:07That in and of itself is very unusual, right?
  • 19:11A non-probability sample is usually very,
  • 19:13very small compared to the full population
  • 19:15let's say of the United States
  • 19:17if that's who we're trying to generalize to.
  • 19:18But for the purposes of illustration
  • 19:20it helps to have a pretty high selection fraction.
  • 19:22And we'll assume that the outcome we're interested
  • 19:24in is whether or not the individual has ever been married.
  • 19:28So this is person level data, right?
  • 19:29Ever been married.
  • 19:31And it is...
  • 19:32we wanna estimate it by gender,
  • 19:34and I will note that the NSFG only calculate
  • 19:36or only captures gender as a binary variable.
  • 19:39This is a very long standing survey,
  • 19:41been going on since the seventies.
  • 19:42We know our understanding of gender as a construct
  • 19:45has grown a lot since the seventies
  • 19:47but this survey, and in fact
  • 19:48many governmental surveys still treat gender
  • 19:51as a binary variable.
  • 19:52So that's our limitation here
  • 19:54but I just want to acknowledge that.
  • 19:56So in this particular case,
  • 19:58we know the true selection bias, right?
  • 20:00Because I actually have all roughly 20,000 people
  • 20:04so that therefore I can calculate what's the truth,
  • 20:06and then I can use my smartphone sample and say,
  • 20:08"Well, how much bias is there?"
  • 20:11So it turns out that in the full sample
  • 20:1346.8% of the females have never been married.
  • 20:16And 56.6% of the males had never been married.
  • 20:20But if I use my selected sample of smartphone users
  • 20:23I'm getting a, well, very close,
  • 20:25but slightly smaller estimate for females.
  • 20:2846.6% never married.
  • 20:30And for males it's like about a percentage
  • 20:32point lower than the truth, 55.5%.
  • 20:35So not a huge amount of bias here.
  • 20:38My smartphone users are not all that non-representative
  • 20:41with respect to the entire sample,
  • 20:43at least with respect to whether
  • 20:44or not they've ever been married.
  • 20:47So when we have binary data,
  • 20:49an important point of reference is what happens if we assume
  • 20:53everybody not in my sample is a one, right?
  • 20:55What if everybody not in my sample was never married
  • 20:58or everyone not in my sample
  • 21:01is a no to never married, right?
  • 21:03So like has, has ever been married?
  • 21:05And these are what's called the Manski bounds.
  • 21:07When you fill in all zeros or fill in old bonds
  • 21:10for the missing values or the values
  • 21:12for those non-selected folks.
  • 21:14So we can bound the bias.
  • 21:15So the bias of this estimate of 46.6 or 46.6%
  • 21:21has to be by definition
  • 21:23between negative 0.098 and positive 0.085.
  • 21:26Because those are the two ends of putting all zeros
  • 21:29or all ones for the people who are not in my sample.
  • 21:32So this is unlike a continuous variable, right?
  • 21:35Where we can't actually put a finite bound on the bias.
  • 21:38We can with a proportion, right?
  • 21:40So this is why, for example,
  • 21:42if any of you ever work on smoking cessation studies
  • 21:45often they do sensitivity analysis.
  • 21:47People who drop out assume they're all smoking, right?
  • 21:50Or assume they're all not smoking.
  • 21:51They're not calling it that
  • 21:53but they're getting the Manski bounds.
  • 21:56Okay.
  • 21:57So the question is, can we do better than the Manski bounds?
  • 22:00Because these are actually pretty wide bounds,
  • 22:02relative to the size of the true bias,
  • 22:04and these are very wide.
  • 22:06And imagine a survey where we didn't have 80% selected.
  • 22:10What if we had 10% selected?
  • 22:12Well, then the Manski bounds are gonna be useless, right?
  • 22:14plug in, all zeros plug in all ones,
  • 22:16you're gonna get these insane estimates
  • 22:17that are nowhere close to what you observed.
  • 22:21So going back to the statistical notation,
  • 22:23this is where I said we're going to use Y
  • 22:24in a slightly different way.
  • 22:26Now, Y, and now forward is the binary variable of interest.
  • 22:30So in this case, in this NSFG example
  • 22:33it was never married.
  • 22:35We have a bunch of auxiliary variables that we observed
  • 22:38for everybody in the selected sample;
  • 22:41age, race, education, et cetera,
  • 22:43and I'm gonna call those Z.
  • 22:48Assume also that we have summary statistics
  • 22:51on Z for the selected cases.
  • 22:53So I don't observe Z for everybody, right?
  • 22:55All my non-smartphone users,
  • 22:57I don't know for each one of them, what is their gender?
  • 23:00What is their age? What is their race?
  • 23:02But I don't actually observe that.
  • 23:03But I observed some kinda summary statistic.
  • 23:06But a mean vector and a covariance matrix of Z.
  • 23:09So I have some source of what does my population
  • 23:12look like at an aggregate level?
  • 23:14And in practice, this would come from something
  • 23:16like census data or in a very large probability sample,
  • 23:20something where we would be pretty confident
  • 23:21This is reflective of the population.
  • 23:23Will note that if we have data for the population
  • 23:27and not the non-selected,
  • 23:29then we can kinda do subtraction, right?
  • 23:30We can take the data for the population
  • 23:32and aggregate and go backwards
  • 23:35to figure out what it would be for the non-selected
  • 23:36by effectively backing out the selected cases.
  • 23:40And similarly another problem
  • 23:42is that we don't have the variance.
  • 23:43We could just assume it's what we observe
  • 23:44in the selected cases.
  • 23:46So how are we gonna use this in order
  • 23:48to estimate of selection bias,
  • 23:52what we're gonna come up
  • 23:53with this measure of unadjusted bias for proportions
  • 23:56called the MUBP.
  • 23:59So the MUBP is an extension of the SMUB
  • 24:02that was for means, for continuous variables
  • 24:04to binary outcomes, right?
  • 24:06To proportions.
  • 24:07High-level, it's based on pattern-mixture models.
  • 24:10It requires you to make explicit assumptions
  • 24:13about the distribution of the selection mechanism,
  • 24:15and it provides you a sensitivity analysis,
  • 24:18basically make different assumptions on S,
  • 24:20I don't know what that distribution is,
  • 24:22and you're gonna get a range of bias.
  • 24:24So that's that idea of how wrong might we be?
  • 24:28So we're trying to just tighten those bounds
  • 24:30compared to the Manski bounce.
  • 24:31Where we don't wanna have to rely on plug in all zeros,
  • 24:33plug in all ones,
  • 24:35we wanna shrink that interval
  • 24:36to give us something a little bit more meaningful.
  • 24:38So the basic idea behind how this works
  • 24:41before I show you the formulas is we can measure
  • 24:44the degree of selection bias in Z, right?
  • 24:47Because we observed Z for our selected sample,
  • 24:50and we observed at an aggregate for the population.
  • 24:53So I can see, for example, that if in my selected sample,
  • 24:56I have 55% females but in the population it's 50% females.
  • 25:01Well, I can see that bias.
  • 25:03Right, I can do that comparison.
  • 25:04So absolutely I can tell you how much selection bias
  • 25:08there is for all of my auxiliary variables.
  • 25:11So if my outcome Y is related to my Zs
  • 25:16then knowing something about the selection bias in Z
  • 25:19tells me something about the selection bias in Y.
  • 25:22It doesn't tell me exactly the selection bias in Y
  • 25:25but it gives me some information in the selection bias in Y.
  • 25:28So in the extreme imagine if your Zs
  • 25:32in your selected sample
  • 25:33in aggregate looked exactly like the population.
  • 25:36Well, then you'd be pretty confident, right?
  • 25:40That there's not an enormous amount of selection bias
  • 25:42in Y assuming that Y was related to the Z.
  • 25:46So we're gonna use pattern-mixture models
  • 25:48to explicitly model that distribution of S, right?
  • 25:52And we're especially gonna focus on the case
  • 25:54when selection depends on Y.
  • 25:56It depends on our binary outcome of interest.
  • 26:00So again, Y is that binary variable interest,
  • 26:03we only have it for the selected sample.
  • 26:05In the NSFG example it's whether the woman or man
  • 26:08has ever been married.
  • 26:10We have Z variables available for the selected cases
  • 26:13in micro data and an aggregate for the non-selected sample,
  • 26:16a demographic characteristics
  • 26:18like age, education, marital status, et cetera.
  • 26:22And the way that we're gonna go
  • 26:24about doing this is we're gonna try
  • 26:25to get back to the idea of normality,
  • 26:27because then as you all know, when everything's normal
  • 26:30it's great, right?
  • 26:32It's easy to work with the normal distribution.
  • 26:34So the way we can do that with a binary variable
  • 26:37is we can think about latent variables.
  • 26:39So we're going to think about a latent variable called U.
  • 26:42That is an underlying, unobserved latent variables.
  • 26:45So unobserved for everybody, including our selected sample.
  • 26:48And it's basically thresholded.
  • 26:50And when U crosses zero, well, then Y goes from zero to one.
  • 26:54So I'm sure many, all of you have seen probit regression,
  • 26:58or this is what happens
  • 26:59and this is how probit regression is justified,
  • 27:01via latent variables.
  • 27:04So we're going to take our Zs
  • 27:06that we have for the selected cases,
  • 27:08and essentially reduce the dimensionality.
  • 27:11We're gonna take the Zs,
  • 27:13run a probate regression of Y on Z in the selected cases,
  • 27:17and pull out the linear predictor
  • 27:19from the regression, right?
  • 27:20The X beta, right?
  • 27:22Sorry, Z beta.
  • 27:24And I'm gonna call that X.
  • 27:25That is my proxy for Y or my Y hat, right?
  • 27:30It's just the predicted value from the regression.
  • 27:32And I can get that for every single observation
  • 27:35in my selected sample, of course, right?
  • 27:37Just plug in each individual's Z values
  • 27:39and get out their Y hat.
  • 27:40That's my proxy value.
  • 27:42And it's called the proxy
  • 27:44because it's the prediction, right?
  • 27:45It's our sort of best guess at Y
  • 27:47based on this model.
  • 27:49So I can get it for every observation in my selected sample,
  • 27:52but very importantly I can also get it on average
  • 27:56for the non-selective sample.
  • 27:57So I have all my beta hats for my probit regression,
  • 28:01and I'm gonna plug in Z-bar.
  • 28:03And I'm going to plug in the average value of my Zs.
  • 28:06And that's going to give me the average value
  • 28:08of X for the non-selected cases.
  • 28:11I don't have an actual observed value
  • 28:13for all those non-selective cases
  • 28:15but I have the average, right?
  • 28:16So I could think about comparing the average Z value
  • 28:19in the aggregate, in the non-selected cases
  • 28:22to that average Z among my selected cases.
  • 28:24And that is of course
  • 28:26exactly where we're gonna get those index from.
  • 28:29So I have my selection indicator S,
  • 28:31so in the smartphone example,
  • 28:33that's S equals one for the smartphone users
  • 28:35and S equals zero for the non-smartphone users
  • 28:37who weren't in my sample.
  • 28:39And importantly, I'm going to allow
  • 28:40there to be some other covariates V
  • 28:43floating around in here that are independent of Y and X
  • 28:46but could be related to selection.
  • 28:48Okay.
  • 28:49So it could be related to how you got into my sample
  • 28:51but importantly, not related to the outcome.
  • 28:55So diving into the math here, the equations,
  • 28:59we're gonna assume a proxy pattern-mixture model for U,
  • 29:02the latent variable underlying Y
  • 29:05and X given the selection indicator.
  • 29:08So what a pattern-mixture model does is it says
  • 29:11there's a totally separate distribution
  • 29:14or joint distribution of Y and X for the selected units
  • 29:16and the non-selected units.
  • 29:18Notice that all my mus, all my sigmas, my rho,
  • 29:21they've all got a superscript of j, right?
  • 29:23So that's whether your S equals zero or S equals one.
  • 29:27So two totally different bi-variate normal distributions
  • 29:31before Y and X,
  • 29:33depending on if you're selected or non-selected.
  • 29:35And then we have a marginal distribution
  • 29:37just Bernoulli, for the selection indicator.
  • 29:40However, I'm sure you all immediately are thinking,
  • 29:43"Well, that's great,
  • 29:45"but I don't have any information to estimate
  • 29:47"some of these parameters for the non-selected cases."
  • 29:51Clearly, for the selected cases, right?
  • 29:53S equals one.
  • 29:54I can estimate all of these things.
  • 29:55But I can't estimate them for the non-selected sample
  • 29:58because I might observe X-bar
  • 30:01but I don't observe anything having to do with you.
  • 30:03'Cause I have no Y information.
  • 30:06So in order to identify this model
  • 30:08and be able to come up with estimates
  • 30:09for all of these parameters,
  • 30:10we have to make an assumption about the selection mechanism.
  • 30:13So we assume that the probability of selection
  • 30:16into my sample is a function of U.
  • 30:19So we're allowing it to be not ignorable.
  • 30:21Remember that's underlying Y and X,
  • 30:23that proxy which is a function of Z.
  • 30:25So that's observed and V, those other variables.
  • 30:30And in particular, we're assuming
  • 30:31that it's this funny looking form of combination
  • 30:34of X and U.
  • 30:35That depends on this sensitivity parameter phi.
  • 30:38So phi it's one minus phi times X
  • 30:41and phi times U.
  • 30:43So that's essentially weighting
  • 30:45the contributions of those two pieces.
  • 30:47How much of selection is dependent
  • 30:49on the thing that I observe
  • 30:50or the proxy builds off the auxiliary variables
  • 30:53and how much of it is depending on the underlying latent U
  • 30:56related to Y,
  • 30:57that is definitely not observed
  • 30:58for the non-selected.
  • 31:00Okay.
  • 31:01And there's a little X star here,
  • 31:02that's sort of a technical detail.
  • 31:03We're rescaling the proxy.
  • 31:05So it has the same variance as U,
  • 31:07very unimportant mathematical detail.
  • 31:10So we have this joint distribution
  • 31:13that is conditional on selection status.
  • 31:16And in addition to, we need that one assumption
  • 31:19to identify things.
  • 31:20We also have the latent variable problem.
  • 31:22So latent variables do not have separately identifiable
  • 31:24mean and variance, right?
  • 31:26So that's just...
  • 31:27Outside of the scope of this talk
  • 31:29that's just a fact, right?
  • 31:30So without loss of generality
  • 31:31we're gonna set the variance of the latent variable
  • 31:34for the select a sample equal to one.
  • 31:35So it's just the scale of the latent variable.
  • 31:38So what we actually care about is a function of you, right?
  • 31:42It's the probability Y equals one marginally
  • 31:45in my entire population.
  • 31:46And so the probability Y equals one,
  • 31:48is a probability U is greater than zero.
  • 31:50That's that relationship.
  • 31:51And so it's a weighted average of the proportion
  • 31:55in the selected sample
  • 31:56and the proportion in the non-selected sample, right?
  • 32:00These are just...
  • 32:01If U has this normal distribution
  • 32:02this is how we get down to the probability
  • 32:04U equals zero.
  • 32:05Like those are those two pieces.
  • 32:08So the key parameter that governs
  • 32:10how this MUBP works is a correlation, right?
  • 32:14It's the strength of the relationship between Y
  • 32:17and your covariates.
  • 32:18How good of a model do you have for Y, right?
  • 32:22So remember we think back to that example
  • 32:24of what if I had no biases Z.
  • 32:26Or if Y wasn't related to Z,
  • 32:28well, then who cares that there is no bias in Z.
  • 32:32But we want there to be a strong relationship
  • 32:34between Z and Y so that we can kind of infer from Z to Y.
  • 32:40So that correlation in this latent variable framework
  • 32:43is called the biserial correlation of the binary X
  • 32:46and the continuous.
  • 32:47I mean, sorry, the binary Y and the continuous X, right?
  • 32:50There's lots of different flavors of correlation,
  • 32:53biserial is the name for this one
  • 32:55that's a binary Y and a continuous X
  • 32:57when we're thinking about the latent variable framework.
  • 33:00Importantly, you can estimate
  • 33:01this in the selected sample, right?
  • 33:04So I can estimate the correlation between you and X
  • 33:06among the selected sample.
  • 33:07I can't for the non-selected sample,
  • 33:09of course, but I can for the selected sample.
  • 33:12So the non-identifiable parameters
  • 33:14of that pattern-mixture model, here they are.
  • 33:15Like the mean for the latent variable,
  • 33:17the variance for the latent variable
  • 33:19and that correlation for the non-selected sample
  • 33:22are in fact identified when we make this assumption
  • 33:24on the selection mechanism.
  • 33:26So let's think about some concrete scenarios.
  • 33:30What if phi was zero?
  • 33:32If phi is zero,
  • 33:33we look up here at this part of the formula,
  • 33:35well, then phi drops out it.
  • 33:38So therefore selection only depends on X
  • 33:40and those extra variables V that don't really matter
  • 33:43because V isn't related to X or Y.
  • 33:46This is an ignorable selection mechanism, okay.
  • 33:50If on the other hand phi is one,
  • 33:52well, then it entirely depends on U.
  • 33:54X doesn't matter at all.
  • 33:55This is your worst, worst, worst case scenario, right?
  • 33:58Where whether or not you're in my sample only depends
  • 34:00on U and therefore only depends on the value of Y.
  • 34:04And so this is extremely not ignorable selection.
  • 34:07And of course the truth is likely to lie
  • 34:10somewhere in between, right?
  • 34:11Some sort of non-ignorable mechanism,
  • 34:13a phi between zero and one, so that U matters
  • 34:16but it's not the only thing that matters.
  • 34:18Right, that X matters as well.
  • 34:20Okay.
  • 34:21So this is a kind of moderate,
  • 34:22non-ignorable selection.
  • 34:23That's most likely the closest to reality
  • 34:26with these non-probability samples.
  • 34:30So for a specified value of phi.
  • 34:33So we pick a value for our sensitivity parameter.
  • 34:35There's no information in the data about it.
  • 34:36We just pick it and we can actually estimate the mean of Y
  • 34:40and compare that to the selected sample proportion.
  • 34:43So we take this select a sample proportion,
  • 34:45subtract what we get as the truth
  • 34:47for that particular value of phi,
  • 34:50and that's our measure of bias, right?
  • 34:52So this second piece that's being subtracted
  • 34:54here depends on phi.
  • 34:55Right, it depends on what your value
  • 34:57of your selected parameter is,
  • 34:58or selection for your sensitivity parameter is.
  • 35:01So in a nutshell, pick a selection mechanism
  • 35:03by specifying specifying phi,
  • 35:06estimate the overall proportion,
  • 35:07and then subtract to get your measure of bias.
  • 35:10And again, we don't know whether we're getting
  • 35:12the right answer because it's depending
  • 35:14on the sensitivity parameter
  • 35:15but it's at least going to allow us to bound the problem.
  • 35:19So the formula is quite messy,
  • 35:21but it gives some insight into how this index works.
  • 35:24So this measure of bias is the selected sample
  • 35:27mean minus that estimator, right?
  • 35:29This is the overall mean of Y
  • 35:32based on those latent variables.
  • 35:34And what gets plugged in here
  • 35:36importantly for the mean
  • 35:37and the variance for the non-selected cases
  • 35:39depends on a component that I've got colored blue here,
  • 35:42and a component that I've got color red.
  • 35:44So if we look at the red piece
  • 35:46this is the comparison of the proxy mean for the unselected
  • 35:49and the selected cases.
  • 35:50This is that bias in Z.
  • 35:52The selection bias in Z,
  • 35:54and it's just been standardized
  • 35:55by its estimated variance, right?
  • 35:57So that's how much selection bias
  • 35:59was present in Z via X, right.
  • 36:02Via using it to predict in the appropriate regression.
  • 36:05Similarly, down here, how different is the variance
  • 36:08of the selected and unselected cases for X.
  • 36:10How much bias, selection bias is there in estimating
  • 36:13the variance?
  • 36:14So we're going to use that difference
  • 36:16and scale the observed mean, right?
  • 36:19There's that observed the estimated mean of U
  • 36:22in the selected sample and how much it's gonna shift
  • 36:24by is it depends on the selection,
  • 36:26I mean, the sensitivity parameter phi,
  • 36:29and also that by serial correlation.
  • 36:31So this is why the by serial correlation is so important.
  • 36:34It is gonna dominate how much of the bias
  • 36:37in X we're going to transfer over into Y.
  • 36:42So if phi were zero,
  • 36:44so if we wanna assume
  • 36:45that it is an ignorable selection mechanism,
  • 36:48then this thing in blue here,
  • 36:50think about plugging zero here, zero here, zero everywhere,
  • 36:52is just gonna reduce down to that correlation.
  • 36:55So we're gonna shift the mean of U
  • 36:56for the non-selective cases
  • 36:59based on the correlation times that difference in X.
  • 37:03Whereas if we have phi equals one,
  • 37:06this thing in blue turns into one over the correlation.
  • 37:10So here is where thinking about the magnitude
  • 37:12of the correlation helps.
  • 37:13If the correlation is really big, right?
  • 37:15If the correlation is 0.8, 0.9,
  • 37:17something really large than phi and...
  • 37:20I mean, sorry, then rho and one over rho
  • 37:22are very close, right?
  • 37:230.8 and 1/0.8 are pretty close.
  • 37:26So if we're thinking about bounding this between phi
  • 37:29equals zero and equals one,
  • 37:30our interval is gonna be relatively small.
  • 37:33But if the correlation is small,
  • 37:35the correlation were 0.2, oh, oh, right?
  • 37:37We're gonna get a really big interval
  • 37:39because that correlation,
  • 37:40we're gonna shift with the factor of multiplied by 0.2
  • 37:43but then one over 0.2.
  • 37:44That's gonna be a really big shift
  • 37:46in that mean of the latent variable U
  • 37:48and therefore the mean of Y.
  • 37:51So how do we get these estimates?
  • 37:53We have two possibilities. We can use what we call
  • 37:55modified maximum likelihood estimation.
  • 37:58It's not true.
  • 37:58Maximum likelihood because we estimate
  • 38:00the biserial correlation with something
  • 38:02called a two step method, right?
  • 38:04So instead of doing a full, maximum likelihood,
  • 38:07we kind of take this cheat in which we set that mean of X
  • 38:12for the selected cases equal to what we observe,
  • 38:15And then conditional not to estimate
  • 38:16the by serial correlation.
  • 38:18Yeah.
  • 38:19And as a sensitivity analysis we would plug in zero one
  • 38:22and maybe 0.5 in the middle
  • 38:23as the values sensitivity parameter.
  • 38:26Alternatively, and we feel is a much more attractive
  • 38:29approach is to be Bayesian about this.
  • 38:31So in this MML estimation,
  • 38:34we are implicitly assuming that we know the betas
  • 38:38from that probate regression.
  • 38:39That we're essentially treating X like we know it.
  • 38:42But we don't know X, right?
  • 38:44That probate regression,
  • 38:45those parameters have error associated with them.
  • 38:47Right?
  • 38:48And you can imagine that the bigger your selected sample,
  • 38:49the more precisely estimating those betas,
  • 38:51that's not being reflected
  • 38:53at all in the modified maximum likelihood.
  • 38:56So instead we can be Bayesian.
  • 38:57Put non-informative priors on all the identified parameters.
  • 39:01That's gonna let those,
  • 39:02the error in those betas be propagated.
  • 39:05And so we'll incorporate that uncertainty.
  • 39:07And we can actually, additionally put a prior on phi, right?
  • 39:11So we could just say
  • 39:12let's have it be uniform across zero one.
  • 39:14Right?
  • 39:15So we can see what does it look like if we in totality,
  • 39:18if we assume that phi is somewhere evenly distributed
  • 39:20across that interval.
  • 39:22We've done other things as well.
  • 39:23We've taken like discreet priors.
  • 39:26Oh, let's put a point mass on 0.5 and one
  • 39:29or other different, right?
  • 39:30You can do whatever you want for that prior.
  • 39:33So let's go back to the example
  • 39:35see what it looks like.
  • 39:36If we have the proportion ever married for females
  • 39:38on the left and males on the right,
  • 39:40the true bias is the black dot.
  • 39:43And so the black is the true bias.
  • 39:45The little tiny diamond is the MUBP for 0.5.
  • 39:50An so that's plugging in that average value.
  • 39:52Some selection mechanism that depends on why some,
  • 39:56somewhere in the middle.
  • 39:57So we're actually coming pretty close.
  • 39:58That happens to be, that's pretty close.
  • 40:00And the intervals in green
  • 40:02are the modified maximum likelihood intervals
  • 40:04from phi equals zero to phi equals one,
  • 40:06and the Bayesian intervals are longer, right?
  • 40:08Naturally.
  • 40:09We're incorporating the uncertainty.
  • 40:11Essentially these MUBP,
  • 40:13modified maximum likely intervals are too short.
  • 40:15And we admit that these are too short.
  • 40:18If we plug in all zeros and all ones
  • 40:21for that small proportion of my NSFG population
  • 40:25that we aren't selected into the sample,
  • 40:27we get huge bounds relative to our indicator.
  • 40:31Right?
  • 40:32So remember when I showed you that slide, that bounded,
  • 40:34we know the bias has to be between these two values.
  • 40:37That's what's going on here.
  • 40:38That's what these two values are.
  • 40:39But using the information in Z
  • 40:41we're able to much, much narrow
  • 40:43or make an estimate on where our selection bias is.
  • 40:46So we got much tighter bounds.
  • 40:48An important fact here
  • 40:49is that we have pretty good predictors.
  • 40:50Our correlation, the biserial correlation
  • 40:53is about 0.7 or 0.8.
  • 40:54So these things are pretty correlated
  • 40:56with whether you've been married, age, education, right?
  • 40:59Those things are pretty correlated.
  • 41:01Another variable in the NSFG is income.
  • 41:04So we can think about an indicator for having low income.
  • 41:08Well, as it turns out those variables
  • 41:10we have on everybody; age, education, gender,
  • 41:14those things are not actually that good of predictors,
  • 41:16of low income, very low correlation.
  • 41:19So our index reflects that.
  • 41:21Or you get much, Y, your intervals.
  • 41:23Sort of closer to the Manski bounds.
  • 41:26And in fact, it's exactly returning one of those bounds.
  • 41:29The filling in all zeros bound is returned by this index.
  • 41:33So that's actually an attractive feature.
  • 41:35Right?
  • 41:36We're sort of bounded at the worst possible case
  • 41:38on one end of the bias
  • 41:40but we are still capturing the truth.
  • 41:42The Manski bounds are basically useless,
  • 41:44right in this particular case.
  • 41:47So that's a toy example.
  • 41:50Just gonna quickly show you a real example,
  • 41:53and I'm actually gonna to skip
  • 41:54over the incentive experiment,
  • 41:55which well, very, very interesting
  • 41:57is there's a lot to talk about,
  • 41:59and I'd rather jump straight to the presidential polls.
  • 42:03So there's very much in the news now,
  • 42:08and over the past several years,
  • 42:08this idea of failure of political polling
  • 42:11and this recent high profile failure
  • 42:12of pre-election polls in the US.
  • 42:15So polls are probability samples
  • 42:18but they have very, very, very low response rates.
  • 42:20I don't know how much you know about how they're done,
  • 42:21but they're very, very low response rate.
  • 42:23But think about what we're getting at in a poll,
  • 42:25a binary variable, are you going to vote for Donald Trump?
  • 42:28Yes or no?
  • 42:29Are you gonna vote for Joe Biden?
  • 42:31Yes or no?
  • 42:31These binary variables.
  • 42:32We want to estimate proportions.
  • 42:34That's what political polls aimed to do.
  • 42:36Pre-election polls.
  • 42:37So we have these political polls with these failures.
  • 42:41So we're thinking, maybe it's a selection bias problem.
  • 42:44And that there is some of this people
  • 42:45are entering into this poll differentially,
  • 42:49depending on who they're going to vote for.
  • 42:52So think of it this way,
  • 42:53and I'm gonna use Trump as the example
  • 42:54'cause we're going to estimate,
  • 42:55I'm gonna try to estimate
  • 42:56the proportion of people who will vote
  • 42:57for Former President Trump in the 2020 election.
  • 43:02So, might Trump supporters
  • 43:04just inherently be less likely to answer the call, right?
  • 43:07To answer that poll or to refuse to answer the question
  • 43:11even conditional demographic characteristics, right?
  • 43:13So two people who otherwise look the same
  • 43:16with respect to those Z variables, age, race, education,
  • 43:20the one who's the Trump supporter, someone might argue,
  • 43:22you might be more suspicious of the government
  • 43:24and the polls, and not want to answer
  • 43:26and not come into this poll, not be selected.
  • 43:28As it would be depending on why.
  • 43:31So the MUBP could be used to try to adjust poll estimates.
  • 43:35Say, well, there's your estimate from the poll
  • 43:38but what if selection were not ignorable?
  • 43:40How different would our estimate
  • 43:42of the proportion voting for Trump?
  • 43:45So in this example, our proportion of interest
  • 43:48is the percent of people who are gonna vote for Trump.
  • 43:51The sample that we used
  • 43:53are publicly available data
  • 43:54from seven different pre-election polls
  • 43:56conducted in seven different states by ABC in 2020.
  • 44:01And the way these polls work
  • 44:03is it's a random digit dialing survey.
  • 44:05So that's literally randomly dialing phone numbers.
  • 44:08Many of whom get
  • 44:09throughout 'cause their business, et cetera,
  • 44:10very, very low response rates, 10% or lower.
  • 44:13Very, very, very low response rates to these kinds of polls.
  • 44:17They do, however, try to do some weighting.
  • 44:19So it's not as if they just take that sample and say,
  • 44:21there we go let's estimate the proportion for Trump.
  • 44:23We do waiting adjustments
  • 44:25and they use what's called inter proportional fitting
  • 44:28or raking to get the distribution of key variables
  • 44:33in the sample to look like the population.
  • 44:36So they use census margins for, again,
  • 44:38it's gender as binary, unfortunately,
  • 44:40age, education, race, ethnicity, and party identification.
  • 44:45So, because we're doing this after the election
  • 44:47we know the truth.
  • 44:48We have access to the true official election outcomes
  • 44:50in each state.
  • 44:51So I know the actual proportion of why.
  • 44:54And my population is likely voters,
  • 44:57because that's who we're trying to target
  • 44:58with these pre-election polls.
  • 44:59You wanna know what's the estimated proportion
  • 45:02would vote for Trump among the likely voters.
  • 45:05So the tricky thing is that population
  • 45:07is hard to come by summary statistics.
  • 45:10Likely voters, right?
  • 45:11It's easy to get summary statistics from all people
  • 45:13in the US or all people of voting age in the US
  • 45:16but not likely voters.
  • 45:18So here Y is an indicator for voting for Trump.
  • 45:21Z is auxiliary variable in the ABC poll.
  • 45:24So all those variables I mentioned
  • 45:25before gender age, et cetera.
  • 45:27We actually have very strong predictors
  • 45:29of why basically because of these political ideation,
  • 45:32party identification variables, right?
  • 45:34Not surprisingly the people who identify as Democrats,
  • 45:37very unlikely to be voting for Trump.
  • 45:41The data set that we found that can give us population level
  • 45:44estimates of the mean of Z for the non-selected sample
  • 45:48is a dataset from AP/NORC.
  • 45:50It's called their VoteCast Data.
  • 45:52And they conduct these large surveys
  • 45:55and provide an indicator of likely voter.
  • 45:58So we can basically use this dataset
  • 46:00to describe the demographic characteristics
  • 46:02of likely voters,
  • 46:04instead of just all people who are 18 and older in the US.
  • 46:09The subtle issue is of course,
  • 46:10these AP VoteCast data are not without error,
  • 46:13but we're going to pretend that they are without error.
  • 46:15And that's like a whole other papers.
  • 46:17How do we handle the fact
  • 46:17that my population data have error?
  • 46:19So we're gonna use the unweighted ABC poll data
  • 46:23as the selected sample and estimate the MUBP
  • 46:26with the Bayesian approach with phi
  • 46:27from the uniform distribution.
  • 46:29The poll selection fraction is very, very, very small.
  • 46:32Right, these polls in each state
  • 46:34have about a thousand people in them
  • 46:36but we've got millions of voters in each state.
  • 46:38So the selection fraction is very, very, very small,
  • 46:40total opposite of the smartphone example.
  • 46:43So we'll just jump straight into the answer,
  • 46:46did it work?
  • 46:47Right, this is really exciting.
  • 46:48So the red circle is the true proportion,
  • 46:52oh, sorry, the true bias,
  • 46:53this should say bias down here.
  • 46:55In each of the states.
  • 46:56So these are the seven states
  • 46:57we looked at Arizona, Florida, Michigan, Minnesota,
  • 46:59North Carolina, Pennsylvania, and Wisconsin.
  • 47:02So this horizontal line here at zero that's no bias, right?
  • 47:06So it's estimated, the ABC poll estimate
  • 47:08would have no bias.
  • 47:09And we can see then in Arizona where sort of overestimated
  • 47:13and in the rest of the states
  • 47:14we've got underestimated the support for Trump.
  • 47:16And so that was really the failure was the underestimation
  • 47:19of the support for Trump.
  • 47:20Notice that our Bayesian bounds
  • 47:24cover the true bias everywhere except
  • 47:26in Pennsylvania and Wisconsin.
  • 47:28And so Wisconsin had an enormous bias,
  • 47:30or they way under called the support for Trump
  • 47:33in Wisconsin by 10 percentage points.
  • 47:34Huge problem.
  • 47:35So we're not getting there
  • 47:37but notice that zero is not in our interval.
  • 47:40So our bounds are suggesting
  • 47:43that there was a negative bias from the poll.
  • 47:46So even though we didn't capture the truth,
  • 47:48we've at least crossed the threshold
  • 47:49saying very likely that you are under calling
  • 47:52the support for Trump.
  • 47:55So how do estimates using the MUBP compared to the ABC poll?
  • 47:59Well, we can use the MUBP bounds to basically shift
  • 48:03the ABC poll estimates.
  • 48:05So we're calling those MUBP adjusted, right?
  • 48:08So we've got the truth is...
  • 48:10The true proportion who voted for Trump
  • 48:12are now these red triangles
  • 48:14and then the black circles are the point estimates
  • 48:17from three different methods of estimation,
  • 48:20of obtaining an estimate.
  • 48:21Unweighted from the poll weighted estimate from the poll
  • 48:25and the adjusted by our measure of selection bias,
  • 48:28the non-ignorable selection bias is the last one.
  • 48:30Is MUBP adjusted.
  • 48:32So we can see that in some cases
  • 48:35our adjustment and the polls are pretty similar, right?
  • 48:39But look at, for example, Wisconsin,
  • 48:41all the way over here on the right.
  • 48:42So again, remember I said, we didn't cover the truth
  • 48:44and we didn't cover the true bias
  • 48:46but our indicator is the only one, right?
  • 48:49That's got that much higher shift up towards Trump.
  • 48:52So this is us saying, well,
  • 48:53if there were an underlying selection mechanism
  • 48:57saying that Trump supporters
  • 48:59were inherently less likely to enter this poll,
  • 49:03this is what would happen.
  • 49:04Or this is what your estimated support for Trump would be.
  • 49:07It's shifted up.
  • 49:09We've got a similar sort of success story
  • 49:11I'll say in Minnesota,
  • 49:12we're both of the ABC estimators did not cover the truth
  • 49:16in these pre-election polls but ours did, right.
  • 49:18We were able to sort of shift up and say,
  • 49:21look, if there were selection bias
  • 49:22that depended on whether or not you supported Trump
  • 49:25we would we captured that.
  • 49:27So the important idea here is, you know,
  • 49:29before the election, we wouldn't have these red triangles.
  • 49:34But it's important to be able to see
  • 49:36that this is saying you're under calling
  • 49:39the support for Trump
  • 49:40if there were a non-negligible selection, right?
  • 49:42So it's that idea of a sensitivity analysis?
  • 49:44How bad would we be doing?
  • 49:46And what we would say is in Minnesota and Wisconsin
  • 49:49we'd be very worried
  • 49:50about under calling the support for Trump.
  • 49:56So what have I just shown you?
  • 49:59I'll summarize.
  • 50:01The MUBP is a sensitivity analysis tool
  • 50:04to assess the potential for non-ignorable selection bias.
  • 50:08If we have a phi equals zero, an ignorable selection,
  • 50:12we can adjust that away via weighting
  • 50:14or some other method, right?
  • 50:16So if it's not ignorable, I mean, if it is ignorable
  • 50:18we can ignore the selection mechanism.
  • 50:21On the other extreme if phi is one,
  • 50:23totally not ignorable,
  • 50:24selection is only depending on that outcome
  • 50:26we're trying to measure.
  • 50:28Somewhere in between we've got the 0.5.
  • 50:30That if you really needed a point estimate
  • 50:32of the bias, that would be 0.5.
  • 50:34And in fact, that's what this black dot is.
  • 50:37That's the adjustment at 0.5 for our adjusted estimator.
  • 50:41This MUBP is tailored to binary outcomes,
  • 50:45and it is an improvement over the normal base SMUB.
  • 50:48I didn't show you the,
  • 50:49so the results from simulations that basically show
  • 50:52if you use the normal method on a binary outcome
  • 50:55you get these huge bounds.
  • 50:56You go outside of the Manski bounds, right?
  • 50:58'Cause it's not properly bounded between zero and one,
  • 51:01or your proportion isn't properly bounded.
  • 51:03And importantly, our measure only requires
  • 51:06summary statistics for Z,
  • 51:08for the population or for the non-selected sample.
  • 51:11So I don't have to have a whole separate data set
  • 51:14where I have everybody who didn't get selected
  • 51:16into my sample,
  • 51:16I just need to know the average of these co-variants, right.
  • 51:20I just needs to know Z-bar in order to get my average
  • 51:23proxy for the non-selected.
  • 51:26With weak information,
  • 51:27so if my model is poor then my Manski bounds
  • 51:30are gonna be what's returned.
  • 51:32So that's a good feature of this index.
  • 51:34Is that it is naturally bound
  • 51:36unlike the normal model version.
  • 51:38And we have done additional work to move
  • 51:41beyond just estimating means and proportions
  • 51:43into linear regression and probate progression.
  • 51:46So we've have indices of selection bias
  • 51:48for regression coefficients.
  • 51:50So instead of wanting to know the mean of Y
  • 51:53or the proportion with Y equals one,
  • 51:55what if you wanted to do a regression of Y
  • 51:57on some covariates?
  • 51:59So we have a paper out in the animals of applied statistics
  • 52:02that extends those two regression coefficients.
  • 52:05So I believe I'm pretty much right on the time
  • 52:07I was supposed to end, so I'll say Thank you everyone.
  • 52:09And I'm happy to take questions.
  • 52:11I'll put on my references
  • 52:12of my meeny, miny fonts, yes.
  • 52:20Robert Does anybody have any questions?
  • 52:26From the room?
  • 52:33So.
  • 52:36Dr. Rebecca Let me stop my share.
  • 52:38Student Hey.
  • 52:40I have a very basic one,
  • 52:41mostly more of curiosity (indistinct)
  • 52:44Sure, sure.
  • 52:45What is it that caused the...
  • 52:50We know after the fact that in your example
  • 52:54that there was the direction of the bias,
  • 52:57but why is it that it only shifted in the Trump direction?
  • 53:03Why?
  • 53:03You don't know in advance if something is more likely
  • 53:06or less likely?
  • 53:08Okay.
  • 53:09So excellent question.
  • 53:09So that is effectively,
  • 53:11the direction of the shift is going to match...
  • 53:15The direction of the shift in the mean of Y,
  • 53:17when the proportion is going to match
  • 53:18the shift in X, right?
  • 53:20So if what you get as your mean for your proxy,
  • 53:25for the non-selected sample is bigger
  • 53:28than for your selected sample
  • 53:30then your proportion is gonna get shifted
  • 53:31in that direction?
  • 53:32Right.
  • 53:33It's only ever going to shift it to match the bias in X.
  • 53:37Right?
  • 53:37And so then, which way that shifts Y
  • 53:39depends on what the relationship
  • 53:41is between the covariates Z and X in the probate regression.
  • 53:46But it will always shift it in a particular direction.
  • 53:49I will notice that I fully admit,
  • 53:52our index actually shifted the wrong direction
  • 53:55in one particular case.
  • 53:57Right?
  • 53:57So actually in Florida,
  • 54:00we actually shifted down when we shouldn't.
  • 54:02Right.
  • 54:03So here's the way to estimate and we're shifting down,
  • 54:05but actually the truth is higher.
  • 54:07So we're not always getting it right
  • 54:09we're getting it right when that X is shifting
  • 54:13in the correct direction.
  • 54:14Right?
  • 54:15So it isn't true that we always...
  • 54:17It's true that it always shifts the direction of X,
  • 54:19but it's not a hundred percent true that X
  • 54:22always shifts in the exact same way as Y.
  • 54:24Just most of the time.
  • 54:25There was evidence of underestimating the Trump support,
  • 54:29and that was in fact reflected in that probate regression,
  • 54:32right in that relationship.
  • 54:33The people who replied to the poll were older,
  • 54:36they were higher educated, right?
  • 54:39And so those older,
  • 54:40higher educated people in aggregate
  • 54:43were less likely to vote for Trump.
  • 54:45So that's why we ended up under calling the support
  • 54:48for Trump when we don't account
  • 54:49for that potential non-ignorable selection bias.
  • 54:52Good question though.
  • 54:55Robert Go it, Thank you.
  • 54:56Any other questions (indistinct)
  • 55:09Anybody?
  • 55:16I know I talk fast and that was a lot of stuff
  • 55:19so you know, like get it.
  • 55:21(indistinct)
  • 55:23Alright.
  • 55:24Well, Andridge, Thank you again.
  • 55:26And.
  • 55:27(students clapping)
  • 55:33Thank you.
  • 55:34Thank you for having me.
  • 55:35Robert Yeah.