YSPH Biostatistics Virtual Seminar: “Optimal Doubly Robust Estimation of Heterogeneous Causal Effects"
November 04, 2020Information
Edward Kennedy, PhD
Assistant Professor, Department of Statistics & Data Science, Carnegie Mellon University
Abstract: Heterogeneous effect estimation plays a crucial role in causal inference, with applications across medicine and social science. Many methods for estimating conditional average treatment effects (CATEs) have been proposed in recent years, but there are important theoretical gaps in understanding if and when such methods are optimal. This is especially true when the CATE has nontrivial structure (e.g., smoothness or sparsity). Our work contributes in several main ways. First, we study a two-stage doubly robust CATE estimator and give a generic model-free error bound, which, despite its generality, yields sharper results than those in the current literature. We apply the bound to derive error rates in nonparametric models with smoothness or sparsity, and give sufficient conditions for oracle efficiency. Underlying our error bound is a general oracle inequality for regression with estimated or imputed outcomes, which is of independent interest; this is the second main contribution. The third contribution is aimed at understanding the fundamental statistical limits of CATE estimation. To that end, we propose and study a local polynomial adaptation of double-residual regression. We show that this estimator can be oracle efficient under even weaker conditions, if used with a specialized form of sample splitting and careful choices of tuning parameters. These are the weakest conditions currently found in the literature, and we conjecture that they are minimal in a minimax sense. We go on to give error bounds in the non-trivial regime where oracle rates cannot be achieved. Some finite-sample properties are explored with simulations.
- 00:00- So let's get started.
- 00:03Welcome everyone.
- 00:04It is my great pleasure to introduce our speaker today,
- 00:07Dr. Edward Kennedy, who is an assistant professor
- 00:11at the Department of Statistics and Data Science
- 00:13at Carnegie Mellon University.
- 00:16Dr. Kennedy got his MA in statistics
- 00:18and PhD in biostatistics from University of Pennsylvania.
- 00:21He's an expert in methods for causal inference,
- 00:24missing data and machine learning,
- 00:26especially in settings involving
- 00:27high dimensional and complex data structures.
- 00:31He has also been collaborating on statistical applications
- 00:34in criminal justice, health services,
- 00:36medicine and public policy.
- 00:38Today's going to share with us his recent work
- 00:40in the space of heterogeneous causal effect estimation.
- 00:43Welcome Edward, the floor is yours.
- 00:46- [Edward] Thanks so much, (clears throat)
- 00:47yeah, thanks for the invitation.
- 00:49I'm happy to talk to everyone today about this work
- 00:51I've been thinking about for the last year or so.
- 00:55Sort of excited about it.
- 00:57Yeah, so it's all about doubly robust estimation
- 00:59of heterogeneous treatment effects.
- 01:03Maybe before I start,
- 01:04I don't know what the standard approach is for questions,
- 01:07but I'd be more than happy to take
- 01:08any questions throughout the talk
- 01:10and I can always sort of adapt and focus more
- 01:13on different parts of the room,
- 01:14what people are interested in.
- 01:17I'm also trying to get used to using Zoom,
- 01:21I've been teaching this big lecture course
- 01:23so I think I can keep an eye on the chat box too
- 01:26if people have questions that way,
- 01:27feel free to just type something in.
- 01:30Okay.
- 01:31So yeah, this is sort
- 01:34of standard problem non-causal inference
- 01:36but I'll give some introduction.
- 01:38The kind of classical target that people go after
- 01:41in causal inference problems is what's
- 01:44often called the average treatment effect.
- 01:46So this tells you the mean outcome if everyone
- 01:48was treated versus if everyone was untreated, for example.
- 01:53So this is, yeah, sort of the standard target.
- 01:57We know quite a bit about estimating this parameter
- 02:01under no unmeasured confounding kinds of assumptions.
- 02:05So just as a just sort of point this out,
- 02:10so a lot of my work is sort of focused
- 02:11on the statistics of causal inference,
- 02:12how to estimate causal parameters
- 02:15well in flexible non-parametric models.
- 02:17So we know quite a bit
- 02:18about this average treatment effect parameter.
- 02:20There are still some really interesting open problems,
- 02:23even for this sort of most basic parameter,
- 02:25which I'd be happy to talk to people about,
- 02:26but this is just one number, it's an overall summary
- 02:31of how people respond to treatment, on average.
- 02:34It can obscure potentially important heterogeneity.
- 02:38So for example, very extreme case would be where half
- 02:43the population is seeing a big benefit
- 02:45from treatment and half is seeing severe harm,
- 02:49then you would completely miss this
- 02:50by just looking at the average treatment effect.
- 02:53So this motivates going beyond this,
- 02:55maybe looking at how treatment effects can vary
- 02:58across subject characteristics.
- 03:01All right, so why should we care about this?
- 03:03Why should we care how treatment effects vary in this way?
- 03:06So often when I talk about this,
- 03:09people's minds go immediately to optimal treatment regimes,
- 03:12which is certainly an important part of this problem.
- 03:16So that means trying to find out who's benefiting
- 03:19from treatment and who is not or who's being harmed.
- 03:22And then just in developing
- 03:24a treatment policy based on this,
- 03:26where you treat the people who benefit,
- 03:27but not the people who don't.
- 03:29This is definitely an important part
- 03:30of understanding heterogeneity,
- 03:32but I don't think it's the whole story.
- 03:33So it can also be very useful just
- 03:36to understand heterogeneity from a theoretical perspective,
- 03:39just to understand the system
- 03:41that you're studying and not only that,
- 03:44but also to help inform future treatment development.
- 03:50So not just trying to optimally assign
- 03:53the current treatment that's available,
- 03:55but if you find, for example,
- 03:57that there are portions of the subject population
- 04:01that are not responding to the treatment,
- 04:03maybe you should then go off and try and develop
- 04:05a treatment that would better aim at these people.
- 04:10So lots of different reasons why you might care
- 04:12about heterogeneity,
- 04:12including devising optimal policies,
- 04:16but not just that.
- 04:19And this really plays a big role across lots
- 04:21of different fields as you can imagine.
- 04:24We might want to target policies based on how people
- 04:29are responding to a drug or a medical treatment.
- 04:33We'll see a sort of political science example here.
- 04:36So this is just a picture of what you should maybe think
- 04:39about as we're talking about this problem
- 04:42with heterogeneous treatment effects.
- 04:44So this is a timely example.
- 04:46So it's looking at the effect
- 04:47of canvassing on voter turnout.
- 04:50So this is the effect of being sort of reminded
- 04:53in a face-to-face way to vote
- 04:55that there's an election coming up
- 04:58and how this effect varies with age.
- 05:00And so I'll come back to where this plot came from
- 05:04and the exact sort of data structure and analysis,
- 05:07but just as a picture to sort of make things concrete.
- 05:11It looks like there might be some sort of positive effect
- 05:15of canvassing for younger people,
- 05:16but not for older people,
- 05:18there might be some non-linearity.
- 05:21So this might be useful for a number of reasons.
- 05:23You might not want to target the older population
- 05:27with canvassing, because it may not be doing anything,
- 05:30you might want to try and find some other way
- 05:32to increase turnout for this group right.
- 05:36Or you might just want to understand sort
- 05:38of from a psychological, sociological,
- 05:41theoretical perspective,
- 05:44what kinds of people are responding to this sort of thing?
- 05:49And so this is just one simple example
- 05:51you can keep in mind.
- 05:54So what's the state of the art for this problem?
- 05:58So in this talk, I'm going to focus
- 06:00on this conditional average treatment effect here.
- 06:02So it's the expected difference
- 06:05if people of type X were treated versus
- 06:08not expected difference in outcomes.
- 06:11This is kind of the classic or standard parameter
- 06:14that people think about now
- 06:16in the heterogeneous treatment effects literature,
- 06:19there are other options you could think
- 06:21about risk ratios, for example, if outcomes are binary.
- 06:25A lot of the methods that I talk about today
- 06:26will have analogs for these other regions,
- 06:30but there are lots of fun, open problems to explore here.
- 06:33How to characterize heterogeneous treatment effects
- 06:35when you have timeframe treatments, continuous treatments,
- 06:39of cool problems to think about.
- 06:40But anyways, this kind of effect where we have
- 06:44a binary treatment and some set of covariates,
- 06:48there's really been this proliferation of proposals
- 06:51in recent years for estimating this thing
- 06:53in a flexible way that goes beyond just fitting
- 06:57a linear model and looking at some interaction terms.
- 07:01(clears throat)
- 07:02So I guess I'll refer to the paper for a lot
- 07:07of these different papers that have thought about this.
- 07:12People have used, sort of random forests
- 07:14and tree based methods basing out
- 07:17of a regression trees, lots of different variants
- 07:20for estimating this thing.
- 07:21So there've been lots of proposals,
- 07:23lots of methods for estimating this,
- 07:24but there's some really big theoretical gaps
- 07:28in this literature.
- 07:30So one, yeah, this is especially true
- 07:32when you can imagine that this conditional effect
- 07:36might be much more simple
- 07:38or sparse or smooth than the rest
- 07:41of the data generating process.
- 07:42So you can imagine you have some
- 07:45potentially complex propensity score describing
- 07:49the mechanism by which people are treated
- 07:50based on their covariates.
- 07:51You have some underlying regression functions
- 07:54that describe this outcome process,
- 07:56how their outcomes depend on covariates,
- 08:01whether they're treated or not.
- 08:02These could be very complex and messy objects,
- 08:05but this CATE might be simpler.
- 08:08And in this kind of regime, there's very little known.
- 08:12I'll talk more about exactly what I mean
- 08:14by this in just a bit.
- 08:17So one question is,
- 08:18how do we adapt to this kind of structure?
- 08:21And there are really no strong theoretical benchmarks
- 08:26in this world in the last few years,
- 08:30which means we have all these proposals,
- 08:33which is great, but we don't know which are optimal
- 08:36or when or if they can be improved in some way.
- 08:41What's the best possible performance
- 08:44that we could ever achieve at estimating
- 08:46this quantity in the non-parametric model
- 08:47without adding assumptions?
- 08:49So these kinds of questions are basically
- 08:50entirely open in this setup.
- 08:53So the point of this work is really to try
- 08:55and push forward to answer some of these questions.
- 09:00There are two kind of big parts of this work,
- 09:05which are in a paper on archive.
- 09:09So one is just to provide more flexible estimators
- 09:12of this guy and specifically to show,
- 09:18give stronger error guarantees on estimating this.
- 09:23So that we can use a really diverse set of methods
- 09:27for estimating this thing in a doubly robust way
- 09:29and still have some rigorous guarantees
- 09:32about how well we're doing.
- 09:34So that part is more practical.
- 09:35It's more about giving a method
- 09:37that people can actually implement
- 09:38and practice that's pretty straight forward,
- 09:41it looks like a two stage progression procedure
- 09:44and being able to say something about this
- 09:46that's model free and and agnostic about both
- 09:52the underlying data generating process
- 09:53and what methods we're using to construct the estimator.
- 09:57This was lacking in the previous literature.
- 09:59So that's one side of this work, which is more practical.
- 10:03I think I'll focus more on that today,
- 10:06but we can always adapt as we go,
- 10:09if people are interested in other stuff.
- 10:10I'm also going to talk a bit about an analysis of this,
- 10:12just to show you sort of how it would work in practice.
- 10:16So that's one part of this work.
- 10:17The second part is more theoretical and it says,
- 10:22so I don't want to just sort of construct
- 10:24an estimator that has the nice error guarantees,
- 10:27but I want to try and figure out what's
- 10:29the best possible performance I could ever get
- 10:31at estimating these heterogeneous effects.
- 10:36This turns out to be a really hard problem
- 10:39with a lot of nuance,
- 10:42but that's sort of the second part
- 10:43of the talk which maybe is a little tackle
- 10:46in a bit less time.
- 10:50So that's kind of big picture.
- 10:51I like to give the punchline of the talk at the start,
- 10:53just so you have an idea of what I'm going to be covering.
- 10:57And yeah, so now let's go into some details.
- 11:01So we're going to think about this sort
- 11:03of classic causal inference data structure,
- 11:06where we have n iid observations, we have covariates X,
- 11:10which are D dimensional, binary treatment for now,
- 11:14all the methods that I'll talk about will work
- 11:17without any extra work in the discrete treatment setting
- 11:21if we have multiple values.
- 11:23The continuous treatment setting
- 11:24is more difficult it turns out.
- 11:27And some outcome Y that we care about.
- 11:31All right, so there are a couple of characters
- 11:33in this talk that will play really important roles.
- 11:37So we'll have some special notation for them.
- 11:39So PI of X, this is the propensity score.
- 11:42This is the chance of being treated,
- 11:44given your covariates.
- 11:47So some people might be more or less likely
- 11:49to be treated depending on their baseline covariates, X.
- 11:54Muse of a, this will be an outcome regression function.
- 11:57So it's your expected outcome given your covariates
- 12:00and given your treatment level.
- 12:02And then we'll also later on in the talk use this ada,
- 12:04which is just the marginal outcome regression.
- 12:06So without thinking about treatment,
- 12:08just how the outcome varies on average as a function of X.
- 12:14And so those are the three main characters in this talk,
- 12:16we'll be using them throughout.
- 12:18So under these standard causal assumptions
- 12:21of consistency, positivity, exchangeability,
- 12:24there's a really amazing group
- 12:27at Yale that are focused on dropping these assumptions.
- 12:31So lots of cool work to be done there,
- 12:34but we're going to be using them today.
- 12:36So consistency, we're roughly thinking
- 12:38this means there's no interference,
- 12:40this is a big problem in causal inference,
- 12:42but we're going to say
- 12:43that my treatments can affect your outcomes, for example.
- 12:47We're going to think about the case where everyone
- 12:48has some chance at receiving treatment,
- 12:51both treatment and control,
- 12:52and then we have no unmeasured confounding.
- 12:54So we've collected enough sufficiently relevant covariates
- 12:57that once we conditioned on them,
- 12:59look within levels of the covariates,
- 13:00the treatment is as good as randomized.
- 13:03So under these three assumptions,
- 13:06this conditional effect on the left-hand side here
- 13:09can just be written as a difference in regression functions.
- 13:11It's just the difference in the regression function
- 13:13under treatment versus control,
- 13:15sort of super simple parameter right.
- 13:18So I'm going to call this thing Tau.
- 13:20This is just the regression under treatment minus
- 13:22the regression under control.
- 13:27So you might think, we know a lot about
- 13:29how to estimate regression functions non-parametrically
- 13:33they're really nice, min and max lower bounds
- 13:36that say we can't do better uniformly across the model
- 13:41without adding some assumptions or some extra structure.
- 13:46The fact that we have a difference
- 13:47in regression doesn't seem like
- 13:48it would make things more complicated
- 13:50than just the initial regression problem,
- 13:53but it turns out it really does,
- 13:55it's super interesting,
- 13:56this is one of the parts of this problem
- 13:57that I think is really fascinating.
- 14:00So just by taking a difference in regressions,
- 14:03you completely change the nature of this problem
- 14:06from the standard non-parametric regression setup.
- 14:10So let's get some intuition for why this is the case.
- 14:14So why isn't it optimal just to estimate
- 14:17the two regression functions
- 14:18and take a difference, for example?
- 14:21So let's think about a simple data generating process
- 14:23where we have just a one dimensional covariate,
- 14:26it's uniform on minus one, one,
- 14:29we have a simple step function propensity score
- 14:32and then we're going to think
- 14:32about a regression function, both under treatment
- 14:35and control that looks like some kind
- 14:37of crazy polynomial from this Gyorfi textbook,
- 14:40I'll show you a picture in just a minute.
- 14:44The important thing about this polynomial
- 14:47is that it's non-smooth, it has a jump,
- 14:50has some kinks in it and so it will be hard to estimate,
- 14:56in general, but we're taking both
- 15:00the regression function under treatment
- 15:01and the regression function under control
- 15:03to be equal, they're equal to this same hard
- 15:06to estimate polynomial function.
- 15:07And so that means the difference is really simple,
- 15:10it's just zero, it's the simplest conditional effect
- 15:12you can imagine, not only constant, but zero.
- 15:15You can imagine this probably happens a lot in practice
- 15:18where we have treatments that are not extremely effective
- 15:22for everyone in some complicated way.
- 15:26So the simplest way you would estimate
- 15:29this conditional effect is just take an estimate
- 15:32of the two regression functions and take a difference.
- 15:35Sometimes I'll call this plugin estimator.
- 15:38There's this paper by Kunzel and colleagues,
- 15:41call it the T-learner.
- 15:43So for example, we can use smoothing splines,
- 15:46estimate the two regression functions and take a difference.
- 15:49And maybe you can already see what's going to go wrong here.
- 15:52So these individual regression functions
- 15:54by themselves are really hard to estimate.
- 15:58They have jumps and kinks, they're messy functions
- 16:01And so when we try and estimate these
- 16:03with smoothing splines, for example,
- 16:05we're going to get really complicated estimates
- 16:08that have some bumps, It's hard to choose
- 16:11the right tuning parameter, but even if we do,
- 16:14we're inheriting the sort of complexity
- 16:16of the individual regression functions.
- 16:18When we take the difference,
- 16:19we're going to see something
- 16:20that is equally complex here
- 16:22and so it's not doing a good job of exploiting
- 16:25this simple structure in the conditional effect.
- 16:30This is sort of analogous to this intuition
- 16:33that people have that interaction terms might
- 16:36be smaller or less worrisome than sort
- 16:41of main effects in a regression model.
- 16:43Or you can think of the muse as sort of main effects
- 16:45and the differences as like an interaction.
- 16:49So here's a picture of this data
- 16:51in the simple motivating example.
- 16:53So we've got treated people on the left
- 16:55and untreated people on the right
- 16:57and this gray line is the true, that messy,
- 17:00weird polynomial function that we're thinking about.
- 17:03So here's a jump and there's a couple
- 17:06of kinks here and there's confounding.
- 17:09So treated people are more likely to have larger Xs,
- 17:13untreated people are more likely to have smaller Xs.
- 17:16So what happens here is the function is sort
- 17:19of a bit easier to estimate on the right side.
- 17:22And so for treated people, we're going to take a sort
- 17:24of larger bandwidth, get a smoother function.
- 17:28For untreated people, it's harder to estimate
- 17:30on the left side and so we're going to need
- 17:31a small bandwidth to try and capture this jump,
- 17:34for example, this discontinuity.
- 17:38And so what's going to happen is when you take a difference
- 17:40of these two regression estimates, these black lines
- 17:42are just the standard smoothing that spline estimates
- 17:46that you're getting are with one line of code,
- 17:48using the default bandwidth choices.
- 17:50When you take a difference,
- 17:51you're going to get something
- 17:52that's very complex and messy and it's not doing
- 17:55a good job of recognizing that the regression functions
- 17:58are the same under treatment and control.
- 18:03So what else could we do?
- 18:04This maybe points to this fact that
- 18:06the plugin estimator breaks,
- 18:09it doesn't do a good job of exploiting a structure,
- 18:11but what other options do we have?
- 18:13So let's say that we knew the propensity scores.
- 18:15So for just simplicity, say we were in a trial,
- 18:19for example, an experiment,
- 18:21where we randomized everyone to treat them
- 18:23with some probability that we knew.
- 18:26In that case, we could construct a pseudo outcome,
- 18:28which is just like an inverse probability weighted outcome,
- 18:31which has exactly the right conditional expectation,
- 18:35its conditional expectation is exactly equal
- 18:37to that conditional effect.
- 18:39And so when you did a non-parametric regression
- 18:41of the pseudo outcome on X,
- 18:43it would be like doing an oracle regression
- 18:45of the true difference in potential outcomes,
- 18:47it has exactly the same conditional expectation.
- 18:50And so this sort of turns this hard problem
- 18:53into a standard non-parametric regression problem.
- 18:56Now this isn't a special case where we knew
- 18:58the propensity scores for the rest
- 18:59of the talk we're gonna think about what happens
- 19:01when we don't know these, what can we say?
- 19:04So here's just a picture of what we get in the setup.
- 19:08So this red line is this really messy plug in estimator
- 19:10that we get that's just inheriting that complexity
- 19:13of estimating the individual regression functions
- 19:15and then these black and blue lines are IPW
- 19:19and doubly robust versions that exploit
- 19:22this underlying smoothness and simplicity
- 19:25of the heterogeneous effects, the conditional effects.
- 19:32So this is just a motivating example
- 19:34to help us get some intuition for what's going on here.
- 19:39So these results are sort of standard in this problem,
- 19:41we'll come back to some simulations later on.
- 19:43And so now our goal is going to study the error
- 19:48of the sort of inverse weighted kind of procedure,
- 19:51but a doubly robust version.
- 19:53We're going to give some new model free error guarantees,
- 19:57which let us use very flexible methods
- 19:59and it turns out we'll actually get better areas
- 20:03than what were achieved previously in literature,
- 20:08even when focusing specifically on some particular method.
- 20:12And then again, we're going to see,
- 20:13how well can we actually do estimating
- 20:15this conditional effect in this problem.
- 20:21Might be a good place to pause
- 20:23and see if people have any questions.
- 20:33Okay.
- 20:34(clears throat)
- 20:35Feel free to shout out any questions
- 20:37or stick them on the chat if any come up.
- 20:42So we're going to start by thinking about
- 20:45a pretty simple two-stage doubly robust estimator,
- 20:48which I'm going to call the DR-learner,
- 20:50this is following this nomenclature that's become kind
- 20:53of common in the heterogeneous effects literature
- 20:57where we have letters and then a learner.
- 21:01So I'm calling this the DR-Learner,
- 21:02but this is not a new procedure,
- 21:04but the version that I'm going to analyze
- 21:06has some variances, but it was actually first proposed
- 21:08by Mike Vanderlande in 2013, was used in 2016
- 21:13by Alex Lucca and Mark Vanderlande.
- 21:16So they proposed this,
- 21:17but they didn't give specific error bounds.
- 21:21I think relatively few people know
- 21:23about these earlier papers because this approach
- 21:25was then sort of rediscovered in various ways
- 21:28after that in the following years,
- 21:30typically in these later versions,
- 21:32people use very specific methods for estimating,
- 21:37for constructing the estimator,
- 21:38which I'll talk about in detail in just a minute,
- 21:41for example, using kernel kind of methods,
- 21:44Local polinomials and this paper used
- 21:48a sort of series or spline and regression.
- 21:52So.
- 21:53(clears throat)
- 21:54These papers are nice ways
- 21:56of doing doubly robust estimation,
- 21:58but they had a couple of drawbacks,
- 22:00which we're going to try and build on in this work.
- 22:03So one is, we're going to try not to commit
- 22:05to using any particular methods.
- 22:07We're going to see what we can say about error guarantees,
- 22:11just for generic regression procedures.
- 22:15And then we're going to see
- 22:16if we can actually weaken the sort of assumptions
- 22:19that we need to get oracle type behavior.
- 22:22So the behavior of an estimator that we would see
- 22:25if we actually observed the potential outcomes
- 22:28and it turns out we'll be able to do this,
- 22:29even though we're not committing to particular methods.
- 22:34There's also a really nice paper by Foster
- 22:35and Syrgkanis from last year,
- 22:38which also considered a version of this DR-learner
- 22:41and they had some really nice model agnostic results,
- 22:44but they weren't doubly robust.
- 22:46So, in this work we're going to try
- 22:48and doubly robust defy these these results.
- 22:54So that's the sort of background and an overview.
- 22:57So let's think about what this estimator is actually doing.
- 23:01So here's the picture of this,
- 23:03what I'm calling the DR-learner.
- 23:05So we're going to do some interesting sample splitting
- 23:08here and later where we split our sample
- 23:10in the three different groups.
- 23:13So one's going to be used for nuisance training
- 23:16for estimating the propensity score.
- 23:19And then I'm also going to estimate
- 23:21the regression functions, but in a separate fold.
- 23:26So I'm separately estimating my propensity score
- 23:29and regression functions.
- 23:30This turns out to not be super crucial for this approach.
- 23:35It actually is crucial for something I'll talk
- 23:37about later in the talk,
- 23:39this is just to give a nicer error bound.
- 23:43So the first stage is we estimate these nuisance functions,
- 23:45the propensity scores and the regressions.
- 23:48And then we go to this new data that we haven't seen yet,
- 23:53our third fold of split data
- 23:56and we construct a pseudo outcome.
- 23:58Pseudo outcome looks like this, it's just some combination,
- 24:01it's like an inverse probability weighted residual term
- 24:04plus something like the plug-in estimator
- 24:07of the conditional effect.
- 24:09So it's just some function of the propensity score estimates
- 24:12and the regression estimates.
- 24:15If you've used doubly robust estimators
- 24:17before you'll recognize this as what
- 24:19we average when we construct
- 24:21a usual doubly robust estimator
- 24:24of the average treatment effect.
- 24:25And so intuitively instead of averaging this year,
- 24:28we're just going to regress it on covariates,
- 24:30that's exactly how this procedure works.
- 24:33So it's pretty simple, construct the pseudo outcome,
- 24:36which we typically would average estimate the ate,
- 24:39now, we're just going to do a regression
- 24:41of this thing on covariates in our third sample.
- 24:45So we can write our estimator this way.
- 24:47This e hat in notation just means
- 24:49some generic regression estimator.
- 24:53So one of the crucial points in this work,
- 24:55so I'm not going to, I want to see what I can say
- 24:57about the error of this estimator without committing
- 25:01to a particular estimator.
- 25:02So if you want to use random forests in that last stage,
- 25:05I want to be able to tell you what kind
- 25:07of error to expect or if you want
- 25:09to use linear regression
- 25:10or whatever procedure you like,
- 25:13the goal would be to give you some nice error guarantee.
- 25:16So (indistinct), and you should think of it as just
- 25:18your favorite regression estimator.
- 25:21So we take the suit outcome,
- 25:22we regress it on covariates, super simple,
- 25:25just create a new column in your dataset,
- 25:27which looks like this pseudo outcome.
- 25:28And then treat that as the outcome
- 25:30in your second stage regression.
- 25:35So here we're going to get let's say we split
- 25:39our sample into half for the second stage regression,
- 25:42we would get an over two kind
- 25:43of we'd be using half our sample
- 25:46for the second stage regression.
- 25:48You can actually just swap these samples
- 25:49in the you'll get back the full sample size errors.
- 25:54So it would be as if you had used
- 25:56the full sample size all at once.
- 25:59That's called Cross Fitting,
- 26:01it's becoming sort of popular in the last couple of years.
- 26:03So here's a schematic of what this thing is doing.
- 26:06So we split our data in the thirds,
- 26:07use one third testing
- 26:09to estimate the propensity score,
- 26:10another third to estimate the regression functions,
- 26:12we use those to construct a pseudo outcome
- 26:14and then we do a second stage regression
- 26:16of that pseudo outcome on covariates.
- 26:19So pretty easy, you can do this in three lines of code.
- 26:25Okay.
- 26:26And now our goal is to say something
- 26:28about the error of this procedure,
- 26:29being completely agnostic about
- 26:30how we estimate these propensity scores,
- 26:32that regression functions and what procedure
- 26:34we use in this third stage or second stage.
- 26:40And it turns out we can do this by exploiting
- 26:42the sample splitting can come up with a strong guarantee
- 26:46that actually gives you smaller errors than
- 26:48what appeared in the previous literature
- 26:50when people focused on specific methods.
- 26:52And the main thing is we're really exploiting
- 26:55the sample splitting.
- 26:58And then the other tool that we're using
- 26:59is we're assuming some stability condition
- 27:02on that second stage estimator,
- 27:03that's the only thing we assume here.
- 27:06It's really mild, I'll tell you what it is right now.
- 27:10So you say that regression estimator is stable,
- 27:14if when you add some constant to the outcome
- 27:18and then do a regression, you get something
- 27:21that's the same as if you do the regression
- 27:22and then add some constant.
- 27:25So it's pretty intuitive,
- 27:26if a method didn't satisfy this,
- 27:27it would be very weird
- 27:30and actually for the proof,
- 27:32we don't actually need this to be exactly equal.
- 27:35So adding a constant pre versus post regression
- 27:38shouldn't change things too much.
- 27:40You don't have to have it be exactly equal,
- 27:43it still works if it's just equal up
- 27:44to the error in the second stage regression.
- 27:52So that's the first stability condition.
- 27:55The second one is just that if you have
- 27:57two random variables with the same conditional expectation,
- 28:00then the mean squared error is going
- 28:01to be the same up to constants.
- 28:03Again, any procedure
- 28:05that didn't satisfy these two assumptions
- 28:07would be very bizarre.
- 28:11It's a very mild stability conditions.
- 28:13And that's essentially all we need.
- 28:15So now our benchmark here is going to be an oracle estimator
- 28:21that instead of doing a regression with the pseudo,
- 28:23it does a regression with the actual potential outcomes,
- 28:26Y, one, Y, zero.
- 28:30So we can think about the mean squared error
- 28:31of this estimator, so I'm using mean squared error,
- 28:34just sort of for simplicity and convention,
- 28:36you could think about translating this
- 28:38to other kinds of measures of risk.
- 28:40That would be an interesting area for future work.
- 28:44So this is the oral, our star is the Oracle
- 28:47the mean squared error.
- 28:48It's the mean squared error you'd get for estimating
- 28:50the conditional effect if you actually saw
- 28:52the potential outcomes.
- 28:55So we get this really nice, simple result,
- 28:57which says that the mean squared error
- 28:59of that DR-learner procedure that uses the pseudo outcomes,
- 29:03it just looks like the Oracle means squared error,
- 29:06plus a product of mean squared errors in estimating
- 29:08the propensity score and the regression function.
- 29:12It resembles the kind of doubly robust error results
- 29:16that you see for estimating average treatment effects,
- 29:18but now we have this for conditional effects.
- 29:23The proof technique is very different here compared
- 29:25to what is done in the average effect case.
- 29:29But the proof is actually very, very straightforward.
- 29:32It's like a page long, you can take a look in the paper,
- 29:35it's really just leaning on this sample splitting
- 29:38and then using stability in a slightly clever way.
- 29:42But the most complicated tool uses is just
- 29:45some careful use of the components
- 29:49of the estimator and iterated expectation.
- 29:53So it's really a pretty simple proof, which I like.
- 29:57So yeah, this is the main result.
- 29:59And again, we're not assuming anything beyond
- 30:02this mild stability here, which is nice.
- 30:04So you can use whatever regression procedures you like.
- 30:07And this will tell you something about the error
- 30:09how it relates to the Oracle error that you would get
- 30:12if you actually observed the potential outcomes.
- 30:18So this is model free method-agnostic,
- 30:21it's also a finite sample down,
- 30:23there's nothing asymptotic here.
- 30:25This means that the mean squared error is upper bounded up
- 30:28to some constant times this term on the right.
- 30:31So there's no end going to infinity or anything here either.
- 30:39So the other crucial point of this is
- 30:41because we have a product of mean squared errors,
- 30:44you have the kind of usual doubly robust story.
- 30:46So if one of these is small, the product will be small,
- 30:50potentially more importantly, if they're both kind
- 30:52of modest sized because both, maybe the propensity score
- 30:55and the regression functions are hard to estimate
- 30:57the product will be potentially quite a bit smaller
- 31:01than the individual pieces.
- 31:04And this is why this is showing you that that sort
- 31:08of plugging approach, which would really just be driven
- 31:10by the mean squared error for estimating
- 31:11the regression functions can be improved by quite a bit,
- 31:15especially if there's some structure to exploit
- 31:17in the propensity scores.
- 31:23Yeah, so in previous work people used specific methods.
- 31:26So they would say I'll use
- 31:28maybe series estimators or current estimators
- 31:31and then the error bound was actually bigger
- 31:34than what we get here.
- 31:36So this it's a little surprising that you can get
- 31:38a smaller error bound under weaker assumptions,
- 31:40but this is a nice advantage
- 31:42of the sample splitting trick here.
- 31:49Now that you have this nice error bound you can plug
- 31:52in sort of results from any of your favorite estimators.
- 31:56So we know lots about mean squared error
- 31:59for estimating regression functions.
- 32:01And so you can just plug in what you get here.
- 32:03So for example, you think about smooth functions.
- 32:07So these are functions and hold their classes intuitively
- 32:11these are functions that are close to their tailored,
- 32:13approximations, the strict definition,
- 32:17which may be I'll pass in the interest of time.
- 32:22Then you can say, for example, if PI is alpha smooth,
- 32:25so it has alpha partial derivatives
- 32:30with the highest order Lipschitz then we know
- 32:33that you can estimate that a propensity score
- 32:37with the mean squared error that looks like
- 32:38n to the minus two alpha over two alpha plus D,
- 32:41this is the usual non-parametric regression
- 32:44mean squared error.
- 32:46You can say the same thing for the regression functions.
- 32:49If they're beta smooth, then we can estimate them
- 32:51at the usual non-parametric rate,
- 32:53n to the minus two beta over two beta plus D.
- 32:56Then we could say,
- 32:57okay, suppose the conditional effect,
- 32:59Tau is gamma smooth, and gamma, it can't be smaller
- 33:04than beta, it has to be at least as smooth
- 33:05as the regression functions and in practice,
- 33:08it could be much more smooth.
- 33:09So for example, in the case where the CATE is just zero
- 33:12or constant, Gamma's like infinity, infinitely smooth.
- 33:17Then if we use a second stage estimator that's optimal
- 33:20for estimating Gamma smooth functions,
- 33:24we can just plug in the error rates
- 33:25that we get and see
- 33:26that we get a mean squared error bound
- 33:28that looks like the Oracle rate.
- 33:30This is the rate we would get if we actually observed
- 33:33the potential outcomes.
- 33:35And then we get this product of mean squared errors.
- 33:37And so whenever this product, it means squared errors
- 33:40is smaller than the Oracle rate,
- 33:42then we're achieving the Oracle rate up to constants,
- 33:46the same rate that we would get
- 33:47if we actually saw Y one minus Y zero.
- 33:51And so you can work out the conditions,
- 33:53what you need to make this term smaller than this one,
- 33:56that's just some algebra
- 34:00and it has some interesting structure.
- 34:03So if the average smoothness of the two nuisance functions,
- 34:07the propensity score and the regression function
- 34:09is greater than D over two divided by some inflation factor,
- 34:14then you can say that you're achieving
- 34:18the same rate as this Oracle procedure.
- 34:22So the analog of this for the average treatment effect
- 34:27or the result you need
- 34:28for the standard doubly robust estimate,
- 34:30or the average treatment effect
- 34:31is that the average smoothness is greater than D over two.
- 34:34So here we don't have D over two,
- 34:36we have D over two over one plus D over gamma.
- 34:40So this is actually giving you a sort
- 34:44of a lower threshold for achieving Oracle rates
- 34:48in this problem.
- 34:49So, because it's a harder problem,
- 34:51we need weaker conditions
- 34:52on the nuisance estimation to behave like an Oracle
- 34:56and how much weaker those conditions
- 34:58are, depends on the dimension of the covariates
- 35:00and the smoothness of the conditional effect.
- 35:04So if we think about the case where the conditional effect
- 35:06is like infinitely smooth,
- 35:07so this is almost like a parametric problem.
- 35:10Then we recovered the usual condition that we need
- 35:13for the doubly robust estimator to be root
- 35:14and consistent as greater than D over two.
- 35:20But when dimension is for some non-trivial smoothness,
- 35:26then we're somewhere in between sort of when
- 35:28a plugin is optimal and this nice kind of parametric setup.
- 35:34So this is just a picture of the rates here
- 35:37which is useful to keep in mind.
- 35:39So here on the x-axis, we have the smoothness
- 35:43of the nuisance functions.
- 35:44You can think of this as the average smoothness
- 35:46of the propensity score in regression functions.
- 35:49And again, in this holder smooth model,
- 35:52which is a common model people use in non-parametrics,
- 35:55the more smooth things are
- 35:57the easier it is to estimate them.
- 36:00And then here we have the mean squared error
- 36:02for estimating the conditional effect.
- 36:06So here is the minimax lower bounce,
- 36:09this is the best possible mean squared error
- 36:11that you can achieve for the average treatment effect.
- 36:14This is just to kind of anchor our results
- 36:16and think about what happens relative to this nicer,
- 36:19simpler parameter, which is just the overall average
- 36:21and not the conditional average.
- 36:24So once you hit a certain smoothness in this case,
- 36:26it's five, so this is looking at
- 36:28a 20 dimensional covariate case where
- 36:32the CATE smoothness is twice the dimension
- 36:35just to fix ideas.
- 36:38And so once we hit this smoothness of five,
- 36:42so we have five partial derivatives,
- 36:43then it's possible to achieve a Rudin rate.
- 36:47So this is into the one half for estimating
- 36:50the average treatment effect.
- 36:52Rudin rates are never possible for conditional effects.
- 36:55So here's the Oracle rate.
- 36:58This is the rate that we would achieve in this problem
- 37:00if we actually observed the potential outcomes.
- 37:02So it's lower than Rudin, it's a bigger error.
- 37:08Here's what you would get with the plugin.
- 37:10This is just really inheriting the complexity
- 37:13and estimating the regression functions individually,
- 37:15it doesn't capture this CATE smoothness
- 37:18and so you need the regression functions
- 37:20to be sort of infinitely smoother or as smooth
- 37:22as the CATE to actually get Oracle efficiency
- 37:25with the plugin estimator.
- 37:28It's this plugin as big errors,
- 37:29if we use this DR-learner approach,
- 37:32we close this gap substantially.
- 37:36So we can say that we're hitting this Oracle rate.
- 37:39Once we have a certain amount of smoothness
- 37:41of the nuisance functions and in between
- 37:44we get an error that looks something like this.
- 37:49So this is just a picture of this row results showing,
- 37:53graphically, the improvement of the DR-learner approach
- 37:56here over a simple plug estimator.
- 38:03So yeah, just the punchline here is
- 38:06this simple two-stage doubly robust approach
- 38:09can do a good job adapting to underlying structure
- 38:12in the conditional effect,
- 38:14even when the nuisance stuff,
- 38:16the propensity scores
- 38:17and the underlying regression functions
- 38:18are more complex or less smooth in this case.
- 38:24This is just talking about the relation
- 38:26to the average treatment effect conditions,
- 38:28which I mentioned before.
- 38:32So you can do the same thing for any generic
- 38:34regression methods you like.
- 38:35So in the paper, I do this for smooth models
- 38:38and sparse models, which are common
- 38:39in these non-parametric settings,
- 38:41where you have high dimensional Xs
- 38:43and you believe that some subset
- 38:45of them are the ones that matter.
- 38:48So I'll skip past this, if you're curious though,
- 38:50all the details are in the paper.
- 38:52So you can say, what kind of sparse should
- 38:53be doing need in the propensity score
- 38:55in regression functions to be able
- 38:57to get something that behaves like an Oracle
- 38:59that actually saw the potential outcomes from the start.
- 39:04You can also do the same kind of game
- 39:05where you compare this to what you need
- 39:06for the average treatment effect.
- 39:11Yeah, happy to talk about this offline
- 39:13or afterwards people have questions.
- 39:18So there's also a nice kind of side result
- 39:21which I think I'll also go through quickly here.
- 39:24From all this, is just a general Oracle inequality
- 39:29for regression when you have some estimated outcomes.
- 39:31So in some sense, there isn't anything really special
- 39:34in our results that has to do
- 39:37with this particular pseudo outcome.
- 39:39So, the proof that we have here works
- 39:43for any second stage or any two-stage sort
- 39:46of regression procedure
- 39:48where you first estimate some nuisance stuff,
- 39:50create a pseudo outcome that depends
- 39:52on this estimated stuff and then do a regression
- 39:54of the pseudo outcome on some set of covariates.
- 39:58And so a nice by-product of this work,
- 40:00as you get a kind of similar error bound
- 40:02for just generic regression with pseudo outcomes.
- 40:07This comes up in a lot of different problems, actually.
- 40:10So one is when you want just a partly conditional effect.
- 40:15So maybe I don't care about how effects vary
- 40:17with all the Xs, but just a subset of them,
- 40:19then you can apply this result.
- 40:20I have a paper with a great student, Amanda Costin,
- 40:23who studied a version of this
- 40:28regression with missing outcomes.
- 40:30Again, these look like nonparametric regression problems
- 40:33where you have to estimate some pseudo outcome
- 40:36dose response curve problems, conditional IV effects,
- 40:40partially linear IVs.
- 40:41So there are lots of different variants where you need
- 40:43to do some kind of two-stage regression procedure like this.
- 40:51Again, you just need a stability condition
- 40:52and you need some sample splitting
- 40:54and you can give a similar kind of a nice rate result
- 40:57that we got for the CATE specific problem,
- 41:01but in generic pseudo outcome progression problem.
- 41:07So we've got about 15 minutes,
- 41:10I have some simulations,
- 41:12which I think I will go over quickly.
- 41:15So we did this in a couple simple models,
- 41:17one, a high dimensional linear model.
- 41:20It's actually a logistic model where
- 41:22we have 500 covariates and 50
- 41:25of them have non-zero coefficients.
- 41:28We just used the default lasso fitting in our
- 41:32and compared plugin estimators
- 41:34to the doubly robust approach that we talked
- 41:37about and then also an ex-learner
- 41:39which is some sort of variants of the plug-in approach
- 41:44that was proposed in recent years.
- 41:47And the basic story is you get sort
- 41:49of what the theory predicts.
- 41:50So the DR-learner does better than these plug-in types
- 41:54of approaches in this setting.
- 41:58The nuisance functions are hard to estimate
- 42:00and so you don't see a massive gain over,
- 42:02for example, the X-Learner,
- 42:04you do see a pretty massive gain
- 42:05over the simple plugin.
- 42:08And we're a bit away
- 42:10from this Oracle DR-learner approach here,
- 42:13so that means great errors is relatively different.
- 42:16This is telling us that the nuisance stuff is hard
- 42:19to estimate in this simulation set up.
- 42:22Here's another simulation based
- 42:24on that plot I showed you before.
- 42:28And so here, I'm actually estimating the propensity scores,
- 42:31but I'm constructing the estimates myself
- 42:33so that I can control the rate of convergence
- 42:35and see how things change across different error rates
- 42:39for estimating with propensity score.
- 42:41So here's what we see.
- 42:42So on the x-axis here,
- 42:44we have how well we're estimating the propensity score.
- 42:48So this is a convergence rate
- 42:50for the propensity score estimator.
- 42:52Y-axis, we have the mean squared error
- 42:54and then this red line is the plugin estimator,
- 42:56it's doing really poorly.
- 42:57It's not capturing this underlying simplicity
- 42:59of the conditional effects.
- 43:00It's really just inheriting that difficulty
- 43:03in estimating the regression functions.
- 43:05Here's the X-learner, it's doing a bit better
- 43:07than the plugin, but it's still not doing
- 43:10a great job capturing the underlying simplicity
- 43:12and the conditional effect.
- 43:14This dotted line is the Oracle.
- 43:16So this is what you would get
- 43:18if you actually observed the potential outcomes.
- 43:20And then the black line is the DR-learner,
- 43:23this two-stage procedure here,
- 43:24I'm just using smoothing splines everywhere,
- 43:26just defaults in R, it's like three lines of code,
- 43:29all the code's in the paper, too,
- 43:31if you want to play around with this.
- 43:33And here we see what we expect.
- 43:35So when it's really hard to estimate the propensity score,
- 43:38it's just a hard problem and we don't do
- 43:40much better than the X-learner.
- 43:44We still get some gain over the plugin in this case,
- 43:47but as soon as you can estimate the propensity score
- 43:50well at all, you start seeing some pretty big gains
- 43:54by doing this doubly robust approach
- 43:56and at some point we start to roughly match
- 43:58the Oracle actually.
- 44:02As soon as we're getting something like
- 44:03into the quarter rates in this case,
- 44:04we're getting close to the Oracle.
- 44:10So maybe I'll just show you an illustration
- 44:12and then I'll talk about the second part of the talk
- 44:15and very briefly if people have,
- 44:17want to talk about that,
- 44:20offline, I'd be more than happy to.
- 44:23So here's a study, which I actually learned about
- 44:25from Peter looking at effects of canvassing
- 44:27on voter turnout, so this is this timely study.
- 44:30Here's the paper, there are almost 20,000 voters
- 44:36across six cities here.
- 44:37They're randomly encouraged to vote
- 44:42in these local elections that people would go
- 44:45and talk to them face to face.
- 44:47You remember what that was like pre-pandemic.
- 44:50Here's a script of the sort of canvassing that they did,
- 44:55just saying, reminding them of the election,
- 44:58giving them a reminder to vote.
- 45:00Hopefully I'm doing this for you as well,
- 45:02if you haven't voted already.
- 45:04And so what's the data we have here?
- 45:07We have a number of covariates things like city,
- 45:10party affiliation, some measures
- 45:11of the past voting history, age, family size, race.
- 45:15Again, the treatment is whether they work randomly contact
- 45:19is actually whether they were randomly assigned some cases,
- 45:22people couldn't be contacted in the setup.
- 45:25So we're just looking at intention
- 45:26to treat kinds of effects.
- 45:28And then the outcome is whether people voted
- 45:30in the local election or not.
- 45:32So just as kind of a proof of concept,
- 45:35I use this DR-learner approach,
- 45:37I just use two folds and use random forest separator
- 45:42for the first stage regressions and the second stage.
- 45:47And actually for one part of the analysis,
- 45:50I used generalized additive models in that second stage.
- 45:56So here's a histogram of the conditional effect estimates.
- 46:00So there's sort of a big chunk, a little bit above zero,
- 46:03but then there is some heterogeneity
- 46:04around that in this case.
- 46:07So there are some people
- 46:08who maybe seem especially responsive to canvassing,
- 46:12maybe some people who are going to know it
- 46:15and actually some are less likely to vote, potentially.
- 46:18This is a plot of the effect estimates
- 46:21from this DR-learner procedure,
- 46:22just to see what they look like,
- 46:24how this would work in practice across
- 46:27to potentially important covariate.
- 46:30So here's the age of the voter and then the party
- 46:34and the color here represents the size and direction
- 46:39of the CATE estimate of the conditional effect estimates,
- 46:41so blue is canvassing is having a bigger effect
- 46:45on voting in the next local election.
- 46:50Red means less likely to vote due to canvassing.
- 46:55So you can see some interesting structure here just briefly,
- 46:59the independent people,
- 47:01it seems like the effects are closer to zero.
- 47:03Democrats maybe seem more likely to be positively affected,
- 47:07maybe more so among younger people.
- 47:11It's just an example of the kind of
- 47:13sort of graphical visualization stuff you could do
- 47:16with this sort of procedure.
- 47:18This is the plot I showed before, where here,
- 47:21we're looking at just how the conditional
- 47:22effect varies with age.
- 47:24And you can see some evidence
- 47:25that younger people are to canvassing.
- 47:33Older people, less evidence that there's any response.
- 47:43I should stop here and see if people have any questions.
- 47:51- So Edward, can I ask a question?
- 47:53- Of course yeah.
- 47:55- I think we've discussed about point estimation.
- 47:57Does this approach also allows
- 47:59for consistent variance estimation?
- 48:01- Yeah, that's a great question.
- 48:04Yeah, I haven't included any of that here,
- 48:08but if you think about that.
- 48:11This Oracle result that we have.
- 48:17If these errors are small enough,
- 48:19so under the kinds of conditions that we talked about,
- 48:22then we're getting an estimate of it looks like an Oracle
- 48:26has to meet or of the potential outcomes on the covariates.
- 48:29And that means that as long as these are small enough,
- 48:31we could just port over any inferential tools
- 48:33that we like from standard non-parametric regression
- 48:35treating our pseudo outcomes as if they were
- 48:38the true existential outcomes, yeah.
- 48:41That's a really important point,
- 48:43I'm glad you mentioned that.
- 48:44- Thanks.
- 48:45- So inference is more complicated
- 48:47and nuanced than non-parametric regression,
- 48:51but any inferential tool could be used here.
- 48:56- So operationally, just to think
- 48:57about how to operationalize the variance estimation
- 48:59also, does that require the cross fitting procedure
- 49:03where you're swapping your D one D two
- 49:06in the estimation process and then?
- 49:10- Yeah, that's a great question too.
- 49:11So not necessarily,
- 49:12so you could just use these folds
- 49:14for nuisance training and then go to this fold
- 49:17and then just forget that you ever used this data
- 49:19and just do variance estimation here.
- 49:21The drawback there would be,
- 49:22you're only using a third of your data.
- 49:25If you really want to make full use
- 49:26of the sample size using
- 49:28the cross fitting procedure would be ideal,
- 49:31but the inference doesn't change.
- 49:32So if you do cross fitting,
- 49:35you would at the end of the day,
- 49:36you'd get an out of sample CATE estimate
- 49:39for every single row in your data, every subject,
- 49:42but just where that CATE was built from other,
- 49:45the nuisance stuff for that estimate
- 49:47was built from other samples.
- 49:50But at the end of the day, you'd get one big column
- 49:52with all these out of sample CATE estimates
- 49:54and then you could just use
- 49:55whatever inferential tools you like there.
- 50:00- Thanks.
- 50:07- So, just got a few minutes.
- 50:10So maybe I'll just give you a high level kind of picture
- 50:12of the stuff in the second part of this talk
- 50:14which is really about pursuing the fundamental limits
- 50:19of conditional effect estimation.
- 50:20So what's the best we could possibly do here?
- 50:23This is completely unknown,
- 50:25which I think is really fascinating.
- 50:27So if you think about what we have so far,
- 50:29so far, we've given these sufficient conditions under
- 50:33which this DR-learner is Oracle efficient,
- 50:36but a natural question here is what happens
- 50:38when those mean squared error terms are too big
- 50:40and so we can't say that we're getting
- 50:42the Oracle rate anymore.
- 50:45Then you might say,
- 50:46okay, is this a bug with the DR-learner?
- 50:50Maybe I could have adapted this in some way
- 50:52to actually do better or maybe I've reached the limits
- 50:56of how well I can do for estimating the effect.
- 51:00It doesn't matter if I had gone to a different estimator,
- 51:03think I would've had the same kind of error.
- 51:07So this is the goal of this last part of the work.
- 51:12So here we use a very different estimator.
- 51:14It's built using this R-learner idea,
- 51:17which is reproducing RKHS extension of this
- 51:22classic double residual regression method
- 51:24of Robinson, which is really cool.
- 51:27This is actually from 1988, so it's a classic method.
- 51:33And so we study a non-parametric version
- 51:35of this built from local polynomial estimators.
- 51:38And I'll just give you a picture
- 51:40of what the estimator is doing.
- 51:41It's quite a bit more complicated
- 51:43than that dr. Learner procedure.
- 51:45So we again use this triple sample splitting
- 51:47and here it's actually much more crucial.
- 51:50So if you didn't use that triple sample splitting
- 51:53for the dr learner,
- 51:53you'd just get a slightly different Arab bound,
- 51:55but here it's actually really important.
- 51:57I'd be happy to talk to people about why specifically.
- 52:01So one part of the sample we estimate propensity scores
- 52:04and another part of the sample.
- 52:05We estimate propensity scores and regression functions.
- 52:08Now the marginal regression functions,
- 52:10we combine these to get weights, Colonel weights.
- 52:13We also combine them to get residuals.
- 52:15So treatment residuals and outcome residuals.
- 52:18This is like what you would get
- 52:19for this re Robinson procedure from econ.
- 52:24Then we do instead of a regression
- 52:26of outcome residuals on treatment residuals,
- 52:28we do a weighted nonparametric regression
- 52:32of these residuals on the treatment residuals.
- 52:34So that's the procedure a little bit more complicated.
- 52:38And again, this is,
- 52:39I think there are ways to make this work well practically,
- 52:42but the goal of this work is really to try
- 52:44and figure out what's the best possible
- 52:46mean squared error that we could achieve.
- 52:47It's less about a practical method,
- 52:51more about just understanding how hard
- 52:52the conditional effect estimation problem is.
- 52:56And so we actually show that a generic version
- 52:59of this procedure,
- 53:02as long as you estimate the propensity scores
- 53:03and the regression functions with linear smoothers,
- 53:05with particular bias and various properties,
- 53:08which are standard in nonparametrics,
- 53:11you can actually get better mean squared error.
- 53:13Then for the dr. Learner,
- 53:15we'll just give you a sense of what this looks like.
- 53:19So you get something that looks like an Oracle rate plus
- 53:22something like the squared bias from the new synced,
- 53:27from the propensity score and regression functions.
- 53:32So before you had the product of mean squared errors,
- 53:36now we have the square of the bias
- 53:38of the two procedures, the mean squared error,
- 53:40and the propensity score in the regression function.
- 53:43And this gives you, this opens the door to under smoothing.
- 53:46So this means that you can estimate the propensity score
- 53:49and the regression functions in a suboptimal way.
- 53:52If you actually just care about the,
- 53:54these functions by themselves.
- 53:56So you drive down the bias that blows up
- 53:59the variance a little bit,
- 54:00but it turns out not to affect the conditional effect
- 54:03estimate too much if you do it in the right way.
- 54:06And so if.
- 54:07You, if you do this, you get.
- 54:11A rate that looks like this,
- 54:12you get an Oracle rate plus into the minus two S over D.
- 54:15And this is strictly better than what we got
- 54:18with the dr. Learner.
- 54:20(clears throat)
- 54:21You can do the same game where you see sort
- 54:23of when the Oracle rate is achieved here, it's achieved.
- 54:27If the average smoothness of the nuisance functions
- 54:29is greater than D over four.
- 54:31And then here, the inflation factor is also changing.
- 54:34So before we had,
- 54:35we needed the smoothness to be greater than D over two,
- 54:38over one plus D over gamma.
- 54:40Now we have D over four over one plus or two gamma.
- 54:45So this is a weaker condition.
- 54:46So this is telling us that there are settings
- 54:48where that dr. Lerner is not Oracle efficient,
- 54:52but there exists an estimator, which is,
- 54:53and it looks like this estimator
- 54:56I had described here,
- 54:57this regression on residuals thing.
- 55:02So that's the story.
- 55:03You can actually,
- 55:03you can actually beat this dr. Lerner.
- 55:05And now the question is, okay, what happens?
- 55:08One, what happens
- 55:09when we're not achieving the Oracle rate here,
- 55:11can you still do better?
- 55:13A second question is can anything, yeah.
- 55:19Can anything achieve the Oracle rate
- 55:20under weaker conditions than this?
- 55:22And so I haven't proved anything about this yet.
- 55:25It turns out to be somewhat difficult,
- 55:29but I conjecture that this, this condition is mini max.
- 55:33So I don't think any,
- 55:34any estimator could ever be Oracle efficient
- 55:36under weaker conditions than what this estimator is.
- 55:40So this is just a picture of the results again.
- 55:42So here's, it's the same setting as before here,
- 55:45we have the plugin estimator that dr. Learner.
- 55:48And here's what we get with this.
- 55:51I call it the LPR learner.
- 55:53It's a local polynomial version of the, our learner.
- 55:55And so we're, actually getting quite a bit smaller rates.
- 55:58We're hitting the Oracle rate under Meeker conditions
- 56:02on the smoothness.
- 56:03Now, the question is whether we can fill this gap anymore,
- 56:08and this is unknown.
- 56:09This is one of the open questions in causal inference.
- 56:14So yeah, I think in the interest of time,
- 56:18I'll skip to the discussion section here.
- 56:21We can actually fill the gap a little bit
- 56:22with some extra, extra tuning.
- 56:26Just interesting.
- 56:29Okay.
- 56:30Yeah.
- 56:31So this last part is really about just pushing the limits,
- 56:32trying to figure out what the best possible performance is.
- 56:35Okay.
- 56:36So just to wrap things up,
- 56:39right we gave some new results here
- 56:41that let you be very flexible with
- 56:43the kinds of methods that you want to use.
- 56:46They do a good job of exploiting this Cate structure
- 56:49when it's there and don't lose much when it's not.
- 56:54So we have this nice model, free Arab bound.
- 56:57We also kind of for free to get
- 56:59this nice general Oracle inequality did
- 57:03some investigation of the best possible rates
- 57:06of convergence,
- 57:07the best possible mean squared error
- 57:08for estimating conditional effects,
- 57:11which again was unknown before.
- 57:14These are the weekend weak cause conditions
- 57:15that have appeared,
- 57:17but it's still not entirely known whether
- 57:19they are mini max optimal or not.
- 57:23So, yeah, big picture goals.
- 57:24We want some nice flexible tools,
- 57:26strong guarantees when it pushed forward,
- 57:28our understanding of this problem.
- 57:30I hope I've conveyed that there are lots of fun,
- 57:32open problems here to work out
- 57:34with important practical implications.
- 57:37Here's just a list of them.
- 57:38I'd be happy to talk more with people at any point,
- 57:42feel free to email me a big part is applying
- 57:44these methods in real problems.
- 57:46And yeah, I should stop here,
- 57:49but feel free to email the, the papers on archive here.
- 57:53I'd be happy to hear people's thoughts.
- 57:55Yeah.
- 57:55Thanks again for inviting me.
- 57:56It was fun.
- 57:58- Yeah.
- 57:59Thanks Edward.
- 57:59That's a very nice talk and I think we're hitting the hour,
- 58:02but I want to see in the audience
- 58:04if we have any questions.
- 58:05Huh.
- 58:13All right.
- 58:14If not, I do have one final question
- 58:16if that's okay.
- 58:17- Yeah, of course.
- 58:18- And so I think there is a hosted literature
- 58:21on flexible outcome modeling
- 58:23to estimate conditional average causal effect,
- 58:26especially those baits and non-parametric tree models
- 58:28(laughs)
- 58:30that are getting popular.
- 58:32So I am just curious to see if you have ever thought
- 58:36about comparing their performances,
- 58:38or do you think there are some differences
- 58:40between those sweats based
- 58:42in non-parametric tree models versus
- 58:44the plug-in estimator?
- 58:46We compared in a simulation study here?
- 58:48- Yeah.
- 58:49I think of them
- 58:50as really just versions of that plugin estimator
- 58:53that use a different regression procedure.
- 58:55There may be ways to tune plugins to try
- 58:58and exploit this special structure of the Cate.
- 59:01But if you're really just looking
- 59:02at the regression functions individually,
- 59:05I think these would be susceptible to the same kinds
- 59:07of issues that we see with the plugin.
- 59:09Yeah.
- 59:10That's a good one.
- 59:11- I see.
- 59:12Yep.
- 59:13So I want to see if there's any further questions
- 59:17from the audience to dr. Kennedy.
- 59:21(indistinct)
- 59:23- I was just wondering if you could speak a little more,
- 59:26why the standard like naming orthogonality results
- 59:29or can it be applicable in this setup?
- 59:32- [Edward] Yeah.
- 59:33(clears throat)
- 59:34Yeah.
- 59:35That's a great question.
- 59:36So one way to S to say it is that these effects,
- 59:42these conditional effects
- 59:43are not Pathwise differentiable.
- 59:46And so these kinds of there's some distinction
- 59:50between naming orthogonality
- 59:51and pathways differentiability,
- 59:52but maybe we can think about them
- 59:53as being roughly the same for now.
- 59:57So yeah, all the standards in my parametric
- 59:59theory breaks down here
- 01:00:01because of this lack of pathways differentiability so the,
- 01:00:04all the efficiency bounds that
- 01:00:05we know and love don't apply,
- 01:00:09but it turns out that there's some kind
- 01:00:11of analogous version of this that works for these things.
- 01:00:15I think of them as like infinite dimensional functional.
- 01:00:18So instead of like the ate, which is just a number,
- 01:00:20this is like a curve,
- 01:00:22but it has the same kinds of like functional structure
- 01:00:25in the sense that it's combining regression functions
- 01:00:28or our propensity scores in some way.
- 01:00:29And we don't care about the individual components.
- 01:00:33We care about their combination.
- 01:00:36So yeah, the standard stuff doesn't work just
- 01:00:38because it's, we're outside of this route
- 01:00:40in Virginia, roughly, but there are, yeah,
- 01:00:44there's analogous structure and there's tons
- 01:00:46of important work to be done,
- 01:00:48sort of formalizing this and extending
- 01:00:53that's a little vague, but hopefully that.
- 01:01:02- All right.
- 01:01:03So any further questions?
- 01:01:08- Thanks again.
- 01:01:09And yeah.
- 01:01:10If any questions come up, feel free to email.
- 01:01:12- Yeah.
- 01:01:14If not,
- 01:01:14I'll let smoke unless that doctors can be again.
- 01:01:16And I'm sure he'll be happy
- 01:01:17to answer your questions offline.
- 01:01:19So thanks everyone.
- 01:01:20I'll see you.
- 01:01:21We'll see you next week.
- 01:01:22- Thanks a lot.