Skip to Main Content

YSPH Biostatistics Virtual Seminar: “Optimal Doubly Robust Estimation of Heterogeneous Causal Effects"

November 04, 2020
  • 00:00- So let's get started.
  • 00:03Welcome everyone.
  • 00:04It is my great pleasure to introduce our speaker today,
  • 00:07Dr. Edward Kennedy, who is an assistant professor
  • 00:11at the Department of Statistics and Data Science
  • 00:13at Carnegie Mellon University.
  • 00:16Dr. Kennedy got his MA in statistics
  • 00:18and PhD in biostatistics from University of Pennsylvania.
  • 00:21He's an expert in methods for causal inference,
  • 00:24missing data and machine learning,
  • 00:26especially in settings involving
  • 00:27high dimensional and complex data structures.
  • 00:31He has also been collaborating on statistical applications
  • 00:34in criminal justice, health services,
  • 00:36medicine and public policy.
  • 00:38Today's going to share with us his recent work
  • 00:40in the space of heterogeneous causal effect estimation.
  • 00:43Welcome Edward, the floor is yours.
  • 00:46- [Edward] Thanks so much, (clears throat)
  • 00:47yeah, thanks for the invitation.
  • 00:49I'm happy to talk to everyone today about this work
  • 00:51I've been thinking about for the last year or so.
  • 00:55Sort of excited about it.
  • 00:57Yeah, so it's all about doubly robust estimation
  • 00:59of heterogeneous treatment effects.
  • 01:03Maybe before I start,
  • 01:04I don't know what the standard approach is for questions,
  • 01:07but I'd be more than happy to take
  • 01:08any questions throughout the talk
  • 01:10and I can always sort of adapt and focus more
  • 01:13on different parts of the room,
  • 01:14what people are interested in.
  • 01:17I'm also trying to get used to using Zoom,
  • 01:21I've been teaching this big lecture course
  • 01:23so I think I can keep an eye on the chat box too
  • 01:26if people have questions that way,
  • 01:27feel free to just type something in.
  • 01:30Okay.
  • 01:31So yeah, this is sort
  • 01:34of standard problem non-causal inference
  • 01:36but I'll give some introduction.
  • 01:38The kind of classical target that people go after
  • 01:41in causal inference problems is what's
  • 01:44often called the average treatment effect.
  • 01:46So this tells you the mean outcome if everyone
  • 01:48was treated versus if everyone was untreated, for example.
  • 01:53So this is, yeah, sort of the standard target.
  • 01:57We know quite a bit about estimating this parameter
  • 02:01under no unmeasured confounding kinds of assumptions.
  • 02:05So just as a just sort of point this out,
  • 02:10so a lot of my work is sort of focused
  • 02:11on the statistics of causal inference,
  • 02:12how to estimate causal parameters
  • 02:15well in flexible non-parametric models.
  • 02:17So we know quite a bit
  • 02:18about this average treatment effect parameter.
  • 02:20There are still some really interesting open problems,
  • 02:23even for this sort of most basic parameter,
  • 02:25which I'd be happy to talk to people about,
  • 02:26but this is just one number, it's an overall summary
  • 02:31of how people respond to treatment, on average.
  • 02:34It can obscure potentially important heterogeneity.
  • 02:38So for example, very extreme case would be where half
  • 02:43the population is seeing a big benefit
  • 02:45from treatment and half is seeing severe harm,
  • 02:49then you would completely miss this
  • 02:50by just looking at the average treatment effect.
  • 02:53So this motivates going beyond this,
  • 02:55maybe looking at how treatment effects can vary
  • 02:58across subject characteristics.
  • 03:01All right, so why should we care about this?
  • 03:03Why should we care how treatment effects vary in this way?
  • 03:06So often when I talk about this,
  • 03:09people's minds go immediately to optimal treatment regimes,
  • 03:12which is certainly an important part of this problem.
  • 03:16So that means trying to find out who's benefiting
  • 03:19from treatment and who is not or who's being harmed.
  • 03:22And then just in developing
  • 03:24a treatment policy based on this,
  • 03:26where you treat the people who benefit,
  • 03:27but not the people who don't.
  • 03:29This is definitely an important part
  • 03:30of understanding heterogeneity,
  • 03:32but I don't think it's the whole story.
  • 03:33So it can also be very useful just
  • 03:36to understand heterogeneity from a theoretical perspective,
  • 03:39just to understand the system
  • 03:41that you're studying and not only that,
  • 03:44but also to help inform future treatment development.
  • 03:50So not just trying to optimally assign
  • 03:53the current treatment that's available,
  • 03:55but if you find, for example,
  • 03:57that there are portions of the subject population
  • 04:01that are not responding to the treatment,
  • 04:03maybe you should then go off and try and develop
  • 04:05a treatment that would better aim at these people.
  • 04:10So lots of different reasons why you might care
  • 04:12about heterogeneity,
  • 04:12including devising optimal policies,
  • 04:16but not just that.
  • 04:19And this really plays a big role across lots
  • 04:21of different fields as you can imagine.
  • 04:24We might want to target policies based on how people
  • 04:29are responding to a drug or a medical treatment.
  • 04:33We'll see a sort of political science example here.
  • 04:36So this is just a picture of what you should maybe think
  • 04:39about as we're talking about this problem
  • 04:42with heterogeneous treatment effects.
  • 04:44So this is a timely example.
  • 04:46So it's looking at the effect
  • 04:47of canvassing on voter turnout.
  • 04:50So this is the effect of being sort of reminded
  • 04:53in a face-to-face way to vote
  • 04:55that there's an election coming up
  • 04:58and how this effect varies with age.
  • 05:00And so I'll come back to where this plot came from
  • 05:04and the exact sort of data structure and analysis,
  • 05:07but just as a picture to sort of make things concrete.
  • 05:11It looks like there might be some sort of positive effect
  • 05:15of canvassing for younger people,
  • 05:16but not for older people,
  • 05:18there might be some non-linearity.
  • 05:21So this might be useful for a number of reasons.
  • 05:23You might not want to target the older population
  • 05:27with canvassing, because it may not be doing anything,
  • 05:30you might want to try and find some other way
  • 05:32to increase turnout for this group right.
  • 05:36Or you might just want to understand sort
  • 05:38of from a psychological, sociological,
  • 05:41theoretical perspective,
  • 05:44what kinds of people are responding to this sort of thing?
  • 05:49And so this is just one simple example
  • 05:51you can keep in mind.
  • 05:54So what's the state of the art for this problem?
  • 05:58So in this talk, I'm going to focus
  • 06:00on this conditional average treatment effect here.
  • 06:02So it's the expected difference
  • 06:05if people of type X were treated versus
  • 06:08not expected difference in outcomes.
  • 06:11This is kind of the classic or standard parameter
  • 06:14that people think about now
  • 06:16in the heterogeneous treatment effects literature,
  • 06:19there are other options you could think
  • 06:21about risk ratios, for example, if outcomes are binary.
  • 06:25A lot of the methods that I talk about today
  • 06:26will have analogs for these other regions,
  • 06:30but there are lots of fun, open problems to explore here.
  • 06:33How to characterize heterogeneous treatment effects
  • 06:35when you have timeframe treatments, continuous treatments,
  • 06:39of cool problems to think about.
  • 06:40But anyways, this kind of effect where we have
  • 06:44a binary treatment and some set of covariates,
  • 06:48there's really been this proliferation of proposals
  • 06:51in recent years for estimating this thing
  • 06:53in a flexible way that goes beyond just fitting
  • 06:57a linear model and looking at some interaction terms.
  • 07:01(clears throat)
  • 07:02So I guess I'll refer to the paper for a lot
  • 07:07of these different papers that have thought about this.
  • 07:12People have used, sort of random forests
  • 07:14and tree based methods basing out
  • 07:17of a regression trees, lots of different variants
  • 07:20for estimating this thing.
  • 07:21So there've been lots of proposals,
  • 07:23lots of methods for estimating this,
  • 07:24but there's some really big theoretical gaps
  • 07:28in this literature.
  • 07:30So one, yeah, this is especially true
  • 07:32when you can imagine that this conditional effect
  • 07:36might be much more simple
  • 07:38or sparse or smooth than the rest
  • 07:41of the data generating process.
  • 07:42So you can imagine you have some
  • 07:45potentially complex propensity score describing
  • 07:49the mechanism by which people are treated
  • 07:50based on their covariates.
  • 07:51You have some underlying regression functions
  • 07:54that describe this outcome process,
  • 07:56how their outcomes depend on covariates,
  • 08:01whether they're treated or not.
  • 08:02These could be very complex and messy objects,
  • 08:05but this CATE might be simpler.
  • 08:08And in this kind of regime, there's very little known.
  • 08:12I'll talk more about exactly what I mean
  • 08:14by this in just a bit.
  • 08:17So one question is,
  • 08:18how do we adapt to this kind of structure?
  • 08:21And there are really no strong theoretical benchmarks
  • 08:26in this world in the last few years,
  • 08:30which means we have all these proposals,
  • 08:33which is great, but we don't know which are optimal
  • 08:36or when or if they can be improved in some way.
  • 08:41What's the best possible performance
  • 08:44that we could ever achieve at estimating
  • 08:46this quantity in the non-parametric model
  • 08:47without adding assumptions?
  • 08:49So these kinds of questions are basically
  • 08:50entirely open in this setup.
  • 08:53So the point of this work is really to try
  • 08:55and push forward to answer some of these questions.
  • 09:00There are two kind of big parts of this work,
  • 09:05which are in a paper on archive.
  • 09:09So one is just to provide more flexible estimators
  • 09:12of this guy and specifically to show,
  • 09:18give stronger error guarantees on estimating this.
  • 09:23So that we can use a really diverse set of methods
  • 09:27for estimating this thing in a doubly robust way
  • 09:29and still have some rigorous guarantees
  • 09:32about how well we're doing.
  • 09:34So that part is more practical.
  • 09:35It's more about giving a method
  • 09:37that people can actually implement
  • 09:38and practice that's pretty straight forward,
  • 09:41it looks like a two stage progression procedure
  • 09:44and being able to say something about this
  • 09:46that's model free and and agnostic about both
  • 09:52the underlying data generating process
  • 09:53and what methods we're using to construct the estimator.
  • 09:57This was lacking in the previous literature.
  • 09:59So that's one side of this work, which is more practical.
  • 10:03I think I'll focus more on that today,
  • 10:06but we can always adapt as we go,
  • 10:09if people are interested in other stuff.
  • 10:10I'm also going to talk a bit about an analysis of this,
  • 10:12just to show you sort of how it would work in practice.
  • 10:16So that's one part of this work.
  • 10:17The second part is more theoretical and it says,
  • 10:22so I don't want to just sort of construct
  • 10:24an estimator that has the nice error guarantees,
  • 10:27but I want to try and figure out what's
  • 10:29the best possible performance I could ever get
  • 10:31at estimating these heterogeneous effects.
  • 10:36This turns out to be a really hard problem
  • 10:39with a lot of nuance,
  • 10:42but that's sort of the second part
  • 10:43of the talk which maybe is a little tackle
  • 10:46in a bit less time.
  • 10:50So that's kind of big picture.
  • 10:51I like to give the punchline of the talk at the start,
  • 10:53just so you have an idea of what I'm going to be covering.
  • 10:57And yeah, so now let's go into some details.
  • 11:01So we're going to think about this sort
  • 11:03of classic causal inference data structure,
  • 11:06where we have n iid observations, we have covariates X,
  • 11:10which are D dimensional, binary treatment for now,
  • 11:14all the methods that I'll talk about will work
  • 11:17without any extra work in the discrete treatment setting
  • 11:21if we have multiple values.
  • 11:23The continuous treatment setting
  • 11:24is more difficult it turns out.
  • 11:27And some outcome Y that we care about.
  • 11:31All right, so there are a couple of characters
  • 11:33in this talk that will play really important roles.
  • 11:37So we'll have some special notation for them.
  • 11:39So PI of X, this is the propensity score.
  • 11:42This is the chance of being treated,
  • 11:44given your covariates.
  • 11:47So some people might be more or less likely
  • 11:49to be treated depending on their baseline covariates, X.
  • 11:54Muse of a, this will be an outcome regression function.
  • 11:57So it's your expected outcome given your covariates
  • 12:00and given your treatment level.
  • 12:02And then we'll also later on in the talk use this ada,
  • 12:04which is just the marginal outcome regression.
  • 12:06So without thinking about treatment,
  • 12:08just how the outcome varies on average as a function of X.
  • 12:14And so those are the three main characters in this talk,
  • 12:16we'll be using them throughout.
  • 12:18So under these standard causal assumptions
  • 12:21of consistency, positivity, exchangeability,
  • 12:24there's a really amazing group
  • 12:27at Yale that are focused on dropping these assumptions.
  • 12:31So lots of cool work to be done there,
  • 12:34but we're going to be using them today.
  • 12:36So consistency, we're roughly thinking
  • 12:38this means there's no interference,
  • 12:40this is a big problem in causal inference,
  • 12:42but we're going to say
  • 12:43that my treatments can affect your outcomes, for example.
  • 12:47We're going to think about the case where everyone
  • 12:48has some chance at receiving treatment,
  • 12:51both treatment and control,
  • 12:52and then we have no unmeasured confounding.
  • 12:54So we've collected enough sufficiently relevant covariates
  • 12:57that once we conditioned on them,
  • 12:59look within levels of the covariates,
  • 13:00the treatment is as good as randomized.
  • 13:03So under these three assumptions,
  • 13:06this conditional effect on the left-hand side here
  • 13:09can just be written as a difference in regression functions.
  • 13:11It's just the difference in the regression function
  • 13:13under treatment versus control,
  • 13:15sort of super simple parameter right.
  • 13:18So I'm going to call this thing Tau.
  • 13:20This is just the regression under treatment minus
  • 13:22the regression under control.
  • 13:27So you might think, we know a lot about
  • 13:29how to estimate regression functions non-parametrically
  • 13:33they're really nice, min and max lower bounds
  • 13:36that say we can't do better uniformly across the model
  • 13:41without adding some assumptions or some extra structure.
  • 13:46The fact that we have a difference
  • 13:47in regression doesn't seem like
  • 13:48it would make things more complicated
  • 13:50than just the initial regression problem,
  • 13:53but it turns out it really does,
  • 13:55it's super interesting,
  • 13:56this is one of the parts of this problem
  • 13:57that I think is really fascinating.
  • 14:00So just by taking a difference in regressions,
  • 14:03you completely change the nature of this problem
  • 14:06from the standard non-parametric regression setup.
  • 14:10So let's get some intuition for why this is the case.
  • 14:14So why isn't it optimal just to estimate
  • 14:17the two regression functions
  • 14:18and take a difference, for example?
  • 14:21So let's think about a simple data generating process
  • 14:23where we have just a one dimensional covariate,
  • 14:26it's uniform on minus one, one,
  • 14:29we have a simple step function propensity score
  • 14:32and then we're going to think
  • 14:32about a regression function, both under treatment
  • 14:35and control that looks like some kind
  • 14:37of crazy polynomial from this Gyorfi textbook,
  • 14:40I'll show you a picture in just a minute.
  • 14:44The important thing about this polynomial
  • 14:47is that it's non-smooth, it has a jump,
  • 14:50has some kinks in it and so it will be hard to estimate,
  • 14:56in general, but we're taking both
  • 15:00the regression function under treatment
  • 15:01and the regression function under control
  • 15:03to be equal, they're equal to this same hard
  • 15:06to estimate polynomial function.
  • 15:07And so that means the difference is really simple,
  • 15:10it's just zero, it's the simplest conditional effect
  • 15:12you can imagine, not only constant, but zero.
  • 15:15You can imagine this probably happens a lot in practice
  • 15:18where we have treatments that are not extremely effective
  • 15:22for everyone in some complicated way.
  • 15:26So the simplest way you would estimate
  • 15:29this conditional effect is just take an estimate
  • 15:32of the two regression functions and take a difference.
  • 15:35Sometimes I'll call this plugin estimator.
  • 15:38There's this paper by Kunzel and colleagues,
  • 15:41call it the T-learner.
  • 15:43So for example, we can use smoothing splines,
  • 15:46estimate the two regression functions and take a difference.
  • 15:49And maybe you can already see what's going to go wrong here.
  • 15:52So these individual regression functions
  • 15:54by themselves are really hard to estimate.
  • 15:58They have jumps and kinks, they're messy functions
  • 16:01And so when we try and estimate these
  • 16:03with smoothing splines, for example,
  • 16:05we're going to get really complicated estimates
  • 16:08that have some bumps, It's hard to choose
  • 16:11the right tuning parameter, but even if we do,
  • 16:14we're inheriting the sort of complexity
  • 16:16of the individual regression functions.
  • 16:18When we take the difference,
  • 16:19we're going to see something
  • 16:20that is equally complex here
  • 16:22and so it's not doing a good job of exploiting
  • 16:25this simple structure in the conditional effect.
  • 16:30This is sort of analogous to this intuition
  • 16:33that people have that interaction terms might
  • 16:36be smaller or less worrisome than sort
  • 16:41of main effects in a regression model.
  • 16:43Or you can think of the muse as sort of main effects
  • 16:45and the differences as like an interaction.
  • 16:49So here's a picture of this data
  • 16:51in the simple motivating example.
  • 16:53So we've got treated people on the left
  • 16:55and untreated people on the right
  • 16:57and this gray line is the true, that messy,
  • 17:00weird polynomial function that we're thinking about.
  • 17:03So here's a jump and there's a couple
  • 17:06of kinks here and there's confounding.
  • 17:09So treated people are more likely to have larger Xs,
  • 17:13untreated people are more likely to have smaller Xs.
  • 17:16So what happens here is the function is sort
  • 17:19of a bit easier to estimate on the right side.
  • 17:22And so for treated people, we're going to take a sort
  • 17:24of larger bandwidth, get a smoother function.
  • 17:28For untreated people, it's harder to estimate
  • 17:30on the left side and so we're going to need
  • 17:31a small bandwidth to try and capture this jump,
  • 17:34for example, this discontinuity.
  • 17:38And so what's going to happen is when you take a difference
  • 17:40of these two regression estimates, these black lines
  • 17:42are just the standard smoothing that spline estimates
  • 17:46that you're getting are with one line of code,
  • 17:48using the default bandwidth choices.
  • 17:50When you take a difference,
  • 17:51you're going to get something
  • 17:52that's very complex and messy and it's not doing
  • 17:55a good job of recognizing that the regression functions
  • 17:58are the same under treatment and control.
  • 18:03So what else could we do?
  • 18:04This maybe points to this fact that
  • 18:06the plugin estimator breaks,
  • 18:09it doesn't do a good job of exploiting a structure,
  • 18:11but what other options do we have?
  • 18:13So let's say that we knew the propensity scores.
  • 18:15So for just simplicity, say we were in a trial,
  • 18:19for example, an experiment,
  • 18:21where we randomized everyone to treat them
  • 18:23with some probability that we knew.
  • 18:26In that case, we could construct a pseudo outcome,
  • 18:28which is just like an inverse probability weighted outcome,
  • 18:31which has exactly the right conditional expectation,
  • 18:35its conditional expectation is exactly equal
  • 18:37to that conditional effect.
  • 18:39And so when you did a non-parametric regression
  • 18:41of the pseudo outcome on X,
  • 18:43it would be like doing an oracle regression
  • 18:45of the true difference in potential outcomes,
  • 18:47it has exactly the same conditional expectation.
  • 18:50And so this sort of turns this hard problem
  • 18:53into a standard non-parametric regression problem.
  • 18:56Now this isn't a special case where we knew
  • 18:58the propensity scores for the rest
  • 18:59of the talk we're gonna think about what happens
  • 19:01when we don't know these, what can we say?
  • 19:04So here's just a picture of what we get in the setup.
  • 19:08So this red line is this really messy plug in estimator
  • 19:10that we get that's just inheriting that complexity
  • 19:13of estimating the individual regression functions
  • 19:15and then these black and blue lines are IPW
  • 19:19and doubly robust versions that exploit
  • 19:22this underlying smoothness and simplicity
  • 19:25of the heterogeneous effects, the conditional effects.
  • 19:32So this is just a motivating example
  • 19:34to help us get some intuition for what's going on here.
  • 19:39So these results are sort of standard in this problem,
  • 19:41we'll come back to some simulations later on.
  • 19:43And so now our goal is going to study the error
  • 19:48of the sort of inverse weighted kind of procedure,
  • 19:51but a doubly robust version.
  • 19:53We're going to give some new model free error guarantees,
  • 19:57which let us use very flexible methods
  • 19:59and it turns out we'll actually get better areas
  • 20:03than what were achieved previously in literature,
  • 20:08even when focusing specifically on some particular method.
  • 20:12And then again, we're going to see,
  • 20:13how well can we actually do estimating
  • 20:15this conditional effect in this problem.
  • 20:21Might be a good place to pause
  • 20:23and see if people have any questions.
  • 20:33Okay.
  • 20:34(clears throat)
  • 20:35Feel free to shout out any questions
  • 20:37or stick them on the chat if any come up.
  • 20:42So we're going to start by thinking about
  • 20:45a pretty simple two-stage doubly robust estimator,
  • 20:48which I'm going to call the DR-learner,
  • 20:50this is following this nomenclature that's become kind
  • 20:53of common in the heterogeneous effects literature
  • 20:57where we have letters and then a learner.
  • 21:01So I'm calling this the DR-Learner,
  • 21:02but this is not a new procedure,
  • 21:04but the version that I'm going to analyze
  • 21:06has some variances, but it was actually first proposed
  • 21:08by Mike Vanderlande in 2013, was used in 2016
  • 21:13by Alex Lucca and Mark Vanderlande.
  • 21:16So they proposed this,
  • 21:17but they didn't give specific error bounds.
  • 21:21I think relatively few people know
  • 21:23about these earlier papers because this approach
  • 21:25was then sort of rediscovered in various ways
  • 21:28after that in the following years,
  • 21:30typically in these later versions,
  • 21:32people use very specific methods for estimating,
  • 21:37for constructing the estimator,
  • 21:38which I'll talk about in detail in just a minute,
  • 21:41for example, using kernel kind of methods,
  • 21:44Local polinomials and this paper used
  • 21:48a sort of series or spline and regression.
  • 21:52So.
  • 21:53(clears throat)
  • 21:54These papers are nice ways
  • 21:56of doing doubly robust estimation,
  • 21:58but they had a couple of drawbacks,
  • 22:00which we're going to try and build on in this work.
  • 22:03So one is, we're going to try not to commit
  • 22:05to using any particular methods.
  • 22:07We're going to see what we can say about error guarantees,
  • 22:11just for generic regression procedures.
  • 22:15And then we're going to see
  • 22:16if we can actually weaken the sort of assumptions
  • 22:19that we need to get oracle type behavior.
  • 22:22So the behavior of an estimator that we would see
  • 22:25if we actually observed the potential outcomes
  • 22:28and it turns out we'll be able to do this,
  • 22:29even though we're not committing to particular methods.
  • 22:34There's also a really nice paper by Foster
  • 22:35and Syrgkanis from last year,
  • 22:38which also considered a version of this DR-learner
  • 22:41and they had some really nice model agnostic results,
  • 22:44but they weren't doubly robust.
  • 22:46So, in this work we're going to try
  • 22:48and doubly robust defy these these results.
  • 22:54So that's the sort of background and an overview.
  • 22:57So let's think about what this estimator is actually doing.
  • 23:01So here's the picture of this,
  • 23:03what I'm calling the DR-learner.
  • 23:05So we're going to do some interesting sample splitting
  • 23:08here and later where we split our sample
  • 23:10in the three different groups.
  • 23:13So one's going to be used for nuisance training
  • 23:16for estimating the propensity score.
  • 23:19And then I'm also going to estimate
  • 23:21the regression functions, but in a separate fold.
  • 23:26So I'm separately estimating my propensity score
  • 23:29and regression functions.
  • 23:30This turns out to not be super crucial for this approach.
  • 23:35It actually is crucial for something I'll talk
  • 23:37about later in the talk,
  • 23:39this is just to give a nicer error bound.
  • 23:43So the first stage is we estimate these nuisance functions,
  • 23:45the propensity scores and the regressions.
  • 23:48And then we go to this new data that we haven't seen yet,
  • 23:53our third fold of split data
  • 23:56and we construct a pseudo outcome.
  • 23:58Pseudo outcome looks like this, it's just some combination,
  • 24:01it's like an inverse probability weighted residual term
  • 24:04plus something like the plug-in estimator
  • 24:07of the conditional effect.
  • 24:09So it's just some function of the propensity score estimates
  • 24:12and the regression estimates.
  • 24:15If you've used doubly robust estimators
  • 24:17before you'll recognize this as what
  • 24:19we average when we construct
  • 24:21a usual doubly robust estimator
  • 24:24of the average treatment effect.
  • 24:25And so intuitively instead of averaging this year,
  • 24:28we're just going to regress it on covariates,
  • 24:30that's exactly how this procedure works.
  • 24:33So it's pretty simple, construct the pseudo outcome,
  • 24:36which we typically would average estimate the ate,
  • 24:39now, we're just going to do a regression
  • 24:41of this thing on covariates in our third sample.
  • 24:45So we can write our estimator this way.
  • 24:47This e hat in notation just means
  • 24:49some generic regression estimator.
  • 24:53So one of the crucial points in this work,
  • 24:55so I'm not going to, I want to see what I can say
  • 24:57about the error of this estimator without committing
  • 25:01to a particular estimator.
  • 25:02So if you want to use random forests in that last stage,
  • 25:05I want to be able to tell you what kind
  • 25:07of error to expect or if you want
  • 25:09to use linear regression
  • 25:10or whatever procedure you like,
  • 25:13the goal would be to give you some nice error guarantee.
  • 25:16So (indistinct), and you should think of it as just
  • 25:18your favorite regression estimator.
  • 25:21So we take the suit outcome,
  • 25:22we regress it on covariates, super simple,
  • 25:25just create a new column in your dataset,
  • 25:27which looks like this pseudo outcome.
  • 25:28And then treat that as the outcome
  • 25:30in your second stage regression.
  • 25:35So here we're going to get let's say we split
  • 25:39our sample into half for the second stage regression,
  • 25:42we would get an over two kind
  • 25:43of we'd be using half our sample
  • 25:46for the second stage regression.
  • 25:48You can actually just swap these samples
  • 25:49in the you'll get back the full sample size errors.
  • 25:54So it would be as if you had used
  • 25:56the full sample size all at once.
  • 25:59That's called Cross Fitting,
  • 26:01it's becoming sort of popular in the last couple of years.
  • 26:03So here's a schematic of what this thing is doing.
  • 26:06So we split our data in the thirds,
  • 26:07use one third testing
  • 26:09to estimate the propensity score,
  • 26:10another third to estimate the regression functions,
  • 26:12we use those to construct a pseudo outcome
  • 26:14and then we do a second stage regression
  • 26:16of that pseudo outcome on covariates.
  • 26:19So pretty easy, you can do this in three lines of code.
  • 26:25Okay.
  • 26:26And now our goal is to say something
  • 26:28about the error of this procedure,
  • 26:29being completely agnostic about
  • 26:30how we estimate these propensity scores,
  • 26:32that regression functions and what procedure
  • 26:34we use in this third stage or second stage.
  • 26:40And it turns out we can do this by exploiting
  • 26:42the sample splitting can come up with a strong guarantee
  • 26:46that actually gives you smaller errors than
  • 26:48what appeared in the previous literature
  • 26:50when people focused on specific methods.
  • 26:52And the main thing is we're really exploiting
  • 26:55the sample splitting.
  • 26:58And then the other tool that we're using
  • 26:59is we're assuming some stability condition
  • 27:02on that second stage estimator,
  • 27:03that's the only thing we assume here.
  • 27:06It's really mild, I'll tell you what it is right now.
  • 27:10So you say that regression estimator is stable,
  • 27:14if when you add some constant to the outcome
  • 27:18and then do a regression, you get something
  • 27:21that's the same as if you do the regression
  • 27:22and then add some constant.
  • 27:25So it's pretty intuitive,
  • 27:26if a method didn't satisfy this,
  • 27:27it would be very weird
  • 27:30and actually for the proof,
  • 27:32we don't actually need this to be exactly equal.
  • 27:35So adding a constant pre versus post regression
  • 27:38shouldn't change things too much.
  • 27:40You don't have to have it be exactly equal,
  • 27:43it still works if it's just equal up
  • 27:44to the error in the second stage regression.
  • 27:52So that's the first stability condition.
  • 27:55The second one is just that if you have
  • 27:57two random variables with the same conditional expectation,
  • 28:00then the mean squared error is going
  • 28:01to be the same up to constants.
  • 28:03Again, any procedure
  • 28:05that didn't satisfy these two assumptions
  • 28:07would be very bizarre.
  • 28:11It's a very mild stability conditions.
  • 28:13And that's essentially all we need.
  • 28:15So now our benchmark here is going to be an oracle estimator
  • 28:21that instead of doing a regression with the pseudo,
  • 28:23it does a regression with the actual potential outcomes,
  • 28:26Y, one, Y, zero.
  • 28:30So we can think about the mean squared error
  • 28:31of this estimator, so I'm using mean squared error,
  • 28:34just sort of for simplicity and convention,
  • 28:36you could think about translating this
  • 28:38to other kinds of measures of risk.
  • 28:40That would be an interesting area for future work.
  • 28:44So this is the oral, our star is the Oracle
  • 28:47the mean squared error.
  • 28:48It's the mean squared error you'd get for estimating
  • 28:50the conditional effect if you actually saw
  • 28:52the potential outcomes.
  • 28:55So we get this really nice, simple result,
  • 28:57which says that the mean squared error
  • 28:59of that DR-learner procedure that uses the pseudo outcomes,
  • 29:03it just looks like the Oracle means squared error,
  • 29:06plus a product of mean squared errors in estimating
  • 29:08the propensity score and the regression function.
  • 29:12It resembles the kind of doubly robust error results
  • 29:16that you see for estimating average treatment effects,
  • 29:18but now we have this for conditional effects.
  • 29:23The proof technique is very different here compared
  • 29:25to what is done in the average effect case.
  • 29:29But the proof is actually very, very straightforward.
  • 29:32It's like a page long, you can take a look in the paper,
  • 29:35it's really just leaning on this sample splitting
  • 29:38and then using stability in a slightly clever way.
  • 29:42But the most complicated tool uses is just
  • 29:45some careful use of the components
  • 29:49of the estimator and iterated expectation.
  • 29:53So it's really a pretty simple proof, which I like.
  • 29:57So yeah, this is the main result.
  • 29:59And again, we're not assuming anything beyond
  • 30:02this mild stability here, which is nice.
  • 30:04So you can use whatever regression procedures you like.
  • 30:07And this will tell you something about the error
  • 30:09how it relates to the Oracle error that you would get
  • 30:12if you actually observed the potential outcomes.
  • 30:18So this is model free method-agnostic,
  • 30:21it's also a finite sample down,
  • 30:23there's nothing asymptotic here.
  • 30:25This means that the mean squared error is upper bounded up
  • 30:28to some constant times this term on the right.
  • 30:31So there's no end going to infinity or anything here either.
  • 30:39So the other crucial point of this is
  • 30:41because we have a product of mean squared errors,
  • 30:44you have the kind of usual doubly robust story.
  • 30:46So if one of these is small, the product will be small,
  • 30:50potentially more importantly, if they're both kind
  • 30:52of modest sized because both, maybe the propensity score
  • 30:55and the regression functions are hard to estimate
  • 30:57the product will be potentially quite a bit smaller
  • 31:01than the individual pieces.
  • 31:04And this is why this is showing you that that sort
  • 31:08of plugging approach, which would really just be driven
  • 31:10by the mean squared error for estimating
  • 31:11the regression functions can be improved by quite a bit,
  • 31:15especially if there's some structure to exploit
  • 31:17in the propensity scores.
  • 31:23Yeah, so in previous work people used specific methods.
  • 31:26So they would say I'll use
  • 31:28maybe series estimators or current estimators
  • 31:31and then the error bound was actually bigger
  • 31:34than what we get here.
  • 31:36So this it's a little surprising that you can get
  • 31:38a smaller error bound under weaker assumptions,
  • 31:40but this is a nice advantage
  • 31:42of the sample splitting trick here.
  • 31:49Now that you have this nice error bound you can plug
  • 31:52in sort of results from any of your favorite estimators.
  • 31:56So we know lots about mean squared error
  • 31:59for estimating regression functions.
  • 32:01And so you can just plug in what you get here.
  • 32:03So for example, you think about smooth functions.
  • 32:07So these are functions and hold their classes intuitively
  • 32:11these are functions that are close to their tailored,
  • 32:13approximations, the strict definition,
  • 32:17which may be I'll pass in the interest of time.
  • 32:22Then you can say, for example, if PI is alpha smooth,
  • 32:25so it has alpha partial derivatives
  • 32:30with the highest order Lipschitz then we know
  • 32:33that you can estimate that a propensity score
  • 32:37with the mean squared error that looks like
  • 32:38n to the minus two alpha over two alpha plus D,
  • 32:41this is the usual non-parametric regression
  • 32:44mean squared error.
  • 32:46You can say the same thing for the regression functions.
  • 32:49If they're beta smooth, then we can estimate them
  • 32:51at the usual non-parametric rate,
  • 32:53n to the minus two beta over two beta plus D.
  • 32:56Then we could say,
  • 32:57okay, suppose the conditional effect,
  • 32:59Tau is gamma smooth, and gamma, it can't be smaller
  • 33:04than beta, it has to be at least as smooth
  • 33:05as the regression functions and in practice,
  • 33:08it could be much more smooth.
  • 33:09So for example, in the case where the CATE is just zero
  • 33:12or constant, Gamma's like infinity, infinitely smooth.
  • 33:17Then if we use a second stage estimator that's optimal
  • 33:20for estimating Gamma smooth functions,
  • 33:24we can just plug in the error rates
  • 33:25that we get and see
  • 33:26that we get a mean squared error bound
  • 33:28that looks like the Oracle rate.
  • 33:30This is the rate we would get if we actually observed
  • 33:33the potential outcomes.
  • 33:35And then we get this product of mean squared errors.
  • 33:37And so whenever this product, it means squared errors
  • 33:40is smaller than the Oracle rate,
  • 33:42then we're achieving the Oracle rate up to constants,
  • 33:46the same rate that we would get
  • 33:47if we actually saw Y one minus Y zero.
  • 33:51And so you can work out the conditions,
  • 33:53what you need to make this term smaller than this one,
  • 33:56that's just some algebra
  • 34:00and it has some interesting structure.
  • 34:03So if the average smoothness of the two nuisance functions,
  • 34:07the propensity score and the regression function
  • 34:09is greater than D over two divided by some inflation factor,
  • 34:14then you can say that you're achieving
  • 34:18the same rate as this Oracle procedure.
  • 34:22So the analog of this for the average treatment effect
  • 34:27or the result you need
  • 34:28for the standard doubly robust estimate,
  • 34:30or the average treatment effect
  • 34:31is that the average smoothness is greater than D over two.
  • 34:34So here we don't have D over two,
  • 34:36we have D over two over one plus D over gamma.
  • 34:40So this is actually giving you a sort
  • 34:44of a lower threshold for achieving Oracle rates
  • 34:48in this problem.
  • 34:49So, because it's a harder problem,
  • 34:51we need weaker conditions
  • 34:52on the nuisance estimation to behave like an Oracle
  • 34:56and how much weaker those conditions
  • 34:58are, depends on the dimension of the covariates
  • 35:00and the smoothness of the conditional effect.
  • 35:04So if we think about the case where the conditional effect
  • 35:06is like infinitely smooth,
  • 35:07so this is almost like a parametric problem.
  • 35:10Then we recovered the usual condition that we need
  • 35:13for the doubly robust estimator to be root
  • 35:14and consistent as greater than D over two.
  • 35:20But when dimension is for some non-trivial smoothness,
  • 35:26then we're somewhere in between sort of when
  • 35:28a plugin is optimal and this nice kind of parametric setup.
  • 35:34So this is just a picture of the rates here
  • 35:37which is useful to keep in mind.
  • 35:39So here on the x-axis, we have the smoothness
  • 35:43of the nuisance functions.
  • 35:44You can think of this as the average smoothness
  • 35:46of the propensity score in regression functions.
  • 35:49And again, in this holder smooth model,
  • 35:52which is a common model people use in non-parametrics,
  • 35:55the more smooth things are
  • 35:57the easier it is to estimate them.
  • 36:00And then here we have the mean squared error
  • 36:02for estimating the conditional effect.
  • 36:06So here is the minimax lower bounce,
  • 36:09this is the best possible mean squared error
  • 36:11that you can achieve for the average treatment effect.
  • 36:14This is just to kind of anchor our results
  • 36:16and think about what happens relative to this nicer,
  • 36:19simpler parameter, which is just the overall average
  • 36:21and not the conditional average.
  • 36:24So once you hit a certain smoothness in this case,
  • 36:26it's five, so this is looking at
  • 36:28a 20 dimensional covariate case where
  • 36:32the CATE smoothness is twice the dimension
  • 36:35just to fix ideas.
  • 36:38And so once we hit this smoothness of five,
  • 36:42so we have five partial derivatives,
  • 36:43then it's possible to achieve a Rudin rate.
  • 36:47So this is into the one half for estimating
  • 36:50the average treatment effect.
  • 36:52Rudin rates are never possible for conditional effects.
  • 36:55So here's the Oracle rate.
  • 36:58This is the rate that we would achieve in this problem
  • 37:00if we actually observed the potential outcomes.
  • 37:02So it's lower than Rudin, it's a bigger error.
  • 37:08Here's what you would get with the plugin.
  • 37:10This is just really inheriting the complexity
  • 37:13and estimating the regression functions individually,
  • 37:15it doesn't capture this CATE smoothness
  • 37:18and so you need the regression functions
  • 37:20to be sort of infinitely smoother or as smooth
  • 37:22as the CATE to actually get Oracle efficiency
  • 37:25with the plugin estimator.
  • 37:28It's this plugin as big errors,
  • 37:29if we use this DR-learner approach,
  • 37:32we close this gap substantially.
  • 37:36So we can say that we're hitting this Oracle rate.
  • 37:39Once we have a certain amount of smoothness
  • 37:41of the nuisance functions and in between
  • 37:44we get an error that looks something like this.
  • 37:49So this is just a picture of this row results showing,
  • 37:53graphically, the improvement of the DR-learner approach
  • 37:56here over a simple plug estimator.
  • 38:03So yeah, just the punchline here is
  • 38:06this simple two-stage doubly robust approach
  • 38:09can do a good job adapting to underlying structure
  • 38:12in the conditional effect,
  • 38:14even when the nuisance stuff,
  • 38:16the propensity scores
  • 38:17and the underlying regression functions
  • 38:18are more complex or less smooth in this case.
  • 38:24This is just talking about the relation
  • 38:26to the average treatment effect conditions,
  • 38:28which I mentioned before.
  • 38:32So you can do the same thing for any generic
  • 38:34regression methods you like.
  • 38:35So in the paper, I do this for smooth models
  • 38:38and sparse models, which are common
  • 38:39in these non-parametric settings,
  • 38:41where you have high dimensional Xs
  • 38:43and you believe that some subset
  • 38:45of them are the ones that matter.
  • 38:48So I'll skip past this, if you're curious though,
  • 38:50all the details are in the paper.
  • 38:52So you can say, what kind of sparse should
  • 38:53be doing need in the propensity score
  • 38:55in regression functions to be able
  • 38:57to get something that behaves like an Oracle
  • 38:59that actually saw the potential outcomes from the start.
  • 39:04You can also do the same kind of game
  • 39:05where you compare this to what you need
  • 39:06for the average treatment effect.
  • 39:11Yeah, happy to talk about this offline
  • 39:13or afterwards people have questions.
  • 39:18So there's also a nice kind of side result
  • 39:21which I think I'll also go through quickly here.
  • 39:24From all this, is just a general Oracle inequality
  • 39:29for regression when you have some estimated outcomes.
  • 39:31So in some sense, there isn't anything really special
  • 39:34in our results that has to do
  • 39:37with this particular pseudo outcome.
  • 39:39So, the proof that we have here works
  • 39:43for any second stage or any two-stage sort
  • 39:46of regression procedure
  • 39:48where you first estimate some nuisance stuff,
  • 39:50create a pseudo outcome that depends
  • 39:52on this estimated stuff and then do a regression
  • 39:54of the pseudo outcome on some set of covariates.
  • 39:58And so a nice by-product of this work,
  • 40:00as you get a kind of similar error bound
  • 40:02for just generic regression with pseudo outcomes.
  • 40:07This comes up in a lot of different problems, actually.
  • 40:10So one is when you want just a partly conditional effect.
  • 40:15So maybe I don't care about how effects vary
  • 40:17with all the Xs, but just a subset of them,
  • 40:19then you can apply this result.
  • 40:20I have a paper with a great student, Amanda Costin,
  • 40:23who studied a version of this
  • 40:28regression with missing outcomes.
  • 40:30Again, these look like nonparametric regression problems
  • 40:33where you have to estimate some pseudo outcome
  • 40:36dose response curve problems, conditional IV effects,
  • 40:40partially linear IVs.
  • 40:41So there are lots of different variants where you need
  • 40:43to do some kind of two-stage regression procedure like this.
  • 40:51Again, you just need a stability condition
  • 40:52and you need some sample splitting
  • 40:54and you can give a similar kind of a nice rate result
  • 40:57that we got for the CATE specific problem,
  • 41:01but in generic pseudo outcome progression problem.
  • 41:07So we've got about 15 minutes,
  • 41:10I have some simulations,
  • 41:12which I think I will go over quickly.
  • 41:15So we did this in a couple simple models,
  • 41:17one, a high dimensional linear model.
  • 41:20It's actually a logistic model where
  • 41:22we have 500 covariates and 50
  • 41:25of them have non-zero coefficients.
  • 41:28We just used the default lasso fitting in our
  • 41:32and compared plugin estimators
  • 41:34to the doubly robust approach that we talked
  • 41:37about and then also an ex-learner
  • 41:39which is some sort of variants of the plug-in approach
  • 41:44that was proposed in recent years.
  • 41:47And the basic story is you get sort
  • 41:49of what the theory predicts.
  • 41:50So the DR-learner does better than these plug-in types
  • 41:54of approaches in this setting.
  • 41:58The nuisance functions are hard to estimate
  • 42:00and so you don't see a massive gain over,
  • 42:02for example, the X-Learner,
  • 42:04you do see a pretty massive gain
  • 42:05over the simple plugin.
  • 42:08And we're a bit away
  • 42:10from this Oracle DR-learner approach here,
  • 42:13so that means great errors is relatively different.
  • 42:16This is telling us that the nuisance stuff is hard
  • 42:19to estimate in this simulation set up.
  • 42:22Here's another simulation based
  • 42:24on that plot I showed you before.
  • 42:28And so here, I'm actually estimating the propensity scores,
  • 42:31but I'm constructing the estimates myself
  • 42:33so that I can control the rate of convergence
  • 42:35and see how things change across different error rates
  • 42:39for estimating with propensity score.
  • 42:41So here's what we see.
  • 42:42So on the x-axis here,
  • 42:44we have how well we're estimating the propensity score.
  • 42:48So this is a convergence rate
  • 42:50for the propensity score estimator.
  • 42:52Y-axis, we have the mean squared error
  • 42:54and then this red line is the plugin estimator,
  • 42:56it's doing really poorly.
  • 42:57It's not capturing this underlying simplicity
  • 42:59of the conditional effects.
  • 43:00It's really just inheriting that difficulty
  • 43:03in estimating the regression functions.
  • 43:05Here's the X-learner, it's doing a bit better
  • 43:07than the plugin, but it's still not doing
  • 43:10a great job capturing the underlying simplicity
  • 43:12and the conditional effect.
  • 43:14This dotted line is the Oracle.
  • 43:16So this is what you would get
  • 43:18if you actually observed the potential outcomes.
  • 43:20And then the black line is the DR-learner,
  • 43:23this two-stage procedure here,
  • 43:24I'm just using smoothing splines everywhere,
  • 43:26just defaults in R, it's like three lines of code,
  • 43:29all the code's in the paper, too,
  • 43:31if you want to play around with this.
  • 43:33And here we see what we expect.
  • 43:35So when it's really hard to estimate the propensity score,
  • 43:38it's just a hard problem and we don't do
  • 43:40much better than the X-learner.
  • 43:44We still get some gain over the plugin in this case,
  • 43:47but as soon as you can estimate the propensity score
  • 43:50well at all, you start seeing some pretty big gains
  • 43:54by doing this doubly robust approach
  • 43:56and at some point we start to roughly match
  • 43:58the Oracle actually.
  • 44:02As soon as we're getting something like
  • 44:03into the quarter rates in this case,
  • 44:04we're getting close to the Oracle.
  • 44:10So maybe I'll just show you an illustration
  • 44:12and then I'll talk about the second part of the talk
  • 44:15and very briefly if people have,
  • 44:17want to talk about that,
  • 44:20offline, I'd be more than happy to.
  • 44:23So here's a study, which I actually learned about
  • 44:25from Peter looking at effects of canvassing
  • 44:27on voter turnout, so this is this timely study.
  • 44:30Here's the paper, there are almost 20,000 voters
  • 44:36across six cities here.
  • 44:37They're randomly encouraged to vote
  • 44:42in these local elections that people would go
  • 44:45and talk to them face to face.
  • 44:47You remember what that was like pre-pandemic.
  • 44:50Here's a script of the sort of canvassing that they did,
  • 44:55just saying, reminding them of the election,
  • 44:58giving them a reminder to vote.
  • 45:00Hopefully I'm doing this for you as well,
  • 45:02if you haven't voted already.
  • 45:04And so what's the data we have here?
  • 45:07We have a number of covariates things like city,
  • 45:10party affiliation, some measures
  • 45:11of the past voting history, age, family size, race.
  • 45:15Again, the treatment is whether they work randomly contact
  • 45:19is actually whether they were randomly assigned some cases,
  • 45:22people couldn't be contacted in the setup.
  • 45:25So we're just looking at intention
  • 45:26to treat kinds of effects.
  • 45:28And then the outcome is whether people voted
  • 45:30in the local election or not.
  • 45:32So just as kind of a proof of concept,
  • 45:35I use this DR-learner approach,
  • 45:37I just use two folds and use random forest separator
  • 45:42for the first stage regressions and the second stage.
  • 45:47And actually for one part of the analysis,
  • 45:50I used generalized additive models in that second stage.
  • 45:56So here's a histogram of the conditional effect estimates.
  • 46:00So there's sort of a big chunk, a little bit above zero,
  • 46:03but then there is some heterogeneity
  • 46:04around that in this case.
  • 46:07So there are some people
  • 46:08who maybe seem especially responsive to canvassing,
  • 46:12maybe some people who are going to know it
  • 46:15and actually some are less likely to vote, potentially.
  • 46:18This is a plot of the effect estimates
  • 46:21from this DR-learner procedure,
  • 46:22just to see what they look like,
  • 46:24how this would work in practice across
  • 46:27to potentially important covariate.
  • 46:30So here's the age of the voter and then the party
  • 46:34and the color here represents the size and direction
  • 46:39of the CATE estimate of the conditional effect estimates,
  • 46:41so blue is canvassing is having a bigger effect
  • 46:45on voting in the next local election.
  • 46:50Red means less likely to vote due to canvassing.
  • 46:55So you can see some interesting structure here just briefly,
  • 46:59the independent people,
  • 47:01it seems like the effects are closer to zero.
  • 47:03Democrats maybe seem more likely to be positively affected,
  • 47:07maybe more so among younger people.
  • 47:11It's just an example of the kind of
  • 47:13sort of graphical visualization stuff you could do
  • 47:16with this sort of procedure.
  • 47:18This is the plot I showed before, where here,
  • 47:21we're looking at just how the conditional
  • 47:22effect varies with age.
  • 47:24And you can see some evidence
  • 47:25that younger people are to canvassing.
  • 47:33Older people, less evidence that there's any response.
  • 47:43I should stop here and see if people have any questions.
  • 47:51- So Edward, can I ask a question?
  • 47:53- Of course yeah.
  • 47:55- I think we've discussed about point estimation.
  • 47:57Does this approach also allows
  • 47:59for consistent variance estimation?
  • 48:01- Yeah, that's a great question.
  • 48:04Yeah, I haven't included any of that here,
  • 48:08but if you think about that.
  • 48:11This Oracle result that we have.
  • 48:17If these errors are small enough,
  • 48:19so under the kinds of conditions that we talked about,
  • 48:22then we're getting an estimate of it looks like an Oracle
  • 48:26has to meet or of the potential outcomes on the covariates.
  • 48:29And that means that as long as these are small enough,
  • 48:31we could just port over any inferential tools
  • 48:33that we like from standard non-parametric regression
  • 48:35treating our pseudo outcomes as if they were
  • 48:38the true existential outcomes, yeah.
  • 48:41That's a really important point,
  • 48:43I'm glad you mentioned that.
  • 48:44- Thanks.
  • 48:45- So inference is more complicated
  • 48:47and nuanced than non-parametric regression,
  • 48:51but any inferential tool could be used here.
  • 48:56- So operationally, just to think
  • 48:57about how to operationalize the variance estimation
  • 48:59also, does that require the cross fitting procedure
  • 49:03where you're swapping your D one D two
  • 49:06in the estimation process and then?
  • 49:10- Yeah, that's a great question too.
  • 49:11So not necessarily,
  • 49:12so you could just use these folds
  • 49:14for nuisance training and then go to this fold
  • 49:17and then just forget that you ever used this data
  • 49:19and just do variance estimation here.
  • 49:21The drawback there would be,
  • 49:22you're only using a third of your data.
  • 49:25If you really want to make full use
  • 49:26of the sample size using
  • 49:28the cross fitting procedure would be ideal,
  • 49:31but the inference doesn't change.
  • 49:32So if you do cross fitting,
  • 49:35you would at the end of the day,
  • 49:36you'd get an out of sample CATE estimate
  • 49:39for every single row in your data, every subject,
  • 49:42but just where that CATE was built from other,
  • 49:45the nuisance stuff for that estimate
  • 49:47was built from other samples.
  • 49:50But at the end of the day, you'd get one big column
  • 49:52with all these out of sample CATE estimates
  • 49:54and then you could just use
  • 49:55whatever inferential tools you like there.
  • 50:00- Thanks.
  • 50:07- So, just got a few minutes.
  • 50:10So maybe I'll just give you a high level kind of picture
  • 50:12of the stuff in the second part of this talk
  • 50:14which is really about pursuing the fundamental limits
  • 50:19of conditional effect estimation.
  • 50:20So what's the best we could possibly do here?
  • 50:23This is completely unknown,
  • 50:25which I think is really fascinating.
  • 50:27So if you think about what we have so far,
  • 50:29so far, we've given these sufficient conditions under
  • 50:33which this DR-learner is Oracle efficient,
  • 50:36but a natural question here is what happens
  • 50:38when those mean squared error terms are too big
  • 50:40and so we can't say that we're getting
  • 50:42the Oracle rate anymore.
  • 50:45Then you might say,
  • 50:46okay, is this a bug with the DR-learner?
  • 50:50Maybe I could have adapted this in some way
  • 50:52to actually do better or maybe I've reached the limits
  • 50:56of how well I can do for estimating the effect.
  • 51:00It doesn't matter if I had gone to a different estimator,
  • 51:03think I would've had the same kind of error.
  • 51:07So this is the goal of this last part of the work.
  • 51:12So here we use a very different estimator.
  • 51:14It's built using this R-learner idea,
  • 51:17which is reproducing RKHS extension of this
  • 51:22classic double residual regression method
  • 51:24of Robinson, which is really cool.
  • 51:27This is actually from 1988, so it's a classic method.
  • 51:33And so we study a non-parametric version
  • 51:35of this built from local polynomial estimators.
  • 51:38And I'll just give you a picture
  • 51:40of what the estimator is doing.
  • 51:41It's quite a bit more complicated
  • 51:43than that dr. Learner procedure.
  • 51:45So we again use this triple sample splitting
  • 51:47and here it's actually much more crucial.
  • 51:50So if you didn't use that triple sample splitting
  • 51:53for the dr learner,
  • 51:53you'd just get a slightly different Arab bound,
  • 51:55but here it's actually really important.
  • 51:57I'd be happy to talk to people about why specifically.
  • 52:01So one part of the sample we estimate propensity scores
  • 52:04and another part of the sample.
  • 52:05We estimate propensity scores and regression functions.
  • 52:08Now the marginal regression functions,
  • 52:10we combine these to get weights, Colonel weights.
  • 52:13We also combine them to get residuals.
  • 52:15So treatment residuals and outcome residuals.
  • 52:18This is like what you would get
  • 52:19for this re Robinson procedure from econ.
  • 52:24Then we do instead of a regression
  • 52:26of outcome residuals on treatment residuals,
  • 52:28we do a weighted nonparametric regression
  • 52:32of these residuals on the treatment residuals.
  • 52:34So that's the procedure a little bit more complicated.
  • 52:38And again, this is,
  • 52:39I think there are ways to make this work well practically,
  • 52:42but the goal of this work is really to try
  • 52:44and figure out what's the best possible
  • 52:46mean squared error that we could achieve.
  • 52:47It's less about a practical method,
  • 52:51more about just understanding how hard
  • 52:52the conditional effect estimation problem is.
  • 52:56And so we actually show that a generic version
  • 52:59of this procedure,
  • 53:02as long as you estimate the propensity scores
  • 53:03and the regression functions with linear smoothers,
  • 53:05with particular bias and various properties,
  • 53:08which are standard in nonparametrics,
  • 53:11you can actually get better mean squared error.
  • 53:13Then for the dr. Learner,
  • 53:15we'll just give you a sense of what this looks like.
  • 53:19So you get something that looks like an Oracle rate plus
  • 53:22something like the squared bias from the new synced,
  • 53:27from the propensity score and regression functions.
  • 53:32So before you had the product of mean squared errors,
  • 53:36now we have the square of the bias
  • 53:38of the two procedures, the mean squared error,
  • 53:40and the propensity score in the regression function.
  • 53:43And this gives you, this opens the door to under smoothing.
  • 53:46So this means that you can estimate the propensity score
  • 53:49and the regression functions in a suboptimal way.
  • 53:52If you actually just care about the,
  • 53:54these functions by themselves.
  • 53:56So you drive down the bias that blows up
  • 53:59the variance a little bit,
  • 54:00but it turns out not to affect the conditional effect
  • 54:03estimate too much if you do it in the right way.
  • 54:06And so if.
  • 54:07You, if you do this, you get.
  • 54:11A rate that looks like this,
  • 54:12you get an Oracle rate plus into the minus two S over D.
  • 54:15And this is strictly better than what we got
  • 54:18with the dr. Learner.
  • 54:20(clears throat)
  • 54:21You can do the same game where you see sort
  • 54:23of when the Oracle rate is achieved here, it's achieved.
  • 54:27If the average smoothness of the nuisance functions
  • 54:29is greater than D over four.
  • 54:31And then here, the inflation factor is also changing.
  • 54:34So before we had,
  • 54:35we needed the smoothness to be greater than D over two,
  • 54:38over one plus D over gamma.
  • 54:40Now we have D over four over one plus or two gamma.
  • 54:45So this is a weaker condition.
  • 54:46So this is telling us that there are settings
  • 54:48where that dr. Lerner is not Oracle efficient,
  • 54:52but there exists an estimator, which is,
  • 54:53and it looks like this estimator
  • 54:56I had described here,
  • 54:57this regression on residuals thing.
  • 55:02So that's the story.
  • 55:03You can actually,
  • 55:03you can actually beat this dr. Lerner.
  • 55:05And now the question is, okay, what happens?
  • 55:08One, what happens
  • 55:09when we're not achieving the Oracle rate here,
  • 55:11can you still do better?
  • 55:13A second question is can anything, yeah.
  • 55:19Can anything achieve the Oracle rate
  • 55:20under weaker conditions than this?
  • 55:22And so I haven't proved anything about this yet.
  • 55:25It turns out to be somewhat difficult,
  • 55:29but I conjecture that this, this condition is mini max.
  • 55:33So I don't think any,
  • 55:34any estimator could ever be Oracle efficient
  • 55:36under weaker conditions than what this estimator is.
  • 55:40So this is just a picture of the results again.
  • 55:42So here's, it's the same setting as before here,
  • 55:45we have the plugin estimator that dr. Learner.
  • 55:48And here's what we get with this.
  • 55:51I call it the LPR learner.
  • 55:53It's a local polynomial version of the, our learner.
  • 55:55And so we're, actually getting quite a bit smaller rates.
  • 55:58We're hitting the Oracle rate under Meeker conditions
  • 56:02on the smoothness.
  • 56:03Now, the question is whether we can fill this gap anymore,
  • 56:08and this is unknown.
  • 56:09This is one of the open questions in causal inference.
  • 56:14So yeah, I think in the interest of time,
  • 56:18I'll skip to the discussion section here.
  • 56:21We can actually fill the gap a little bit
  • 56:22with some extra, extra tuning.
  • 56:26Just interesting.
  • 56:29Okay.
  • 56:30Yeah.
  • 56:31So this last part is really about just pushing the limits,
  • 56:32trying to figure out what the best possible performance is.
  • 56:35Okay.
  • 56:36So just to wrap things up,
  • 56:39right we gave some new results here
  • 56:41that let you be very flexible with
  • 56:43the kinds of methods that you want to use.
  • 56:46They do a good job of exploiting this Cate structure
  • 56:49when it's there and don't lose much when it's not.
  • 56:54So we have this nice model, free Arab bound.
  • 56:57We also kind of for free to get
  • 56:59this nice general Oracle inequality did
  • 57:03some investigation of the best possible rates
  • 57:06of convergence,
  • 57:07the best possible mean squared error
  • 57:08for estimating conditional effects,
  • 57:11which again was unknown before.
  • 57:14These are the weekend weak cause conditions
  • 57:15that have appeared,
  • 57:17but it's still not entirely known whether
  • 57:19they are mini max optimal or not.
  • 57:23So, yeah, big picture goals.
  • 57:24We want some nice flexible tools,
  • 57:26strong guarantees when it pushed forward,
  • 57:28our understanding of this problem.
  • 57:30I hope I've conveyed that there are lots of fun,
  • 57:32open problems here to work out
  • 57:34with important practical implications.
  • 57:37Here's just a list of them.
  • 57:38I'd be happy to talk more with people at any point,
  • 57:42feel free to email me a big part is applying
  • 57:44these methods in real problems.
  • 57:46And yeah, I should stop here,
  • 57:49but feel free to email the, the papers on archive here.
  • 57:53I'd be happy to hear people's thoughts.
  • 57:55Yeah.
  • 57:55Thanks again for inviting me.
  • 57:56It was fun.
  • 57:58- Yeah.
  • 57:59Thanks Edward.
  • 57:59That's a very nice talk and I think we're hitting the hour,
  • 58:02but I want to see in the audience
  • 58:04if we have any questions.
  • 58:05Huh.
  • 58:13All right.
  • 58:14If not, I do have one final question
  • 58:16if that's okay.
  • 58:17- Yeah, of course.
  • 58:18- And so I think there is a hosted literature
  • 58:21on flexible outcome modeling
  • 58:23to estimate conditional average causal effect,
  • 58:26especially those baits and non-parametric tree models
  • 58:28(laughs)
  • 58:30that are getting popular.
  • 58:32So I am just curious to see if you have ever thought
  • 58:36about comparing their performances,
  • 58:38or do you think there are some differences
  • 58:40between those sweats based
  • 58:42in non-parametric tree models versus
  • 58:44the plug-in estimator?
  • 58:46We compared in a simulation study here?
  • 58:48- Yeah.
  • 58:49I think of them
  • 58:50as really just versions of that plugin estimator
  • 58:53that use a different regression procedure.
  • 58:55There may be ways to tune plugins to try
  • 58:58and exploit this special structure of the Cate.
  • 59:01But if you're really just looking
  • 59:02at the regression functions individually,
  • 59:05I think these would be susceptible to the same kinds
  • 59:07of issues that we see with the plugin.
  • 59:09Yeah.
  • 59:10That's a good one.
  • 59:11- I see.
  • 59:12Yep.
  • 59:13So I want to see if there's any further questions
  • 59:17from the audience to dr. Kennedy.
  • 59:21(indistinct)
  • 59:23- I was just wondering if you could speak a little more,
  • 59:26why the standard like naming orthogonality results
  • 59:29or can it be applicable in this setup?
  • 59:32- [Edward] Yeah.
  • 59:33(clears throat)
  • 59:34Yeah.
  • 59:35That's a great question.
  • 59:36So one way to S to say it is that these effects,
  • 59:42these conditional effects
  • 59:43are not Pathwise differentiable.
  • 59:46And so these kinds of there's some distinction
  • 59:50between naming orthogonality
  • 59:51and pathways differentiability,
  • 59:52but maybe we can think about them
  • 59:53as being roughly the same for now.
  • 59:57So yeah, all the standards in my parametric
  • 59:59theory breaks down here
  • 01:00:01because of this lack of pathways differentiability so the,
  • 01:00:04all the efficiency bounds that
  • 01:00:05we know and love don't apply,
  • 01:00:09but it turns out that there's some kind
  • 01:00:11of analogous version of this that works for these things.
  • 01:00:15I think of them as like infinite dimensional functional.
  • 01:00:18So instead of like the ate, which is just a number,
  • 01:00:20this is like a curve,
  • 01:00:22but it has the same kinds of like functional structure
  • 01:00:25in the sense that it's combining regression functions
  • 01:00:28or our propensity scores in some way.
  • 01:00:29And we don't care about the individual components.
  • 01:00:33We care about their combination.
  • 01:00:36So yeah, the standard stuff doesn't work just
  • 01:00:38because it's, we're outside of this route
  • 01:00:40in Virginia, roughly, but there are, yeah,
  • 01:00:44there's analogous structure and there's tons
  • 01:00:46of important work to be done,
  • 01:00:48sort of formalizing this and extending
  • 01:00:53that's a little vague, but hopefully that.
  • 01:01:02- All right.
  • 01:01:03So any further questions?
  • 01:01:08- Thanks again.
  • 01:01:09And yeah.
  • 01:01:10If any questions come up, feel free to email.
  • 01:01:12- Yeah.
  • 01:01:14If not,
  • 01:01:14I'll let smoke unless that doctors can be again.
  • 01:01:16And I'm sure he'll be happy
  • 01:01:17to answer your questions offline.
  • 01:01:19So thanks everyone.
  • 01:01:20I'll see you.
  • 01:01:21We'll see you next week.
  • 01:01:22- Thanks a lot.