YSPH Biostatistics Virtual Seminar: “Optimal Doubly Robust Estimation of Heterogeneous Causal Effects"

November 04, 2020

Information

Edward Kennedy, PhD
Assistant Professor, Department of Statistics & Data Science, Carnegie Mellon University

Abstract: Heterogeneous effect estimation plays a crucial role in causal inference, with applications across medicine and social science. Many methods for estimating conditional average treatment effects (CATEs) have been proposed in recent years, but there are important theoretical gaps in understanding if and when such methods are optimal. This is especially true when the CATE has nontrivial structure (e.g., smoothness or sparsity). Our work contributes in several main ways. First, we study a two-stage doubly robust CATE estimator and give a generic model-free error bound, which, despite its generality, yields sharper results than those in the current literature. We apply the bound to derive error rates in nonparametric models with smoothness or sparsity, and give sufficient conditions for oracle efficiency. Underlying our error bound is a general oracle inequality for regression with estimated or imputed outcomes, which is of independent interest; this is the second main contribution. The third contribution is aimed at understanding the fundamental statistical limits of CATE estimation. To that end, we propose and study a local polynomial adaptation of double-residual regression. We show that this estimator can be oracle efficient under even weaker conditions, if used with a specialized form of sample splitting and careful choices of tuning parameters. These are the weakest conditions currently found in the literature, and we conjecture that they are minimal in a minimax sense. We go on to give error bounds in the non-trivial regime where oracle rates cannot be achieved. Some finite-sample properties are explored with simulations.

ID5837

To CiteDCA Citation Guide

00:00- So let's get started.
00:03Welcome everyone.
00:04It is my great pleasure to introduce our speaker today,
00:07Dr. Edward Kennedy, who is an assistant professor
00:11at the Department of Statistics and Data Science
00:13at Carnegie Mellon University.
00:16Dr. Kennedy got his MA in statistics
00:18and PhD in biostatistics from University of Pennsylvania.
00:21He's an expert in methods for causal inference,
00:24missing data and machine learning,
00:26especially in settings involving
00:27high dimensional and complex data structures.
00:31He has also been collaborating on statistical applications
00:34in criminal justice, health services,
00:36medicine and public policy.
00:38Today's going to share with us his recent work
00:40in the space of heterogeneous causal effect estimation.
00:43Welcome Edward, the floor is yours.
00:46- [Edward] Thanks so much, (clears throat)
00:47yeah, thanks for the invitation.
00:49I'm happy to talk to everyone today about this work
00:51I've been thinking about for the last year or so.
00:55Sort of excited about it.
00:57Yeah, so it's all about doubly robust estimation
00:59of heterogeneous treatment effects.
01:03Maybe before I start,
01:04I don't know what the standard approach is for questions,
01:07but I'd be more than happy to take
01:08any questions throughout the talk
01:10and I can always sort of adapt and focus more
01:13on different parts of the room,
01:14what people are interested in.
01:17I'm also trying to get used to using Zoom,
01:21I've been teaching this big lecture course
01:23so I think I can keep an eye on the chat box too
01:26if people have questions that way,
01:27feel free to just type something in.
01:30Okay.
01:31So yeah, this is sort
01:34of standard problem non-causal inference
01:36but I'll give some introduction.
01:38The kind of classical target that people go after
01:41in causal inference problems is what's
01:44often called the average treatment effect.
01:46So this tells you the mean outcome if everyone
01:48was treated versus if everyone was untreated, for example.
01:53So this is, yeah, sort of the standard target.
01:57We know quite a bit about estimating this parameter
02:01under no unmeasured confounding kinds of assumptions.
02:05So just as a just sort of point this out,
02:10so a lot of my work is sort of focused
02:11on the statistics of causal inference,
02:12how to estimate causal parameters
02:15well in flexible non-parametric models.
02:17So we know quite a bit
02:18about this average treatment effect parameter.
02:20There are still some really interesting open problems,
02:23even for this sort of most basic parameter,
02:25which I'd be happy to talk to people about,
02:26but this is just one number, it's an overall summary
02:31of how people respond to treatment, on average.
02:34It can obscure potentially important heterogeneity.
02:38So for example, very extreme case would be where half
02:43the population is seeing a big benefit
02:45from treatment and half is seeing severe harm,
02:49then you would completely miss this
02:50by just looking at the average treatment effect.
02:53So this motivates going beyond this,
02:55maybe looking at how treatment effects can vary
02:58across subject characteristics.
03:01All right, so why should we care about this?
03:03Why should we care how treatment effects vary in this way?
03:06So often when I talk about this,
03:09people's minds go immediately to optimal treatment regimes,
03:12which is certainly an important part of this problem.
03:16So that means trying to find out who's benefiting
03:19from treatment and who is not or who's being harmed.
03:22And then just in developing
03:24a treatment policy based on this,
03:26where you treat the people who benefit,
03:27but not the people who don't.
03:29This is definitely an important part
03:30of understanding heterogeneity,
03:32but I don't think it's the whole story.
03:33So it can also be very useful just
03:36to understand heterogeneity from a theoretical perspective,
03:39just to understand the system
03:41that you're studying and not only that,
03:44but also to help inform future treatment development.
03:50So not just trying to optimally assign
03:53the current treatment that's available,
03:55but if you find, for example,
03:57that there are portions of the subject population
04:01that are not responding to the treatment,
04:03maybe you should then go off and try and develop
04:05a treatment that would better aim at these people.
04:10So lots of different reasons why you might care
04:12about heterogeneity,
04:12including devising optimal policies,
04:16but not just that.
04:19And this really plays a big role across lots
04:21of different fields as you can imagine.
04:24We might want to target policies based on how people
04:29are responding to a drug or a medical treatment.
04:33We'll see a sort of political science example here.
04:36So this is just a picture of what you should maybe think
04:39about as we're talking about this problem
04:42with heterogeneous treatment effects.
04:44So this is a timely example.
04:46So it's looking at the effect
04:47of canvassing on voter turnout.
04:50So this is the effect of being sort of reminded
04:53in a face-to-face way to vote
04:55that there's an election coming up
04:58and how this effect varies with age.
05:00And so I'll come back to where this plot came from
05:04and the exact sort of data structure and analysis,
05:07but just as a picture to sort of make things concrete.
05:11It looks like there might be some sort of positive effect
05:15of canvassing for younger people,
05:16but not for older people,
05:18there might be some non-linearity.
05:21So this might be useful for a number of reasons.
05:23You might not want to target the older population
05:27with canvassing, because it may not be doing anything,
05:30you might want to try and find some other way
05:32to increase turnout for this group right.
05:36Or you might just want to understand sort
05:38of from a psychological, sociological,
05:41theoretical perspective,
05:44what kinds of people are responding to this sort of thing?
05:49And so this is just one simple example
05:51you can keep in mind.
05:54So what's the state of the art for this problem?
05:58So in this talk, I'm going to focus
06:00on this conditional average treatment effect here.
06:02So it's the expected difference
06:05if people of type X were treated versus
06:08not expected difference in outcomes.
06:11This is kind of the classic or standard parameter
06:14that people think about now
06:16in the heterogeneous treatment effects literature,
06:19there are other options you could think
06:21about risk ratios, for example, if outcomes are binary.
06:25A lot of the methods that I talk about today
06:26will have analogs for these other regions,
06:30but there are lots of fun, open problems to explore here.
06:33How to characterize heterogeneous treatment effects
06:35when you have timeframe treatments, continuous treatments,
06:39of cool problems to think about.
06:40But anyways, this kind of effect where we have
06:44a binary treatment and some set of covariates,
06:48there's really been this proliferation of proposals
06:51in recent years for estimating this thing
06:53in a flexible way that goes beyond just fitting
06:57a linear model and looking at some interaction terms.
07:01(clears throat)
07:02So I guess I'll refer to the paper for a lot
07:07of these different papers that have thought about this.
07:12People have used, sort of random forests
07:14and tree based methods basing out
07:17of a regression trees, lots of different variants
07:20for estimating this thing.
07:21So there've been lots of proposals,
07:23lots of methods for estimating this,
07:24but there's some really big theoretical gaps
07:28in this literature.
07:30So one, yeah, this is especially true
07:32when you can imagine that this conditional effect
07:36might be much more simple
07:38or sparse or smooth than the rest
07:41of the data generating process.
07:42So you can imagine you have some
07:45potentially complex propensity score describing
07:49the mechanism by which people are treated
07:50based on their covariates.
07:51You have some underlying regression functions
07:54that describe this outcome process,
07:56how their outcomes depend on covariates,
08:01whether they're treated or not.
08:02These could be very complex and messy objects,
08:05but this CATE might be simpler.
08:08And in this kind of regime, there's very little known.
08:12I'll talk more about exactly what I mean
08:14by this in just a bit.
08:17So one question is,
08:18how do we adapt to this kind of structure?
08:21And there are really no strong theoretical benchmarks
08:26in this world in the last few years,
08:30which means we have all these proposals,
08:33which is great, but we don't know which are optimal
08:36or when or if they can be improved in some way.
08:41What's the best possible performance
08:44that we could ever achieve at estimating
08:46this quantity in the non-parametric model
08:47without adding assumptions?
08:49So these kinds of questions are basically
08:50entirely open in this setup.
08:53So the point of this work is really to try
08:55and push forward to answer some of these questions.
09:00There are two kind of big parts of this work,
09:05which are in a paper on archive.
09:09So one is just to provide more flexible estimators
09:12of this guy and specifically to show,
09:18give stronger error guarantees on estimating this.
09:23So that we can use a really diverse set of methods
09:27for estimating this thing in a doubly robust way
09:29and still have some rigorous guarantees
09:32about how well we're doing.
09:34So that part is more practical.
09:35It's more about giving a method
09:37that people can actually implement
09:38and practice that's pretty straight forward,
09:41it looks like a two stage progression procedure
09:44and being able to say something about this
09:46that's model free and and agnostic about both
09:52the underlying data generating process
09:53and what methods we're using to construct the estimator.
09:57This was lacking in the previous literature.
09:59So that's one side of this work, which is more practical.
10:03I think I'll focus more on that today,
10:06but we can always adapt as we go,
10:09if people are interested in other stuff.
10:10I'm also going to talk a bit about an analysis of this,
10:12just to show you sort of how it would work in practice.
10:16So that's one part of this work.
10:17The second part is more theoretical and it says,
10:22so I don't want to just sort of construct
10:24an estimator that has the nice error guarantees,
10:27but I want to try and figure out what's
10:29the best possible performance I could ever get
10:31at estimating these heterogeneous effects.
10:36This turns out to be a really hard problem
10:39with a lot of nuance,
10:42but that's sort of the second part
10:43of the talk which maybe is a little tackle
10:46in a bit less time.
10:50So that's kind of big picture.
10:51I like to give the punchline of the talk at the start,
10:53just so you have an idea of what I'm going to be covering.
10:57And yeah, so now let's go into some details.
11:01So we're going to think about this sort
11:03of classic causal inference data structure,
11:06where we have n iid observations, we have covariates X,
11:10which are D dimensional, binary treatment for now,
11:14all the methods that I'll talk about will work
11:17without any extra work in the discrete treatment setting
11:21if we have multiple values.
11:23The continuous treatment setting
11:24is more difficult it turns out.
11:27And some outcome Y that we care about.
11:31All right, so there are a couple of characters
11:33in this talk that will play really important roles.
11:37So we'll have some special notation for them.
11:39So PI of X, this is the propensity score.
11:42This is the chance of being treated,
11:44given your covariates.
11:47So some people might be more or less likely
11:49to be treated depending on their baseline covariates, X.
11:54Muse of a, this will be an outcome regression function.
11:57So it's your expected outcome given your covariates
12:00and given your treatment level.
12:02And then we'll also later on in the talk use this ada,
12:04which is just the marginal outcome regression.
12:06So without thinking about treatment,
12:08just how the outcome varies on average as a function of X.
12:14And so those are the three main characters in this talk,
12:16we'll be using them throughout.
12:18So under these standard causal assumptions
12:21of consistency, positivity, exchangeability,
12:24there's a really amazing group
12:27at Yale that are focused on dropping these assumptions.
12:31So lots of cool work to be done there,
12:34but we're going to be using them today.
12:36So consistency, we're roughly thinking
12:38this means there's no interference,
12:40this is a big problem in causal inference,
12:42but we're going to say
12:43that my treatments can affect your outcomes, for example.
12:47We're going to think about the case where everyone
12:48has some chance at receiving treatment,
12:51both treatment and control,
12:52and then we have no unmeasured confounding.
12:54So we've collected enough sufficiently relevant covariates
12:57that once we conditioned on them,
12:59look within levels of the covariates,
13:00the treatment is as good as randomized.
13:03So under these three assumptions,
13:06this conditional effect on the left-hand side here
13:09can just be written as a difference in regression functions.
13:11It's just the difference in the regression function
13:13under treatment versus control,
13:15sort of super simple parameter right.
13:18So I'm going to call this thing Tau.
13:20This is just the regression under treatment minus
13:22the regression under control.
13:27So you might think, we know a lot about
13:29how to estimate regression functions non-parametrically
13:33they're really nice, min and max lower bounds
13:36that say we can't do better uniformly across the model
13:41without adding some assumptions or some extra structure.
13:46The fact that we have a difference
13:47in regression doesn't seem like
13:48it would make things more complicated
13:50than just the initial regression problem,
13:53but it turns out it really does,
13:55it's super interesting,
13:56this is one of the parts of this problem
13:57that I think is really fascinating.
14:00So just by taking a difference in regressions,
14:03you completely change the nature of this problem
14:06from the standard non-parametric regression setup.
14:10So let's get some intuition for why this is the case.
14:14So why isn't it optimal just to estimate
14:17the two regression functions
14:18and take a difference, for example?
14:21So let's think about a simple data generating process
14:23where we have just a one dimensional covariate,
14:26it's uniform on minus one, one,
14:29we have a simple step function propensity score
14:32and then we're going to think
14:32about a regression function, both under treatment
14:35and control that looks like some kind
14:37of crazy polynomial from this Gyorfi textbook,
14:40I'll show you a picture in just a minute.
14:44The important thing about this polynomial
14:47is that it's non-smooth, it has a jump,
14:50has some kinks in it and so it will be hard to estimate,
14:56in general, but we're taking both
15:00the regression function under treatment
15:01and the regression function under control
15:03to be equal, they're equal to this same hard
15:06to estimate polynomial function.
15:07And so that means the difference is really simple,
15:10it's just zero, it's the simplest conditional effect
15:12you can imagine, not only constant, but zero.
15:15You can imagine this probably happens a lot in practice
15:18where we have treatments that are not extremely effective
15:22for everyone in some complicated way.
15:26So the simplest way you would estimate
15:29this conditional effect is just take an estimate
15:32of the two regression functions and take a difference.
15:35Sometimes I'll call this plugin estimator.
15:38There's this paper by Kunzel and colleagues,
15:41call it the T-learner.
15:43So for example, we can use smoothing splines,
15:46estimate the two regression functions and take a difference.
15:49And maybe you can already see what's going to go wrong here.
15:52So these individual regression functions
15:54by themselves are really hard to estimate.
15:58They have jumps and kinks, they're messy functions
16:01And so when we try and estimate these
16:03with smoothing splines, for example,
16:05we're going to get really complicated estimates
16:08that have some bumps, It's hard to choose
16:11the right tuning parameter, but even if we do,
16:14we're inheriting the sort of complexity
16:16of the individual regression functions.
16:18When we take the difference,
16:19we're going to see something
16:20that is equally complex here
16:22and so it's not doing a good job of exploiting
16:25this simple structure in the conditional effect.
16:30This is sort of analogous to this intuition
16:33that people have that interaction terms might
16:36be smaller or less worrisome than sort
16:41of main effects in a regression model.
16:43Or you can think of the muse as sort of main effects
16:45and the differences as like an interaction.
16:49So here's a picture of this data
16:51in the simple motivating example.
16:53So we've got treated people on the left
16:55and untreated people on the right
16:57and this gray line is the true, that messy,
17:00weird polynomial function that we're thinking about.
17:03So here's a jump and there's a couple
17:06of kinks here and there's confounding.
17:09So treated people are more likely to have larger Xs,
17:13untreated people are more likely to have smaller Xs.
17:16So what happens here is the function is sort
17:19of a bit easier to estimate on the right side.
17:22And so for treated people, we're going to take a sort
17:24of larger bandwidth, get a smoother function.
17:28For untreated people, it's harder to estimate
17:30on the left side and so we're going to need
17:31a small bandwidth to try and capture this jump,
17:34for example, this discontinuity.
17:38And so what's going to happen is when you take a difference
17:40of these two regression estimates, these black lines
17:42are just the standard smoothing that spline estimates
17:46that you're getting are with one line of code,
17:48using the default bandwidth choices.
17:50When you take a difference,
17:51you're going to get something
17:52that's very complex and messy and it's not doing
17:55a good job of recognizing that the regression functions
17:58are the same under treatment and control.
18:03So what else could we do?
18:04This maybe points to this fact that
18:06the plugin estimator breaks,
18:09it doesn't do a good job of exploiting a structure,
18:11but what other options do we have?
18:13So let's say that we knew the propensity scores.
18:15So for just simplicity, say we were in a trial,
18:19for example, an experiment,
18:21where we randomized everyone to treat them
18:23with some probability that we knew.
18:26In that case, we could construct a pseudo outcome,
18:28which is just like an inverse probability weighted outcome,
18:31which has exactly the right conditional expectation,
18:35its conditional expectation is exactly equal
18:37to that conditional effect.
18:39And so when you did a non-parametric regression
18:41of the pseudo outcome on X,
18:43it would be like doing an oracle regression
18:45of the true difference in potential outcomes,
18:47it has exactly the same conditional expectation.
18:50And so this sort of turns this hard problem
18:53into a standard non-parametric regression problem.
18:56Now this isn't a special case where we knew
18:58the propensity scores for the rest
18:59of the talk we're gonna think about what happens
19:01when we don't know these, what can we say?
19:04So here's just a picture of what we get in the setup.
19:08So this red line is this really messy plug in estimator
19:10that we get that's just inheriting that complexity
19:13of estimating the individual regression functions
19:15and then these black and blue lines are IPW
19:19and doubly robust versions that exploit
19:22this underlying smoothness and simplicity
19:25of the heterogeneous effects, the conditional effects.
19:32So this is just a motivating example
19:34to help us get some intuition for what's going on here.
19:39So these results are sort of standard in this problem,
19:41we'll come back to some simulations later on.
19:43And so now our goal is going to study the error
19:48of the sort of inverse weighted kind of procedure,
19:51but a doubly robust version.
19:53We're going to give some new model free error guarantees,
19:57which let us use very flexible methods
19:59and it turns out we'll actually get better areas
20:03than what were achieved previously in literature,
20:08even when focusing specifically on some particular method.
20:12And then again, we're going to see,
20:13how well can we actually do estimating
20:15this conditional effect in this problem.
20:21Might be a good place to pause
20:23and see if people have any questions.
20:33Okay.
20:34(clears throat)
20:35Feel free to shout out any questions
20:37or stick them on the chat if any come up.
20:42So we're going to start by thinking about
20:45a pretty simple two-stage doubly robust estimator,
20:48which I'm going to call the DR-learner,
20:50this is following this nomenclature that's become kind
20:53of common in the heterogeneous effects literature
20:57where we have letters and then a learner.
21:01So I'm calling this the DR-Learner,
21:02but this is not a new procedure,
21:04but the version that I'm going to analyze
21:06has some variances, but it was actually first proposed
21:08by Mike Vanderlande in 2013, was used in 2016
21:13by Alex Lucca and Mark Vanderlande.
21:16So they proposed this,
21:17but they didn't give specific error bounds.
21:21I think relatively few people know
21:23about these earlier papers because this approach
21:25was then sort of rediscovered in various ways
21:28after that in the following years,
21:30typically in these later versions,
21:32people use very specific methods for estimating,
21:37for constructing the estimator,
21:38which I'll talk about in detail in just a minute,
21:41for example, using kernel kind of methods,
21:44Local polinomials and this paper used
21:48a sort of series or spline and regression.
21:52So.
21:53(clears throat)
21:54These papers are nice ways
21:56of doing doubly robust estimation,
21:58but they had a couple of drawbacks,
22:00which we're going to try and build on in this work.
22:03So one is, we're going to try not to commit
22:05to using any particular methods.
22:07We're going to see what we can say about error guarantees,
22:11just for generic regression procedures.
22:15And then we're going to see
22:16if we can actually weaken the sort of assumptions
22:19that we need to get oracle type behavior.
22:22So the behavior of an estimator that we would see
22:25if we actually observed the potential outcomes
22:28and it turns out we'll be able to do this,
22:29even though we're not committing to particular methods.
22:34There's also a really nice paper by Foster
22:35and Syrgkanis from last year,
22:38which also considered a version of this DR-learner
22:41and they had some really nice model agnostic results,
22:44but they weren't doubly robust.
22:46So, in this work we're going to try
22:48and doubly robust defy these these results.
22:54So that's the sort of background and an overview.
22:57So let's think about what this estimator is actually doing.
23:01So here's the picture of this,
23:03what I'm calling the DR-learner.
23:05So we're going to do some interesting sample splitting
23:08here and later where we split our sample
23:10in the three different groups.
23:13So one's going to be used for nuisance training
23:16for estimating the propensity score.
23:19And then I'm also going to estimate
23:21the regression functions, but in a separate fold.
23:26So I'm separately estimating my propensity score
23:29and regression functions.
23:30This turns out to not be super crucial for this approach.
23:35It actually is crucial for something I'll talk
23:37about later in the talk,
23:39this is just to give a nicer error bound.
23:43So the first stage is we estimate these nuisance functions,
23:45the propensity scores and the regressions.
23:48And then we go to this new data that we haven't seen yet,
23:53our third fold of split data
23:56and we construct a pseudo outcome.
23:58Pseudo outcome looks like this, it's just some combination,
24:01it's like an inverse probability weighted residual term
24:04plus something like the plug-in estimator
24:07of the conditional effect.
24:09So it's just some function of the propensity score estimates
24:12and the regression estimates.
24:15If you've used doubly robust estimators
24:17before you'll recognize this as what
24:19we average when we construct
24:21a usual doubly robust estimator
24:24of the average treatment effect.
24:25And so intuitively instead of averaging this year,
24:28we're just going to regress it on covariates,
24:30that's exactly how this procedure works.
24:33So it's pretty simple, construct the pseudo outcome,
24:36which we typically would average estimate the ate,
24:39now, we're just going to do a regression
24:41of this thing on covariates in our third sample.
24:45So we can write our estimator this way.
24:47This e hat in notation just means
24:49some generic regression estimator.
24:53So one of the crucial points in this work,
24:55so I'm not going to, I want to see what I can say
24:57about the error of this estimator without committing
25:01to a particular estimator.
25:02So if you want to use random forests in that last stage,
25:05I want to be able to tell you what kind
25:07of error to expect or if you want
25:09to use linear regression
25:10or whatever procedure you like,
25:13the goal would be to give you some nice error guarantee.
25:16So (indistinct), and you should think of it as just
25:18your favorite regression estimator.
25:21So we take the suit outcome,
25:22we regress it on covariates, super simple,
25:25just create a new column in your dataset,
25:27which looks like this pseudo outcome.
25:28And then treat that as the outcome
25:30in your second stage regression.
25:35So here we're going to get let's say we split
25:39our sample into half for the second stage regression,
25:42we would get an over two kind
25:43of we'd be using half our sample
25:46for the second stage regression.
25:48You can actually just swap these samples
25:49in the you'll get back the full sample size errors.
25:54So it would be as if you had used
25:56the full sample size all at once.
25:59That's called Cross Fitting,
26:01it's becoming sort of popular in the last couple of years.
26:03So here's a schematic of what this thing is doing.
26:06So we split our data in the thirds,
26:07use one third testing
26:09to estimate the propensity score,
26:10another third to estimate the regression functions,
26:12we use those to construct a pseudo outcome
26:14and then we do a second stage regression
26:16of that pseudo outcome on covariates.
26:19So pretty easy, you can do this in three lines of code.
26:25Okay.
26:26And now our goal is to say something
26:28about the error of this procedure,
26:29being completely agnostic about
26:30how we estimate these propensity scores,
26:32that regression functions and what procedure
26:34we use in this third stage or second stage.
26:40And it turns out we can do this by exploiting
26:42the sample splitting can come up with a strong guarantee
26:46that actually gives you smaller errors than
26:48what appeared in the previous literature
26:50when people focused on specific methods.
26:52And the main thing is we're really exploiting
26:55the sample splitting.
26:58And then the other tool that we're using
26:59is we're assuming some stability condition
27:02on that second stage estimator,
27:03that's the only thing we assume here.
27:06It's really mild, I'll tell you what it is right now.
27:10So you say that regression estimator is stable,
27:14if when you add some constant to the outcome
27:18and then do a regression, you get something
27:21that's the same as if you do the regression
27:22and then add some constant.
27:25So it's pretty intuitive,
27:26if a method didn't satisfy this,
27:27it would be very weird
27:30and actually for the proof,
27:32we don't actually need this to be exactly equal.
27:35So adding a constant pre versus post regression
27:38shouldn't change things too much.
27:40You don't have to have it be exactly equal,
27:43it still works if it's just equal up
27:44to the error in the second stage regression.
27:52So that's the first stability condition.
27:55The second one is just that if you have
27:57two random variables with the same conditional expectation,
28:00then the mean squared error is going
28:01to be the same up to constants.
28:03Again, any procedure
28:05that didn't satisfy these two assumptions
28:07would be very bizarre.
28:11It's a very mild stability conditions.
28:13And that's essentially all we need.
28:15So now our benchmark here is going to be an oracle estimator
28:21that instead of doing a regression with the pseudo,
28:23it does a regression with the actual potential outcomes,
28:26Y, one, Y, zero.
28:30So we can think about the mean squared error
28:31of this estimator, so I'm using mean squared error,
28:34just sort of for simplicity and convention,
28:36you could think about translating this
28:38to other kinds of measures of risk.
28:40That would be an interesting area for future work.
28:44So this is the oral, our star is the Oracle
28:47the mean squared error.
28:48It's the mean squared error you'd get for estimating
28:50the conditional effect if you actually saw
28:52the potential outcomes.
28:55So we get this really nice, simple result,
28:57which says that the mean squared error
28:59of that DR-learner procedure that uses the pseudo outcomes,
29:03it just looks like the Oracle means squared error,
29:06plus a product of mean squared errors in estimating
29:08the propensity score and the regression function.
29:12It resembles the kind of doubly robust error results
29:16that you see for estimating average treatment effects,
29:18but now we have this for conditional effects.
29:23The proof technique is very different here compared
29:25to what is done in the average effect case.
29:29But the proof is actually very, very straightforward.
29:32It's like a page long, you can take a look in the paper,
29:35it's really just leaning on this sample splitting
29:38and then using stability in a slightly clever way.
29:42But the most complicated tool uses is just
29:45some careful use of the components
29:49of the estimator and iterated expectation.
29:53So it's really a pretty simple proof, which I like.
29:57So yeah, this is the main result.
29:59And again, we're not assuming anything beyond
30:02this mild stability here, which is nice.
30:04So you can use whatever regression procedures you like.
30:07And this will tell you something about the error
30:09how it relates to the Oracle error that you would get
30:12if you actually observed the potential outcomes.
30:18So this is model free method-agnostic,
30:21it's also a finite sample down,
30:23there's nothing asymptotic here.
30:25This means that the mean squared error is upper bounded up
30:28to some constant times this term on the right.
30:31So there's no end going to infinity or anything here either.
30:39So the other crucial point of this is
30:41because we have a product of mean squared errors,
30:44you have the kind of usual doubly robust story.
30:46So if one of these is small, the product will be small,
30:50potentially more importantly, if they're both kind
30:52of modest sized because both, maybe the propensity score
30:55and the regression functions are hard to estimate
30:57the product will be potentially quite a bit smaller
31:01than the individual pieces.
31:04And this is why this is showing you that that sort
31:08of plugging approach, which would really just be driven
31:10by the mean squared error for estimating
31:11the regression functions can be improved by quite a bit,
31:15especially if there's some structure to exploit
31:17in the propensity scores.
31:23Yeah, so in previous work people used specific methods.
31:26So they would say I'll use
31:28maybe series estimators or current estimators
31:31and then the error bound was actually bigger
31:34than what we get here.
31:36So this it's a little surprising that you can get
31:38a smaller error bound under weaker assumptions,
31:40but this is a nice advantage
31:42of the sample splitting trick here.
31:49Now that you have this nice error bound you can plug
31:52in sort of results from any of your favorite estimators.
31:56So we know lots about mean squared error
31:59for estimating regression functions.
32:01And so you can just plug in what you get here.
32:03So for example, you think about smooth functions.
32:07So these are functions and hold their classes intuitively
32:11these are functions that are close to their tailored,
32:13approximations, the strict definition,
32:17which may be I'll pass in the interest of time.
32:22Then you can say, for example, if PI is alpha smooth,
32:25so it has alpha partial derivatives
32:30with the highest order Lipschitz then we know
32:33that you can estimate that a propensity score
32:37with the mean squared error that looks like
32:38n to the minus two alpha over two alpha plus D,
32:41this is the usual non-parametric regression
32:44mean squared error.
32:46You can say the same thing for the regression functions.
32:49If they're beta smooth, then we can estimate them
32:51at the usual non-parametric rate,
32:53n to the minus two beta over two beta plus D.
32:56Then we could say,
32:57okay, suppose the conditional effect,
32:59Tau is gamma smooth, and gamma, it can't be smaller
33:04than beta, it has to be at least as smooth
33:05as the regression functions and in practice,
33:08it could be much more smooth.
33:09So for example, in the case where the CATE is just zero
33:12or constant, Gamma's like infinity, infinitely smooth.
33:17Then if we use a second stage estimator that's optimal
33:20for estimating Gamma smooth functions,
33:24we can just plug in the error rates
33:25that we get and see
33:26that we get a mean squared error bound
33:28that looks like the Oracle rate.
33:30This is the rate we would get if we actually observed
33:33the potential outcomes.
33:35And then we get this product of mean squared errors.
33:37And so whenever this product, it means squared errors
33:40is smaller than the Oracle rate,
33:42then we're achieving the Oracle rate up to constants,
33:46the same rate that we would get
33:47if we actually saw Y one minus Y zero.
33:51And so you can work out the conditions,
33:53what you need to make this term smaller than this one,
33:56that's just some algebra
34:00and it has some interesting structure.
34:03So if the average smoothness of the two nuisance functions,
34:07the propensity score and the regression function
34:09is greater than D over two divided by some inflation factor,
34:14then you can say that you're achieving
34:18the same rate as this Oracle procedure.
34:22So the analog of this for the average treatment effect
34:27or the result you need
34:28for the standard doubly robust estimate,
34:30or the average treatment effect
34:31is that the average smoothness is greater than D over two.
34:34So here we don't have D over two,
34:36we have D over two over one plus D over gamma.
34:40So this is actually giving you a sort
34:44of a lower threshold for achieving Oracle rates
34:48in this problem.
34:49So, because it's a harder problem,
34:51we need weaker conditions
34:52on the nuisance estimation to behave like an Oracle
34:56and how much weaker those conditions
34:58are, depends on the dimension of the covariates
35:00and the smoothness of the conditional effect.
35:04So if we think about the case where the conditional effect
35:06is like infinitely smooth,
35:07so this is almost like a parametric problem.
35:10Then we recovered the usual condition that we need
35:13for the doubly robust estimator to be root
35:14and consistent as greater than D over two.
35:20But when dimension is for some non-trivial smoothness,
35:26then we're somewhere in between sort of when
35:28a plugin is optimal and this nice kind of parametric setup.
35:34So this is just a picture of the rates here
35:37which is useful to keep in mind.
35:39So here on the x-axis, we have the smoothness
35:43of the nuisance functions.
35:44You can think of this as the average smoothness
35:46of the propensity score in regression functions.
35:49And again, in this holder smooth model,
35:52which is a common model people use in non-parametrics,
35:55the more smooth things are
35:57the easier it is to estimate them.
36:00And then here we have the mean squared error
36:02for estimating the conditional effect.
36:06So here is the minimax lower bounce,
36:09this is the best possible mean squared error
36:11that you can achieve for the average treatment effect.
36:14This is just to kind of anchor our results
36:16and think about what happens relative to this nicer,
36:19simpler parameter, which is just the overall average
36:21and not the conditional average.
36:24So once you hit a certain smoothness in this case,
36:26it's five, so this is looking at
36:28a 20 dimensional covariate case where
36:32the CATE smoothness is twice the dimension
36:35just to fix ideas.
36:38And so once we hit this smoothness of five,
36:42so we have five partial derivatives,
36:43then it's possible to achieve a Rudin rate.
36:47So this is into the one half for estimating
36:50the average treatment effect.
36:52Rudin rates are never possible for conditional effects.
36:55So here's the Oracle rate.
36:58This is the rate that we would achieve in this problem
37:00if we actually observed the potential outcomes.
37:02So it's lower than Rudin, it's a bigger error.
37:08Here's what you would get with the plugin.
37:10This is just really inheriting the complexity
37:13and estimating the regression functions individually,
37:15it doesn't capture this CATE smoothness
37:18and so you need the regression functions
37:20to be sort of infinitely smoother or as smooth
37:22as the CATE to actually get Oracle efficiency
37:25with the plugin estimator.
37:28It's this plugin as big errors,
37:29if we use this DR-learner approach,
37:32we close this gap substantially.
37:36So we can say that we're hitting this Oracle rate.
37:39Once we have a certain amount of smoothness
37:41of the nuisance functions and in between
37:44we get an error that looks something like this.
37:49So this is just a picture of this row results showing,
37:53graphically, the improvement of the DR-learner approach
37:56here over a simple plug estimator.
38:03So yeah, just the punchline here is
38:06this simple two-stage doubly robust approach
38:09can do a good job adapting to underlying structure
38:12in the conditional effect,
38:14even when the nuisance stuff,
38:16the propensity scores
38:17and the underlying regression functions
38:18are more complex or less smooth in this case.
38:24This is just talking about the relation
38:26to the average treatment effect conditions,
38:28which I mentioned before.
38:32So you can do the same thing for any generic
38:34regression methods you like.
38:35So in the paper, I do this for smooth models
38:38and sparse models, which are common
38:39in these non-parametric settings,
38:41where you have high dimensional Xs
38:43and you believe that some subset
38:45of them are the ones that matter.
38:48So I'll skip past this, if you're curious though,
38:50all the details are in the paper.
38:52So you can say, what kind of sparse should
38:53be doing need in the propensity score
38:55in regression functions to be able
38:57to get something that behaves like an Oracle
38:59that actually saw the potential outcomes from the start.
39:04You can also do the same kind of game
39:05where you compare this to what you need
39:06for the average treatment effect.
39:11Yeah, happy to talk about this offline
39:13or afterwards people have questions.
39:18So there's also a nice kind of side result
39:21which I think I'll also go through quickly here.
39:24From all this, is just a general Oracle inequality
39:29for regression when you have some estimated outcomes.
39:31So in some sense, there isn't anything really special
39:34in our results that has to do
39:37with this particular pseudo outcome.
39:39So, the proof that we have here works
39:43for any second stage or any two-stage sort
39:46of regression procedure
39:48where you first estimate some nuisance stuff,
39:50create a pseudo outcome that depends
39:52on this estimated stuff and then do a regression
39:54of the pseudo outcome on some set of covariates.
39:58And so a nice by-product of this work,
40:00as you get a kind of similar error bound
40:02for just generic regression with pseudo outcomes.
40:07This comes up in a lot of different problems, actually.
40:10So one is when you want just a partly conditional effect.
40:15So maybe I don't care about how effects vary
40:17with all the Xs, but just a subset of them,
40:19then you can apply this result.
40:20I have a paper with a great student, Amanda Costin,
40:23who studied a version of this
40:28regression with missing outcomes.
40:30Again, these look like nonparametric regression problems
40:33where you have to estimate some pseudo outcome
40:36dose response curve problems, conditional IV effects,
40:40partially linear IVs.
40:41So there are lots of different variants where you need
40:43to do some kind of two-stage regression procedure like this.
40:51Again, you just need a stability condition
40:52and you need some sample splitting
40:54and you can give a similar kind of a nice rate result
40:57that we got for the CATE specific problem,
41:01but in generic pseudo outcome progression problem.
41:07So we've got about 15 minutes,
41:10I have some simulations,
41:12which I think I will go over quickly.
41:15So we did this in a couple simple models,
41:17one, a high dimensional linear model.
41:20It's actually a logistic model where
41:22we have 500 covariates and 50
41:25of them have non-zero coefficients.
41:28We just used the default lasso fitting in our
41:32and compared plugin estimators
41:34to the doubly robust approach that we talked
41:37about and then also an ex-learner
41:39which is some sort of variants of the plug-in approach
41:44that was proposed in recent years.
41:47And the basic story is you get sort
41:49of what the theory predicts.
41:50So the DR-learner does better than these plug-in types
41:54of approaches in this setting.
41:58The nuisance functions are hard to estimate
42:00and so you don't see a massive gain over,
42:02for example, the X-Learner,
42:04you do see a pretty massive gain
42:05over the simple plugin.
42:08And we're a bit away
42:10from this Oracle DR-learner approach here,
42:13so that means great errors is relatively different.
42:16This is telling us that the nuisance stuff is hard
42:19to estimate in this simulation set up.
42:22Here's another simulation based
42:24on that plot I showed you before.
42:28And so here, I'm actually estimating the propensity scores,
42:31but I'm constructing the estimates myself
42:33so that I can control the rate of convergence
42:35and see how things change across different error rates
42:39for estimating with propensity score.
42:41So here's what we see.
42:42So on the x-axis here,
42:44we have how well we're estimating the propensity score.
42:48So this is a convergence rate
42:50for the propensity score estimator.
42:52Y-axis, we have the mean squared error
42:54and then this red line is the plugin estimator,
42:56it's doing really poorly.
42:57It's not capturing this underlying simplicity
42:59of the conditional effects.
43:00It's really just inheriting that difficulty
43:03in estimating the regression functions.
43:05Here's the X-learner, it's doing a bit better
43:07than the plugin, but it's still not doing
43:10a great job capturing the underlying simplicity
43:12and the conditional effect.
43:14This dotted line is the Oracle.
43:16So this is what you would get
43:18if you actually observed the potential outcomes.
43:20And then the black line is the DR-learner,
43:23this two-stage procedure here,
43:24I'm just using smoothing splines everywhere,
43:26just defaults in R, it's like three lines of code,
43:29all the code's in the paper, too,
43:31if you want to play around with this.
43:33And here we see what we expect.
43:35So when it's really hard to estimate the propensity score,
43:38it's just a hard problem and we don't do
43:40much better than the X-learner.
43:44We still get some gain over the plugin in this case,
43:47but as soon as you can estimate the propensity score
43:50well at all, you start seeing some pretty big gains
43:54by doing this doubly robust approach
43:56and at some point we start to roughly match
43:58the Oracle actually.
44:02As soon as we're getting something like
44:03into the quarter rates in this case,
44:04we're getting close to the Oracle.
44:10So maybe I'll just show you an illustration
44:12and then I'll talk about the second part of the talk
44:15and very briefly if people have,
44:17want to talk about that,
44:20offline, I'd be more than happy to.
44:23So here's a study, which I actually learned about
44:25from Peter looking at effects of canvassing
44:27on voter turnout, so this is this timely study.
44:30Here's the paper, there are almost 20,000 voters
44:36across six cities here.
44:37They're randomly encouraged to vote
44:42in these local elections that people would go
44:45and talk to them face to face.
44:47You remember what that was like pre-pandemic.
44:50Here's a script of the sort of canvassing that they did,
44:55just saying, reminding them of the election,
44:58giving them a reminder to vote.
45:00Hopefully I'm doing this for you as well,
45:02if you haven't voted already.
45:04And so what's the data we have here?
45:07We have a number of covariates things like city,
45:10party affiliation, some measures
45:11of the past voting history, age, family size, race.
45:15Again, the treatment is whether they work randomly contact
45:19is actually whether they were randomly assigned some cases,
45:22people couldn't be contacted in the setup.
45:25So we're just looking at intention
45:26to treat kinds of effects.
45:28And then the outcome is whether people voted
45:30in the local election or not.
45:32So just as kind of a proof of concept,
45:35I use this DR-learner approach,
45:37I just use two folds and use random forest separator
45:42for the first stage regressions and the second stage.
45:47And actually for one part of the analysis,
45:50I used generalized additive models in that second stage.
45:56So here's a histogram of the conditional effect estimates.
46:00So there's sort of a big chunk, a little bit above zero,
46:03but then there is some heterogeneity
46:04around that in this case.
46:07So there are some people
46:08who maybe seem especially responsive to canvassing,
46:12maybe some people who are going to know it
46:15and actually some are less likely to vote, potentially.
46:18This is a plot of the effect estimates
46:21from this DR-learner procedure,
46:22just to see what they look like,
46:24how this would work in practice across
46:27to potentially important covariate.
46:30So here's the age of the voter and then the party
46:34and the color here represents the size and direction
46:39of the CATE estimate of the conditional effect estimates,
46:41so blue is canvassing is having a bigger effect
46:45on voting in the next local election.
46:50Red means less likely to vote due to canvassing.
46:55So you can see some interesting structure here just briefly,
46:59the independent people,
47:01it seems like the effects are closer to zero.
47:03Democrats maybe seem more likely to be positively affected,
47:07maybe more so among younger people.
47:11It's just an example of the kind of
47:13sort of graphical visualization stuff you could do
47:16with this sort of procedure.
47:18This is the plot I showed before, where here,
47:21we're looking at just how the conditional
47:22effect varies with age.
47:24And you can see some evidence
47:25that younger people are to canvassing.
47:33Older people, less evidence that there's any response.
47:43I should stop here and see if people have any questions.
47:51- So Edward, can I ask a question?
47:53- Of course yeah.
47:55- I think we've discussed about point estimation.
47:57Does this approach also allows
47:59for consistent variance estimation?
48:01- Yeah, that's a great question.
48:04Yeah, I haven't included any of that here,
48:08but if you think about that.
48:11This Oracle result that we have.
48:17If these errors are small enough,
48:19so under the kinds of conditions that we talked about,
48:22then we're getting an estimate of it looks like an Oracle
48:26has to meet or of the potential outcomes on the covariates.
48:29And that means that as long as these are small enough,
48:31we could just port over any inferential tools
48:33that we like from standard non-parametric regression
48:35treating our pseudo outcomes as if they were
48:38the true existential outcomes, yeah.
48:41That's a really important point,
48:43I'm glad you mentioned that.
48:44- Thanks.
48:45- So inference is more complicated
48:47and nuanced than non-parametric regression,
48:51but any inferential tool could be used here.
48:56- So operationally, just to think
48:57about how to operationalize the variance estimation
48:59also, does that require the cross fitting procedure
49:03where you're swapping your D one D two
49:06in the estimation process and then?
49:10- Yeah, that's a great question too.
49:11So not necessarily,
49:12so you could just use these folds
49:14for nuisance training and then go to this fold
49:17and then just forget that you ever used this data
49:19and just do variance estimation here.
49:21The drawback there would be,
49:22you're only using a third of your data.
49:25If you really want to make full use
49:26of the sample size using
49:28the cross fitting procedure would be ideal,
49:31but the inference doesn't change.
49:32So if you do cross fitting,
49:35you would at the end of the day,
49:36you'd get an out of sample CATE estimate
49:39for every single row in your data, every subject,
49:42but just where that CATE was built from other,
49:45the nuisance stuff for that estimate
49:47was built from other samples.
49:50But at the end of the day, you'd get one big column
49:52with all these out of sample CATE estimates
49:54and then you could just use
49:55whatever inferential tools you like there.
50:00- Thanks.
50:07- So, just got a few minutes.
50:10So maybe I'll just give you a high level kind of picture
50:12of the stuff in the second part of this talk
50:14which is really about pursuing the fundamental limits
50:19of conditional effect estimation.
50:20So what's the best we could possibly do here?
50:23This is completely unknown,
50:25which I think is really fascinating.
50:27So if you think about what we have so far,
50:29so far, we've given these sufficient conditions under
50:33which this DR-learner is Oracle efficient,
50:36but a natural question here is what happens
50:38when those mean squared error terms are too big
50:40and so we can't say that we're getting
50:42the Oracle rate anymore.
50:45Then you might say,
50:46okay, is this a bug with the DR-learner?
50:50Maybe I could have adapted this in some way
50:52to actually do better or maybe I've reached the limits
50:56of how well I can do for estimating the effect.
51:00It doesn't matter if I had gone to a different estimator,
51:03think I would've had the same kind of error.
51:07So this is the goal of this last part of the work.
51:12So here we use a very different estimator.
51:14It's built using this R-learner idea,
51:17which is reproducing RKHS extension of this
51:22classic double residual regression method
51:24of Robinson, which is really cool.
51:27This is actually from 1988, so it's a classic method.
51:33And so we study a non-parametric version
51:35of this built from local polynomial estimators.
51:38And I'll just give you a picture
51:40of what the estimator is doing.
51:41It's quite a bit more complicated
51:43than that dr. Learner procedure.
51:45So we again use this triple sample splitting
51:47and here it's actually much more crucial.
51:50So if you didn't use that triple sample splitting
51:53for the dr learner,
51:53you'd just get a slightly different Arab bound,
51:55but here it's actually really important.
51:57I'd be happy to talk to people about why specifically.
52:01So one part of the sample we estimate propensity scores
52:04and another part of the sample.
52:05We estimate propensity scores and regression functions.
52:08Now the marginal regression functions,
52:10we combine these to get weights, Colonel weights.
52:13We also combine them to get residuals.
52:15So treatment residuals and outcome residuals.
52:18This is like what you would get
52:19for this re Robinson procedure from econ.
52:24Then we do instead of a regression
52:26of outcome residuals on treatment residuals,
52:28we do a weighted nonparametric regression
52:32of these residuals on the treatment residuals.
52:34So that's the procedure a little bit more complicated.
52:38And again, this is,
52:39I think there are ways to make this work well practically,
52:42but the goal of this work is really to try
52:44and figure out what's the best possible
52:46mean squared error that we could achieve.
52:47It's less about a practical method,
52:51more about just understanding how hard
52:52the conditional effect estimation problem is.
52:56And so we actually show that a generic version
52:59of this procedure,
53:02as long as you estimate the propensity scores
53:03and the regression functions with linear smoothers,
53:05with particular bias and various properties,
53:08which are standard in nonparametrics,
53:11you can actually get better mean squared error.
53:13Then for the dr. Learner,
53:15we'll just give you a sense of what this looks like.
53:19So you get something that looks like an Oracle rate plus
53:22something like the squared bias from the new synced,
53:27from the propensity score and regression functions.
53:32So before you had the product of mean squared errors,
53:36now we have the square of the bias
53:38of the two procedures, the mean squared error,
53:40and the propensity score in the regression function.
53:43And this gives you, this opens the door to under smoothing.
53:46So this means that you can estimate the propensity score
53:49and the regression functions in a suboptimal way.
53:52If you actually just care about the,
53:54these functions by themselves.
53:56So you drive down the bias that blows up
53:59the variance a little bit,
54:00but it turns out not to affect the conditional effect
54:03estimate too much if you do it in the right way.
54:06And so if.
54:07You, if you do this, you get.
54:11A rate that looks like this,
54:12you get an Oracle rate plus into the minus two S over D.
54:15And this is strictly better than what we got
54:18with the dr. Learner.
54:20(clears throat)
54:21You can do the same game where you see sort
54:23of when the Oracle rate is achieved here, it's achieved.
54:27If the average smoothness of the nuisance functions
54:29is greater than D over four.
54:31And then here, the inflation factor is also changing.
54:34So before we had,
54:35we needed the smoothness to be greater than D over two,
54:38over one plus D over gamma.
54:40Now we have D over four over one plus or two gamma.
54:45So this is a weaker condition.
54:46So this is telling us that there are settings
54:48where that dr. Lerner is not Oracle efficient,
54:52but there exists an estimator, which is,
54:53and it looks like this estimator
54:56I had described here,
54:57this regression on residuals thing.
55:02So that's the story.
55:03You can actually,
55:03you can actually beat this dr. Lerner.
55:05And now the question is, okay, what happens?
55:08One, what happens
55:09when we're not achieving the Oracle rate here,
55:11can you still do better?
55:13A second question is can anything, yeah.
55:19Can anything achieve the Oracle rate
55:20under weaker conditions than this?
55:22And so I haven't proved anything about this yet.
55:25It turns out to be somewhat difficult,
55:29but I conjecture that this, this condition is mini max.
55:33So I don't think any,
55:34any estimator could ever be Oracle efficient
55:36under weaker conditions than what this estimator is.
55:40So this is just a picture of the results again.
55:42So here's, it's the same setting as before here,
55:45we have the plugin estimator that dr. Learner.
55:48And here's what we get with this.
55:51I call it the LPR learner.
55:53It's a local polynomial version of the, our learner.
55:55And so we're, actually getting quite a bit smaller rates.
55:58We're hitting the Oracle rate under Meeker conditions
56:02on the smoothness.
56:03Now, the question is whether we can fill this gap anymore,
56:08and this is unknown.
56:09This is one of the open questions in causal inference.
56:14So yeah, I think in the interest of time,
56:18I'll skip to the discussion section here.
56:21We can actually fill the gap a little bit
56:22with some extra, extra tuning.
56:26Just interesting.
56:29Okay.
56:30Yeah.
56:31So this last part is really about just pushing the limits,
56:32trying to figure out what the best possible performance is.
56:35Okay.
56:36So just to wrap things up,
56:39right we gave some new results here
56:41that let you be very flexible with
56:43the kinds of methods that you want to use.
56:46They do a good job of exploiting this Cate structure
56:49when it's there and don't lose much when it's not.
56:54So we have this nice model, free Arab bound.
56:57We also kind of for free to get
56:59this nice general Oracle inequality did
57:03some investigation of the best possible rates
57:06of convergence,
57:07the best possible mean squared error
57:08for estimating conditional effects,
57:11which again was unknown before.
57:14These are the weekend weak cause conditions
57:15that have appeared,
57:17but it's still not entirely known whether
57:19they are mini max optimal or not.
57:23So, yeah, big picture goals.
57:24We want some nice flexible tools,
57:26strong guarantees when it pushed forward,
57:28our understanding of this problem.
57:30I hope I've conveyed that there are lots of fun,
57:32open problems here to work out
57:34with important practical implications.
57:37Here's just a list of them.
57:38I'd be happy to talk more with people at any point,
57:42feel free to email me a big part is applying
57:44these methods in real problems.
57:46And yeah, I should stop here,
57:49but feel free to email the, the papers on archive here.
57:53I'd be happy to hear people's thoughts.
57:55Yeah.
57:55Thanks again for inviting me.
57:56It was fun.
57:58- Yeah.
57:59Thanks Edward.
57:59That's a very nice talk and I think we're hitting the hour,
58:02but I want to see in the audience
58:04if we have any questions.
58:05Huh.
58:13All right.
58:14If not, I do have one final question
58:16if that's okay.
58:17- Yeah, of course.
58:18- And so I think there is a hosted literature
58:21on flexible outcome modeling
58:23to estimate conditional average causal effect,
58:26especially those baits and non-parametric tree models
58:28(laughs)
58:30that are getting popular.
58:32So I am just curious to see if you have ever thought
58:36about comparing their performances,
58:38or do you think there are some differences
58:40between those sweats based
58:42in non-parametric tree models versus
58:44the plug-in estimator?
58:46We compared in a simulation study here?
58:48- Yeah.
58:49I think of them
58:50as really just versions of that plugin estimator
58:53that use a different regression procedure.
58:55There may be ways to tune plugins to try
58:58and exploit this special structure of the Cate.
59:01But if you're really just looking
59:02at the regression functions individually,
59:05I think these would be susceptible to the same kinds
59:07of issues that we see with the plugin.
59:09Yeah.
59:10That's a good one.
59:11- I see.
59:12Yep.
59:13So I want to see if there's any further questions
59:17from the audience to dr. Kennedy.
59:21(indistinct)
59:23- I was just wondering if you could speak a little more,
59:26why the standard like naming orthogonality results
59:29or can it be applicable in this setup?
59:32- [Edward] Yeah.
59:33(clears throat)
59:34Yeah.
59:35That's a great question.
59:36So one way to S to say it is that these effects,
59:42these conditional effects
59:43are not Pathwise differentiable.
59:46And so these kinds of there's some distinction
59:50between naming orthogonality
59:51and pathways differentiability,
59:52but maybe we can think about them
59:53as being roughly the same for now.
59:57So yeah, all the standards in my parametric
59:59theory breaks down here
01:00:01because of this lack of pathways differentiability so the,
01:00:04all the efficiency bounds that
01:00:05we know and love don't apply,
01:00:09but it turns out that there's some kind
01:00:11of analogous version of this that works for these things.
01:00:15I think of them as like infinite dimensional functional.
01:00:18So instead of like the ate, which is just a number,
01:00:20this is like a curve,
01:00:22but it has the same kinds of like functional structure
01:00:25in the sense that it's combining regression functions
01:00:28or our propensity scores in some way.
01:00:29And we don't care about the individual components.
01:00:33We care about their combination.
01:00:36So yeah, the standard stuff doesn't work just
01:00:38because it's, we're outside of this route
01:00:40in Virginia, roughly, but there are, yeah,
01:00:44there's analogous structure and there's tons
01:00:46of important work to be done,
01:00:48sort of formalizing this and extending
01:00:53that's a little vague, but hopefully that.
01:01:02- All right.
01:01:03So any further questions?
01:01:08- Thanks again.
01:01:09And yeah.
01:01:10If any questions come up, feel free to email.
01:01:12- Yeah.
01:01:14If not,
01:01:14I'll let smoke unless that doctors can be again.
01:01:16And I'm sure he'll be happy
01:01:17to answer your questions offline.
01:01:19So thanks everyone.
01:01:20I'll see you.
01:01:21We'll see you next week.
01:01:22- Thanks a lot.