BIS Seminar: Adventures in sparsity and shrinkage with the normal means modelMarch 03, 2021
Matthew Stephens, PhD, Ralph W. Gerard Professor Department of Statistics, Human Genetics and the College
The University of Chicago
March 2, 2021
To CiteDCA Citation Guide
- 00:04- So, hi everyone.
- 00:06Since we're still waiting for people to join,
- 00:08I will first give a brief introduction
- 00:11to Matthew here.
- 00:13First, it's my honor to introduce Dr. Matthew Stevens
- 00:16as our seminar speaker today.
- 00:18And Matthew is a professor from human genetics
- 00:21and the statistics at University of Chicago.
- 00:24And in the past, his research mainly focused
- 00:27on developing new statistical methods
- 00:29for especially genetic applications.
- 00:32Including, for example,
- 00:33GWAS association studies and fine mapping
- 00:36and populations genetic variants.
- 00:38And today, he will give a talk
- 00:40on some recently developed empirical Bayse methods
- 00:43for the estimation for normal mean models
- 00:46that will introduce the (indistinct) properties
- 00:48such as shrinkage,
- 00:49sparsity or smoothness.
- 00:51And he will also discuss how to apply these methods
- 00:54to a range of practical applications.
- 00:59Okay, so let's wait for another minute
- 01:02and then I will hand it over to Matthew.
- 01:59So I will hand it over to Matthew from here.
- 02:02Let's welcome him.
- 02:05- Thank you very much.
- 02:05It's a great pleasure to be here
- 02:07and to get the opportunity
- 02:09to present my work to you today.
- 02:11So I guess just a little bit of background.
- 02:14So few years ago...
- 02:18Well, I guess, I've been teaching sparsity
- 02:20and shrinkage for a while,
- 02:22and it struck me that, in practice,
- 02:24people don't really use many of these ideas directly...
- 02:29At least not the empirical Bayse versions
- 02:30of these ideas
- 02:32directly in applications.
- 02:34And so, I'm wondering why that is.
- 02:37And partly, it's the lack of...
- 02:40User-friendly, convenient methods
- 02:43for applying these ideas.
- 02:44So I've been trying to think about
- 02:46how we can make these powerful ideas and methods
- 02:51more generally applicable or easily applicable
- 02:53in applications.
- 02:54These ideas have been around quite some time.
- 02:57But I think we've made some progress
- 02:59on actually just making them a bit simpler maybe
- 03:02and simpler to apply in practice,
- 03:04so I'm gonna tell you about those today.
- 03:08Oh, sorry.
- 03:09It's not advancing, let me see.
- 03:10Okay, so yeah, kind of related to that,
- 03:13the normal means problem is
- 03:15something we teach quite frequently.
- 03:18It's not hard to teach
- 03:20but it always struck me whenever I was taught it
- 03:23that it looked like a kind of
- 03:24a toy model that statisticians kind of think up
- 03:28to teach students things
- 03:31but never actually use.
- 03:32And then suddenly, I had an epiphany
- 03:35and realized that it's super useful.
- 03:37And so now, I'm trying to...
- 03:38I'm not the only one but I'm trying to convince people
- 03:42that actually, this is a super useful thing
- 03:44that we should be using in practice.
- 03:46So here's the normal means model.
- 03:49The idea is that you've got
- 03:50a bunch of observations, XJ,
- 03:52that you can think of as noisy observations
- 03:54of theta J
- 03:55and they have some variance.
- 03:58I'm going to allow each variance to be different.
- 04:02The simplest version would be to assume
- 04:04that the variances are all the same
- 04:06but I'm going to allow them to be different.
- 04:10But an important point is
- 04:11that we're going to assume that the variance is unknown,
- 04:14which sounds a bit weird
- 04:15but in applications, we'll see that there are reasons
- 04:17why we might think that's an okay assumption
- 04:21in some applications.
- 04:22Okay, so the basic idea is
- 04:25you've got a bunch of measurements
- 04:26that are noisy measurements
- 04:27of sum theta J
- 04:29and they have known variance,
- 04:31so they have known precision essentially,
- 04:33and you want to estimate the Theta Js.
- 04:36And, of course, the MLE is just
- 04:38to estimate see theta J
- 04:39by its corresponding measurement, XJ.
- 04:43And really, it was a big surprise, I think.
- 04:46I wasn't around at the time
- 04:47but I believe it was a big surprise in 1956
- 04:49when Stein showed that you can do better
- 04:52than the MLE, at least in terms of
- 04:55average squared error expected square there.
- 05:01And so, there are many different ways to motivate or...
- 05:05To motivate this result.
- 05:07And I think many of them end up not being that intuitive.
- 05:11It is quite a surprising result in generality
- 05:15but I think...
- 05:16So the way I like to think about the intuition
- 05:18for why this might be true,
- 05:19it's not the only intuition
- 05:20but it's one intuition for why this might be true,
- 05:22is to have an empirical Bayse thinking to the problem.
- 05:27And so, to illustrate this idea,
- 05:30I use a well-worn device, at this point,
- 05:33which is baseball batting averages.
- 05:38Efron certainly has used this example before
- 05:40to motivate empirical Bayse ideas.
- 05:42This particular example comes from...
- 05:44The data come from this block here,
- 05:46that I referenced at the bottom,
- 05:48which I quite like as an explanation
- 05:49of basic ideas behind empirical Bayse.
- 05:52So this histogram here shows a bunch
- 05:54of basic baseball batting averages
- 05:57for a particular season in 1900.
- 05:59You don't need to know very much about baseball
- 06:01to know what's going on here.
- 06:02Essentially, in baseball, you go and try and hit a ball
- 06:06and your batting average is
- 06:07what proportion of the time
- 06:08you as a bat person end up hitting the ball.
- 06:12And a good baseball batting average is around 0.3 or so.
- 06:17And in a professional baseball,
- 06:19no one's really going to have a batting average of zero
- 06:21'cause they wouldn't survive.
- 06:25But empirically, there were some individuals in this season
- 06:28who had a batting average of zero,
- 06:30that is they completely failed to hit the ball every time
- 06:33they went up to bat.
- 06:34And there were some people
- 06:36who had a batting average of above 0.4,
- 06:38which is also completely unheard of in baseball.
- 06:43Nobody has a batting average that high,
- 06:45so what's going on here?
- 06:47Well, it's a simple explanation is
- 06:49that these individuals at the tails are individuals
- 06:52who just had a few at-bats.
- 06:53They only went and attempted
- 06:55to hit the ball a small number of times.
- 06:57And so, maybe these individuals only had two bats
- 07:01and they missed it both times,
- 07:02they got injured or they weren't selected
- 07:04or, for whatever reason, they didn't hit
- 07:06the ball many times...
- 07:07They didn't go to at bat many times
- 07:09and so, their batting average was empirically zero.
- 07:12Think of that as the maximum likelihood estimate.
- 07:15But if you wanted to predict
- 07:17what they would do, say next season,
- 07:19if you gave them more at bats
- 07:21in the long run,
- 07:22zero would be a bad estimate for obvious reasons.
- 07:27And the same applies to these individuals up here
- 07:30with very big batting averages.
- 07:31They also had relatively few at-bats
- 07:34and they just happened to hit it above 0.4 of the time
- 07:38out of the at-bats.
- 07:39And the individuals who had lots of at-bats are
- 07:41all in the middle here.
- 07:44So these are binomial observations, basically,
- 07:45and the ones who have small N are
- 07:49more likely to be in the tails
- 07:50and the ones we're big N are
- 07:51all going to be around in the middle here.
- 07:53So what would we do?
- 07:54What would we want to do
- 07:55if we wanted to estimate,
- 07:57for example, for this individual,
- 07:59their batting average for next season?
- 08:01If we were gonna predict what they were gonna get.
- 08:04Well, we would definitely want to estimate
- 08:08something closer to the average batting average than 04.
- 08:13That's the intuition.
- 08:14And one way to frame that problem is that...
- 08:18So sorry.
- 08:18So this is the basic idea of shrinkage.
- 08:20We would want to shrink these estimates towards,
- 08:23in this case, towards the mean.
- 08:25So how are we gonna do that?
- 08:27Well, one way to think about it is...
- 08:31Sorry, let me just...
- 08:36Sorry, just getting my slides.
- 08:37Okay, so here, the red line represents
- 08:40some underlying distribution
- 08:44of actual batting averages.
- 08:46So conceptually, some distribution
- 08:47of actual batting averages among individuals
- 08:53in this kind of league.
- 08:54So the red line, in a Bayesean point of view,
- 08:57kind of represent a sensible prior
- 09:00for any given individual's batting average
- 09:02before we saw that data.
- 09:03So think of the red line as representing
- 09:05the variation or the distribution
- 09:08of actual batting averages among players.
- 09:12And in fact, what we've done here is estimate
- 09:17that red line from the data.
- 09:21That's the empirical Bayse part
- 09:23of the empirical Bayse.
- 09:25The empirical part of empirical Bayse is that
- 09:27the red line which we're going to use
- 09:28as a prior for any given player was
- 09:30actually estimated from all the data.
- 09:33And the basic idea is
- 09:34because we know what the variance
- 09:36of a binomial distribution is,
- 09:38we can kind of estimate
- 09:40what the overall distribution
- 09:42of the underlying piece in this binomial look like,
- 09:46taking account of the fact that the histogram is
- 09:49a noisy observations of that underlying P.
- 09:53Every bat...
- 09:54Basically, every every estimated batting average is
- 09:58a noisy estimate of the true batting average
- 10:00with the noise depending on
- 10:01how many at-bats they have.
- 10:03So once we've estimated that red line,
- 10:07that prior,
- 10:08we can compute the posterior
- 10:10for each individual based on that prior.
- 10:12And when we do that,
- 10:13this is a histogram of the posterior means.
- 10:16So these are, if you like,
- 10:18shrunken estimates of the batting average
- 10:20for each individual.
- 10:21And you can see that the individuals
- 10:22who had zero at-bats got shrunk
- 10:25all the way over somewhere here.
- 10:27And that's because their data really...
- 10:29Although, the point estimate was zero,
- 10:32they had very few at bats.
- 10:33So the information in that data are very slim,
- 10:38very little information.
- 10:40And so, the prior dominates
- 10:41when you're looking at the posterior distribution
- 10:43for these individuals.
- 10:44Whereas individuals in the middle
- 10:46who have more at-bats,
- 10:47will have the estimate that is less shrunken.
- 10:52So that's gonna be a theme we'll come back to later.
- 10:55So how do we form...
- 10:57That's a picture.
- 10:59How do we formulate that?
- 11:00So those were binomial data,
- 11:03I'm gonna talk about normal data.
- 11:05So don't get confused by that.
- 11:08I'm just going to assume normality could do
- 11:10the same thing for a binomial,
- 11:12but I think the normals a more generally useful
- 11:16and convenient way to go.
- 11:20So here's a normal means model again
- 11:23and the idea is that that
- 11:25we're going to assume that thetas come
- 11:26from some prior distribution, G,
- 11:29that was the red line in my example,
- 11:31and we're going to estimate G
- 11:33by maximum likelihood essentially.
- 11:34So we're going to use all the X's,
- 11:36integrating out theta
- 11:38to obtain a maximum likelihood estimate for G.
- 11:40That's stage one,
- 11:42that's estimating that red line.
- 11:43And then stage two is
- 11:44to compute the posterior distribution
- 11:46for each batting average,
- 11:48or whatever theta J we're interested in,
- 11:51taking into account that estimated prior
- 11:53and the data on the individual J.
- 11:56So that's the formalization of these ideas.
- 12:01And these posterior distributions are gonna be shrunk
- 12:03towards the prior or the primary.
- 12:07So what kind of...
- 12:09So I guess I've left unspecified here,
- 12:12what family of priors should we consider for G?
- 12:17So a commonly used prior distribution is
- 12:21this so-called point-normal,
- 12:23or sometimes called spike and slab prior distribution.
- 12:28And these are...
- 12:28Sorry, I should say,
- 12:29I'm going to be thinking a lot about problems
- 12:31where we want to induce sparsity.
- 12:33So in baseball, we were shrinking towards the mean
- 12:37but in many applications,
- 12:38the natural point towards
- 12:40natural prime mean, if you like, is zero
- 12:42in situations where we expect effects
- 12:46to be sparse, for example.
- 12:47So I'm gonna be talking mostly about that situation,
- 12:50although the ideas are more general.
- 12:51And so, I'm going to be focusing
- 12:52on the sparsity inducing choices of prior family.
- 12:56And so, one commonly used one is this point normal
- 12:59where there's some mass pi zero
- 13:02exactly at zero,
- 13:04and then the rest of the mass is
- 13:06normally distributed about zero.
- 13:09So the commonly used one.
- 13:11In fact, it turns out,
- 13:13and this is kind of interesting I think,
- 13:15that it can be easier to do the computations
- 13:17for more general families.
- 13:20So for example,
- 13:22just take the non-parametric family
- 13:24that's the zero-centered scale mixture of normal,
- 13:27so we'll see that in it,
- 13:28which includes all these distributions of special cases.
- 13:31It's nonparametric.
- 13:33It includes a point-normal here.
- 13:35It also includes the T-distribution,
- 13:36the Laplace distribution,
- 13:38the horseshoe prior, if you've come across that,
- 13:40this zero-centered scale mixture of normals
- 13:42and the surprise is that it turns out
- 13:45to be easier, in some sense,
- 13:47to do the calculations for this family,
- 13:49this more general family,
- 13:50than this narrow family,
- 13:52partly because of the convex family.
- 13:53So you can think of this as a kind of a convex relaxation
- 13:57of the problem.
- 13:58So all the computations become...
- 13:59The optimizations you have to do in the simplest case
- 14:01become convex when you use this family.
- 14:05So let me say a bit more about that
- 14:07for the non-parametric.
- 14:09How do we actually do these non-parametric computations?
- 14:12Well, we actually approximate
- 14:14the non-parametric computation using a grid idea.
- 14:17So here's the idea.
- 14:20We modeled G, our prior,
- 14:22as a mixture of...
- 14:23I like to think of this K as being big.
- 14:25A large number of normal distributions.
- 14:28All of these normal distributions are centered at zero,
- 14:31that's this zero here,
- 14:32and they have a different variance.
- 14:34Some of them have very small variances,
- 14:36perhaps even one of them has a zero variance,
- 14:38so that's the point mass at zero.
- 14:39And the variance is sigma...
- 14:41Think of Sigma squared K getting gradually bigger
- 14:43until the last Sigma squared K is very big.
- 14:47So we're just gonna use a lot of them.
- 14:48Think of K as being, let's say 100
- 14:50or 1,000 for the...
- 14:52In practice, we find 20 is enough
- 14:54but just think of it as being big
- 14:57and spanning a lot of different variances,
- 14:59going from very, very small,
- 15:00to very, very big.
- 15:01And then, estimating G just comes down
- 15:05to estimating these pis,
- 15:06these mixture proportions.
- 15:09And that, then of course,
- 15:11is a finite dimensional optimization problem
- 15:13and in the normal means model,
- 15:15it's a convex...
- 15:16Well actually, for any mixture,
- 15:18it's a convex problem,
- 15:19and so there are efficient ways to find
- 15:22the MLE for pi, given the grid of variances.
- 15:27So let's just illustrate what's going on here.
- 15:30Here's a grid of just three normals.
- 15:33The one in the middle has the smallest variance,
- 15:35the one over here has the biggest variance.
- 15:37And we can get a mixture of those,
- 15:39looks like that.
- 15:40So you can see this is kind of a spiky distribution
- 15:42but also with a long tail,
- 15:44even with just a mixture of three distributions.
- 15:47And so, the idea is that you can get
- 15:49quite a flex...
- 15:51It's a flexible family
- 15:52by using a larger number of variances than three.
- 15:56You can imagine you can get distributions
- 15:59that have all sorts of spikiness
- 16:01and long-tailed behavior.
- 16:06So maybe just to fill in the details here;
- 16:09with that prior as a mixture of normals,
- 16:13the marginal distribution, P of X,
- 16:15integrating out theta is analytic
- 16:18because the sum of normals is normal.
- 16:21So if you have a normally distributed variable
- 16:23and then you have another variable
- 16:25that's a normal error on top of that,
- 16:27you get a normal.
- 16:28So the marginal is a mixture of normals
- 16:32that's very simple to work with
- 16:34and estimating pi is a convex optimization problem.
- 16:38You can do it.
- 16:39You can do an EM algorithm
- 16:40but convex methods,
- 16:41as pointed out by Koenker and Mizera,
- 16:43can be a lot more reliable and faster.
- 16:49Okay, so let's just illustrate those ideas again.
- 16:52Here's a potential prior distribution
- 16:58and here's a likelihood.
- 17:00So this is like a likelihood from a normal...
- 17:02This is an estimate...
- 17:04Think of this as a likelihood
- 17:05for theta J in a normal means model.
- 17:08So maybe XJ was one and a half or something
- 17:11and SJ was, I don't know,
- 17:12something like a half or something
- 17:14or a half squared.
- 17:17So this is meant to represent the likelihood.
- 17:19So what does the posterior look like
- 17:21when we combine this prior,
- 17:23the black line,
- 17:24with this likelihood, the red line?
- 17:26it looks like this green line here.
- 17:28So what you can see is going on here is
- 17:31that you get shrinkage towards the mean, right?
- 17:34But because the black line is long-tailed
- 17:37because of the prior in this case has a long tail,
- 17:40and because the red line...
- 17:41The likelihood lies quite a ways in the tail,
- 17:44the spiky bit at zero doesn't have very much impact
- 17:48because it's completely...
- 17:49Zero is basically inconsistent with the data
- 17:52and so the posterior looks approximately normal.
- 17:54It's actually a mixture of normals
- 17:56but it looks approximately normal 'cause of weight
- 17:58and there, zero is very, very small.
- 18:02Whereas if a...
- 18:04Here's a different example,
- 18:05the black line is covered
- 18:08by the green line this time because it's...
- 18:09So I plotted all three lines on the same plot here.
- 18:12The black line is...
- 18:13Think of it as pretty much the green line.
- 18:14It's still the same spiky prior
- 18:16but now the likelihood is much flatter.
- 18:18The XJ is the same.
- 18:20Actually, it's one and a half
- 18:21but we have an SJ that's much bigger.
- 18:23So what happens here is that
- 18:25the prior dominates because the likelihood's
- 18:27relatively flat,
- 18:29and so the posterior looks pretty much like the prior
- 18:31and you get very strong shrinkage.
- 18:33So think of this as corresponding
- 18:35to those individuals who had very few at-bats,
- 18:38their data are very imprecise,
- 18:41and so their posterior, the green line,
- 18:43looks very like the prior, the black line.
- 18:46Okay, so we're gonna shrink those observations more.
- 18:50So the key point here, I guess,
- 18:52is that the observations with larger standard error,
- 18:55larger SJ,
- 18:58get shrunk more.
- 19:00I should say "larger standard deviation" get shrunk more.
- 19:04Here's another intermediate example
- 19:07where the red line...
- 19:08The likelihood's kind of not quite enough.
- 19:11It illustrates the idea that the posterior could be bimodal
- 19:15because the prior and the likelihood are indifferent,
- 19:18have weight in different places.
- 19:20So you can get different kinds of shrinkage depending on
- 19:23how spiky the prior is,
- 19:24how long-tailed the prior is,
- 19:25how flat the likelihood is etc.
- 19:35So obviously the shrinkage,
- 19:36the amount of shrinkage you get,
- 19:37depends on the prior, G,
- 19:39which you're gonna estimate from the data.
- 19:41It also depends on the standard error
- 19:43or the standard deviation, SJ.
- 19:46And one way to summarize this kind of the behavior,
- 19:48the shrinkage behavior,
- 19:50is to focus on how the posterior mean changes with X.
- 19:55So we can define this operator here,
- 19:57S-G-S of X,
- 19:59as the X posterior mean of theta J,
- 20:04given the prior and its variance or standard deviation
- 20:08and that we observed XJ is equal to X.
- 20:14I'm gonna call this the shrinkage operator
- 20:17for the prior, G,
- 20:18and variance, S for standard deviation, S.
- 20:22Okay, so we could just plot
- 20:24some of these shrinkage operators.
- 20:25So the idea here is...
- 20:26Sorry, this slide has B instead of X.
- 20:30Sometimes I use B and sometimes I use X.
- 20:34I've got them mixed up here, sorry.
- 20:35So think of this as X
- 20:37and this is S of X.
- 20:39So these different lines here correspond
- 20:43to different priors.
- 20:44So the idea is that by using different priors,
- 20:47we can get different types of shrinkage behavior.
- 20:50So this prior here shrinks very strongly to zero.
- 20:54This green line shrinks very strongly to zero
- 20:57until B exceeds some value around five,
- 21:01at which point it hardly shrinks at all.
- 21:04So this is kind of a prior that has
- 21:07kind of a big spike near zero.
- 21:10But also a long tail,
- 21:11such that when you get far enough in the tail,
- 21:15you start to be convinced
- 21:17that there's a real signal here.
- 21:18So you can think of that
- 21:19as this kind of...
- 21:20This is sometimes called...
- 21:22This is local shrinkage
- 21:24and this is global.
- 21:26So you get very strong local shrinkage towards zero
- 21:29but very little shrinkage
- 21:31if the signal is strong enough.
- 21:33That kind of thing.
- 21:34But the real point here is that
- 21:35by using different priors,
- 21:37these different scale mixture of normal priors,
- 21:39you can get very different looking shrinkage behaviors.
- 21:43Ones that shrink very strongly to zero
- 21:45and then stop shrinking
- 21:46or ones that shrink a little bit all the way, etc.
- 21:52And so, if you're familiar with other ways
- 21:55of doing shrinkage analysis,
- 21:56and this is one of them,
- 21:58or shrinkage,
- 21:59is to use a penalized likelihood.
- 22:01Then you can try and draw a parallel
- 22:04and that's what I'm trying to do here.
- 22:05Draw a parallel between the Bayesean method
- 22:08and the penalized likelihood-based approaches
- 22:11to inducing shrinkage or sparsity.
- 22:15Another way to induce shrinkage is to essentially...
- 22:18This is the kind of normal log likelihood here
- 22:21and this is a penalty here
- 22:23that you add for this.
- 22:25This could be an L1 penalty
- 22:26or an L2 penalty or an L0 penalty,
- 22:28or some other kind of penalty.
- 22:30So there's a penalty function here.
- 22:32And you define the estimate
- 22:34as the value that minimizes
- 22:36this penalized log likelihood.
- 22:38Sorry, yeah, this is a negative log likelihood.
- 22:41Penalized least squares, I guess this would be.
- 22:45Okay, so now eight is a penalty function here
- 22:49and Lambda is a tuning parameter
- 22:51that says how strong, in some sense, the penalty is.
- 22:55And these are also widely used
- 22:58to induce shrinkage,
- 22:59especially in regression contexts.
- 23:02And so, here are some commonly used shrinkage operators,
- 23:06corresponding to different penalty functions.
- 23:09So this green line is what's called
- 23:12the hard thresholding,
- 23:16which corresponds to an L0 penalty.
- 23:19If you don't know what that means, don't worry.
- 23:21But if you do, you make that connection.
- 23:25At the red line here is L1 penalty
- 23:28or soft thresholding.
- 23:29And these two other ones here are particular instances
- 23:33of some non-convex penalties that are used
- 23:36in regression context,
- 23:37particularly in practice.
- 23:39And I guess that the point here is
- 23:42that, essentially, different prior distributions
- 23:45in the normal means model can lead
- 23:48to shrinkage operators, shrinkage behavior
- 23:51that looks kind of similar
- 23:53to each of these different types of penalty.
- 23:58So you can't actually mimic the behavior exactly.
- 24:03I've just...
- 24:05Or actually, my student, (indistinct) Kim,
- 24:07chose the priors to visually closely match these
- 24:11but you can't get...
- 24:12Some of these have kinks and stuff
- 24:13that you can't actually, formally, exactly mimic
- 24:18but you can get qualitatively similar shrinkage behavior
- 24:21from different priors
- 24:23as different penalty functions.
- 24:24So you should think about the different priors
- 24:26as being analogous to different penalty functions.
- 24:30And so, the key...
- 24:33How does EB, empirical Bayse shrinkage,
- 24:35differ from, say, these kinds of penalty-based approaches,
- 24:40which I should say are maybe
- 24:41more widely used in practice?
- 24:44Well, so shrinkage is determined by the prior, G,
- 24:49which we estimate in an empirical Bayse context
- 24:52by maximum likelihood.
- 24:53Whereas in typical shrinkage...
- 24:57Sorry, typical penalty-based analyses,
- 25:00people use cross validation to estimate parameters.
- 25:03And the result is that cross-validation is fine
- 25:07for estimating one parameter
- 25:08but it becomes quite cumbersome
- 25:10to estimate two parameters,
- 25:12and really tricky to estimate three or four parameters
- 25:14'cause you have to go and do a grid of different values
- 25:17and do a lot of cross-validations
- 25:18and start estimating all these different parameters.
- 25:22So the point here is really
- 25:23because we estimate G by maximum likelihood,
- 25:26we can actually have a much more flexible family in practice
- 25:29that we can optimize over more easily.
- 25:33It's very flexible,
- 25:34you can mimic a range of penalty functions
- 25:36so you don't have to choose
- 25:37whether to use L1 or L2 or L0.
- 25:40You can essentially estimate
- 25:42over these non-parametric prior families.
- 25:44Think of that as kind of deciding automatically
- 25:47whether to use L0, L1, L2
- 25:49or some kind of non-convex penalty,
- 25:51or something in between.
- 25:53And the posterior distribution, of course then,
- 25:58another nice thing is that it gives not
- 25:59only the point estimates
- 26:01but, if you like, it also gives shrunken interval estimates
- 26:04which are not yielded by a penalty-based approach.
- 26:08So I guess I'm trying to say
- 26:09that there are potential advantages
- 26:11of the empirical Bayse approach
- 26:13over the penalty-based approach.
- 26:16And yeah, although I think,
- 26:19people have tried, particularly Efron has highlighted
- 26:21the potential for empirical Bayse
- 26:23to be used in practical applications,
- 26:25largely in the practical application.
- 26:27So I've seen empirical Bayse shrinkage hasn't
- 26:29really been used very, very much.
- 26:32So that's the goal,
- 26:34is to change that.
- 26:37So before I talk about examples,
- 26:39I guess I will pause for a moment
- 26:41to see if there are any questions.
- 26:54And I can't see the chat for some reason
- 26:56so if anyone...
- 26:57So please unmute yourself
- 26:58if you have a question.
- 27:00- I don't think people are (indistinct)
- 27:04every question in the chat.
- 27:05At least, I didn't see any. - Good.
- 27:07Okay, thank you.
- 27:10It's all clear.
- 27:12I'm happy to go on but
- 27:14I just wanna...
- 27:25Okay, so we've been trying to...
- 27:27My group has been trying to think about
- 27:28how to use these ideas,
- 27:30make these ideas useful in practice
- 27:32for a range of practical applications.
- 27:35We've done work on multiple testing,
- 27:37on high dimensional linear aggression,
- 27:40and also some on matrix factorization.
- 27:43From previous experience,
- 27:44I'll probably get time to talk about the first two
- 27:45and maybe not the last one,
- 27:47but there's a pre-print on the archive.
- 27:49You can see if you're interested
- 27:50in matrix factorization.
- 27:51Maybe I'll get to get to talk about that briefly.
- 27:55But let me talk about multiple testing first.
- 27:59So the typical multiple testing setup,
- 28:02where you might typically, say,
- 28:05apply a Benjamini-Hochberg type procedure is
- 28:08you've got a large number of tests,
- 28:09So J equals one to N,
- 28:11and test J yields a P value, PJ,
- 28:15and then you reject all tests with
- 28:16some PJ less than a threshold gamma,
- 28:19where that threshold is chosen
- 28:21to control the FDR in a frequented sense.
- 28:23So that's the typical setup.
- 28:26So how are we going to apply
- 28:27the normal means model to this problem?
- 28:33Okay, well, in many applications,
- 28:37not all but in many,
- 28:38the P values are derived from
- 28:40some kind of effect size estimate,
- 28:41which I'm going to call "Beta hat J,"
- 28:45which have standard errors, SJ,
- 28:47that satisfy approximately, at least,
- 28:49that Beta J hat is normally distributed
- 28:53about the true value Beta J
- 28:55with some variance given it by SJ.
- 28:58So in a lot...
- 29:01I work a lot in genetic applications.
- 29:03So in genetic applications,
- 29:05we're looking at different genes here.
- 29:07So Beta J hat might be the estimate
- 29:11of the difference in expression, let's say,
- 29:13of a gene, J, between, say, males and females.
- 29:18And Beta J would be the true difference
- 29:21at that gene.
- 29:22And you're interested in identifying
- 29:25which genes are truly different...
- 29:28Have a different mean expression
- 29:30between males and females here.
- 29:32And the reason that SJ is approximately known is
- 29:36because you've got multiple males
- 29:38and multiple females that you're using
- 29:41to estimate this difference.
- 29:43And so, you get an estimated standard error
- 29:45of that Beta hat as well.
- 29:48And so, once you've set the problem up like this,
- 29:50of course, it looks suddenly like a normal means problem
- 29:55and we can kind of apply
- 29:57the empirical Bayes normal means idea.
- 30:00We're gonna put a prior on the Beta Js
- 30:03that is sparsity inducing.
- 30:05That is, it's kind of centered at zero,
- 30:06maybe it's got a point mass at zero.
- 30:08But we're gonna estimate that prior from the data.
- 30:17And so, not only can you get posterior means for Beta,
- 30:23as I said, you can get posterior interval estimates.
- 30:25So you can kind of do things
- 30:27like compute the posterior in 90% credible interval,
- 30:31given that prior and the likelihood
- 30:33for each Beta J
- 30:34and we could reject, for example,
- 30:35if the interval does not contain zero.
- 30:39And I'm not going to talk about this in detail
- 30:42because the details are in
- 30:44a biostatistics paper from 2017.
- 30:46I should say that the idea of using empirical Bayse for FDR
- 30:50actually dates back to before Benjamini and Hoffberg.
- 30:54Duncan Thomas has a really nice paper
- 30:57that was pointed out to me by John Witty
- 30:59that actually contains these basic ideas
- 31:03but not nice software implementation,
- 31:07which maybe explains why it hasn't caught on
- 31:09in practice yet.
- 31:10Efron's also been a pioneer in this area.
- 31:16Okay, so I don't want to dwell on that
- 31:18because, actually, I think I'll just summarize
- 31:21what I think is true compared with Benjamini-Hochberg.
- 31:25You get a bit of an increase in power
- 31:27by using an empirical Bayse approach.
- 31:29The Benjamini-Hochberg approach is
- 31:32more robust to correlated tests though,
- 31:34so the empirical Bayse normal means model does assume
- 31:38that the tests are independent
- 31:39and, in practice, we have seen
- 31:43that correlations can cause problems.
- 31:45If you're interested in that,
- 31:46I have a pre-print with Lei Sun on my website.
- 31:49But the empirical Bayse normal means
- 31:52also provides these interval estimates,
- 31:55which is kind of nice.
- 31:56Benjamini-Hochberg does not.
- 31:58So there are some advantages
- 31:59of the empirical Bayes approach
- 32:01and maybe some disadvantages compared
- 32:04with Benjamini-Hochberg.
- 32:05But I think that the real benefit
- 32:06of the empirical Bayse approach
- 32:08actually comes when we look at multi-variate extensions
- 32:11of this idea.
- 32:12So I just wanted to briefly highlight those
- 32:14and spend some time on those.
- 32:16So here's the multi-variate version
- 32:19of the empirical Bayse normal means models.
- 32:24And now, my Beta Js are a vector of observation.
- 32:29So I think of this as measuring, say,
- 32:32gene J in multiple different tissues.
- 32:34Think of different tissues.
- 32:36You look at at heart
- 32:37you look at lung,
- 32:38you look brain,
- 32:39you look at the spleen.
- 32:42In fact, we've got 50 different tissues in the example
- 32:46I'm gonna show in a minute.
- 32:48So we've measured some kind of effect
- 32:52in each gene, in each of these 50 different tissues
- 32:56and we want to know where the effects are...
- 33:01Which genes show effects in which tissues.
- 33:04So Beta J is now a vector of length R,
- 33:08the number of tissues.
- 33:09R is 50 in our example.
- 33:11And so you've got...
- 33:12We're gonna assume that the estimates are
- 33:16normally distributed with mean,
- 33:17the true values and some variance,
- 33:19covariance matrix now,
- 33:21which we're going to assume, for now, is known.
- 33:23That's actually a little trickier
- 33:26but I'm gonna gloss over that for...
- 33:28If you want to see details,
- 33:29take a look at the paper.
- 33:31I just wanna get the essence of the idea across.
- 33:33We're still going to assume that Beta J comes
- 33:35from some prior, G,
- 33:36and we're still gonna use a mixture of normals,
- 33:38but now we're using a mixture
- 33:39of multi-variate normals.
- 33:41And unlike the univariate case,
- 33:43we can't use a grid of...
- 33:45We can't use a grid of values
- 33:47that span all possible covariance matrices here.
- 33:50It's just too much.
- 33:52So we have to do something to estimate
- 33:53these covariance matrices,
- 33:55as well as estimate the pis.
- 33:57And again, if you want to see the details,
- 34:00take a look at a Urbut et al.
- 34:03But let me just illustrate
- 34:04the idea of what's going on here,
- 34:06or what happens when you apply this method
- 34:08to some data.
- 34:10So this is...
- 34:10I said 50,
- 34:11we have 44 tissues in this particular example.
- 34:14So each row here is a tissue.
- 34:17These yellow ones here are brain tissues,
- 34:20different brain tissues,
- 34:22and I think we'll see one later that's blood.
- 34:25I think this one might be blood.
- 34:26Anyway, each one is a tissue;
- 34:28lung, blood, etc.
- 34:29You don't need to know which ones are which, for now.
- 34:32And so, what we've done here is plot
- 34:34the Beta hat and plus or minus two standard deviations
- 34:38for each tissue at a particular...
- 34:41In this case, a particular snip, actually (indistinct).
- 34:44So this is an eQTL analysis,
- 34:46for those of you who know what that means.
- 34:48If you don't, don't worry about it.
- 34:50Just think of it as having an estimated effect
- 34:53plus or minus two standard deviations
- 34:55in 44 different tissues,
- 34:58and we want to know which ones are quote,
- 35:01"significantly different from zero."
- 35:05And so what happens...
- 35:11Didn't expect that to happen.
- 35:17Sorry, okay.
- 35:17Yeah, these are just...
- 35:18These are just two examples.
- 35:20So this is one example,
- 35:22here's another example
- 35:23where we've done the same thing.
- 35:25Estimated effects, plus or minus two standard deviations.
- 35:28So what you can see in this first one is
- 35:30that it looks like, at least,
- 35:31that the brain tissues have some kind of effect.
- 35:33That's what you're supposed to see here.
- 35:36And maybe there are some effects in other tissues.
- 35:38There's a tendency for effects to be positive,
- 35:40which might suggest that maybe everything has
- 35:44a small effect to everywhere,
- 35:45but particularly strong in the brain.
- 35:47And whereas in this example,
- 35:50this one appears to have an effect
- 35:52in just one tissue.
- 35:53This is the blood actually.
- 35:54So this is an effect in blood
- 35:56but mostly, it doesn't look like
- 35:58there's an effect in other tissues.
- 36:00But these, just to emphasize,
- 36:01these are the raw data,
- 36:03in the sense that they're the Beta hats
- 36:04and the standard errors.
- 36:05There's no shrinkage occurred yet.
- 36:07But the idea is that the empirical Bayse approach takes
- 36:11all these examples,
- 36:12examples like this and examples like this,
- 36:14to learn about what kinds of patterns are present
- 36:17in the data.
- 36:18That is, "What does G look like?"
- 36:19So it learns from these examples
- 36:21that there are some effects that look like
- 36:24they're shared among the brain tissues,
- 36:27and there are some effects that are...
- 36:28These are actually somewhat rare
- 36:31but rarely, there's an effect that's specific
- 36:33to one tissue like, in this case, blood.
- 36:35And it also learns, in this case actually,
- 36:39that there's a lot of null things,
- 36:41because there are a lot of null things as well.
- 36:44So it puts lots of mass on the null as well
- 36:46and that causes the shrinkage.
- 36:48And then, having estimated those patterns from the data,
- 36:51it computes posteriors.
- 36:53And so, here's the data and then the posterior intervals
- 36:58for the same...
- 36:59For that first example.
- 37:00And what you can see is that because of
- 37:02the combining information across tissues,
- 37:06you get standard errors that are getting smaller,
- 37:09the brain estimates all get shrunk towards one another,
- 37:12and all these...
- 37:14There's some borrowing strength of information,
- 37:16borrowing information across these tissues,
- 37:18to make these these look like
- 37:20some of them are kind of borderline significant.
- 37:22Now, it looks like there's probably
- 37:23an effect in every tissue
- 37:25but a much stronger effect in brain.
- 37:26Whereas this example here,
- 37:29it recognizes that this looks like an effect
- 37:32that's specific to blood.
- 37:33And so, it shrinks everything else strongly towards zero
- 37:36because it knows that most things are null,
- 37:38it's learned that from the data,
- 37:40but the blood estimate gets hardly shrunk at all.
- 37:43We saw that kind of behavior where things
- 37:44that are near zero can get shrunk towards zero,
- 37:47whereas other things that are far
- 37:48away don't get shrunk as much.
- 37:50And it's really hard to do that kind of thing
- 37:53without doing some kind of model-based analysis,
- 37:57doing Benjamini-Hochberg type art non-model based
- 38:01without making any assumptions
- 38:03or making minimal assumptions,
- 38:06very hard to capture this kind of thing, I think.
- 38:09So I think the empirical Bayse approach
- 38:11has big advantages in this setting.
- 38:17I'll pause before I talk about regression.
- 38:20Any questions there?
- 38:24- So Matthew, I have some basic questions.
- 38:26So in your means multivariate multiple testing case,
- 38:30I guess for each of the plot,
- 38:32you are looking at maybe a particular genes influence
- 38:35on some... - Good, yeah.
- 38:37Sorry, I did skip over it a bit.
- 38:39So these are eQTLs.
- 38:40So actually, what I'm plotting here is each...
- 38:43This is a single snip associated
- 38:46with a single gene.
- 38:48And this is it's,
- 38:49"How associated is this snip
- 38:52"with this genes expression level
- 38:54"in the different brain tissues,
- 38:56"in the blood tissue in lung and spleen, etc?"
- 39:02The idea is that...
- 39:04What the scientific goal is to understand
- 39:07which genetic variants are impacting gene expression
- 39:10in different tissues,
- 39:12which might tell us something
- 39:13about the biology of the tissues
- 39:15and the regulation going on in the different tissues.
- 39:18- Got it.
- 39:19So in this case,
- 39:20I don't think I fully understand
- 39:23why it's multivariate multiple tests, not univariate
- 39:27because you are looking at each gene
- 39:28versus each snip.
- 39:32- Right, so sorry.
- 39:33Think of J indexing eQTL.
- 39:36So we've got 2 million potential eQTLs,
- 39:39so that's the multiple part of it.
- 39:41For 2 million potential eQTLs, that's...
- 39:45And then each eQTL has data on 44 tissues,
- 39:50so that's the multi-variate part of it.
- 39:52(speaking over each other)
- 39:53If you thought about it
- 39:54in terms of say P values or maybe Z scores,
- 39:57you have a matrix of Z scores.
- 39:59There are two million rows
- 40:01and there are 44 columns
- 40:03and you have a Z score or a P value
- 40:05for each element in that matrix,
- 40:09and what we're assuming is that
- 40:11the rows are independent,
- 40:13which is not quite true but still,
- 40:15we're assuming that the rows are independent
- 40:17and the columns,
- 40:18we're assuming that they can be correlated.
- 40:20And in particular,
- 40:21we're assuming that the...
- 40:22Well, we're assuming that both
- 40:24the measurements can be correlated,
- 40:26so it's V,
- 40:27but also that the effects can be correlated.
- 40:30So that's to capture the idea
- 40:31that there might be some effects
- 40:33that are shared between say brain tissues--
- 40:36- I see.
- 40:37I see.
- 40:38So this multi-variate is different
- 40:40from our usual notion where the multivariate
- 40:43and multivariate snip.
- 40:44So there's multivariate tissue.
- 40:46I guess, are the samples from the same cohort?
- 40:49- Yeah, so in this particular case,
- 40:52the samples are from the same individuals.
- 40:55So these different brain tissues...
- 40:56There's overlap anyway, let's say.
- 40:57And so, that's what causes this...
- 41:00That causes headaches, actually, for this--
- 41:02- Okay, got it, thanks. - Yeah.
- 41:04Just to emphasize,
- 41:05it doesn't have to be different tissues.
- 41:07The whole method works
- 41:08on any matrix of Z scores, basically.
- 41:11As long as you think that the rows correspond
- 41:14to different tests
- 41:15and the columns correspond
- 41:16to different, say, conditions for the same test.
- 41:21So examples might be
- 41:24you're looking at the same snip
- 41:25across lots of different phenotypes,
- 41:27so looking at schizophrenia,
- 41:29looking at bipolar,
- 41:31looking at different diseases
- 41:33or different traits,
- 41:35and you can have a Beta hat for that snip
- 41:37and a standard error for that snip
- 41:39in every trait.
- 41:40And you could try to learn,
- 41:42"Oh look, there are some traits
- 41:43"that tend to share effects
- 41:45"and other traits that don't,"
- 41:46or, often in experiments,
- 41:49people treat their samples
- 41:51with different treatments.
- 41:52They challenge them with different viruses.
- 41:54They look to see which things are being changed
- 41:58when you challenge a cell with different viruses
- 42:00or different heat shock treatments
- 42:02or any kind of different treatment.
- 42:03So yeah, basically, the idea is very generic.
- 42:06The idea is if you've got a matrix of Z scores
- 42:09where the effect, say, look likely to be shared
- 42:12among column sometimes
- 42:13and the rows are gonna be approximately independent,
- 42:17or at least you're willing to assume that,
- 42:20then you can apply the method.
- 42:21- Okay, got it, thanks.
- 42:22- So, actually, that's an important kind of...
- 42:25Also, something that I've been thinking about a lot is
- 42:28the benefits of modular or generic methods.
- 42:34So if you think about what methods are applied
- 42:36in statistics a lot,
- 42:37you think T-test, linear regression.
- 42:39These are all kind of very generic ideas.
- 42:42They don't...
- 42:44And Benjamini-Hochberg.
- 42:45The nice thing about Benjamini-Hochberg is
- 42:47you just need a set of P values
- 42:48and you can apply Benjamini-Hochberg.
- 42:50You don't have to worry too much
- 42:52about where those P values came from.
- 42:54So I think, for applications,
- 42:56it's really useful to try to think about
- 42:58what's the simplest type of data
- 43:01you could imagine inputting into the procedure
- 43:04in order to output something useful?
- 43:06And sometimes, that involves making compromises
- 43:09because to make a procedure generic enough,
- 43:12you have to compromise on what...
- 43:14On maybe what the details of what are going in.
- 43:17So here, what we've compromised on is
- 43:18that we take a matrix of Z scores,
- 43:20or potentially Beta hats and their standard errors,
- 43:23we can do either,
- 43:24and that's the input.
- 43:26So that makes it relatively generic.
- 43:28You don't have to worry too much
- 43:30about whether those Beta hats
- 43:31and the standard errors, or the Z scores,
- 43:33are coming from logistic regression
- 43:35or linear regression,
- 43:36or whether that controlling for some covariance
- 43:38or all sorts of...
- 43:39From a mixed model, etc.
- 43:41As long as they have the basic property that
- 43:44the Beta hat is normally distributed
- 43:47about the true Beta
- 43:48with some variance that you are willing to estimate,
- 43:51then you can go.
- 43:58- Sorry, (indistinct).
- 44:01A short question.
- 44:02So in practice, how you choose...
- 44:05How many mix...
- 44:06How many distribution you want to mixture
- 44:08like the (indistinct)? - (indistinct)
- 44:10Yeah, so great question.
- 44:12And my answer, generally,
- 44:14is just use as many as you want.
- 44:16So as many as you can stomach.
- 44:19The more you use, the slower it is.
- 44:25And so, you might worry about over-fitting,
- 44:27but it turns out that these procedures are
- 44:29very robust to over-fitting
- 44:30because of this fact that the mean is fixed at zero.
- 44:34So all the components have a mean zero
- 44:37and have some covariance
- 44:39and because of that,
- 44:40they have limited flexibility to overfit.
- 44:45They're just not that flexible.
- 44:48And in the univariate case,
- 44:50that's even more obvious, I think.
- 44:51That in the univariate case,
- 44:53every one of those distributions,
- 44:55any mixture of normals that are
- 44:57all centered at zero is unimodal at zero
- 45:01and has limited...
- 45:02Can't have wiggly distributions
- 45:04that are very spiky and overfitting.
- 45:06So these methods are relatively immune
- 45:09to overfitting in practice.
- 45:12If you're worried about that,
- 45:13you can do a test-train type thing
- 45:15where you use half your tests to train,
- 45:18and then you look at the log likelihood
- 45:20out of sample on others,
- 45:22and then tweak the number to avoid overfitting.
- 45:27And we did do that early on in the methods
- 45:30but we don't do it very often now,
- 45:32or we only do it now when we're worried
- 45:34'cause generally it seems like overfitting doesn't seem
- 45:37to be a problem,
- 45:38but if we see results are a little bit weird
- 45:39or a bit concerning,
- 45:41we try it to make sure we're not overfitting.
- 45:46- Okay, thank you.
- 45:47- I should say that, in the paper,
- 45:49we kind of outlined some procedures we use
- 45:51for estimating these variance, co-variance matrices
- 45:54but they're not like...
- 45:55They're kind of like...
- 45:57The whole philosophy is that we could probably do better
- 46:01and we're continuing to try and work
- 46:03on better methods for estimating this
- 46:05as we go forward.
- 46:07So we're continually improving
- 46:08the ways we can estimate this.
- 46:16Okay, so briefly, I'll talk about linear regression.
- 46:19So here's your standard linear regression where,
- 46:21so we've N observations,
- 46:23X is the matrix of covariates here,
- 46:25B are the regression coefficients.
- 46:27I'm kind of thinking of P as being big, potentially here.
- 46:32And the errors normal.
- 46:33And so, the empirical Bayes idea would be
- 46:36to assume that the Bs come from
- 46:38some prior distribution, G,
- 46:40which comes from some family, curly G.
- 46:42And what we'd like to do is estimate G
- 46:45and then shrink the estimates of B,
- 46:49using empirical Bayse type ideas
- 46:52and posterior count computations.
- 46:55But it's not a simple normal means model here,
- 46:58so how do we end up applying
- 47:00the empirical Bayse methods to this problem?
- 47:04Well, let's just...
- 47:05I'm gonna explain our algorithm
- 47:07by analogy with penalized regression algorithms
- 47:11because the algorithm ends up looking very similar,
- 47:13and then I'll tell you
- 47:14what the algorithm is actually kind of doing.
- 47:16So a penalized regression would solve this problem.
- 47:19So if you've seen the Lasso before...
- 47:22I hope many of you might have.
- 47:23If you've seen the Lasso before,
- 47:24this would be solving this problem
- 47:26with H being the L1 penalty,
- 47:29absolute value of B, right?
- 47:31So this...
- 47:32So what algorithm...
- 47:34There are many, many algorithms to solve this problem
- 47:37but a very simple one is coordinate ascent.
- 47:40So essentially, for each coordi...
- 47:42it just iterates the following.
- 47:43For each coordinate,
- 47:44you have some kind of current estimate for Bs.
- 47:48So what you do here is you form the residuals
- 47:50by taking away the effects of all the Bs
- 47:55except the one you're trying to update,
- 47:57the one you're trying to estimate.
- 47:59So X minus J here is all the covariates
- 48:01except covariate J.
- 48:02B minus J is all the corresponding coefficients.
- 48:06So this is the residual.
- 48:08RJ is the residual,
- 48:09after removing all the current estimated effects
- 48:12apart from the Jth one.
- 48:14And then you basically compute a estimate
- 48:18of the Jth effect by regressing those residuals on XJ.
- 48:24And then you shrink that using a shrinkage operator
- 48:29that we saw earlier.
- 48:31Just to remind you
- 48:33that a shrinkage operator is the one
- 48:34that minimizes this penalized least squares problem.
- 48:38And it turns out,
- 48:39it's not hard to show that this is coordinate ascent
- 48:45for minimizing this, penalized objective function.
- 48:49And so every iteration of this increases
- 48:53that objective function
- 48:55or decreases it.
- 49:00Okay, so it turns...
- 49:01So our algorithm looks very similar.
- 49:03You still compute the residuals,
- 49:06you compute a Beta hat
- 49:08by regressing the residuals on XJ.
- 49:10You also, at the same time,
- 49:11compute a standard error,
- 49:12which is familiar form.
- 49:17And then you, instead of shrinking using
- 49:21that penalized regression operator,
- 49:23you use a...
- 49:25Sorry, I should say,
- 49:27this is assuming G is known.
- 49:28I'm starting with G.
- 49:29G is known.
- 49:31So you can shrink...
- 49:32Instead of using the penalty-based method,
- 49:35you use the posterior mean shrinkage operator here
- 49:38that I introduced earlier.
- 49:40So it's basically exactly the same algorithm,
- 49:41except replacing this penalty-based shrinkage operator
- 49:47with an empirical Bayse
- 49:48or a Bayesean shrinkage operator.
- 49:54And so, you could ask what that's doing
- 49:56and it turns out that what it's doing is
- 49:57it's minimizing the Kullback-Leibler Divergence
- 50:01between some approximate posterior, Q,
- 50:05and the true posterior, P, here
- 50:08under the constraint that this Q is factorized.
- 50:13So this is what's called
- 50:15a variational approximation
- 50:17or a mean-field,
- 50:17or fully factorized variational approximation.
- 50:21If you've seen that before,
- 50:22you'll know what's going on here.
- 50:24If you haven't seen it before,
- 50:25it's trying to find an approximation
- 50:27to the posterior.
- 50:29This is the true posterior,
- 50:30it's trying to find an approximation
- 50:31to that posterior that minimizes
- 50:33the Kullbert-Leibler Divergence
- 50:35between the approximation
- 50:36and the true value under in a simplifying assumption
- 50:40that the posterior factorizes,
- 50:41which, of course, it doesn't,
- 50:42so that's why it's an approximation.
- 50:44So that algorithm I just said is
- 50:46a coordinate ascent algorithm
- 50:48for maximizing F or minimizing the KL divergence.
- 50:53So every iteration of that algorithm gets
- 50:55a better estimate estimate
- 50:58of the posterior, essentially.
- 51:02Just to outline
- 51:03and just to give you the intuition
- 51:06for how you could maybe estimate G,
- 51:08this isn't actually quite what we do
- 51:10so the details get a bit more complicated,
- 51:13but just to give you an intuition
- 51:14for how you might think that you can estimate G;
- 51:18Every iteration of this algorithm computes a B hat
- 51:21and a corresponding standard error,
- 51:23so you could imagine...
- 51:25These two steps here, you could imagine storing these
- 51:28through the iterations
- 51:29and, at the end,
- 51:31you could apply the empirical Bayes normal means procedure
- 51:35to estimate G from these B hats
- 51:37and standard errors,
- 51:38and something close to that kind of works.
- 51:43The details are a bit more complicated than that.
- 51:47So let me give you some kind of intuition
- 51:51for what we're trying to achieve here based
- 51:54on simulation results.
- 51:55So these are some simulations we've done.
- 51:57The covariates are all independent here.
- 52:00The true prior is a point normal,
- 52:02that means that most of the effects are zero.
- 52:07Well, actually maybe here,
- 52:09one of the effects is nonzero,
- 52:11five of the effects is nonzero,
- 52:1350 of the effects are nonzero
- 52:15and 500 of the effects of nonzero.
- 52:17And actually, there are 500 effects,
- 52:18500 variables in this example.
- 52:23So the X-axis here just shows the number
- 52:25of non-zero coordinates
- 52:27and the results I've shown here are the prediction error,
- 52:30so we're focusing on prediction error,
- 52:32the out of sample prediction error,
- 52:33using three different penalty-based approaches.
- 52:37The Lasso, which is this line,
- 52:40the L0Learn, which is this line,
- 52:43which is L0 zero penalty,
- 52:45and Ridge, which is this penalty,
- 52:47the L2 penalty.
- 52:48So the important thing to know is that
- 52:51the L0 penalty is really designed, if you like,
- 52:55to do well under very sparse models.
- 52:58So that's why it's got the lowest prediction error
- 53:03when the model is very sparse,
- 53:05but when the model is completely densed,
- 53:07it does very poorly.
- 53:08Whereas Ridge is designed much more to...
- 53:14It's actually based on a prior
- 53:15that the effects are normally distributed.
- 53:17So it's much better at dense models
- 53:19than sparse models.
- 53:20And you can see that at least relative to L0Learn,
- 53:23Ridge is much better for the dense case
- 53:26but also much worse for the sparse case.
- 53:30And then Lasso has some kind of ability
- 53:32to deal with both scenarios,
- 53:35but it's not quite as good as the L0 penalty
- 53:37when things are very sparse,
- 53:39and it's not quite as good as the Ridge penalty
- 53:41when things are very dense.
- 53:45So our goal is that by learning the prior G
- 53:49from the data,
- 53:50we can adapt to each of these scenarios
- 53:53and get performance close to L0Learn
- 53:55when the truth is sparse
- 53:58and get performance close to Ridge regression
- 54:01when the truth is dense.
- 54:03And so, the red line here actually shows
- 54:06the performance of our method.
- 54:09And you can see, indeed,
- 54:10we do even slightly better than L0Learn
- 54:13in this part here
- 54:14and slightly better than cross-validated Ridge regression
- 54:18in this here.
- 54:19The difference between these two is just that
- 54:21the Ridge regression is doing cross-validation
- 54:23to estimate the tuning parameter
- 54:25and we're using empirical Bayse maximum likelihood
- 54:28to estimate it.
- 54:28So that's just that difference there.
- 54:30And the Oracle here is using the true...
- 54:32You can do the Oracle computation
- 54:35for the Ridge regression
- 54:36with the true tuning parameter here.
- 54:38I should say that may be that this...
- 54:42Maybe I should just show you the results.
- 54:45So here is a bunch of other penalties,
- 54:47including elastic net, for example, you might wonder,
- 54:49which is kind of a compromise
- 54:51between L1 and L2.
- 54:52And you can see, it does do the compromising
- 54:56but it doesn't do as well
- 54:57as the empirical Bayse approach.
- 54:59And here are some other non-convex methods
- 55:01that are more, again...
- 55:04They're kind of more tuned to the sparse case
- 55:06than to the dense case.
- 55:09As promised, I'm gonna skip over
- 55:12the matrix factorization
- 55:14and just summarize to give time for questions.
- 55:18So the summary is that
- 55:21the empirical Bayse normal means model provides
- 55:23a flexible and convenient way
- 55:25to induce shrinkage and sparsity
- 55:26in a range of applications.
- 55:28And we've been spending a lot of time trying
- 55:32to apply these methods
- 55:33and provide software to do
- 55:35some of these different things.
- 55:36And there's a bunch of things on my publications page
- 55:39and if you're interested in...
- 55:40If you can't find what you're looking for,
- 55:41just let me know, I'd be happy to point you to it.
- 55:44Thanks very much.
- 55:47- Thanks Matthew, that's a great talk.
- 55:49I wonder whether the audience have
- 55:51any questions for Matthew.
- 55:57So I do have some questions for you.
- 56:00So I think I really like the idea
- 56:02of applying empirical Bayes to a lot of applications
- 56:06and it's really seems empirical Bayes has great success.
- 56:10But I do have a question or some doubt
- 56:13about the inference part,
- 56:15especially in that linear regression model.
- 56:18So currently, for the current work you have been doing,
- 56:22you essentially shrink each
- 56:24of the co-efficient that based on, essentially,
- 56:26their estimated value,
- 56:28but in some applications,
- 56:31such as a GWAS study or fine mapping,
- 56:33different snips can have
- 56:35very different LD score structure.
- 56:38So in this case,
- 56:39how much we can trust the inference,
- 56:43the P value, from this (indistinct)?
- 56:48- So, great question.
- 56:51So let me just first...
- 56:55First emphasize that the shrink...
- 56:57The estimate here is being done removing the effects,
- 57:01or the estimated effects,
- 57:03of all the other variables.
- 57:04So each iteration of this,
- 57:06when you're estimating the effect
- 57:07of snip J, in your example,
- 57:10you're taking the estimated effects
- 57:11of the other variables into account.
- 57:14So the LD structure, as you mentioned,
- 57:18that's the correlation structure
- 57:20for those who don't know,
- 57:21between the Xs is formerly taken into account.
- 57:24However, there is a problem with this approach
- 57:26for very highly correlated variables.
- 57:30So let's just suppose there are two variables
- 57:34that are completely correlated,
- 57:35what does this algorithm end up doing?
- 57:38It ends up basically choosing one of them
- 57:40and ignoring the other.
- 57:44The Lasso does the same in fact.
- 57:46So it ends up choosing one of them
- 57:50and ignoring the other.
- 57:52And if you look to the posterior distribution
- 57:55on its effect,
- 57:57it would be far too confident
- 57:58in the size of the effect
- 58:00because it would assume that the other one had zero effect.
- 58:04And so it would have a small credible
- 58:06and for, let's say, around the effect size
- 58:08when, really, it should be saying
- 58:09you don't know which one to include.
- 58:11And so, we've worked recently
- 58:14on a method for doing that.
- 58:17A different method,
- 58:19different work than what I've just described here
- 58:22for doing fine mapping using variational approximations
- 58:25that don't have this problem,
- 58:27and it's on my webpage.
- 58:29It's Wang et al in JRSS-B,
- 58:34just recently, this year.
- 58:362021, I guess. - That's awesome.
- 58:39Thanks, so any more question for Matthew from the audience?
- 58:47Okay, I think we're of running out of time also.
- 58:51So if you have any question
- 58:53about the stuff to (indistinct)
- 58:54you want to use,
- 58:55I think you can contact either
- 58:56the authors of the paper or Matthew off the line.
- 59:00And thank you again for agreeing to present your work here.
- 59:04It's looks really useful and interesting.
- 59:07- Thank you for having me.