# BIS Seminar: Adventures in sparsity and shrinkage with the normal means model

March 03, 2021## Information

Matthew Stephens, PhD, Ralph W. Gerard Professor Department of Statistics, Human Genetics and the College

The University of Chicago

Biostatistics Seminar

March 2, 2021

ID6245

To CiteDCA Citation Guide

- 00:04- So, hi everyone.
- 00:06Since we're still waiting for people to join,
- 00:08I will first give a brief introduction
- 00:11to Matthew here.
- 00:13First, it's my honor to introduce Dr. Matthew Stevens
- 00:16as our seminar speaker today.
- 00:18And Matthew is a professor from human genetics
- 00:21and the statistics at University of Chicago.
- 00:24And in the past, his research mainly focused
- 00:27on developing new statistical methods
- 00:29for especially genetic applications.
- 00:32Including, for example,
- 00:33GWAS association studies and fine mapping
- 00:36and populations genetic variants.
- 00:38And today, he will give a talk
- 00:40on some recently developed empirical Bayse methods
- 00:43for the estimation for normal mean models
- 00:46that will introduce the (indistinct) properties
- 00:48such as shrinkage,
- 00:49sparsity or smoothness.
- 00:51And he will also discuss how to apply these methods
- 00:54to a range of practical applications.
- 00:59Okay, so let's wait for another minute
- 01:02and then I will hand it over to Matthew.
- 01:59So I will hand it over to Matthew from here.
- 02:02Let's welcome him.
- 02:05- Thank you very much.
- 02:05It's a great pleasure to be here
- 02:07and to get the opportunity
- 02:09to present my work to you today.
- 02:11So I guess just a little bit of background.
- 02:14So few years ago...
- 02:18Well, I guess, I've been teaching sparsity
- 02:20and shrinkage for a while,
- 02:22and it struck me that, in practice,
- 02:24people don't really use many of these ideas directly...
- 02:29At least not the empirical Bayse versions
- 02:30of these ideas
- 02:32directly in applications.
- 02:34And so, I'm wondering why that is.
- 02:37And partly, it's the lack of...
- 02:40User-friendly, convenient methods
- 02:43for applying these ideas.
- 02:44So I've been trying to think about
- 02:46how we can make these powerful ideas and methods
- 02:51more generally applicable or easily applicable
- 02:53in applications.
- 02:54These ideas have been around quite some time.
- 02:57But I think we've made some progress
- 02:59on actually just making them a bit simpler maybe
- 03:02and simpler to apply in practice,
- 03:04so I'm gonna tell you about those today.
- 03:08Oh, sorry.
- 03:09It's not advancing, let me see.
- 03:10Okay, so yeah, kind of related to that,
- 03:13the normal means problem is
- 03:15something we teach quite frequently.
- 03:18It's not hard to teach
- 03:20but it always struck me whenever I was taught it
- 03:23that it looked like a kind of
- 03:24a toy model that statisticians kind of think up
- 03:28to teach students things
- 03:31but never actually use.
- 03:32And then suddenly, I had an epiphany
- 03:35and realized that it's super useful.
- 03:37And so now, I'm trying to...
- 03:38I'm not the only one but I'm trying to convince people
- 03:42that actually, this is a super useful thing
- 03:44that we should be using in practice.
- 03:46So here's the normal means model.
- 03:49The idea is that you've got
- 03:50a bunch of observations, XJ,
- 03:52that you can think of as noisy observations
- 03:54of theta J
- 03:55and they have some variance.
- 03:58I'm going to allow each variance to be different.
- 04:02The simplest version would be to assume
- 04:04that the variances are all the same
- 04:06but I'm going to allow them to be different.
- 04:10But an important point is
- 04:11that we're going to assume that the variance is unknown,
- 04:14which sounds a bit weird
- 04:15but in applications, we'll see that there are reasons
- 04:17why we might think that's an okay assumption
- 04:21in some applications.
- 04:22Okay, so the basic idea is
- 04:25you've got a bunch of measurements
- 04:26that are noisy measurements
- 04:27of sum theta J
- 04:29and they have known variance,
- 04:31so they have known precision essentially,
- 04:33and you want to estimate the Theta Js.
- 04:36And, of course, the MLE is just
- 04:38to estimate see theta J
- 04:39by its corresponding measurement, XJ.
- 04:43And really, it was a big surprise, I think.
- 04:46I wasn't around at the time
- 04:47but I believe it was a big surprise in 1956
- 04:49when Stein showed that you can do better
- 04:52than the MLE, at least in terms of
- 04:55average squared error expected square there.
- 05:01And so, there are many different ways to motivate or...
- 05:05To motivate this result.
- 05:07And I think many of them end up not being that intuitive.
- 05:11It is quite a surprising result in generality
- 05:15but I think...
- 05:16So the way I like to think about the intuition
- 05:18for why this might be true,
- 05:19it's not the only intuition
- 05:20but it's one intuition for why this might be true,
- 05:22is to have an empirical Bayse thinking to the problem.
- 05:27And so, to illustrate this idea,
- 05:30I use a well-worn device, at this point,
- 05:33which is baseball batting averages.
- 05:38Efron certainly has used this example before
- 05:40to motivate empirical Bayse ideas.
- 05:42This particular example comes from...
- 05:44The data come from this block here,
- 05:46that I referenced at the bottom,
- 05:48which I quite like as an explanation
- 05:49of basic ideas behind empirical Bayse.
- 05:52So this histogram here shows a bunch
- 05:54of basic baseball batting averages
- 05:57for a particular season in 1900.
- 05:59You don't need to know very much about baseball
- 06:01to know what's going on here.
- 06:02Essentially, in baseball, you go and try and hit a ball
- 06:06and your batting average is
- 06:07what proportion of the time
- 06:08you as a bat person end up hitting the ball.
- 06:12And a good baseball batting average is around 0.3 or so.
- 06:17And in a professional baseball,
- 06:19no one's really going to have a batting average of zero
- 06:21'cause they wouldn't survive.
- 06:25But empirically, there were some individuals in this season
- 06:28who had a batting average of zero,
- 06:30that is they completely failed to hit the ball every time
- 06:33they went up to bat.
- 06:34And there were some people
- 06:36who had a batting average of above 0.4,
- 06:38which is also completely unheard of in baseball.
- 06:43Nobody has a batting average that high,
- 06:45so what's going on here?
- 06:47Well, it's a simple explanation is
- 06:49that these individuals at the tails are individuals
- 06:52who just had a few at-bats.
- 06:53They only went and attempted
- 06:55to hit the ball a small number of times.
- 06:57And so, maybe these individuals only had two bats
- 07:01and they missed it both times,
- 07:02they got injured or they weren't selected
- 07:04or, for whatever reason, they didn't hit
- 07:06the ball many times...
- 07:07They didn't go to at bat many times
- 07:09and so, their batting average was empirically zero.
- 07:12Think of that as the maximum likelihood estimate.
- 07:15But if you wanted to predict
- 07:17what they would do, say next season,
- 07:19if you gave them more at bats
- 07:21in the long run,
- 07:22zero would be a bad estimate for obvious reasons.
- 07:27And the same applies to these individuals up here
- 07:30with very big batting averages.
- 07:31They also had relatively few at-bats
- 07:34and they just happened to hit it above 0.4 of the time
- 07:38out of the at-bats.
- 07:39And the individuals who had lots of at-bats are
- 07:41all in the middle here.
- 07:44So these are binomial observations, basically,
- 07:45and the ones who have small N are
- 07:49more likely to be in the tails
- 07:50and the ones we're big N are
- 07:51all going to be around in the middle here.
- 07:53So what would we do?
- 07:54What would we want to do
- 07:55if we wanted to estimate,
- 07:57for example, for this individual,
- 07:59their batting average for next season?
- 08:01If we were gonna predict what they were gonna get.
- 08:04Well, we would definitely want to estimate
- 08:08something closer to the average batting average than 04.
- 08:13That's the intuition.
- 08:14And one way to frame that problem is that...
- 08:18So sorry.
- 08:18So this is the basic idea of shrinkage.
- 08:20We would want to shrink these estimates towards,
- 08:23in this case, towards the mean.
- 08:25So how are we gonna do that?
- 08:27Well, one way to think about it is...
- 08:31Sorry, let me just...
- 08:33Yes.
- 08:36Sorry, just getting my slides.
- 08:37Okay, so here, the red line represents
- 08:40some underlying distribution
- 08:44of actual batting averages.
- 08:46So conceptually, some distribution
- 08:47of actual batting averages among individuals
- 08:53in this kind of league.
- 08:54So the red line, in a Bayesean point of view,
- 08:57kind of represent a sensible prior
- 09:00for any given individual's batting average
- 09:02before we saw that data.
- 09:03So think of the red line as representing
- 09:05the variation or the distribution
- 09:08of actual batting averages among players.
- 09:12And in fact, what we've done here is estimate
- 09:17that red line from the data.
- 09:21That's the empirical Bayse part
- 09:23of the empirical Bayse.
- 09:25The empirical part of empirical Bayse is that
- 09:27the red line which we're going to use
- 09:28as a prior for any given player was
- 09:30actually estimated from all the data.
- 09:33And the basic idea is
- 09:34because we know what the variance
- 09:36of a binomial distribution is,
- 09:38we can kind of estimate
- 09:40what the overall distribution
- 09:42of the underlying piece in this binomial look like,
- 09:46taking account of the fact that the histogram is
- 09:49a noisy observations of that underlying P.
- 09:53Every bat...
- 09:54Basically, every every estimated batting average is
- 09:58a noisy estimate of the true batting average
- 10:00with the noise depending on
- 10:01how many at-bats they have.
- 10:03So once we've estimated that red line,
- 10:07that prior,
- 10:08we can compute the posterior
- 10:10for each individual based on that prior.
- 10:12And when we do that,
- 10:13this is a histogram of the posterior means.
- 10:16So these are, if you like,
- 10:18shrunken estimates of the batting average
- 10:20for each individual.
- 10:21And you can see that the individuals
- 10:22who had zero at-bats got shrunk
- 10:25all the way over somewhere here.
- 10:27And that's because their data really...
- 10:29Although, the point estimate was zero,
- 10:32they had very few at bats.
- 10:33So the information in that data are very slim,
- 10:38very little information.
- 10:40And so, the prior dominates
- 10:41when you're looking at the posterior distribution
- 10:43for these individuals.
- 10:44Whereas individuals in the middle
- 10:46who have more at-bats,
- 10:47will have the estimate that is less shrunken.
- 10:52So that's gonna be a theme we'll come back to later.
- 10:55So how do we form...
- 10:57That's a picture.
- 10:59How do we formulate that?
- 11:00So those were binomial data,
- 11:03I'm gonna talk about normal data.
- 11:05So don't get confused by that.
- 11:08I'm just going to assume normality could do
- 11:10the same thing for a binomial,
- 11:12but I think the normals a more generally useful
- 11:16and convenient way to go.
- 11:20So here's a normal means model again
- 11:23and the idea is that that
- 11:25we're going to assume that thetas come
- 11:26from some prior distribution, G,
- 11:29that was the red line in my example,
- 11:31and we're going to estimate G
- 11:33by maximum likelihood essentially.
- 11:34So we're going to use all the X's,
- 11:36integrating out theta
- 11:38to obtain a maximum likelihood estimate for G.
- 11:40That's stage one,
- 11:42that's estimating that red line.
- 11:43And then stage two is
- 11:44to compute the posterior distribution
- 11:46for each batting average,
- 11:48or whatever theta J we're interested in,
- 11:51taking into account that estimated prior
- 11:53and the data on the individual J.
- 11:56So that's the formalization of these ideas.
- 12:01And these posterior distributions are gonna be shrunk
- 12:03towards the prior or the primary.
- 12:07So what kind of...
- 12:09So I guess I've left unspecified here,
- 12:12what family of priors should we consider for G?
- 12:17So a commonly used prior distribution is
- 12:21this so-called point-normal,
- 12:23or sometimes called spike and slab prior distribution.
- 12:28And these are...
- 12:28Sorry, I should say,
- 12:29I'm going to be thinking a lot about problems
- 12:31where we want to induce sparsity.
- 12:33So in baseball, we were shrinking towards the mean
- 12:37but in many applications,
- 12:38the natural point towards
- 12:40natural prime mean, if you like, is zero
- 12:42in situations where we expect effects
- 12:46to be sparse, for example.
- 12:47So I'm gonna be talking mostly about that situation,
- 12:50although the ideas are more general.
- 12:51And so, I'm going to be focusing
- 12:52on the sparsity inducing choices of prior family.
- 12:56And so, one commonly used one is this point normal
- 12:59where there's some mass pi zero
- 13:02exactly at zero,
- 13:04and then the rest of the mass is
- 13:06normally distributed about zero.
- 13:09So the commonly used one.
- 13:11In fact, it turns out,
- 13:13and this is kind of interesting I think,
- 13:15that it can be easier to do the computations
- 13:17for more general families.
- 13:20So for example,
- 13:22just take the non-parametric family
- 13:24that's the zero-centered scale mixture of normal,
- 13:27so we'll see that in it,
- 13:28which includes all these distributions of special cases.
- 13:31It's nonparametric.
- 13:33It includes a point-normal here.
- 13:35It also includes the T-distribution,
- 13:36the Laplace distribution,
- 13:38the horseshoe prior, if you've come across that,
- 13:40this zero-centered scale mixture of normals
- 13:42and the surprise is that it turns out
- 13:45to be easier, in some sense,
- 13:47to do the calculations for this family,
- 13:49this more general family,
- 13:50than this narrow family,
- 13:52partly because of the convex family.
- 13:53So you can think of this as a kind of a convex relaxation
- 13:57of the problem.
- 13:58So all the computations become...
- 13:59The optimizations you have to do in the simplest case
- 14:01become convex when you use this family.
- 14:05So let me say a bit more about that
- 14:07for the non-parametric.
- 14:09How do we actually do these non-parametric computations?
- 14:12Well, we actually approximate
- 14:14the non-parametric computation using a grid idea.
- 14:17So here's the idea.
- 14:20We modeled G, our prior,
- 14:22as a mixture of...
- 14:23I like to think of this K as being big.
- 14:25A large number of normal distributions.
- 14:28All of these normal distributions are centered at zero,
- 14:31that's this zero here,
- 14:32and they have a different variance.
- 14:34Some of them have very small variances,
- 14:36perhaps even one of them has a zero variance,
- 14:38so that's the point mass at zero.
- 14:39And the variance is sigma...
- 14:41Think of Sigma squared K getting gradually bigger
- 14:43until the last Sigma squared K is very big.
- 14:47So we're just gonna use a lot of them.
- 14:48Think of K as being, let's say 100
- 14:50or 1,000 for the...
- 14:52In practice, we find 20 is enough
- 14:54but just think of it as being big
- 14:57and spanning a lot of different variances,
- 14:59going from very, very small,
- 15:00to very, very big.
- 15:01And then, estimating G just comes down
- 15:05to estimating these pis,
- 15:06these mixture proportions.
- 15:09And that, then of course,
- 15:11is a finite dimensional optimization problem
- 15:13and in the normal means model,
- 15:15it's a convex...
- 15:16Well actually, for any mixture,
- 15:18it's a convex problem,
- 15:19and so there are efficient ways to find
- 15:22the MLE for pi, given the grid of variances.
- 15:27So let's just illustrate what's going on here.
- 15:30Here's a grid of just three normals.
- 15:33The one in the middle has the smallest variance,
- 15:35the one over here has the biggest variance.
- 15:37And we can get a mixture of those,
- 15:39looks like that.
- 15:40So you can see this is kind of a spiky distribution
- 15:42but also with a long tail,
- 15:44even with just a mixture of three distributions.
- 15:47And so, the idea is that you can get
- 15:49quite a flex...
- 15:51It's a flexible family
- 15:52by using a larger number of variances than three.
- 15:56You can imagine you can get distributions
- 15:59that have all sorts of spikiness
- 16:01and long-tailed behavior.
- 16:06So maybe just to fill in the details here;
- 16:09with that prior as a mixture of normals,
- 16:13the marginal distribution, P of X,
- 16:15integrating out theta is analytic
- 16:18because the sum of normals is normal.
- 16:21So if you have a normally distributed variable
- 16:23and then you have another variable
- 16:25that's a normal error on top of that,
- 16:27you get a normal.
- 16:28So the marginal is a mixture of normals
- 16:32that's very simple to work with
- 16:34and estimating pi is a convex optimization problem.
- 16:38You can do it.
- 16:39You can do an EM algorithm
- 16:40but convex methods,
- 16:41as pointed out by Koenker and Mizera,
- 16:43can be a lot more reliable and faster.
- 16:49Okay, so let's just illustrate those ideas again.
- 16:52Here's a potential prior distribution
- 16:58and here's a likelihood.
- 17:00So this is like a likelihood from a normal...
- 17:02This is an estimate...
- 17:04Think of this as a likelihood
- 17:05for theta J in a normal means model.
- 17:08So maybe XJ was one and a half or something
- 17:11and SJ was, I don't know,
- 17:12something like a half or something
- 17:14or a half squared.
- 17:17So this is meant to represent the likelihood.
- 17:19So what does the posterior look like
- 17:21when we combine this prior,
- 17:23the black line,
- 17:24with this likelihood, the red line?
- 17:26it looks like this green line here.
- 17:28So what you can see is going on here is
- 17:31that you get shrinkage towards the mean, right?
- 17:34But because the black line is long-tailed
- 17:37because of the prior in this case has a long tail,
- 17:40and because the red line...
- 17:41The likelihood lies quite a ways in the tail,
- 17:44the spiky bit at zero doesn't have very much impact
- 17:48because it's completely...
- 17:49Zero is basically inconsistent with the data
- 17:52and so the posterior looks approximately normal.
- 17:54It's actually a mixture of normals
- 17:56but it looks approximately normal 'cause of weight
- 17:58and there, zero is very, very small.
- 18:02Whereas if a...
- 18:04Here's a different example,
- 18:05the black line is covered
- 18:08by the green line this time because it's...
- 18:09So I plotted all three lines on the same plot here.
- 18:12The black line is...
- 18:13Think of it as pretty much the green line.
- 18:14It's still the same spiky prior
- 18:16but now the likelihood is much flatter.
- 18:18The XJ is the same.
- 18:20Actually, it's one and a half
- 18:21but we have an SJ that's much bigger.
- 18:23So what happens here is that
- 18:25the prior dominates because the likelihood's
- 18:27relatively flat,
- 18:29and so the posterior looks pretty much like the prior
- 18:31and you get very strong shrinkage.
- 18:33So think of this as corresponding
- 18:35to those individuals who had very few at-bats,
- 18:38their data are very imprecise,
- 18:41and so their posterior, the green line,
- 18:43looks very like the prior, the black line.
- 18:46Okay, so we're gonna shrink those observations more.
- 18:50So the key point here, I guess,
- 18:52is that the observations with larger standard error,
- 18:55larger SJ,
- 18:58get shrunk more.
- 19:00I should say "larger standard deviation" get shrunk more.
- 19:04Here's another intermediate example
- 19:07where the red line...
- 19:08The likelihood's kind of not quite enough.
- 19:11It illustrates the idea that the posterior could be bimodal
- 19:15because the prior and the likelihood are indifferent,
- 19:18have weight in different places.
- 19:20So you can get different kinds of shrinkage depending on
- 19:23how spiky the prior is,
- 19:24how long-tailed the prior is,
- 19:25how flat the likelihood is etc.
- 19:35So obviously the shrinkage,
- 19:36the amount of shrinkage you get,
- 19:37depends on the prior, G,
- 19:39which you're gonna estimate from the data.
- 19:41It also depends on the standard error
- 19:43or the standard deviation, SJ.
- 19:46And one way to summarize this kind of the behavior,
- 19:48the shrinkage behavior,
- 19:50is to focus on how the posterior mean changes with X.
- 19:55So we can define this operator here,
- 19:57S-G-S of X,
- 19:59as the X posterior mean of theta J,
- 20:04given the prior and its variance or standard deviation
- 20:08and that we observed XJ is equal to X.
- 20:14I'm gonna call this the shrinkage operator
- 20:17for the prior, G,
- 20:18and variance, S for standard deviation, S.
- 20:22Okay, so we could just plot
- 20:24some of these shrinkage operators.
- 20:25So the idea here is...
- 20:26Sorry, this slide has B instead of X.
- 20:30Sometimes I use B and sometimes I use X.
- 20:34I've got them mixed up here, sorry.
- 20:35So think of this as X
- 20:37and this is S of X.
- 20:39So these different lines here correspond
- 20:43to different priors.
- 20:44So the idea is that by using different priors,
- 20:47we can get different types of shrinkage behavior.
- 20:50So this prior here shrinks very strongly to zero.
- 20:54This green line shrinks very strongly to zero
- 20:57until B exceeds some value around five,
- 21:01at which point it hardly shrinks at all.
- 21:04So this is kind of a prior that has
- 21:07kind of a big spike near zero.
- 21:10But also a long tail,
- 21:11such that when you get far enough in the tail,
- 21:15you start to be convinced
- 21:17that there's a real signal here.
- 21:18So you can think of that
- 21:19as this kind of...
- 21:20This is sometimes called...
- 21:22This is local shrinkage
- 21:24and this is global.
- 21:26So you get very strong local shrinkage towards zero
- 21:29but very little shrinkage
- 21:31if the signal is strong enough.
- 21:33That kind of thing.
- 21:34But the real point here is that
- 21:35by using different priors,
- 21:37these different scale mixture of normal priors,
- 21:39you can get very different looking shrinkage behaviors.
- 21:43Ones that shrink very strongly to zero
- 21:45and then stop shrinking
- 21:46or ones that shrink a little bit all the way, etc.
- 21:52And so, if you're familiar with other ways
- 21:55of doing shrinkage analysis,
- 21:56and this is one of them,
- 21:58or shrinkage,
- 21:59is to use a penalized likelihood.
- 22:01Then you can try and draw a parallel
- 22:04and that's what I'm trying to do here.
- 22:05Draw a parallel between the Bayesean method
- 22:08and the penalized likelihood-based approaches
- 22:11to inducing shrinkage or sparsity.
- 22:15Another way to induce shrinkage is to essentially...
- 22:18This is the kind of normal log likelihood here
- 22:21and this is a penalty here
- 22:23that you add for this.
- 22:25This could be an L1 penalty
- 22:26or an L2 penalty or an L0 penalty,
- 22:28or some other kind of penalty.
- 22:30So there's a penalty function here.
- 22:32And you define the estimate
- 22:34as the value that minimizes
- 22:36this penalized log likelihood.
- 22:38Sorry, yeah, this is a negative log likelihood.
- 22:41Penalized least squares, I guess this would be.
- 22:45Okay, so now eight is a penalty function here
- 22:49and Lambda is a tuning parameter
- 22:51that says how strong, in some sense, the penalty is.
- 22:55And these are also widely used
- 22:58to induce shrinkage,
- 22:59especially in regression contexts.
- 23:02And so, here are some commonly used shrinkage operators,
- 23:06corresponding to different penalty functions.
- 23:09So this green line is what's called
- 23:12the hard thresholding,
- 23:16which corresponds to an L0 penalty.
- 23:19If you don't know what that means, don't worry.
- 23:21But if you do, you make that connection.
- 23:25At the red line here is L1 penalty
- 23:28or soft thresholding.
- 23:29And these two other ones here are particular instances
- 23:33of some non-convex penalties that are used
- 23:36in regression context,
- 23:37particularly in practice.
- 23:39And I guess that the point here is
- 23:42that, essentially, different prior distributions
- 23:45in the normal means model can lead
- 23:48to shrinkage operators, shrinkage behavior
- 23:51that looks kind of similar
- 23:53to each of these different types of penalty.
- 23:58So you can't actually mimic the behavior exactly.
- 24:03I've just...
- 24:05Or actually, my student, (indistinct) Kim,
- 24:07chose the priors to visually closely match these
- 24:11but you can't get...
- 24:12Some of these have kinks and stuff
- 24:13that you can't actually, formally, exactly mimic
- 24:18but you can get qualitatively similar shrinkage behavior
- 24:21from different priors
- 24:23as different penalty functions.
- 24:24So you should think about the different priors
- 24:26as being analogous to different penalty functions.
- 24:30And so, the key...
- 24:33How does EB, empirical Bayse shrinkage,
- 24:35differ from, say, these kinds of penalty-based approaches,
- 24:40which I should say are maybe
- 24:41more widely used in practice?
- 24:44Well, so shrinkage is determined by the prior, G,
- 24:49which we estimate in an empirical Bayse context
- 24:52by maximum likelihood.
- 24:53Whereas in typical shrinkage...
- 24:57Sorry, typical penalty-based analyses,
- 25:00people use cross validation to estimate parameters.
- 25:03And the result is that cross-validation is fine
- 25:07for estimating one parameter
- 25:08but it becomes quite cumbersome
- 25:10to estimate two parameters,
- 25:12and really tricky to estimate three or four parameters
- 25:14'cause you have to go and do a grid of different values
- 25:17and do a lot of cross-validations
- 25:18and start estimating all these different parameters.
- 25:22So the point here is really
- 25:23because we estimate G by maximum likelihood,
- 25:26we can actually have a much more flexible family in practice
- 25:29that we can optimize over more easily.
- 25:33It's very flexible,
- 25:34you can mimic a range of penalty functions
- 25:36so you don't have to choose
- 25:37whether to use L1 or L2 or L0.
- 25:40You can essentially estimate
- 25:42over these non-parametric prior families.
- 25:44Think of that as kind of deciding automatically
- 25:47whether to use L0, L1, L2
- 25:49or some kind of non-convex penalty,
- 25:51or something in between.
- 25:53And the posterior distribution, of course then,
- 25:58another nice thing is that it gives not
- 25:59only the point estimates
- 26:01but, if you like, it also gives shrunken interval estimates
- 26:04which are not yielded by a penalty-based approach.
- 26:08So I guess I'm trying to say
- 26:09that there are potential advantages
- 26:11of the empirical Bayse approach
- 26:13over the penalty-based approach.
- 26:16And yeah, although I think,
- 26:19people have tried, particularly Efron has highlighted
- 26:21the potential for empirical Bayse
- 26:23to be used in practical applications,
- 26:25largely in the practical application.
- 26:27So I've seen empirical Bayse shrinkage hasn't
- 26:29really been used very, very much.
- 26:32So that's the goal,
- 26:34is to change that.
- 26:37So before I talk about examples,
- 26:39I guess I will pause for a moment
- 26:41to see if there are any questions.
- 26:54And I can't see the chat for some reason
- 26:56so if anyone...
- 26:57So please unmute yourself
- 26:58if you have a question.
- 27:00- I don't think people are (indistinct)
- 27:04every question in the chat.
- 27:05At least, I didn't see any. - Good.
- 27:07Okay, thank you.
- 27:10It's all clear.
- 27:12I'm happy to go on but
- 27:14I just wanna...
- 27:25Okay, so we've been trying to...
- 27:27My group has been trying to think about
- 27:28how to use these ideas,
- 27:30make these ideas useful in practice
- 27:32for a range of practical applications.
- 27:35We've done work on multiple testing,
- 27:37on high dimensional linear aggression,
- 27:40and also some on matrix factorization.
- 27:43From previous experience,
- 27:44I'll probably get time to talk about the first two
- 27:45and maybe not the last one,
- 27:47but there's a pre-print on the archive.
- 27:49You can see if you're interested
- 27:50in matrix factorization.
- 27:51Maybe I'll get to get to talk about that briefly.
- 27:55But let me talk about multiple testing first.
- 27:59So the typical multiple testing setup,
- 28:02where you might typically, say,
- 28:05apply a Benjamini-Hochberg type procedure is
- 28:08you've got a large number of tests,
- 28:09So J equals one to N,
- 28:11and test J yields a P value, PJ,
- 28:15and then you reject all tests with
- 28:16some PJ less than a threshold gamma,
- 28:19where that threshold is chosen
- 28:21to control the FDR in a frequented sense.
- 28:23So that's the typical setup.
- 28:26So how are we going to apply
- 28:27the normal means model to this problem?
- 28:33Okay, well, in many applications,
- 28:37not all but in many,
- 28:38the P values are derived from
- 28:40some kind of effect size estimate,
- 28:41which I'm going to call "Beta hat J,"
- 28:45which have standard errors, SJ,
- 28:47that satisfy approximately, at least,
- 28:49that Beta J hat is normally distributed
- 28:53about the true value Beta J
- 28:55with some variance given it by SJ.
- 28:58So in a lot...
- 29:01I work a lot in genetic applications.
- 29:03So in genetic applications,
- 29:05we're looking at different genes here.
- 29:07So Beta J hat might be the estimate
- 29:11of the difference in expression, let's say,
- 29:13of a gene, J, between, say, males and females.
- 29:18And Beta J would be the true difference
- 29:21at that gene.
- 29:22And you're interested in identifying
- 29:25which genes are truly different...
- 29:28Have a different mean expression
- 29:30between males and females here.
- 29:32And the reason that SJ is approximately known is
- 29:36because you've got multiple males
- 29:38and multiple females that you're using
- 29:41to estimate this difference.
- 29:43And so, you get an estimated standard error
- 29:45of that Beta hat as well.
- 29:48And so, once you've set the problem up like this,
- 29:50of course, it looks suddenly like a normal means problem
- 29:55and we can kind of apply
- 29:57the empirical Bayes normal means idea.
- 30:00We're gonna put a prior on the Beta Js
- 30:03that is sparsity inducing.
- 30:05That is, it's kind of centered at zero,
- 30:06maybe it's got a point mass at zero.
- 30:08But we're gonna estimate that prior from the data.
- 30:14Okay.
- 30:17And so, not only can you get posterior means for Beta,
- 30:23as I said, you can get posterior interval estimates.
- 30:25So you can kind of do things
- 30:27like compute the posterior in 90% credible interval,
- 30:31given that prior and the likelihood
- 30:33for each Beta J
- 30:34and we could reject, for example,
- 30:35if the interval does not contain zero.
- 30:39And I'm not going to talk about this in detail
- 30:42because the details are in
- 30:44a biostatistics paper from 2017.
- 30:46I should say that the idea of using empirical Bayse for FDR
- 30:50actually dates back to before Benjamini and Hoffberg.
- 30:54Duncan Thomas has a really nice paper
- 30:57that was pointed out to me by John Witty
- 30:59that actually contains these basic ideas
- 31:03but not nice software implementation,
- 31:07which maybe explains why it hasn't caught on
- 31:09in practice yet.
- 31:10Efron's also been a pioneer in this area.
- 31:13So...
- 31:16Okay, so I don't want to dwell on that
- 31:18because, actually, I think I'll just summarize
- 31:21what I think is true compared with Benjamini-Hochberg.
- 31:25You get a bit of an increase in power
- 31:27by using an empirical Bayse approach.
- 31:29The Benjamini-Hochberg approach is
- 31:32more robust to correlated tests though,
- 31:34so the empirical Bayse normal means model does assume
- 31:38that the tests are independent
- 31:39and, in practice, we have seen
- 31:43that correlations can cause problems.
- 31:45If you're interested in that,
- 31:46I have a pre-print with Lei Sun on my website.
- 31:49But the empirical Bayse normal means
- 31:52also provides these interval estimates,
- 31:55which is kind of nice.
- 31:56Benjamini-Hochberg does not.
- 31:58So there are some advantages
- 31:59of the empirical Bayes approach
- 32:01and maybe some disadvantages compared
- 32:04with Benjamini-Hochberg.
- 32:05But I think that the real benefit
- 32:06of the empirical Bayse approach
- 32:08actually comes when we look at multi-variate extensions
- 32:11of this idea.
- 32:12So I just wanted to briefly highlight those
- 32:14and spend some time on those.
- 32:16So here's the multi-variate version
- 32:19of the empirical Bayse normal means models.
- 32:24And now, my Beta Js are a vector of observation.
- 32:29So I think of this as measuring, say,
- 32:32gene J in multiple different tissues.
- 32:34Think of different tissues.
- 32:36You look at at heart
- 32:37you look at lung,
- 32:38you look brain,
- 32:39you look at the spleen.
- 32:42In fact, we've got 50 different tissues in the example
- 32:46I'm gonna show in a minute.
- 32:48So we've measured some kind of effect
- 32:52in each gene, in each of these 50 different tissues
- 32:56and we want to know where the effects are...
- 33:01Which genes show effects in which tissues.
- 33:04So Beta J is now a vector of length R,
- 33:08the number of tissues.
- 33:09R is 50 in our example.
- 33:11And so you've got...
- 33:12We're gonna assume that the estimates are
- 33:16normally distributed with mean,
- 33:17the true values and some variance,
- 33:19covariance matrix now,
- 33:21which we're going to assume, for now, is known.
- 33:23That's actually a little trickier
- 33:26but I'm gonna gloss over that for...
- 33:28If you want to see details,
- 33:29take a look at the paper.
- 33:31I just wanna get the essence of the idea across.
- 33:33We're still going to assume that Beta J comes
- 33:35from some prior, G,
- 33:36and we're still gonna use a mixture of normals,
- 33:38but now we're using a mixture
- 33:39of multi-variate normals.
- 33:41And unlike the univariate case,
- 33:43we can't use a grid of...
- 33:45We can't use a grid of values
- 33:47that span all possible covariance matrices here.
- 33:50It's just too much.
- 33:52So we have to do something to estimate
- 33:53these covariance matrices,
- 33:55as well as estimate the pis.
- 33:57And again, if you want to see the details,
- 34:00take a look at a Urbut et al.
- 34:03But let me just illustrate
- 34:04the idea of what's going on here,
- 34:06or what happens when you apply this method
- 34:08to some data.
- 34:10So this is...
- 34:10I said 50,
- 34:11we have 44 tissues in this particular example.
- 34:14So each row here is a tissue.
- 34:17These yellow ones here are brain tissues,
- 34:20different brain tissues,
- 34:22and I think we'll see one later that's blood.
- 34:25I think this one might be blood.
- 34:26Anyway, each one is a tissue;
- 34:28lung, blood, etc.
- 34:29You don't need to know which ones are which, for now.
- 34:32And so, what we've done here is plot
- 34:34the Beta hat and plus or minus two standard deviations
- 34:38for each tissue at a particular...
- 34:41In this case, a particular snip, actually (indistinct).
- 34:44So this is an eQTL analysis,
- 34:46for those of you who know what that means.
- 34:48If you don't, don't worry about it.
- 34:50Just think of it as having an estimated effect
- 34:53plus or minus two standard deviations
- 34:55in 44 different tissues,
- 34:58and we want to know which ones are quote,
- 35:01"significantly different from zero."
- 35:05And so what happens...
- 35:10Sorry.
- 35:11Didn't expect that to happen.
- 35:17Sorry, okay.
- 35:17Yeah, these are just...
- 35:18These are just two examples.
- 35:20So this is one example,
- 35:22here's another example
- 35:23where we've done the same thing.
- 35:25Estimated effects, plus or minus two standard deviations.
- 35:28So what you can see in this first one is
- 35:30that it looks like, at least,
- 35:31that the brain tissues have some kind of effect.
- 35:33That's what you're supposed to see here.
- 35:36And maybe there are some effects in other tissues.
- 35:38There's a tendency for effects to be positive,
- 35:40which might suggest that maybe everything has
- 35:44a small effect to everywhere,
- 35:45but particularly strong in the brain.
- 35:47And whereas in this example,
- 35:50this one appears to have an effect
- 35:52in just one tissue.
- 35:53This is the blood actually.
- 35:54So this is an effect in blood
- 35:56but mostly, it doesn't look like
- 35:58there's an effect in other tissues.
- 36:00But these, just to emphasize,
- 36:01these are the raw data,
- 36:03in the sense that they're the Beta hats
- 36:04and the standard errors.
- 36:05There's no shrinkage occurred yet.
- 36:07But the idea is that the empirical Bayse approach takes
- 36:11all these examples,
- 36:12examples like this and examples like this,
- 36:14to learn about what kinds of patterns are present
- 36:17in the data.
- 36:18That is, "What does G look like?"
- 36:19So it learns from these examples
- 36:21that there are some effects that look like
- 36:24they're shared among the brain tissues,
- 36:27and there are some effects that are...
- 36:28These are actually somewhat rare
- 36:31but rarely, there's an effect that's specific
- 36:33to one tissue like, in this case, blood.
- 36:35And it also learns, in this case actually,
- 36:39that there's a lot of null things,
- 36:41because there are a lot of null things as well.
- 36:44So it puts lots of mass on the null as well
- 36:46and that causes the shrinkage.
- 36:48And then, having estimated those patterns from the data,
- 36:51it computes posteriors.
- 36:53And so, here's the data and then the posterior intervals
- 36:58for the same...
- 36:59For that first example.
- 37:00And what you can see is that because of
- 37:02the combining information across tissues,
- 37:06you get standard errors that are getting smaller,
- 37:09the brain estimates all get shrunk towards one another,
- 37:12and all these...
- 37:14There's some borrowing strength of information,
- 37:16borrowing information across these tissues,
- 37:18to make these these look like
- 37:20some of them are kind of borderline significant.
- 37:22Now, it looks like there's probably
- 37:23an effect in every tissue
- 37:25but a much stronger effect in brain.
- 37:26Whereas this example here,
- 37:29it recognizes that this looks like an effect
- 37:32that's specific to blood.
- 37:33And so, it shrinks everything else strongly towards zero
- 37:36because it knows that most things are null,
- 37:38it's learned that from the data,
- 37:40but the blood estimate gets hardly shrunk at all.
- 37:43We saw that kind of behavior where things
- 37:44that are near zero can get shrunk towards zero,
- 37:47whereas other things that are far
- 37:48away don't get shrunk as much.
- 37:50And it's really hard to do that kind of thing
- 37:53without doing some kind of model-based analysis,
- 37:57doing Benjamini-Hochberg type art non-model based
- 38:01without making any assumptions
- 38:03or making minimal assumptions,
- 38:06very hard to capture this kind of thing, I think.
- 38:09So I think the empirical Bayse approach
- 38:11has big advantages in this setting.
- 38:17I'll pause before I talk about regression.
- 38:20Any questions there?
- 38:24- So Matthew, I have some basic questions.
- 38:26So in your means multivariate multiple testing case,
- 38:30I guess for each of the plot,
- 38:32you are looking at maybe a particular genes influence
- 38:35on some... - Good, yeah.
- 38:37Sorry, I did skip over it a bit.
- 38:39So these are eQTLs.
- 38:40So actually, what I'm plotting here is each...
- 38:43This is a single snip associated
- 38:46with a single gene.
- 38:48And this is it's,
- 38:49"How associated is this snip
- 38:52"with this genes expression level
- 38:54"in the different brain tissues,
- 38:56"in the blood tissue in lung and spleen, etc?"
- 39:02The idea is that...
- 39:04What the scientific goal is to understand
- 39:07which genetic variants are impacting gene expression
- 39:10in different tissues,
- 39:12which might tell us something
- 39:13about the biology of the tissues
- 39:15and the regulation going on in the different tissues.
- 39:18- Got it.
- 39:19So in this case,
- 39:20I don't think I fully understand
- 39:23why it's multivariate multiple tests, not univariate
- 39:27because you are looking at each gene
- 39:28versus each snip.
- 39:32- Right, so sorry.
- 39:33Think of J indexing eQTL.
- 39:36So we've got 2 million potential eQTLs,
- 39:39so that's the multiple part of it.
- 39:41For 2 million potential eQTLs, that's...
- 39:45And then each eQTL has data on 44 tissues,
- 39:50so that's the multi-variate part of it.
- 39:52(speaking over each other)
- 39:53If you thought about it
- 39:54in terms of say P values or maybe Z scores,
- 39:57you have a matrix of Z scores.
- 39:59There are two million rows
- 40:01and there are 44 columns
- 40:03and you have a Z score or a P value
- 40:05for each element in that matrix,
- 40:09and what we're assuming is that
- 40:11the rows are independent,
- 40:13which is not quite true but still,
- 40:15we're assuming that the rows are independent
- 40:17and the columns,
- 40:18we're assuming that they can be correlated.
- 40:20And in particular,
- 40:21we're assuming that the...
- 40:22Well, we're assuming that both
- 40:24the measurements can be correlated,
- 40:26so it's V,
- 40:27but also that the effects can be correlated.
- 40:30So that's to capture the idea
- 40:31that there might be some effects
- 40:33that are shared between say brain tissues--
- 40:36- I see.
- 40:37I see.
- 40:38So this multi-variate is different
- 40:40from our usual notion where the multivariate
- 40:43and multivariate snip.
- 40:44So there's multivariate tissue.
- 40:46I guess, are the samples from the same cohort?
- 40:49- Yeah, so in this particular case,
- 40:52the samples are from the same individuals.
- 40:55So these different brain tissues...
- 40:56There's overlap anyway, let's say.
- 40:57And so, that's what causes this...
- 41:00That causes headaches, actually, for this--
- 41:02- Okay, got it, thanks. - Yeah.
- 41:04Just to emphasize,
- 41:05it doesn't have to be different tissues.
- 41:07The whole method works
- 41:08on any matrix of Z scores, basically.
- 41:11As long as you think that the rows correspond
- 41:14to different tests
- 41:15and the columns correspond
- 41:16to different, say, conditions for the same test.
- 41:21So examples might be
- 41:24you're looking at the same snip
- 41:25across lots of different phenotypes,
- 41:27so looking at schizophrenia,
- 41:29looking at bipolar,
- 41:31looking at different diseases
- 41:33or different traits,
- 41:35and you can have a Beta hat for that snip
- 41:37and a standard error for that snip
- 41:39in every trait.
- 41:40And you could try to learn,
- 41:42"Oh look, there are some traits
- 41:43"that tend to share effects
- 41:45"and other traits that don't,"
- 41:46or, often in experiments,
- 41:49people treat their samples
- 41:51with different treatments.
- 41:52They challenge them with different viruses.
- 41:54They look to see which things are being changed
- 41:58when you challenge a cell with different viruses
- 42:00or different heat shock treatments
- 42:02or any kind of different treatment.
- 42:03So yeah, basically, the idea is very generic.
- 42:06The idea is if you've got a matrix of Z scores
- 42:09where the effect, say, look likely to be shared
- 42:12among column sometimes
- 42:13and the rows are gonna be approximately independent,
- 42:17or at least you're willing to assume that,
- 42:20then you can apply the method.
- 42:21- Okay, got it, thanks.
- 42:22- So, actually, that's an important kind of...
- 42:25Also, something that I've been thinking about a lot is
- 42:28the benefits of modular or generic methods.
- 42:34So if you think about what methods are applied
- 42:36in statistics a lot,
- 42:37you think T-test, linear regression.
- 42:39These are all kind of very generic ideas.
- 42:42They don't...
- 42:44And Benjamini-Hochberg.
- 42:45The nice thing about Benjamini-Hochberg is
- 42:47you just need a set of P values
- 42:48and you can apply Benjamini-Hochberg.
- 42:50You don't have to worry too much
- 42:52about where those P values came from.
- 42:54So I think, for applications,
- 42:56it's really useful to try to think about
- 42:58what's the simplest type of data
- 43:01you could imagine inputting into the procedure
- 43:04in order to output something useful?
- 43:06And sometimes, that involves making compromises
- 43:09because to make a procedure generic enough,
- 43:12you have to compromise on what...
- 43:14On maybe what the details of what are going in.
- 43:17So here, what we've compromised on is
- 43:18that we take a matrix of Z scores,
- 43:20or potentially Beta hats and their standard errors,
- 43:23we can do either,
- 43:24and that's the input.
- 43:26So that makes it relatively generic.
- 43:28You don't have to worry too much
- 43:30about whether those Beta hats
- 43:31and the standard errors, or the Z scores,
- 43:33are coming from logistic regression
- 43:35or linear regression,
- 43:36or whether that controlling for some covariance
- 43:38or all sorts of...
- 43:39From a mixed model, etc.
- 43:41As long as they have the basic property that
- 43:44the Beta hat is normally distributed
- 43:47about the true Beta
- 43:48with some variance that you are willing to estimate,
- 43:51then you can go.
- 43:58- Sorry, (indistinct).
- 44:01A short question.
- 44:02So in practice, how you choose...
- 44:05How many mix...
- 44:06How many distribution you want to mixture
- 44:08like the (indistinct)? - (indistinct)
- 44:10Yeah, so great question.
- 44:12And my answer, generally,
- 44:14is just use as many as you want.
- 44:16So as many as you can stomach.
- 44:19The more you use, the slower it is.
- 44:25And so, you might worry about over-fitting,
- 44:27but it turns out that these procedures are
- 44:29very robust to over-fitting
- 44:30because of this fact that the mean is fixed at zero.
- 44:34So all the components have a mean zero
- 44:37and have some covariance
- 44:39and because of that,
- 44:40they have limited flexibility to overfit.
- 44:45They're just not that flexible.
- 44:48And in the univariate case,
- 44:50that's even more obvious, I think.
- 44:51That in the univariate case,
- 44:53every one of those distributions,
- 44:55any mixture of normals that are
- 44:57all centered at zero is unimodal at zero
- 45:01and has limited...
- 45:02Can't have wiggly distributions
- 45:04that are very spiky and overfitting.
- 45:06So these methods are relatively immune
- 45:09to overfitting in practice.
- 45:12If you're worried about that,
- 45:13you can do a test-train type thing
- 45:15where you use half your tests to train,
- 45:18and then you look at the log likelihood
- 45:20out of sample on others,
- 45:22and then tweak the number to avoid overfitting.
- 45:27And we did do that early on in the methods
- 45:30but we don't do it very often now,
- 45:32or we only do it now when we're worried
- 45:34'cause generally it seems like overfitting doesn't seem
- 45:37to be a problem,
- 45:38but if we see results are a little bit weird
- 45:39or a bit concerning,
- 45:41we try it to make sure we're not overfitting.
- 45:46- Okay, thank you.
- 45:47- I should say that, in the paper,
- 45:49we kind of outlined some procedures we use
- 45:51for estimating these variance, co-variance matrices
- 45:54but they're not like...
- 45:55They're kind of like...
- 45:57The whole philosophy is that we could probably do better
- 46:01and we're continuing to try and work
- 46:03on better methods for estimating this
- 46:05as we go forward.
- 46:07So we're continually improving
- 46:08the ways we can estimate this.
- 46:16Okay, so briefly, I'll talk about linear regression.
- 46:19So here's your standard linear regression where,
- 46:21so we've N observations,
- 46:23X is the matrix of covariates here,
- 46:25B are the regression coefficients.
- 46:27I'm kind of thinking of P as being big, potentially here.
- 46:32And the errors normal.
- 46:33And so, the empirical Bayes idea would be
- 46:36to assume that the Bs come from
- 46:38some prior distribution, G,
- 46:40which comes from some family, curly G.
- 46:42And what we'd like to do is estimate G
- 46:45and then shrink the estimates of B,
- 46:49using empirical Bayse type ideas
- 46:52and posterior count computations.
- 46:55But it's not a simple normal means model here,
- 46:58so how do we end up applying
- 47:00the empirical Bayse methods to this problem?
- 47:04Well, let's just...
- 47:05I'm gonna explain our algorithm
- 47:07by analogy with penalized regression algorithms
- 47:11because the algorithm ends up looking very similar,
- 47:13and then I'll tell you
- 47:14what the algorithm is actually kind of doing.
- 47:16So a penalized regression would solve this problem.
- 47:19So if you've seen the Lasso before...
- 47:22I hope many of you might have.
- 47:23If you've seen the Lasso before,
- 47:24this would be solving this problem
- 47:26with H being the L1 penalty,
- 47:29absolute value of B, right?
- 47:31So this...
- 47:32So what algorithm...
- 47:34There are many, many algorithms to solve this problem
- 47:37but a very simple one is coordinate ascent.
- 47:40So essentially, for each coordi...
- 47:42it just iterates the following.
- 47:43For each coordinate,
- 47:44you have some kind of current estimate for Bs.
- 47:47(indistinct)
- 47:48So what you do here is you form the residuals
- 47:50by taking away the effects of all the Bs
- 47:55except the one you're trying to update,
- 47:57the one you're trying to estimate.
- 47:59So X minus J here is all the covariates
- 48:01except covariate J.
- 48:02B minus J is all the corresponding coefficients.
- 48:06So this is the residual.
- 48:08RJ is the residual,
- 48:09after removing all the current estimated effects
- 48:12apart from the Jth one.
- 48:14And then you basically compute a estimate
- 48:18of the Jth effect by regressing those residuals on XJ.
- 48:24And then you shrink that using a shrinkage operator
- 48:29that we saw earlier.
- 48:31Just to remind you
- 48:33that a shrinkage operator is the one
- 48:34that minimizes this penalized least squares problem.
- 48:38And it turns out,
- 48:39it's not hard to show that this is coordinate ascent
- 48:45for minimizing this, penalized objective function.
- 48:49And so every iteration of this increases
- 48:53that objective function
- 48:55or decreases it.
- 48:58Okay.
- 49:00Okay, so it turns...
- 49:01So our algorithm looks very similar.
- 49:03You still compute the residuals,
- 49:06you compute a Beta hat
- 49:08by regressing the residuals on XJ.
- 49:10You also, at the same time,
- 49:11compute a standard error,
- 49:12which is familiar form.
- 49:17And then you, instead of shrinking using
- 49:21that penalized regression operator,
- 49:23you use a...
- 49:25Sorry, I should say,
- 49:27this is assuming G is known.
- 49:28I'm starting with G.
- 49:29G is known.
- 49:31So you can shrink...
- 49:32Instead of using the penalty-based method,
- 49:35you use the posterior mean shrinkage operator here
- 49:38that I introduced earlier.
- 49:40So it's basically exactly the same algorithm,
- 49:41except replacing this penalty-based shrinkage operator
- 49:47with an empirical Bayse
- 49:48or a Bayesean shrinkage operator.
- 49:54And so, you could ask what that's doing
- 49:56and it turns out that what it's doing is
- 49:57it's minimizing the Kullback-Leibler Divergence
- 50:01between some approximate posterior, Q,
- 50:05and the true posterior, P, here
- 50:08under the constraint that this Q is factorized.
- 50:13So this is what's called
- 50:15a variational approximation
- 50:17or a mean-field,
- 50:17or fully factorized variational approximation.
- 50:21If you've seen that before,
- 50:22you'll know what's going on here.
- 50:24If you haven't seen it before,
- 50:25it's trying to find an approximation
- 50:27to the posterior.
- 50:29This is the true posterior,
- 50:30it's trying to find an approximation
- 50:31to that posterior that minimizes
- 50:33the Kullbert-Leibler Divergence
- 50:35between the approximation
- 50:36and the true value under in a simplifying assumption
- 50:40that the posterior factorizes,
- 50:41which, of course, it doesn't,
- 50:42so that's why it's an approximation.
- 50:44So that algorithm I just said is
- 50:46a coordinate ascent algorithm
- 50:48for maximizing F or minimizing the KL divergence.
- 50:53So every iteration of that algorithm gets
- 50:55a better estimate estimate
- 50:58of the posterior, essentially.
- 51:02Just to outline
- 51:03and just to give you the intuition
- 51:06for how you could maybe estimate G,
- 51:08this isn't actually quite what we do
- 51:10so the details get a bit more complicated,
- 51:13but just to give you an intuition
- 51:14for how you might think that you can estimate G;
- 51:18Every iteration of this algorithm computes a B hat
- 51:21and a corresponding standard error,
- 51:23so you could imagine...
- 51:25These two steps here, you could imagine storing these
- 51:28through the iterations
- 51:29and, at the end,
- 51:31you could apply the empirical Bayes normal means procedure
- 51:35to estimate G from these B hats
- 51:37and standard errors,
- 51:38and something close to that kind of works.
- 51:43The details are a bit more complicated than that.
- 51:47So let me give you some kind of intuition
- 51:51for what we're trying to achieve here based
- 51:54on simulation results.
- 51:55So these are some simulations we've done.
- 51:57The covariates are all independent here.
- 52:00The true prior is a point normal,
- 52:02that means that most of the effects are zero.
- 52:07Well, actually maybe here,
- 52:09one of the effects is nonzero,
- 52:11five of the effects is nonzero,
- 52:1350 of the effects are nonzero
- 52:15and 500 of the effects of nonzero.
- 52:17And actually, there are 500 effects,
- 52:18500 variables in this example.
- 52:23So the X-axis here just shows the number
- 52:25of non-zero coordinates
- 52:27and the results I've shown here are the prediction error,
- 52:30so we're focusing on prediction error,
- 52:32the out of sample prediction error,
- 52:33using three different penalty-based approaches.
- 52:37The Lasso, which is this line,
- 52:40the L0Learn, which is this line,
- 52:43which is L0 zero penalty,
- 52:45and Ridge, which is this penalty,
- 52:47the L2 penalty.
- 52:48So the important thing to know is that
- 52:51the L0 penalty is really designed, if you like,
- 52:55to do well under very sparse models.
- 52:58So that's why it's got the lowest prediction error
- 53:03when the model is very sparse,
- 53:05but when the model is completely densed,
- 53:07it does very poorly.
- 53:08Whereas Ridge is designed much more to...
- 53:14It's actually based on a prior
- 53:15that the effects are normally distributed.
- 53:17So it's much better at dense models
- 53:19than sparse models.
- 53:20And you can see that at least relative to L0Learn,
- 53:23Ridge is much better for the dense case
- 53:26but also much worse for the sparse case.
- 53:30And then Lasso has some kind of ability
- 53:32to deal with both scenarios,
- 53:35but it's not quite as good as the L0 penalty
- 53:37when things are very sparse,
- 53:39and it's not quite as good as the Ridge penalty
- 53:41when things are very dense.
- 53:45So our goal is that by learning the prior G
- 53:49from the data,
- 53:50we can adapt to each of these scenarios
- 53:53and get performance close to L0Learn
- 53:55when the truth is sparse
- 53:58and get performance close to Ridge regression
- 54:01when the truth is dense.
- 54:03And so, the red line here actually shows
- 54:06the performance of our method.
- 54:09And you can see, indeed,
- 54:10we do even slightly better than L0Learn
- 54:13in this part here
- 54:14and slightly better than cross-validated Ridge regression
- 54:18in this here.
- 54:19The difference between these two is just that
- 54:21the Ridge regression is doing cross-validation
- 54:23to estimate the tuning parameter
- 54:25and we're using empirical Bayse maximum likelihood
- 54:28to estimate it.
- 54:28So that's just that difference there.
- 54:30And the Oracle here is using the true...
- 54:32You can do the Oracle computation
- 54:35for the Ridge regression
- 54:36with the true tuning parameter here.
- 54:38I should say that may be that this...
- 54:42Maybe I should just show you the results.
- 54:45So here is a bunch of other penalties,
- 54:47including elastic net, for example, you might wonder,
- 54:49which is kind of a compromise
- 54:51between L1 and L2.
- 54:52And you can see, it does do the compromising
- 54:56but it doesn't do as well
- 54:57as the empirical Bayse approach.
- 54:59And here are some other non-convex methods
- 55:01that are more, again...
- 55:04They're kind of more tuned to the sparse case
- 55:06than to the dense case.
- 55:09As promised, I'm gonna skip over
- 55:12the matrix factorization
- 55:14and just summarize to give time for questions.
- 55:18So the summary is that
- 55:21the empirical Bayse normal means model provides
- 55:23a flexible and convenient way
- 55:25to induce shrinkage and sparsity
- 55:26in a range of applications.
- 55:28And we've been spending a lot of time trying
- 55:32to apply these methods
- 55:33and provide software to do
- 55:35some of these different things.
- 55:36And there's a bunch of things on my publications page
- 55:39and if you're interested in...
- 55:40If you can't find what you're looking for,
- 55:41just let me know, I'd be happy to point you to it.
- 55:44Thanks very much.
- 55:47- Thanks Matthew, that's a great talk.
- 55:49I wonder whether the audience have
- 55:51any questions for Matthew.
- 55:57So I do have some questions for you.
- 56:00So I think I really like the idea
- 56:02of applying empirical Bayes to a lot of applications
- 56:06and it's really seems empirical Bayes has great success.
- 56:10But I do have a question or some doubt
- 56:13about the inference part,
- 56:15especially in that linear regression model.
- 56:18So currently, for the current work you have been doing,
- 56:22you essentially shrink each
- 56:24of the co-efficient that based on, essentially,
- 56:26their estimated value,
- 56:28but in some applications,
- 56:31such as a GWAS study or fine mapping,
- 56:33different snips can have
- 56:35very different LD score structure.
- 56:38So in this case,
- 56:39how much we can trust the inference,
- 56:43the P value, from this (indistinct)?
- 56:48- So, great question.
- 56:51So let me just first...
- 56:55First emphasize that the shrink...
- 56:57The estimate here is being done removing the effects,
- 57:01or the estimated effects,
- 57:03of all the other variables.
- 57:04So each iteration of this,
- 57:06when you're estimating the effect
- 57:07of snip J, in your example,
- 57:10you're taking the estimated effects
- 57:11of the other variables into account.
- 57:14So the LD structure, as you mentioned,
- 57:18that's the correlation structure
- 57:20for those who don't know,
- 57:21between the Xs is formerly taken into account.
- 57:24However, there is a problem with this approach
- 57:26for very highly correlated variables.
- 57:30So let's just suppose there are two variables
- 57:34that are completely correlated,
- 57:35what does this algorithm end up doing?
- 57:38It ends up basically choosing one of them
- 57:40and ignoring the other.
- 57:44The Lasso does the same in fact.
- 57:46So it ends up choosing one of them
- 57:50and ignoring the other.
- 57:52And if you look to the posterior distribution
- 57:55on its effect,
- 57:57it would be far too confident
- 57:58in the size of the effect
- 58:00because it would assume that the other one had zero effect.
- 58:04And so it would have a small credible
- 58:06and for, let's say, around the effect size
- 58:08when, really, it should be saying
- 58:09you don't know which one to include.
- 58:11And so, we've worked recently
- 58:14on a method for doing that.
- 58:17A different method,
- 58:19different work than what I've just described here
- 58:22for doing fine mapping using variational approximations
- 58:25that don't have this problem,
- 58:27and it's on my webpage.
- 58:29It's Wang et al in JRSS-B,
- 58:34just recently, this year.
- 58:362021, I guess. - That's awesome.
- 58:39Thanks, so any more question for Matthew from the audience?
- 58:47Okay, I think we're of running out of time also.
- 58:51So if you have any question
- 58:53about the stuff to (indistinct)
- 58:54you want to use,
- 58:55I think you can contact either
- 58:56the authors of the paper or Matthew off the line.
- 59:00And thank you again for agreeing to present your work here.
- 59:04It's looks really useful and interesting.
- 59:07- Thank you for having me.