Skip to Main Content

BIS Seminar: Adventures in sparsity and shrinkage with the normal means model

March 03, 2021
  • 00:04- So, hi everyone.
  • 00:06Since we're still waiting for people to join,
  • 00:08I will first give a brief introduction
  • 00:11to Matthew here.
  • 00:13First, it's my honor to introduce Dr. Matthew Stevens
  • 00:16as our seminar speaker today.
  • 00:18And Matthew is a professor from human genetics
  • 00:21and the statistics at University of Chicago.
  • 00:24And in the past, his research mainly focused
  • 00:27on developing new statistical methods
  • 00:29for especially genetic applications.
  • 00:32Including, for example,
  • 00:33GWAS association studies and fine mapping
  • 00:36and populations genetic variants.
  • 00:38And today, he will give a talk
  • 00:40on some recently developed empirical Bayse methods
  • 00:43for the estimation for normal mean models
  • 00:46that will introduce the (indistinct) properties
  • 00:48such as shrinkage,
  • 00:49sparsity or smoothness.
  • 00:51And he will also discuss how to apply these methods
  • 00:54to a range of practical applications.
  • 00:59Okay, so let's wait for another minute
  • 01:02and then I will hand it over to Matthew.
  • 01:59So I will hand it over to Matthew from here.
  • 02:02Let's welcome him.
  • 02:05- Thank you very much.
  • 02:05It's a great pleasure to be here
  • 02:07and to get the opportunity
  • 02:09to present my work to you today.
  • 02:11So I guess just a little bit of background.
  • 02:14So few years ago...
  • 02:18Well, I guess, I've been teaching sparsity
  • 02:20and shrinkage for a while,
  • 02:22and it struck me that, in practice,
  • 02:24people don't really use many of these ideas directly...
  • 02:29At least not the empirical Bayse versions
  • 02:30of these ideas
  • 02:32directly in applications.
  • 02:34And so, I'm wondering why that is.
  • 02:37And partly, it's the lack of...
  • 02:40User-friendly, convenient methods
  • 02:43for applying these ideas.
  • 02:44So I've been trying to think about
  • 02:46how we can make these powerful ideas and methods
  • 02:51more generally applicable or easily applicable
  • 02:53in applications.
  • 02:54These ideas have been around quite some time.
  • 02:57But I think we've made some progress
  • 02:59on actually just making them a bit simpler maybe
  • 03:02and simpler to apply in practice,
  • 03:04so I'm gonna tell you about those today.
  • 03:08Oh, sorry.
  • 03:09It's not advancing, let me see.
  • 03:10Okay, so yeah, kind of related to that,
  • 03:13the normal means problem is
  • 03:15something we teach quite frequently.
  • 03:18It's not hard to teach
  • 03:20but it always struck me whenever I was taught it
  • 03:23that it looked like a kind of
  • 03:24a toy model that statisticians kind of think up
  • 03:28to teach students things
  • 03:31but never actually use.
  • 03:32And then suddenly, I had an epiphany
  • 03:35and realized that it's super useful.
  • 03:37And so now, I'm trying to...
  • 03:38I'm not the only one but I'm trying to convince people
  • 03:42that actually, this is a super useful thing
  • 03:44that we should be using in practice.
  • 03:46So here's the normal means model.
  • 03:49The idea is that you've got
  • 03:50a bunch of observations, XJ,
  • 03:52that you can think of as noisy observations
  • 03:54of theta J
  • 03:55and they have some variance.
  • 03:58I'm going to allow each variance to be different.
  • 04:02The simplest version would be to assume
  • 04:04that the variances are all the same
  • 04:06but I'm going to allow them to be different.
  • 04:10But an important point is
  • 04:11that we're going to assume that the variance is unknown,
  • 04:14which sounds a bit weird
  • 04:15but in applications, we'll see that there are reasons
  • 04:17why we might think that's an okay assumption
  • 04:21in some applications.
  • 04:22Okay, so the basic idea is
  • 04:25you've got a bunch of measurements
  • 04:26that are noisy measurements
  • 04:27of sum theta J
  • 04:29and they have known variance,
  • 04:31so they have known precision essentially,
  • 04:33and you want to estimate the Theta Js.
  • 04:36And, of course, the MLE is just
  • 04:38to estimate see theta J
  • 04:39by its corresponding measurement, XJ.
  • 04:43And really, it was a big surprise, I think.
  • 04:46I wasn't around at the time
  • 04:47but I believe it was a big surprise in 1956
  • 04:49when Stein showed that you can do better
  • 04:52than the MLE, at least in terms of
  • 04:55average squared error expected square there.
  • 05:01And so, there are many different ways to motivate or...
  • 05:05To motivate this result.
  • 05:07And I think many of them end up not being that intuitive.
  • 05:11It is quite a surprising result in generality
  • 05:15but I think...
  • 05:16So the way I like to think about the intuition
  • 05:18for why this might be true,
  • 05:19it's not the only intuition
  • 05:20but it's one intuition for why this might be true,
  • 05:22is to have an empirical Bayse thinking to the problem.
  • 05:27And so, to illustrate this idea,
  • 05:30I use a well-worn device, at this point,
  • 05:33which is baseball batting averages.
  • 05:38Efron certainly has used this example before
  • 05:40to motivate empirical Bayse ideas.
  • 05:42This particular example comes from...
  • 05:44The data come from this block here,
  • 05:46that I referenced at the bottom,
  • 05:48which I quite like as an explanation
  • 05:49of basic ideas behind empirical Bayse.
  • 05:52So this histogram here shows a bunch
  • 05:54of basic baseball batting averages
  • 05:57for a particular season in 1900.
  • 05:59You don't need to know very much about baseball
  • 06:01to know what's going on here.
  • 06:02Essentially, in baseball, you go and try and hit a ball
  • 06:06and your batting average is
  • 06:07what proportion of the time
  • 06:08you as a bat person end up hitting the ball.
  • 06:12And a good baseball batting average is around 0.3 or so.
  • 06:17And in a professional baseball,
  • 06:19no one's really going to have a batting average of zero
  • 06:21'cause they wouldn't survive.
  • 06:25But empirically, there were some individuals in this season
  • 06:28who had a batting average of zero,
  • 06:30that is they completely failed to hit the ball every time
  • 06:33they went up to bat.
  • 06:34And there were some people
  • 06:36who had a batting average of above 0.4,
  • 06:38which is also completely unheard of in baseball.
  • 06:43Nobody has a batting average that high,
  • 06:45so what's going on here?
  • 06:47Well, it's a simple explanation is
  • 06:49that these individuals at the tails are individuals
  • 06:52who just had a few at-bats.
  • 06:53They only went and attempted
  • 06:55to hit the ball a small number of times.
  • 06:57And so, maybe these individuals only had two bats
  • 07:01and they missed it both times,
  • 07:02they got injured or they weren't selected
  • 07:04or, for whatever reason, they didn't hit
  • 07:06the ball many times...
  • 07:07They didn't go to at bat many times
  • 07:09and so, their batting average was empirically zero.
  • 07:12Think of that as the maximum likelihood estimate.
  • 07:15But if you wanted to predict
  • 07:17what they would do, say next season,
  • 07:19if you gave them more at bats
  • 07:21in the long run,
  • 07:22zero would be a bad estimate for obvious reasons.
  • 07:27And the same applies to these individuals up here
  • 07:30with very big batting averages.
  • 07:31They also had relatively few at-bats
  • 07:34and they just happened to hit it above 0.4 of the time
  • 07:38out of the at-bats.
  • 07:39And the individuals who had lots of at-bats are
  • 07:41all in the middle here.
  • 07:44So these are binomial observations, basically,
  • 07:45and the ones who have small N are
  • 07:49more likely to be in the tails
  • 07:50and the ones we're big N are
  • 07:51all going to be around in the middle here.
  • 07:53So what would we do?
  • 07:54What would we want to do
  • 07:55if we wanted to estimate,
  • 07:57for example, for this individual,
  • 07:59their batting average for next season?
  • 08:01If we were gonna predict what they were gonna get.
  • 08:04Well, we would definitely want to estimate
  • 08:08something closer to the average batting average than 04.
  • 08:13That's the intuition.
  • 08:14And one way to frame that problem is that...
  • 08:18So sorry.
  • 08:18So this is the basic idea of shrinkage.
  • 08:20We would want to shrink these estimates towards,
  • 08:23in this case, towards the mean.
  • 08:25So how are we gonna do that?
  • 08:27Well, one way to think about it is...
  • 08:31Sorry, let me just...
  • 08:33Yes.
  • 08:36Sorry, just getting my slides.
  • 08:37Okay, so here, the red line represents
  • 08:40some underlying distribution
  • 08:44of actual batting averages.
  • 08:46So conceptually, some distribution
  • 08:47of actual batting averages among individuals
  • 08:53in this kind of league.
  • 08:54So the red line, in a Bayesean point of view,
  • 08:57kind of represent a sensible prior
  • 09:00for any given individual's batting average
  • 09:02before we saw that data.
  • 09:03So think of the red line as representing
  • 09:05the variation or the distribution
  • 09:08of actual batting averages among players.
  • 09:12And in fact, what we've done here is estimate
  • 09:17that red line from the data.
  • 09:21That's the empirical Bayse part
  • 09:23of the empirical Bayse.
  • 09:25The empirical part of empirical Bayse is that
  • 09:27the red line which we're going to use
  • 09:28as a prior for any given player was
  • 09:30actually estimated from all the data.
  • 09:33And the basic idea is
  • 09:34because we know what the variance
  • 09:36of a binomial distribution is,
  • 09:38we can kind of estimate
  • 09:40what the overall distribution
  • 09:42of the underlying piece in this binomial look like,
  • 09:46taking account of the fact that the histogram is
  • 09:49a noisy observations of that underlying P.
  • 09:53Every bat...
  • 09:54Basically, every every estimated batting average is
  • 09:58a noisy estimate of the true batting average
  • 10:00with the noise depending on
  • 10:01how many at-bats they have.
  • 10:03So once we've estimated that red line,
  • 10:07that prior,
  • 10:08we can compute the posterior
  • 10:10for each individual based on that prior.
  • 10:12And when we do that,
  • 10:13this is a histogram of the posterior means.
  • 10:16So these are, if you like,
  • 10:18shrunken estimates of the batting average
  • 10:20for each individual.
  • 10:21And you can see that the individuals
  • 10:22who had zero at-bats got shrunk
  • 10:25all the way over somewhere here.
  • 10:27And that's because their data really...
  • 10:29Although, the point estimate was zero,
  • 10:32they had very few at bats.
  • 10:33So the information in that data are very slim,
  • 10:38very little information.
  • 10:40And so, the prior dominates
  • 10:41when you're looking at the posterior distribution
  • 10:43for these individuals.
  • 10:44Whereas individuals in the middle
  • 10:46who have more at-bats,
  • 10:47will have the estimate that is less shrunken.
  • 10:52So that's gonna be a theme we'll come back to later.
  • 10:55So how do we form...
  • 10:57That's a picture.
  • 10:59How do we formulate that?
  • 11:00So those were binomial data,
  • 11:03I'm gonna talk about normal data.
  • 11:05So don't get confused by that.
  • 11:08I'm just going to assume normality could do
  • 11:10the same thing for a binomial,
  • 11:12but I think the normals a more generally useful
  • 11:16and convenient way to go.
  • 11:20So here's a normal means model again
  • 11:23and the idea is that that
  • 11:25we're going to assume that thetas come
  • 11:26from some prior distribution, G,
  • 11:29that was the red line in my example,
  • 11:31and we're going to estimate G
  • 11:33by maximum likelihood essentially.
  • 11:34So we're going to use all the X's,
  • 11:36integrating out theta
  • 11:38to obtain a maximum likelihood estimate for G.
  • 11:40That's stage one,
  • 11:42that's estimating that red line.
  • 11:43And then stage two is
  • 11:44to compute the posterior distribution
  • 11:46for each batting average,
  • 11:48or whatever theta J we're interested in,
  • 11:51taking into account that estimated prior
  • 11:53and the data on the individual J.
  • 11:56So that's the formalization of these ideas.
  • 12:01And these posterior distributions are gonna be shrunk
  • 12:03towards the prior or the primary.
  • 12:07So what kind of...
  • 12:09So I guess I've left unspecified here,
  • 12:12what family of priors should we consider for G?
  • 12:17So a commonly used prior distribution is
  • 12:21this so-called point-normal,
  • 12:23or sometimes called spike and slab prior distribution.
  • 12:28And these are...
  • 12:28Sorry, I should say,
  • 12:29I'm going to be thinking a lot about problems
  • 12:31where we want to induce sparsity.
  • 12:33So in baseball, we were shrinking towards the mean
  • 12:37but in many applications,
  • 12:38the natural point towards
  • 12:40natural prime mean, if you like, is zero
  • 12:42in situations where we expect effects
  • 12:46to be sparse, for example.
  • 12:47So I'm gonna be talking mostly about that situation,
  • 12:50although the ideas are more general.
  • 12:51And so, I'm going to be focusing
  • 12:52on the sparsity inducing choices of prior family.
  • 12:56And so, one commonly used one is this point normal
  • 12:59where there's some mass pi zero
  • 13:02exactly at zero,
  • 13:04and then the rest of the mass is
  • 13:06normally distributed about zero.
  • 13:09So the commonly used one.
  • 13:11In fact, it turns out,
  • 13:13and this is kind of interesting I think,
  • 13:15that it can be easier to do the computations
  • 13:17for more general families.
  • 13:20So for example,
  • 13:22just take the non-parametric family
  • 13:24that's the zero-centered scale mixture of normal,
  • 13:27so we'll see that in it,
  • 13:28which includes all these distributions of special cases.
  • 13:31It's nonparametric.
  • 13:33It includes a point-normal here.
  • 13:35It also includes the T-distribution,
  • 13:36the Laplace distribution,
  • 13:38the horseshoe prior, if you've come across that,
  • 13:40this zero-centered scale mixture of normals
  • 13:42and the surprise is that it turns out
  • 13:45to be easier, in some sense,
  • 13:47to do the calculations for this family,
  • 13:49this more general family,
  • 13:50than this narrow family,
  • 13:52partly because of the convex family.
  • 13:53So you can think of this as a kind of a convex relaxation
  • 13:57of the problem.
  • 13:58So all the computations become...
  • 13:59The optimizations you have to do in the simplest case
  • 14:01become convex when you use this family.
  • 14:05So let me say a bit more about that
  • 14:07for the non-parametric.
  • 14:09How do we actually do these non-parametric computations?
  • 14:12Well, we actually approximate
  • 14:14the non-parametric computation using a grid idea.
  • 14:17So here's the idea.
  • 14:20We modeled G, our prior,
  • 14:22as a mixture of...
  • 14:23I like to think of this K as being big.
  • 14:25A large number of normal distributions.
  • 14:28All of these normal distributions are centered at zero,
  • 14:31that's this zero here,
  • 14:32and they have a different variance.
  • 14:34Some of them have very small variances,
  • 14:36perhaps even one of them has a zero variance,
  • 14:38so that's the point mass at zero.
  • 14:39And the variance is sigma...
  • 14:41Think of Sigma squared K getting gradually bigger
  • 14:43until the last Sigma squared K is very big.
  • 14:47So we're just gonna use a lot of them.
  • 14:48Think of K as being, let's say 100
  • 14:50or 1,000 for the...
  • 14:52In practice, we find 20 is enough
  • 14:54but just think of it as being big
  • 14:57and spanning a lot of different variances,
  • 14:59going from very, very small,
  • 15:00to very, very big.
  • 15:01And then, estimating G just comes down
  • 15:05to estimating these pis,
  • 15:06these mixture proportions.
  • 15:09And that, then of course,
  • 15:11is a finite dimensional optimization problem
  • 15:13and in the normal means model,
  • 15:15it's a convex...
  • 15:16Well actually, for any mixture,
  • 15:18it's a convex problem,
  • 15:19and so there are efficient ways to find
  • 15:22the MLE for pi, given the grid of variances.
  • 15:27So let's just illustrate what's going on here.
  • 15:30Here's a grid of just three normals.
  • 15:33The one in the middle has the smallest variance,
  • 15:35the one over here has the biggest variance.
  • 15:37And we can get a mixture of those,
  • 15:39looks like that.
  • 15:40So you can see this is kind of a spiky distribution
  • 15:42but also with a long tail,
  • 15:44even with just a mixture of three distributions.
  • 15:47And so, the idea is that you can get
  • 15:49quite a flex...
  • 15:51It's a flexible family
  • 15:52by using a larger number of variances than three.
  • 15:56You can imagine you can get distributions
  • 15:59that have all sorts of spikiness
  • 16:01and long-tailed behavior.
  • 16:06So maybe just to fill in the details here;
  • 16:09with that prior as a mixture of normals,
  • 16:13the marginal distribution, P of X,
  • 16:15integrating out theta is analytic
  • 16:18because the sum of normals is normal.
  • 16:21So if you have a normally distributed variable
  • 16:23and then you have another variable
  • 16:25that's a normal error on top of that,
  • 16:27you get a normal.
  • 16:28So the marginal is a mixture of normals
  • 16:32that's very simple to work with
  • 16:34and estimating pi is a convex optimization problem.
  • 16:38You can do it.
  • 16:39You can do an EM algorithm
  • 16:40but convex methods,
  • 16:41as pointed out by Koenker and Mizera,
  • 16:43can be a lot more reliable and faster.
  • 16:49Okay, so let's just illustrate those ideas again.
  • 16:52Here's a potential prior distribution
  • 16:58and here's a likelihood.
  • 17:00So this is like a likelihood from a normal...
  • 17:02This is an estimate...
  • 17:04Think of this as a likelihood
  • 17:05for theta J in a normal means model.
  • 17:08So maybe XJ was one and a half or something
  • 17:11and SJ was, I don't know,
  • 17:12something like a half or something
  • 17:14or a half squared.
  • 17:17So this is meant to represent the likelihood.
  • 17:19So what does the posterior look like
  • 17:21when we combine this prior,
  • 17:23the black line,
  • 17:24with this likelihood, the red line?
  • 17:26it looks like this green line here.
  • 17:28So what you can see is going on here is
  • 17:31that you get shrinkage towards the mean, right?
  • 17:34But because the black line is long-tailed
  • 17:37because of the prior in this case has a long tail,
  • 17:40and because the red line...
  • 17:41The likelihood lies quite a ways in the tail,
  • 17:44the spiky bit at zero doesn't have very much impact
  • 17:48because it's completely...
  • 17:49Zero is basically inconsistent with the data
  • 17:52and so the posterior looks approximately normal.
  • 17:54It's actually a mixture of normals
  • 17:56but it looks approximately normal 'cause of weight
  • 17:58and there, zero is very, very small.
  • 18:02Whereas if a...
  • 18:04Here's a different example,
  • 18:05the black line is covered
  • 18:08by the green line this time because it's...
  • 18:09So I plotted all three lines on the same plot here.
  • 18:12The black line is...
  • 18:13Think of it as pretty much the green line.
  • 18:14It's still the same spiky prior
  • 18:16but now the likelihood is much flatter.
  • 18:18The XJ is the same.
  • 18:20Actually, it's one and a half
  • 18:21but we have an SJ that's much bigger.
  • 18:23So what happens here is that
  • 18:25the prior dominates because the likelihood's
  • 18:27relatively flat,
  • 18:29and so the posterior looks pretty much like the prior
  • 18:31and you get very strong shrinkage.
  • 18:33So think of this as corresponding
  • 18:35to those individuals who had very few at-bats,
  • 18:38their data are very imprecise,
  • 18:41and so their posterior, the green line,
  • 18:43looks very like the prior, the black line.
  • 18:46Okay, so we're gonna shrink those observations more.
  • 18:50So the key point here, I guess,
  • 18:52is that the observations with larger standard error,
  • 18:55larger SJ,
  • 18:58get shrunk more.
  • 19:00I should say "larger standard deviation" get shrunk more.
  • 19:04Here's another intermediate example
  • 19:07where the red line...
  • 19:08The likelihood's kind of not quite enough.
  • 19:11It illustrates the idea that the posterior could be bimodal
  • 19:15because the prior and the likelihood are indifferent,
  • 19:18have weight in different places.
  • 19:20So you can get different kinds of shrinkage depending on
  • 19:23how spiky the prior is,
  • 19:24how long-tailed the prior is,
  • 19:25how flat the likelihood is etc.
  • 19:35So obviously the shrinkage,
  • 19:36the amount of shrinkage you get,
  • 19:37depends on the prior, G,
  • 19:39which you're gonna estimate from the data.
  • 19:41It also depends on the standard error
  • 19:43or the standard deviation, SJ.
  • 19:46And one way to summarize this kind of the behavior,
  • 19:48the shrinkage behavior,
  • 19:50is to focus on how the posterior mean changes with X.
  • 19:55So we can define this operator here,
  • 19:57S-G-S of X,
  • 19:59as the X posterior mean of theta J,
  • 20:04given the prior and its variance or standard deviation
  • 20:08and that we observed XJ is equal to X.
  • 20:14I'm gonna call this the shrinkage operator
  • 20:17for the prior, G,
  • 20:18and variance, S for standard deviation, S.
  • 20:22Okay, so we could just plot
  • 20:24some of these shrinkage operators.
  • 20:25So the idea here is...
  • 20:26Sorry, this slide has B instead of X.
  • 20:30Sometimes I use B and sometimes I use X.
  • 20:34I've got them mixed up here, sorry.
  • 20:35So think of this as X
  • 20:37and this is S of X.
  • 20:39So these different lines here correspond
  • 20:43to different priors.
  • 20:44So the idea is that by using different priors,
  • 20:47we can get different types of shrinkage behavior.
  • 20:50So this prior here shrinks very strongly to zero.
  • 20:54This green line shrinks very strongly to zero
  • 20:57until B exceeds some value around five,
  • 21:01at which point it hardly shrinks at all.
  • 21:04So this is kind of a prior that has
  • 21:07kind of a big spike near zero.
  • 21:10But also a long tail,
  • 21:11such that when you get far enough in the tail,
  • 21:15you start to be convinced
  • 21:17that there's a real signal here.
  • 21:18So you can think of that
  • 21:19as this kind of...
  • 21:20This is sometimes called...
  • 21:22This is local shrinkage
  • 21:24and this is global.
  • 21:26So you get very strong local shrinkage towards zero
  • 21:29but very little shrinkage
  • 21:31if the signal is strong enough.
  • 21:33That kind of thing.
  • 21:34But the real point here is that
  • 21:35by using different priors,
  • 21:37these different scale mixture of normal priors,
  • 21:39you can get very different looking shrinkage behaviors.
  • 21:43Ones that shrink very strongly to zero
  • 21:45and then stop shrinking
  • 21:46or ones that shrink a little bit all the way, etc.
  • 21:52And so, if you're familiar with other ways
  • 21:55of doing shrinkage analysis,
  • 21:56and this is one of them,
  • 21:58or shrinkage,
  • 21:59is to use a penalized likelihood.
  • 22:01Then you can try and draw a parallel
  • 22:04and that's what I'm trying to do here.
  • 22:05Draw a parallel between the Bayesean method
  • 22:08and the penalized likelihood-based approaches
  • 22:11to inducing shrinkage or sparsity.
  • 22:15Another way to induce shrinkage is to essentially...
  • 22:18This is the kind of normal log likelihood here
  • 22:21and this is a penalty here
  • 22:23that you add for this.
  • 22:25This could be an L1 penalty
  • 22:26or an L2 penalty or an L0 penalty,
  • 22:28or some other kind of penalty.
  • 22:30So there's a penalty function here.
  • 22:32And you define the estimate
  • 22:34as the value that minimizes
  • 22:36this penalized log likelihood.
  • 22:38Sorry, yeah, this is a negative log likelihood.
  • 22:41Penalized least squares, I guess this would be.
  • 22:45Okay, so now eight is a penalty function here
  • 22:49and Lambda is a tuning parameter
  • 22:51that says how strong, in some sense, the penalty is.
  • 22:55And these are also widely used
  • 22:58to induce shrinkage,
  • 22:59especially in regression contexts.
  • 23:02And so, here are some commonly used shrinkage operators,
  • 23:06corresponding to different penalty functions.
  • 23:09So this green line is what's called
  • 23:12the hard thresholding,
  • 23:16which corresponds to an L0 penalty.
  • 23:19If you don't know what that means, don't worry.
  • 23:21But if you do, you make that connection.
  • 23:25At the red line here is L1 penalty
  • 23:28or soft thresholding.
  • 23:29And these two other ones here are particular instances
  • 23:33of some non-convex penalties that are used
  • 23:36in regression context,
  • 23:37particularly in practice.
  • 23:39And I guess that the point here is
  • 23:42that, essentially, different prior distributions
  • 23:45in the normal means model can lead
  • 23:48to shrinkage operators, shrinkage behavior
  • 23:51that looks kind of similar
  • 23:53to each of these different types of penalty.
  • 23:58So you can't actually mimic the behavior exactly.
  • 24:03I've just...
  • 24:05Or actually, my student, (indistinct) Kim,
  • 24:07chose the priors to visually closely match these
  • 24:11but you can't get...
  • 24:12Some of these have kinks and stuff
  • 24:13that you can't actually, formally, exactly mimic
  • 24:18but you can get qualitatively similar shrinkage behavior
  • 24:21from different priors
  • 24:23as different penalty functions.
  • 24:24So you should think about the different priors
  • 24:26as being analogous to different penalty functions.
  • 24:30And so, the key...
  • 24:33How does EB, empirical Bayse shrinkage,
  • 24:35differ from, say, these kinds of penalty-based approaches,
  • 24:40which I should say are maybe
  • 24:41more widely used in practice?
  • 24:44Well, so shrinkage is determined by the prior, G,
  • 24:49which we estimate in an empirical Bayse context
  • 24:52by maximum likelihood.
  • 24:53Whereas in typical shrinkage...
  • 24:57Sorry, typical penalty-based analyses,
  • 25:00people use cross validation to estimate parameters.
  • 25:03And the result is that cross-validation is fine
  • 25:07for estimating one parameter
  • 25:08but it becomes quite cumbersome
  • 25:10to estimate two parameters,
  • 25:12and really tricky to estimate three or four parameters
  • 25:14'cause you have to go and do a grid of different values
  • 25:17and do a lot of cross-validations
  • 25:18and start estimating all these different parameters.
  • 25:22So the point here is really
  • 25:23because we estimate G by maximum likelihood,
  • 25:26we can actually have a much more flexible family in practice
  • 25:29that we can optimize over more easily.
  • 25:33It's very flexible,
  • 25:34you can mimic a range of penalty functions
  • 25:36so you don't have to choose
  • 25:37whether to use L1 or L2 or L0.
  • 25:40You can essentially estimate
  • 25:42over these non-parametric prior families.
  • 25:44Think of that as kind of deciding automatically
  • 25:47whether to use L0, L1, L2
  • 25:49or some kind of non-convex penalty,
  • 25:51or something in between.
  • 25:53And the posterior distribution, of course then,
  • 25:58another nice thing is that it gives not
  • 25:59only the point estimates
  • 26:01but, if you like, it also gives shrunken interval estimates
  • 26:04which are not yielded by a penalty-based approach.
  • 26:08So I guess I'm trying to say
  • 26:09that there are potential advantages
  • 26:11of the empirical Bayse approach
  • 26:13over the penalty-based approach.
  • 26:16And yeah, although I think,
  • 26:19people have tried, particularly Efron has highlighted
  • 26:21the potential for empirical Bayse
  • 26:23to be used in practical applications,
  • 26:25largely in the practical application.
  • 26:27So I've seen empirical Bayse shrinkage hasn't
  • 26:29really been used very, very much.
  • 26:32So that's the goal,
  • 26:34is to change that.
  • 26:37So before I talk about examples,
  • 26:39I guess I will pause for a moment
  • 26:41to see if there are any questions.
  • 26:54And I can't see the chat for some reason
  • 26:56so if anyone...
  • 26:57So please unmute yourself
  • 26:58if you have a question.
  • 27:00- I don't think people are (indistinct)
  • 27:04every question in the chat.
  • 27:05At least, I didn't see any. - Good.
  • 27:07Okay, thank you.
  • 27:10It's all clear.
  • 27:12I'm happy to go on but
  • 27:14I just wanna...
  • 27:25Okay, so we've been trying to...
  • 27:27My group has been trying to think about
  • 27:28how to use these ideas,
  • 27:30make these ideas useful in practice
  • 27:32for a range of practical applications.
  • 27:35We've done work on multiple testing,
  • 27:37on high dimensional linear aggression,
  • 27:40and also some on matrix factorization.
  • 27:43From previous experience,
  • 27:44I'll probably get time to talk about the first two
  • 27:45and maybe not the last one,
  • 27:47but there's a pre-print on the archive.
  • 27:49You can see if you're interested
  • 27:50in matrix factorization.
  • 27:51Maybe I'll get to get to talk about that briefly.
  • 27:55But let me talk about multiple testing first.
  • 27:59So the typical multiple testing setup,
  • 28:02where you might typically, say,
  • 28:05apply a Benjamini-Hochberg type procedure is
  • 28:08you've got a large number of tests,
  • 28:09So J equals one to N,
  • 28:11and test J yields a P value, PJ,
  • 28:15and then you reject all tests with
  • 28:16some PJ less than a threshold gamma,
  • 28:19where that threshold is chosen
  • 28:21to control the FDR in a frequented sense.
  • 28:23So that's the typical setup.
  • 28:26So how are we going to apply
  • 28:27the normal means model to this problem?
  • 28:33Okay, well, in many applications,
  • 28:37not all but in many,
  • 28:38the P values are derived from
  • 28:40some kind of effect size estimate,
  • 28:41which I'm going to call "Beta hat J,"
  • 28:45which have standard errors, SJ,
  • 28:47that satisfy approximately, at least,
  • 28:49that Beta J hat is normally distributed
  • 28:53about the true value Beta J
  • 28:55with some variance given it by SJ.
  • 28:58So in a lot...
  • 29:01I work a lot in genetic applications.
  • 29:03So in genetic applications,
  • 29:05we're looking at different genes here.
  • 29:07So Beta J hat might be the estimate
  • 29:11of the difference in expression, let's say,
  • 29:13of a gene, J, between, say, males and females.
  • 29:18And Beta J would be the true difference
  • 29:21at that gene.
  • 29:22And you're interested in identifying
  • 29:25which genes are truly different...
  • 29:28Have a different mean expression
  • 29:30between males and females here.
  • 29:32And the reason that SJ is approximately known is
  • 29:36because you've got multiple males
  • 29:38and multiple females that you're using
  • 29:41to estimate this difference.
  • 29:43And so, you get an estimated standard error
  • 29:45of that Beta hat as well.
  • 29:48And so, once you've set the problem up like this,
  • 29:50of course, it looks suddenly like a normal means problem
  • 29:55and we can kind of apply
  • 29:57the empirical Bayes normal means idea.
  • 30:00We're gonna put a prior on the Beta Js
  • 30:03that is sparsity inducing.
  • 30:05That is, it's kind of centered at zero,
  • 30:06maybe it's got a point mass at zero.
  • 30:08But we're gonna estimate that prior from the data.
  • 30:14Okay.
  • 30:17And so, not only can you get posterior means for Beta,
  • 30:23as I said, you can get posterior interval estimates.
  • 30:25So you can kind of do things
  • 30:27like compute the posterior in 90% credible interval,
  • 30:31given that prior and the likelihood
  • 30:33for each Beta J
  • 30:34and we could reject, for example,
  • 30:35if the interval does not contain zero.
  • 30:39And I'm not going to talk about this in detail
  • 30:42because the details are in
  • 30:44a biostatistics paper from 2017.
  • 30:46I should say that the idea of using empirical Bayse for FDR
  • 30:50actually dates back to before Benjamini and Hoffberg.
  • 30:54Duncan Thomas has a really nice paper
  • 30:57that was pointed out to me by John Witty
  • 30:59that actually contains these basic ideas
  • 31:03but not nice software implementation,
  • 31:07which maybe explains why it hasn't caught on
  • 31:09in practice yet.
  • 31:10Efron's also been a pioneer in this area.
  • 31:13So...
  • 31:16Okay, so I don't want to dwell on that
  • 31:18because, actually, I think I'll just summarize
  • 31:21what I think is true compared with Benjamini-Hochberg.
  • 31:25You get a bit of an increase in power
  • 31:27by using an empirical Bayse approach.
  • 31:29The Benjamini-Hochberg approach is
  • 31:32more robust to correlated tests though,
  • 31:34so the empirical Bayse normal means model does assume
  • 31:38that the tests are independent
  • 31:39and, in practice, we have seen
  • 31:43that correlations can cause problems.
  • 31:45If you're interested in that,
  • 31:46I have a pre-print with Lei Sun on my website.
  • 31:49But the empirical Bayse normal means
  • 31:52also provides these interval estimates,
  • 31:55which is kind of nice.
  • 31:56Benjamini-Hochberg does not.
  • 31:58So there are some advantages
  • 31:59of the empirical Bayes approach
  • 32:01and maybe some disadvantages compared
  • 32:04with Benjamini-Hochberg.
  • 32:05But I think that the real benefit
  • 32:06of the empirical Bayse approach
  • 32:08actually comes when we look at multi-variate extensions
  • 32:11of this idea.
  • 32:12So I just wanted to briefly highlight those
  • 32:14and spend some time on those.
  • 32:16So here's the multi-variate version
  • 32:19of the empirical Bayse normal means models.
  • 32:24And now, my Beta Js are a vector of observation.
  • 32:29So I think of this as measuring, say,
  • 32:32gene J in multiple different tissues.
  • 32:34Think of different tissues.
  • 32:36You look at at heart
  • 32:37you look at lung,
  • 32:38you look brain,
  • 32:39you look at the spleen.
  • 32:42In fact, we've got 50 different tissues in the example
  • 32:46I'm gonna show in a minute.
  • 32:48So we've measured some kind of effect
  • 32:52in each gene, in each of these 50 different tissues
  • 32:56and we want to know where the effects are...
  • 33:01Which genes show effects in which tissues.
  • 33:04So Beta J is now a vector of length R,
  • 33:08the number of tissues.
  • 33:09R is 50 in our example.
  • 33:11And so you've got...
  • 33:12We're gonna assume that the estimates are
  • 33:16normally distributed with mean,
  • 33:17the true values and some variance,
  • 33:19covariance matrix now,
  • 33:21which we're going to assume, for now, is known.
  • 33:23That's actually a little trickier
  • 33:26but I'm gonna gloss over that for...
  • 33:28If you want to see details,
  • 33:29take a look at the paper.
  • 33:31I just wanna get the essence of the idea across.
  • 33:33We're still going to assume that Beta J comes
  • 33:35from some prior, G,
  • 33:36and we're still gonna use a mixture of normals,
  • 33:38but now we're using a mixture
  • 33:39of multi-variate normals.
  • 33:41And unlike the univariate case,
  • 33:43we can't use a grid of...
  • 33:45We can't use a grid of values
  • 33:47that span all possible covariance matrices here.
  • 33:50It's just too much.
  • 33:52So we have to do something to estimate
  • 33:53these covariance matrices,
  • 33:55as well as estimate the pis.
  • 33:57And again, if you want to see the details,
  • 34:00take a look at a Urbut et al.
  • 34:03But let me just illustrate
  • 34:04the idea of what's going on here,
  • 34:06or what happens when you apply this method
  • 34:08to some data.
  • 34:10So this is...
  • 34:10I said 50,
  • 34:11we have 44 tissues in this particular example.
  • 34:14So each row here is a tissue.
  • 34:17These yellow ones here are brain tissues,
  • 34:20different brain tissues,
  • 34:22and I think we'll see one later that's blood.
  • 34:25I think this one might be blood.
  • 34:26Anyway, each one is a tissue;
  • 34:28lung, blood, etc.
  • 34:29You don't need to know which ones are which, for now.
  • 34:32And so, what we've done here is plot
  • 34:34the Beta hat and plus or minus two standard deviations
  • 34:38for each tissue at a particular...
  • 34:41In this case, a particular snip, actually (indistinct).
  • 34:44So this is an eQTL analysis,
  • 34:46for those of you who know what that means.
  • 34:48If you don't, don't worry about it.
  • 34:50Just think of it as having an estimated effect
  • 34:53plus or minus two standard deviations
  • 34:55in 44 different tissues,
  • 34:58and we want to know which ones are quote,
  • 35:01"significantly different from zero."
  • 35:05And so what happens...
  • 35:10Sorry.
  • 35:11Didn't expect that to happen.
  • 35:17Sorry, okay.
  • 35:17Yeah, these are just...
  • 35:18These are just two examples.
  • 35:20So this is one example,
  • 35:22here's another example
  • 35:23where we've done the same thing.
  • 35:25Estimated effects, plus or minus two standard deviations.
  • 35:28So what you can see in this first one is
  • 35:30that it looks like, at least,
  • 35:31that the brain tissues have some kind of effect.
  • 35:33That's what you're supposed to see here.
  • 35:36And maybe there are some effects in other tissues.
  • 35:38There's a tendency for effects to be positive,
  • 35:40which might suggest that maybe everything has
  • 35:44a small effect to everywhere,
  • 35:45but particularly strong in the brain.
  • 35:47And whereas in this example,
  • 35:50this one appears to have an effect
  • 35:52in just one tissue.
  • 35:53This is the blood actually.
  • 35:54So this is an effect in blood
  • 35:56but mostly, it doesn't look like
  • 35:58there's an effect in other tissues.
  • 36:00But these, just to emphasize,
  • 36:01these are the raw data,
  • 36:03in the sense that they're the Beta hats
  • 36:04and the standard errors.
  • 36:05There's no shrinkage occurred yet.
  • 36:07But the idea is that the empirical Bayse approach takes
  • 36:11all these examples,
  • 36:12examples like this and examples like this,
  • 36:14to learn about what kinds of patterns are present
  • 36:17in the data.
  • 36:18That is, "What does G look like?"
  • 36:19So it learns from these examples
  • 36:21that there are some effects that look like
  • 36:24they're shared among the brain tissues,
  • 36:27and there are some effects that are...
  • 36:28These are actually somewhat rare
  • 36:31but rarely, there's an effect that's specific
  • 36:33to one tissue like, in this case, blood.
  • 36:35And it also learns, in this case actually,
  • 36:39that there's a lot of null things,
  • 36:41because there are a lot of null things as well.
  • 36:44So it puts lots of mass on the null as well
  • 36:46and that causes the shrinkage.
  • 36:48And then, having estimated those patterns from the data,
  • 36:51it computes posteriors.
  • 36:53And so, here's the data and then the posterior intervals
  • 36:58for the same...
  • 36:59For that first example.
  • 37:00And what you can see is that because of
  • 37:02the combining information across tissues,
  • 37:06you get standard errors that are getting smaller,
  • 37:09the brain estimates all get shrunk towards one another,
  • 37:12and all these...
  • 37:14There's some borrowing strength of information,
  • 37:16borrowing information across these tissues,
  • 37:18to make these these look like
  • 37:20some of them are kind of borderline significant.
  • 37:22Now, it looks like there's probably
  • 37:23an effect in every tissue
  • 37:25but a much stronger effect in brain.
  • 37:26Whereas this example here,
  • 37:29it recognizes that this looks like an effect
  • 37:32that's specific to blood.
  • 37:33And so, it shrinks everything else strongly towards zero
  • 37:36because it knows that most things are null,
  • 37:38it's learned that from the data,
  • 37:40but the blood estimate gets hardly shrunk at all.
  • 37:43We saw that kind of behavior where things
  • 37:44that are near zero can get shrunk towards zero,
  • 37:47whereas other things that are far
  • 37:48away don't get shrunk as much.
  • 37:50And it's really hard to do that kind of thing
  • 37:53without doing some kind of model-based analysis,
  • 37:57doing Benjamini-Hochberg type art non-model based
  • 38:01without making any assumptions
  • 38:03or making minimal assumptions,
  • 38:06very hard to capture this kind of thing, I think.
  • 38:09So I think the empirical Bayse approach
  • 38:11has big advantages in this setting.
  • 38:17I'll pause before I talk about regression.
  • 38:20Any questions there?
  • 38:24- So Matthew, I have some basic questions.
  • 38:26So in your means multivariate multiple testing case,
  • 38:30I guess for each of the plot,
  • 38:32you are looking at maybe a particular genes influence
  • 38:35on some... - Good, yeah.
  • 38:37Sorry, I did skip over it a bit.
  • 38:39So these are eQTLs.
  • 38:40So actually, what I'm plotting here is each...
  • 38:43This is a single snip associated
  • 38:46with a single gene.
  • 38:48And this is it's,
  • 38:49"How associated is this snip
  • 38:52"with this genes expression level
  • 38:54"in the different brain tissues,
  • 38:56"in the blood tissue in lung and spleen, etc?"
  • 39:02The idea is that...
  • 39:04What the scientific goal is to understand
  • 39:07which genetic variants are impacting gene expression
  • 39:10in different tissues,
  • 39:12which might tell us something
  • 39:13about the biology of the tissues
  • 39:15and the regulation going on in the different tissues.
  • 39:18- Got it.
  • 39:19So in this case,
  • 39:20I don't think I fully understand
  • 39:23why it's multivariate multiple tests, not univariate
  • 39:27because you are looking at each gene
  • 39:28versus each snip.
  • 39:32- Right, so sorry.
  • 39:33Think of J indexing eQTL.
  • 39:36So we've got 2 million potential eQTLs,
  • 39:39so that's the multiple part of it.
  • 39:41For 2 million potential eQTLs, that's...
  • 39:45And then each eQTL has data on 44 tissues,
  • 39:50so that's the multi-variate part of it.
  • 39:52(speaking over each other)
  • 39:53If you thought about it
  • 39:54in terms of say P values or maybe Z scores,
  • 39:57you have a matrix of Z scores.
  • 39:59There are two million rows
  • 40:01and there are 44 columns
  • 40:03and you have a Z score or a P value
  • 40:05for each element in that matrix,
  • 40:09and what we're assuming is that
  • 40:11the rows are independent,
  • 40:13which is not quite true but still,
  • 40:15we're assuming that the rows are independent
  • 40:17and the columns,
  • 40:18we're assuming that they can be correlated.
  • 40:20And in particular,
  • 40:21we're assuming that the...
  • 40:22Well, we're assuming that both
  • 40:24the measurements can be correlated,
  • 40:26so it's V,
  • 40:27but also that the effects can be correlated.
  • 40:30So that's to capture the idea
  • 40:31that there might be some effects
  • 40:33that are shared between say brain tissues--
  • 40:36- I see.
  • 40:37I see.
  • 40:38So this multi-variate is different
  • 40:40from our usual notion where the multivariate
  • 40:43and multivariate snip.
  • 40:44So there's multivariate tissue.
  • 40:46I guess, are the samples from the same cohort?
  • 40:49- Yeah, so in this particular case,
  • 40:52the samples are from the same individuals.
  • 40:55So these different brain tissues...
  • 40:56There's overlap anyway, let's say.
  • 40:57And so, that's what causes this...
  • 41:00That causes headaches, actually, for this--
  • 41:02- Okay, got it, thanks. - Yeah.
  • 41:04Just to emphasize,
  • 41:05it doesn't have to be different tissues.
  • 41:07The whole method works
  • 41:08on any matrix of Z scores, basically.
  • 41:11As long as you think that the rows correspond
  • 41:14to different tests
  • 41:15and the columns correspond
  • 41:16to different, say, conditions for the same test.
  • 41:21So examples might be
  • 41:24you're looking at the same snip
  • 41:25across lots of different phenotypes,
  • 41:27so looking at schizophrenia,
  • 41:29looking at bipolar,
  • 41:31looking at different diseases
  • 41:33or different traits,
  • 41:35and you can have a Beta hat for that snip
  • 41:37and a standard error for that snip
  • 41:39in every trait.
  • 41:40And you could try to learn,
  • 41:42"Oh look, there are some traits
  • 41:43"that tend to share effects
  • 41:45"and other traits that don't,"
  • 41:46or, often in experiments,
  • 41:49people treat their samples
  • 41:51with different treatments.
  • 41:52They challenge them with different viruses.
  • 41:54They look to see which things are being changed
  • 41:58when you challenge a cell with different viruses
  • 42:00or different heat shock treatments
  • 42:02or any kind of different treatment.
  • 42:03So yeah, basically, the idea is very generic.
  • 42:06The idea is if you've got a matrix of Z scores
  • 42:09where the effect, say, look likely to be shared
  • 42:12among column sometimes
  • 42:13and the rows are gonna be approximately independent,
  • 42:17or at least you're willing to assume that,
  • 42:20then you can apply the method.
  • 42:21- Okay, got it, thanks.
  • 42:22- So, actually, that's an important kind of...
  • 42:25Also, something that I've been thinking about a lot is
  • 42:28the benefits of modular or generic methods.
  • 42:34So if you think about what methods are applied
  • 42:36in statistics a lot,
  • 42:37you think T-test, linear regression.
  • 42:39These are all kind of very generic ideas.
  • 42:42They don't...
  • 42:44And Benjamini-Hochberg.
  • 42:45The nice thing about Benjamini-Hochberg is
  • 42:47you just need a set of P values
  • 42:48and you can apply Benjamini-Hochberg.
  • 42:50You don't have to worry too much
  • 42:52about where those P values came from.
  • 42:54So I think, for applications,
  • 42:56it's really useful to try to think about
  • 42:58what's the simplest type of data
  • 43:01you could imagine inputting into the procedure
  • 43:04in order to output something useful?
  • 43:06And sometimes, that involves making compromises
  • 43:09because to make a procedure generic enough,
  • 43:12you have to compromise on what...
  • 43:14On maybe what the details of what are going in.
  • 43:17So here, what we've compromised on is
  • 43:18that we take a matrix of Z scores,
  • 43:20or potentially Beta hats and their standard errors,
  • 43:23we can do either,
  • 43:24and that's the input.
  • 43:26So that makes it relatively generic.
  • 43:28You don't have to worry too much
  • 43:30about whether those Beta hats
  • 43:31and the standard errors, or the Z scores,
  • 43:33are coming from logistic regression
  • 43:35or linear regression,
  • 43:36or whether that controlling for some covariance
  • 43:38or all sorts of...
  • 43:39From a mixed model, etc.
  • 43:41As long as they have the basic property that
  • 43:44the Beta hat is normally distributed
  • 43:47about the true Beta
  • 43:48with some variance that you are willing to estimate,
  • 43:51then you can go.
  • 43:58- Sorry, (indistinct).
  • 44:01A short question.
  • 44:02So in practice, how you choose...
  • 44:05How many mix...
  • 44:06How many distribution you want to mixture
  • 44:08like the (indistinct)? - (indistinct)
  • 44:10Yeah, so great question.
  • 44:12And my answer, generally,
  • 44:14is just use as many as you want.
  • 44:16So as many as you can stomach.
  • 44:19The more you use, the slower it is.
  • 44:25And so, you might worry about over-fitting,
  • 44:27but it turns out that these procedures are
  • 44:29very robust to over-fitting
  • 44:30because of this fact that the mean is fixed at zero.
  • 44:34So all the components have a mean zero
  • 44:37and have some covariance
  • 44:39and because of that,
  • 44:40they have limited flexibility to overfit.
  • 44:45They're just not that flexible.
  • 44:48And in the univariate case,
  • 44:50that's even more obvious, I think.
  • 44:51That in the univariate case,
  • 44:53every one of those distributions,
  • 44:55any mixture of normals that are
  • 44:57all centered at zero is unimodal at zero
  • 45:01and has limited...
  • 45:02Can't have wiggly distributions
  • 45:04that are very spiky and overfitting.
  • 45:06So these methods are relatively immune
  • 45:09to overfitting in practice.
  • 45:12If you're worried about that,
  • 45:13you can do a test-train type thing
  • 45:15where you use half your tests to train,
  • 45:18and then you look at the log likelihood
  • 45:20out of sample on others,
  • 45:22and then tweak the number to avoid overfitting.
  • 45:27And we did do that early on in the methods
  • 45:30but we don't do it very often now,
  • 45:32or we only do it now when we're worried
  • 45:34'cause generally it seems like overfitting doesn't seem
  • 45:37to be a problem,
  • 45:38but if we see results are a little bit weird
  • 45:39or a bit concerning,
  • 45:41we try it to make sure we're not overfitting.
  • 45:46- Okay, thank you.
  • 45:47- I should say that, in the paper,
  • 45:49we kind of outlined some procedures we use
  • 45:51for estimating these variance, co-variance matrices
  • 45:54but they're not like...
  • 45:55They're kind of like...
  • 45:57The whole philosophy is that we could probably do better
  • 46:01and we're continuing to try and work
  • 46:03on better methods for estimating this
  • 46:05as we go forward.
  • 46:07So we're continually improving
  • 46:08the ways we can estimate this.
  • 46:16Okay, so briefly, I'll talk about linear regression.
  • 46:19So here's your standard linear regression where,
  • 46:21so we've N observations,
  • 46:23X is the matrix of covariates here,
  • 46:25B are the regression coefficients.
  • 46:27I'm kind of thinking of P as being big, potentially here.
  • 46:32And the errors normal.
  • 46:33And so, the empirical Bayes idea would be
  • 46:36to assume that the Bs come from
  • 46:38some prior distribution, G,
  • 46:40which comes from some family, curly G.
  • 46:42And what we'd like to do is estimate G
  • 46:45and then shrink the estimates of B,
  • 46:49using empirical Bayse type ideas
  • 46:52and posterior count computations.
  • 46:55But it's not a simple normal means model here,
  • 46:58so how do we end up applying
  • 47:00the empirical Bayse methods to this problem?
  • 47:04Well, let's just...
  • 47:05I'm gonna explain our algorithm
  • 47:07by analogy with penalized regression algorithms
  • 47:11because the algorithm ends up looking very similar,
  • 47:13and then I'll tell you
  • 47:14what the algorithm is actually kind of doing.
  • 47:16So a penalized regression would solve this problem.
  • 47:19So if you've seen the Lasso before...
  • 47:22I hope many of you might have.
  • 47:23If you've seen the Lasso before,
  • 47:24this would be solving this problem
  • 47:26with H being the L1 penalty,
  • 47:29absolute value of B, right?
  • 47:31So this...
  • 47:32So what algorithm...
  • 47:34There are many, many algorithms to solve this problem
  • 47:37but a very simple one is coordinate ascent.
  • 47:40So essentially, for each coordi...
  • 47:42it just iterates the following.
  • 47:43For each coordinate,
  • 47:44you have some kind of current estimate for Bs.
  • 47:47(indistinct)
  • 47:48So what you do here is you form the residuals
  • 47:50by taking away the effects of all the Bs
  • 47:55except the one you're trying to update,
  • 47:57the one you're trying to estimate.
  • 47:59So X minus J here is all the covariates
  • 48:01except covariate J.
  • 48:02B minus J is all the corresponding coefficients.
  • 48:06So this is the residual.
  • 48:08RJ is the residual,
  • 48:09after removing all the current estimated effects
  • 48:12apart from the Jth one.
  • 48:14And then you basically compute a estimate
  • 48:18of the Jth effect by regressing those residuals on XJ.
  • 48:24And then you shrink that using a shrinkage operator
  • 48:29that we saw earlier.
  • 48:31Just to remind you
  • 48:33that a shrinkage operator is the one
  • 48:34that minimizes this penalized least squares problem.
  • 48:38And it turns out,
  • 48:39it's not hard to show that this is coordinate ascent
  • 48:45for minimizing this, penalized objective function.
  • 48:49And so every iteration of this increases
  • 48:53that objective function
  • 48:55or decreases it.
  • 48:58Okay.
  • 49:00Okay, so it turns...
  • 49:01So our algorithm looks very similar.
  • 49:03You still compute the residuals,
  • 49:06you compute a Beta hat
  • 49:08by regressing the residuals on XJ.
  • 49:10You also, at the same time,
  • 49:11compute a standard error,
  • 49:12which is familiar form.
  • 49:17And then you, instead of shrinking using
  • 49:21that penalized regression operator,
  • 49:23you use a...
  • 49:25Sorry, I should say,
  • 49:27this is assuming G is known.
  • 49:28I'm starting with G.
  • 49:29G is known.
  • 49:31So you can shrink...
  • 49:32Instead of using the penalty-based method,
  • 49:35you use the posterior mean shrinkage operator here
  • 49:38that I introduced earlier.
  • 49:40So it's basically exactly the same algorithm,
  • 49:41except replacing this penalty-based shrinkage operator
  • 49:47with an empirical Bayse
  • 49:48or a Bayesean shrinkage operator.
  • 49:54And so, you could ask what that's doing
  • 49:56and it turns out that what it's doing is
  • 49:57it's minimizing the Kullback-Leibler Divergence
  • 50:01between some approximate posterior, Q,
  • 50:05and the true posterior, P, here
  • 50:08under the constraint that this Q is factorized.
  • 50:13So this is what's called
  • 50:15a variational approximation
  • 50:17or a mean-field,
  • 50:17or fully factorized variational approximation.
  • 50:21If you've seen that before,
  • 50:22you'll know what's going on here.
  • 50:24If you haven't seen it before,
  • 50:25it's trying to find an approximation
  • 50:27to the posterior.
  • 50:29This is the true posterior,
  • 50:30it's trying to find an approximation
  • 50:31to that posterior that minimizes
  • 50:33the Kullbert-Leibler Divergence
  • 50:35between the approximation
  • 50:36and the true value under in a simplifying assumption
  • 50:40that the posterior factorizes,
  • 50:41which, of course, it doesn't,
  • 50:42so that's why it's an approximation.
  • 50:44So that algorithm I just said is
  • 50:46a coordinate ascent algorithm
  • 50:48for maximizing F or minimizing the KL divergence.
  • 50:53So every iteration of that algorithm gets
  • 50:55a better estimate estimate
  • 50:58of the posterior, essentially.
  • 51:02Just to outline
  • 51:03and just to give you the intuition
  • 51:06for how you could maybe estimate G,
  • 51:08this isn't actually quite what we do
  • 51:10so the details get a bit more complicated,
  • 51:13but just to give you an intuition
  • 51:14for how you might think that you can estimate G;
  • 51:18Every iteration of this algorithm computes a B hat
  • 51:21and a corresponding standard error,
  • 51:23so you could imagine...
  • 51:25These two steps here, you could imagine storing these
  • 51:28through the iterations
  • 51:29and, at the end,
  • 51:31you could apply the empirical Bayes normal means procedure
  • 51:35to estimate G from these B hats
  • 51:37and standard errors,
  • 51:38and something close to that kind of works.
  • 51:43The details are a bit more complicated than that.
  • 51:47So let me give you some kind of intuition
  • 51:51for what we're trying to achieve here based
  • 51:54on simulation results.
  • 51:55So these are some simulations we've done.
  • 51:57The covariates are all independent here.
  • 52:00The true prior is a point normal,
  • 52:02that means that most of the effects are zero.
  • 52:07Well, actually maybe here,
  • 52:09one of the effects is nonzero,
  • 52:11five of the effects is nonzero,
  • 52:1350 of the effects are nonzero
  • 52:15and 500 of the effects of nonzero.
  • 52:17And actually, there are 500 effects,
  • 52:18500 variables in this example.
  • 52:23So the X-axis here just shows the number
  • 52:25of non-zero coordinates
  • 52:27and the results I've shown here are the prediction error,
  • 52:30so we're focusing on prediction error,
  • 52:32the out of sample prediction error,
  • 52:33using three different penalty-based approaches.
  • 52:37The Lasso, which is this line,
  • 52:40the L0Learn, which is this line,
  • 52:43which is L0 zero penalty,
  • 52:45and Ridge, which is this penalty,
  • 52:47the L2 penalty.
  • 52:48So the important thing to know is that
  • 52:51the L0 penalty is really designed, if you like,
  • 52:55to do well under very sparse models.
  • 52:58So that's why it's got the lowest prediction error
  • 53:03when the model is very sparse,
  • 53:05but when the model is completely densed,
  • 53:07it does very poorly.
  • 53:08Whereas Ridge is designed much more to...
  • 53:14It's actually based on a prior
  • 53:15that the effects are normally distributed.
  • 53:17So it's much better at dense models
  • 53:19than sparse models.
  • 53:20And you can see that at least relative to L0Learn,
  • 53:23Ridge is much better for the dense case
  • 53:26but also much worse for the sparse case.
  • 53:30And then Lasso has some kind of ability
  • 53:32to deal with both scenarios,
  • 53:35but it's not quite as good as the L0 penalty
  • 53:37when things are very sparse,
  • 53:39and it's not quite as good as the Ridge penalty
  • 53:41when things are very dense.
  • 53:45So our goal is that by learning the prior G
  • 53:49from the data,
  • 53:50we can adapt to each of these scenarios
  • 53:53and get performance close to L0Learn
  • 53:55when the truth is sparse
  • 53:58and get performance close to Ridge regression
  • 54:01when the truth is dense.
  • 54:03And so, the red line here actually shows
  • 54:06the performance of our method.
  • 54:09And you can see, indeed,
  • 54:10we do even slightly better than L0Learn
  • 54:13in this part here
  • 54:14and slightly better than cross-validated Ridge regression
  • 54:18in this here.
  • 54:19The difference between these two is just that
  • 54:21the Ridge regression is doing cross-validation
  • 54:23to estimate the tuning parameter
  • 54:25and we're using empirical Bayse maximum likelihood
  • 54:28to estimate it.
  • 54:28So that's just that difference there.
  • 54:30And the Oracle here is using the true...
  • 54:32You can do the Oracle computation
  • 54:35for the Ridge regression
  • 54:36with the true tuning parameter here.
  • 54:38I should say that may be that this...
  • 54:42Maybe I should just show you the results.
  • 54:45So here is a bunch of other penalties,
  • 54:47including elastic net, for example, you might wonder,
  • 54:49which is kind of a compromise
  • 54:51between L1 and L2.
  • 54:52And you can see, it does do the compromising
  • 54:56but it doesn't do as well
  • 54:57as the empirical Bayse approach.
  • 54:59And here are some other non-convex methods
  • 55:01that are more, again...
  • 55:04They're kind of more tuned to the sparse case
  • 55:06than to the dense case.
  • 55:09As promised, I'm gonna skip over
  • 55:12the matrix factorization
  • 55:14and just summarize to give time for questions.
  • 55:18So the summary is that
  • 55:21the empirical Bayse normal means model provides
  • 55:23a flexible and convenient way
  • 55:25to induce shrinkage and sparsity
  • 55:26in a range of applications.
  • 55:28And we've been spending a lot of time trying
  • 55:32to apply these methods
  • 55:33and provide software to do
  • 55:35some of these different things.
  • 55:36And there's a bunch of things on my publications page
  • 55:39and if you're interested in...
  • 55:40If you can't find what you're looking for,
  • 55:41just let me know, I'd be happy to point you to it.
  • 55:44Thanks very much.
  • 55:47- Thanks Matthew, that's a great talk.
  • 55:49I wonder whether the audience have
  • 55:51any questions for Matthew.
  • 55:57So I do have some questions for you.
  • 56:00So I think I really like the idea
  • 56:02of applying empirical Bayes to a lot of applications
  • 56:06and it's really seems empirical Bayes has great success.
  • 56:10But I do have a question or some doubt
  • 56:13about the inference part,
  • 56:15especially in that linear regression model.
  • 56:18So currently, for the current work you have been doing,
  • 56:22you essentially shrink each
  • 56:24of the co-efficient that based on, essentially,
  • 56:26their estimated value,
  • 56:28but in some applications,
  • 56:31such as a GWAS study or fine mapping,
  • 56:33different snips can have
  • 56:35very different LD score structure.
  • 56:38So in this case,
  • 56:39how much we can trust the inference,
  • 56:43the P value, from this (indistinct)?
  • 56:48- So, great question.
  • 56:51So let me just first...
  • 56:55First emphasize that the shrink...
  • 56:57The estimate here is being done removing the effects,
  • 57:01or the estimated effects,
  • 57:03of all the other variables.
  • 57:04So each iteration of this,
  • 57:06when you're estimating the effect
  • 57:07of snip J, in your example,
  • 57:10you're taking the estimated effects
  • 57:11of the other variables into account.
  • 57:14So the LD structure, as you mentioned,
  • 57:18that's the correlation structure
  • 57:20for those who don't know,
  • 57:21between the Xs is formerly taken into account.
  • 57:24However, there is a problem with this approach
  • 57:26for very highly correlated variables.
  • 57:30So let's just suppose there are two variables
  • 57:34that are completely correlated,
  • 57:35what does this algorithm end up doing?
  • 57:38It ends up basically choosing one of them
  • 57:40and ignoring the other.
  • 57:44The Lasso does the same in fact.
  • 57:46So it ends up choosing one of them
  • 57:50and ignoring the other.
  • 57:52And if you look to the posterior distribution
  • 57:55on its effect,
  • 57:57it would be far too confident
  • 57:58in the size of the effect
  • 58:00because it would assume that the other one had zero effect.
  • 58:04And so it would have a small credible
  • 58:06and for, let's say, around the effect size
  • 58:08when, really, it should be saying
  • 58:09you don't know which one to include.
  • 58:11And so, we've worked recently
  • 58:14on a method for doing that.
  • 58:17A different method,
  • 58:19different work than what I've just described here
  • 58:22for doing fine mapping using variational approximations
  • 58:25that don't have this problem,
  • 58:27and it's on my webpage.
  • 58:29It's Wang et al in JRSS-B,
  • 58:34just recently, this year.
  • 58:362021, I guess. - That's awesome.
  • 58:39Thanks, so any more question for Matthew from the audience?
  • 58:47Okay, I think we're of running out of time also.
  • 58:51So if you have any question
  • 58:53about the stuff to (indistinct)
  • 58:54you want to use,
  • 58:55I think you can contact either
  • 58:56the authors of the paper or Matthew off the line.
  • 59:00And thank you again for agreeing to present your work here.
  • 59:04It's looks really useful and interesting.
  • 59:07- Thank you for having me.