BIS Seminar: Adventures in sparsity and shrinkage with the normal means model

March 03, 2021

Information

Matthew Stephens, PhD, Ralph W. Gerard Professor Department of Statistics, Human Genetics and the College
The University of Chicago

Biostatistics Seminar

March 2, 2021

ID6245

To CiteDCA Citation Guide

00:04- So, hi everyone.
00:06Since we're still waiting for people to join,
00:08I will first give a brief introduction
00:11to Matthew here.
00:13First, it's my honor to introduce Dr. Matthew Stevens
00:16as our seminar speaker today.
00:18And Matthew is a professor from human genetics
00:21and the statistics at University of Chicago.
00:24And in the past, his research mainly focused
00:27on developing new statistical methods
00:29for especially genetic applications.
00:32Including, for example,
00:33GWAS association studies and fine mapping
00:36and populations genetic variants.
00:38And today, he will give a talk
00:40on some recently developed empirical Bayse methods
00:43for the estimation for normal mean models
00:46that will introduce the (indistinct) properties
00:48such as shrinkage,
00:49sparsity or smoothness.
00:51And he will also discuss how to apply these methods
00:54to a range of practical applications.
00:59Okay, so let's wait for another minute
01:02and then I will hand it over to Matthew.
01:59So I will hand it over to Matthew from here.
02:02Let's welcome him.
02:05- Thank you very much.
02:05It's a great pleasure to be here
02:07and to get the opportunity
02:09to present my work to you today.
02:11So I guess just a little bit of background.
02:14So few years ago...
02:18Well, I guess, I've been teaching sparsity
02:20and shrinkage for a while,
02:22and it struck me that, in practice,
02:24people don't really use many of these ideas directly...
02:29At least not the empirical Bayse versions
02:30of these ideas
02:32directly in applications.
02:34And so, I'm wondering why that is.
02:37And partly, it's the lack of...
02:40User-friendly, convenient methods
02:43for applying these ideas.
02:44So I've been trying to think about
02:46how we can make these powerful ideas and methods
02:51more generally applicable or easily applicable
02:53in applications.
02:54These ideas have been around quite some time.
02:57But I think we've made some progress
02:59on actually just making them a bit simpler maybe
03:02and simpler to apply in practice,
03:04so I'm gonna tell you about those today.
03:08Oh, sorry.
03:09It's not advancing, let me see.
03:10Okay, so yeah, kind of related to that,
03:13the normal means problem is
03:15something we teach quite frequently.
03:18It's not hard to teach
03:20but it always struck me whenever I was taught it
03:23that it looked like a kind of
03:24a toy model that statisticians kind of think up
03:28to teach students things
03:31but never actually use.
03:32And then suddenly, I had an epiphany
03:35and realized that it's super useful.
03:37And so now, I'm trying to...
03:38I'm not the only one but I'm trying to convince people
03:42that actually, this is a super useful thing
03:44that we should be using in practice.
03:46So here's the normal means model.
03:49The idea is that you've got
03:50a bunch of observations, XJ,
03:52that you can think of as noisy observations
03:54of theta J
03:55and they have some variance.
03:58I'm going to allow each variance to be different.
04:02The simplest version would be to assume
04:04that the variances are all the same
04:06but I'm going to allow them to be different.
04:10But an important point is
04:11that we're going to assume that the variance is unknown,
04:14which sounds a bit weird
04:15but in applications, we'll see that there are reasons
04:17why we might think that's an okay assumption
04:21in some applications.
04:22Okay, so the basic idea is
04:25you've got a bunch of measurements
04:26that are noisy measurements
04:27of sum theta J
04:29and they have known variance,
04:31so they have known precision essentially,
04:33and you want to estimate the Theta Js.
04:36And, of course, the MLE is just
04:38to estimate see theta J
04:39by its corresponding measurement, XJ.
04:43And really, it was a big surprise, I think.
04:46I wasn't around at the time
04:47but I believe it was a big surprise in 1956
04:49when Stein showed that you can do better
04:52than the MLE, at least in terms of
04:55average squared error expected square there.
05:01And so, there are many different ways to motivate or...
05:05To motivate this result.
05:07And I think many of them end up not being that intuitive.
05:11It is quite a surprising result in generality
05:15but I think...
05:16So the way I like to think about the intuition
05:18for why this might be true,
05:19it's not the only intuition
05:20but it's one intuition for why this might be true,
05:22is to have an empirical Bayse thinking to the problem.
05:27And so, to illustrate this idea,
05:30I use a well-worn device, at this point,
05:33which is baseball batting averages.
05:38Efron certainly has used this example before
05:40to motivate empirical Bayse ideas.
05:42This particular example comes from...
05:44The data come from this block here,
05:46that I referenced at the bottom,
05:48which I quite like as an explanation
05:49of basic ideas behind empirical Bayse.
05:52So this histogram here shows a bunch
05:54of basic baseball batting averages
05:57for a particular season in 1900.
05:59You don't need to know very much about baseball
06:01to know what's going on here.
06:02Essentially, in baseball, you go and try and hit a ball
06:06and your batting average is
06:07what proportion of the time
06:08you as a bat person end up hitting the ball.
06:12And a good baseball batting average is around 0.3 or so.
06:17And in a professional baseball,
06:19no one's really going to have a batting average of zero
06:21'cause they wouldn't survive.
06:25But empirically, there were some individuals in this season
06:28who had a batting average of zero,
06:30that is they completely failed to hit the ball every time
06:33they went up to bat.
06:34And there were some people
06:36who had a batting average of above 0.4,
06:38which is also completely unheard of in baseball.
06:43Nobody has a batting average that high,
06:45so what's going on here?
06:47Well, it's a simple explanation is
06:49that these individuals at the tails are individuals
06:52who just had a few at-bats.
06:53They only went and attempted
06:55to hit the ball a small number of times.
06:57And so, maybe these individuals only had two bats
07:01and they missed it both times,
07:02they got injured or they weren't selected
07:04or, for whatever reason, they didn't hit
07:06the ball many times...
07:07They didn't go to at bat many times
07:09and so, their batting average was empirically zero.
07:12Think of that as the maximum likelihood estimate.
07:15But if you wanted to predict
07:17what they would do, say next season,
07:19if you gave them more at bats
07:21in the long run,
07:22zero would be a bad estimate for obvious reasons.
07:27And the same applies to these individuals up here
07:30with very big batting averages.
07:31They also had relatively few at-bats
07:34and they just happened to hit it above 0.4 of the time
07:38out of the at-bats.
07:39And the individuals who had lots of at-bats are
07:41all in the middle here.
07:44So these are binomial observations, basically,
07:45and the ones who have small N are
07:49more likely to be in the tails
07:50and the ones we're big N are
07:51all going to be around in the middle here.
07:53So what would we do?
07:54What would we want to do
07:55if we wanted to estimate,
07:57for example, for this individual,
07:59their batting average for next season?
08:01If we were gonna predict what they were gonna get.
08:04Well, we would definitely want to estimate
08:08something closer to the average batting average than 04.
08:13That's the intuition.
08:14And one way to frame that problem is that...
08:18So sorry.
08:18So this is the basic idea of shrinkage.
08:20We would want to shrink these estimates towards,
08:23in this case, towards the mean.
08:25So how are we gonna do that?
08:27Well, one way to think about it is...
08:31Sorry, let me just...
08:33Yes.
08:36Sorry, just getting my slides.
08:37Okay, so here, the red line represents
08:40some underlying distribution
08:44of actual batting averages.
08:46So conceptually, some distribution
08:47of actual batting averages among individuals
08:53in this kind of league.
08:54So the red line, in a Bayesean point of view,
08:57kind of represent a sensible prior
09:00for any given individual's batting average
09:02before we saw that data.
09:03So think of the red line as representing
09:05the variation or the distribution
09:08of actual batting averages among players.
09:12And in fact, what we've done here is estimate
09:17that red line from the data.
09:21That's the empirical Bayse part
09:23of the empirical Bayse.
09:25The empirical part of empirical Bayse is that
09:27the red line which we're going to use
09:28as a prior for any given player was
09:30actually estimated from all the data.
09:33And the basic idea is
09:34because we know what the variance
09:36of a binomial distribution is,
09:38we can kind of estimate
09:40what the overall distribution
09:42of the underlying piece in this binomial look like,
09:46taking account of the fact that the histogram is
09:49a noisy observations of that underlying P.
09:53Every bat...
09:54Basically, every every estimated batting average is
09:58a noisy estimate of the true batting average
10:00with the noise depending on
10:01how many at-bats they have.
10:03So once we've estimated that red line,
10:07that prior,
10:08we can compute the posterior
10:10for each individual based on that prior.
10:12And when we do that,
10:13this is a histogram of the posterior means.
10:16So these are, if you like,
10:18shrunken estimates of the batting average
10:20for each individual.
10:21And you can see that the individuals
10:22who had zero at-bats got shrunk
10:25all the way over somewhere here.
10:27And that's because their data really...
10:29Although, the point estimate was zero,
10:32they had very few at bats.
10:33So the information in that data are very slim,
10:38very little information.
10:40And so, the prior dominates
10:41when you're looking at the posterior distribution
10:43for these individuals.
10:44Whereas individuals in the middle
10:46who have more at-bats,
10:47will have the estimate that is less shrunken.
10:52So that's gonna be a theme we'll come back to later.
10:55So how do we form...
10:57That's a picture.
10:59How do we formulate that?
11:00So those were binomial data,
11:03I'm gonna talk about normal data.
11:05So don't get confused by that.
11:08I'm just going to assume normality could do
11:10the same thing for a binomial,
11:12but I think the normals a more generally useful
11:16and convenient way to go.
11:20So here's a normal means model again
11:23and the idea is that that
11:25we're going to assume that thetas come
11:26from some prior distribution, G,
11:29that was the red line in my example,
11:31and we're going to estimate G
11:33by maximum likelihood essentially.
11:34So we're going to use all the X's,
11:36integrating out theta
11:38to obtain a maximum likelihood estimate for G.
11:40That's stage one,
11:42that's estimating that red line.
11:43And then stage two is
11:44to compute the posterior distribution
11:46for each batting average,
11:48or whatever theta J we're interested in,
11:51taking into account that estimated prior
11:53and the data on the individual J.
11:56So that's the formalization of these ideas.
12:01And these posterior distributions are gonna be shrunk
12:03towards the prior or the primary.
12:07So what kind of...
12:09So I guess I've left unspecified here,
12:12what family of priors should we consider for G?
12:17So a commonly used prior distribution is
12:21this so-called point-normal,
12:23or sometimes called spike and slab prior distribution.
12:28And these are...
12:28Sorry, I should say,
12:29I'm going to be thinking a lot about problems
12:31where we want to induce sparsity.
12:33So in baseball, we were shrinking towards the mean
12:37but in many applications,
12:38the natural point towards
12:40natural prime mean, if you like, is zero
12:42in situations where we expect effects
12:46to be sparse, for example.
12:47So I'm gonna be talking mostly about that situation,
12:50although the ideas are more general.
12:51And so, I'm going to be focusing
12:52on the sparsity inducing choices of prior family.
12:56And so, one commonly used one is this point normal
12:59where there's some mass pi zero
13:02exactly at zero,
13:04and then the rest of the mass is
13:06normally distributed about zero.
13:09So the commonly used one.
13:11In fact, it turns out,
13:13and this is kind of interesting I think,
13:15that it can be easier to do the computations
13:17for more general families.
13:20So for example,
13:22just take the non-parametric family
13:24that's the zero-centered scale mixture of normal,
13:27so we'll see that in it,
13:28which includes all these distributions of special cases.
13:31It's nonparametric.
13:33It includes a point-normal here.
13:35It also includes the T-distribution,
13:36the Laplace distribution,
13:38the horseshoe prior, if you've come across that,
13:40this zero-centered scale mixture of normals
13:42and the surprise is that it turns out
13:45to be easier, in some sense,
13:47to do the calculations for this family,
13:49this more general family,
13:50than this narrow family,
13:52partly because of the convex family.
13:53So you can think of this as a kind of a convex relaxation
13:57of the problem.
13:58So all the computations become...
13:59The optimizations you have to do in the simplest case
14:01become convex when you use this family.
14:05So let me say a bit more about that
14:07for the non-parametric.
14:09How do we actually do these non-parametric computations?
14:12Well, we actually approximate
14:14the non-parametric computation using a grid idea.
14:17So here's the idea.
14:20We modeled G, our prior,
14:22as a mixture of...
14:23I like to think of this K as being big.
14:25A large number of normal distributions.
14:28All of these normal distributions are centered at zero,
14:31that's this zero here,
14:32and they have a different variance.
14:34Some of them have very small variances,
14:36perhaps even one of them has a zero variance,
14:38so that's the point mass at zero.
14:39And the variance is sigma...
14:41Think of Sigma squared K getting gradually bigger
14:43until the last Sigma squared K is very big.
14:47So we're just gonna use a lot of them.
14:48Think of K as being, let's say 100
14:50or 1,000 for the...
14:52In practice, we find 20 is enough
14:54but just think of it as being big
14:57and spanning a lot of different variances,
14:59going from very, very small,
15:00to very, very big.
15:01And then, estimating G just comes down
15:05to estimating these pis,
15:06these mixture proportions.
15:09And that, then of course,
15:11is a finite dimensional optimization problem
15:13and in the normal means model,
15:15it's a convex...
15:16Well actually, for any mixture,
15:18it's a convex problem,
15:19and so there are efficient ways to find
15:22the MLE for pi, given the grid of variances.
15:27So let's just illustrate what's going on here.
15:30Here's a grid of just three normals.
15:33The one in the middle has the smallest variance,
15:35the one over here has the biggest variance.
15:37And we can get a mixture of those,
15:39looks like that.
15:40So you can see this is kind of a spiky distribution
15:42but also with a long tail,
15:44even with just a mixture of three distributions.
15:47And so, the idea is that you can get
15:49quite a flex...
15:51It's a flexible family
15:52by using a larger number of variances than three.
15:56You can imagine you can get distributions
15:59that have all sorts of spikiness
16:01and long-tailed behavior.
16:06So maybe just to fill in the details here;
16:09with that prior as a mixture of normals,
16:13the marginal distribution, P of X,
16:15integrating out theta is analytic
16:18because the sum of normals is normal.
16:21So if you have a normally distributed variable
16:23and then you have another variable
16:25that's a normal error on top of that,
16:27you get a normal.
16:28So the marginal is a mixture of normals
16:32that's very simple to work with
16:34and estimating pi is a convex optimization problem.
16:38You can do it.
16:39You can do an EM algorithm
16:40but convex methods,
16:41as pointed out by Koenker and Mizera,
16:43can be a lot more reliable and faster.
16:49Okay, so let's just illustrate those ideas again.
16:52Here's a potential prior distribution
16:58and here's a likelihood.
17:00So this is like a likelihood from a normal...
17:02This is an estimate...
17:04Think of this as a likelihood
17:05for theta J in a normal means model.
17:08So maybe XJ was one and a half or something
17:11and SJ was, I don't know,
17:12something like a half or something
17:14or a half squared.
17:17So this is meant to represent the likelihood.
17:19So what does the posterior look like
17:21when we combine this prior,
17:23the black line,
17:24with this likelihood, the red line?
17:26it looks like this green line here.
17:28So what you can see is going on here is
17:31that you get shrinkage towards the mean, right?
17:34But because the black line is long-tailed
17:37because of the prior in this case has a long tail,
17:40and because the red line...
17:41The likelihood lies quite a ways in the tail,
17:44the spiky bit at zero doesn't have very much impact
17:48because it's completely...
17:49Zero is basically inconsistent with the data
17:52and so the posterior looks approximately normal.
17:54It's actually a mixture of normals
17:56but it looks approximately normal 'cause of weight
17:58and there, zero is very, very small.
18:02Whereas if a...
18:04Here's a different example,
18:05the black line is covered
18:08by the green line this time because it's...
18:09So I plotted all three lines on the same plot here.
18:12The black line is...
18:13Think of it as pretty much the green line.
18:14It's still the same spiky prior
18:16but now the likelihood is much flatter.
18:18The XJ is the same.
18:20Actually, it's one and a half
18:21but we have an SJ that's much bigger.
18:23So what happens here is that
18:25the prior dominates because the likelihood's
18:27relatively flat,
18:29and so the posterior looks pretty much like the prior
18:31and you get very strong shrinkage.
18:33So think of this as corresponding
18:35to those individuals who had very few at-bats,
18:38their data are very imprecise,
18:41and so their posterior, the green line,
18:43looks very like the prior, the black line.
18:46Okay, so we're gonna shrink those observations more.
18:50So the key point here, I guess,
18:52is that the observations with larger standard error,
18:55larger SJ,
18:58get shrunk more.
19:00I should say "larger standard deviation" get shrunk more.
19:04Here's another intermediate example
19:07where the red line...
19:08The likelihood's kind of not quite enough.
19:11It illustrates the idea that the posterior could be bimodal
19:15because the prior and the likelihood are indifferent,
19:18have weight in different places.
19:20So you can get different kinds of shrinkage depending on
19:23how spiky the prior is,
19:24how long-tailed the prior is,
19:25how flat the likelihood is etc.
19:35So obviously the shrinkage,
19:36the amount of shrinkage you get,
19:37depends on the prior, G,
19:39which you're gonna estimate from the data.
19:41It also depends on the standard error
19:43or the standard deviation, SJ.
19:46And one way to summarize this kind of the behavior,
19:48the shrinkage behavior,
19:50is to focus on how the posterior mean changes with X.
19:55So we can define this operator here,
19:57S-G-S of X,
19:59as the X posterior mean of theta J,
20:04given the prior and its variance or standard deviation
20:08and that we observed XJ is equal to X.
20:14I'm gonna call this the shrinkage operator
20:17for the prior, G,
20:18and variance, S for standard deviation, S.
20:22Okay, so we could just plot
20:24some of these shrinkage operators.
20:25So the idea here is...
20:26Sorry, this slide has B instead of X.
20:30Sometimes I use B and sometimes I use X.
20:34I've got them mixed up here, sorry.
20:35So think of this as X
20:37and this is S of X.
20:39So these different lines here correspond
20:43to different priors.
20:44So the idea is that by using different priors,
20:47we can get different types of shrinkage behavior.
20:50So this prior here shrinks very strongly to zero.
20:54This green line shrinks very strongly to zero
20:57until B exceeds some value around five,
21:01at which point it hardly shrinks at all.
21:04So this is kind of a prior that has
21:07kind of a big spike near zero.
21:10But also a long tail,
21:11such that when you get far enough in the tail,
21:15you start to be convinced
21:17that there's a real signal here.
21:18So you can think of that
21:19as this kind of...
21:20This is sometimes called...
21:22This is local shrinkage
21:24and this is global.
21:26So you get very strong local shrinkage towards zero
21:29but very little shrinkage
21:31if the signal is strong enough.
21:33That kind of thing.
21:34But the real point here is that
21:35by using different priors,
21:37these different scale mixture of normal priors,
21:39you can get very different looking shrinkage behaviors.
21:43Ones that shrink very strongly to zero
21:45and then stop shrinking
21:46or ones that shrink a little bit all the way, etc.
21:52And so, if you're familiar with other ways
21:55of doing shrinkage analysis,
21:56and this is one of them,
21:58or shrinkage,
21:59is to use a penalized likelihood.
22:01Then you can try and draw a parallel
22:04and that's what I'm trying to do here.
22:05Draw a parallel between the Bayesean method
22:08and the penalized likelihood-based approaches
22:11to inducing shrinkage or sparsity.
22:15Another way to induce shrinkage is to essentially...
22:18This is the kind of normal log likelihood here
22:21and this is a penalty here
22:23that you add for this.
22:25This could be an L1 penalty
22:26or an L2 penalty or an L0 penalty,
22:28or some other kind of penalty.
22:30So there's a penalty function here.
22:32And you define the estimate
22:34as the value that minimizes
22:36this penalized log likelihood.
22:38Sorry, yeah, this is a negative log likelihood.
22:41Penalized least squares, I guess this would be.
22:45Okay, so now eight is a penalty function here
22:49and Lambda is a tuning parameter
22:51that says how strong, in some sense, the penalty is.
22:55And these are also widely used
22:58to induce shrinkage,
22:59especially in regression contexts.
23:02And so, here are some commonly used shrinkage operators,
23:06corresponding to different penalty functions.
23:09So this green line is what's called
23:12the hard thresholding,
23:16which corresponds to an L0 penalty.
23:19If you don't know what that means, don't worry.
23:21But if you do, you make that connection.
23:25At the red line here is L1 penalty
23:28or soft thresholding.
23:29And these two other ones here are particular instances
23:33of some non-convex penalties that are used
23:36in regression context,
23:37particularly in practice.
23:39And I guess that the point here is
23:42that, essentially, different prior distributions
23:45in the normal means model can lead
23:48to shrinkage operators, shrinkage behavior
23:51that looks kind of similar
23:53to each of these different types of penalty.
23:58So you can't actually mimic the behavior exactly.
24:03I've just...
24:05Or actually, my student, (indistinct) Kim,
24:07chose the priors to visually closely match these
24:11but you can't get...
24:12Some of these have kinks and stuff
24:13that you can't actually, formally, exactly mimic
24:18but you can get qualitatively similar shrinkage behavior
24:21from different priors
24:23as different penalty functions.
24:24So you should think about the different priors
24:26as being analogous to different penalty functions.
24:30And so, the key...
24:33How does EB, empirical Bayse shrinkage,
24:35differ from, say, these kinds of penalty-based approaches,
24:40which I should say are maybe
24:41more widely used in practice?
24:44Well, so shrinkage is determined by the prior, G,
24:49which we estimate in an empirical Bayse context
24:52by maximum likelihood.
24:53Whereas in typical shrinkage...
24:57Sorry, typical penalty-based analyses,
25:00people use cross validation to estimate parameters.
25:03And the result is that cross-validation is fine
25:07for estimating one parameter
25:08but it becomes quite cumbersome
25:10to estimate two parameters,
25:12and really tricky to estimate three or four parameters
25:14'cause you have to go and do a grid of different values
25:17and do a lot of cross-validations
25:18and start estimating all these different parameters.
25:22So the point here is really
25:23because we estimate G by maximum likelihood,
25:26we can actually have a much more flexible family in practice
25:29that we can optimize over more easily.
25:33It's very flexible,
25:34you can mimic a range of penalty functions
25:36so you don't have to choose
25:37whether to use L1 or L2 or L0.
25:40You can essentially estimate
25:42over these non-parametric prior families.
25:44Think of that as kind of deciding automatically
25:47whether to use L0, L1, L2
25:49or some kind of non-convex penalty,
25:51or something in between.
25:53And the posterior distribution, of course then,
25:58another nice thing is that it gives not
25:59only the point estimates
26:01but, if you like, it also gives shrunken interval estimates
26:04which are not yielded by a penalty-based approach.
26:08So I guess I'm trying to say
26:09that there are potential advantages
26:11of the empirical Bayse approach
26:13over the penalty-based approach.
26:16And yeah, although I think,
26:19people have tried, particularly Efron has highlighted
26:21the potential for empirical Bayse
26:23to be used in practical applications,
26:25largely in the practical application.
26:27So I've seen empirical Bayse shrinkage hasn't
26:29really been used very, very much.
26:32So that's the goal,
26:34is to change that.
26:37So before I talk about examples,
26:39I guess I will pause for a moment
26:41to see if there are any questions.
26:54And I can't see the chat for some reason
26:56so if anyone...
26:57So please unmute yourself
26:58if you have a question.
27:00- I don't think people are (indistinct)
27:04every question in the chat.
27:05At least, I didn't see any. - Good.
27:07Okay, thank you.
27:10It's all clear.
27:12I'm happy to go on but
27:14I just wanna...
27:25Okay, so we've been trying to...
27:27My group has been trying to think about
27:28how to use these ideas,
27:30make these ideas useful in practice
27:32for a range of practical applications.
27:35We've done work on multiple testing,
27:37on high dimensional linear aggression,
27:40and also some on matrix factorization.
27:43From previous experience,
27:44I'll probably get time to talk about the first two
27:45and maybe not the last one,
27:47but there's a pre-print on the archive.
27:49You can see if you're interested
27:50in matrix factorization.
27:51Maybe I'll get to get to talk about that briefly.
27:55But let me talk about multiple testing first.
27:59So the typical multiple testing setup,
28:02where you might typically, say,
28:05apply a Benjamini-Hochberg type procedure is
28:08you've got a large number of tests,
28:09So J equals one to N,
28:11and test J yields a P value, PJ,
28:15and then you reject all tests with
28:16some PJ less than a threshold gamma,
28:19where that threshold is chosen
28:21to control the FDR in a frequented sense.
28:23So that's the typical setup.
28:26So how are we going to apply
28:27the normal means model to this problem?
28:33Okay, well, in many applications,
28:37not all but in many,
28:38the P values are derived from
28:40some kind of effect size estimate,
28:41which I'm going to call "Beta hat J,"
28:45which have standard errors, SJ,
28:47that satisfy approximately, at least,
28:49that Beta J hat is normally distributed
28:53about the true value Beta J
28:55with some variance given it by SJ.
28:58So in a lot...
29:01I work a lot in genetic applications.
29:03So in genetic applications,
29:05we're looking at different genes here.
29:07So Beta J hat might be the estimate
29:11of the difference in expression, let's say,
29:13of a gene, J, between, say, males and females.
29:18And Beta J would be the true difference
29:21at that gene.
29:22And you're interested in identifying
29:25which genes are truly different...
29:28Have a different mean expression
29:30between males and females here.
29:32And the reason that SJ is approximately known is
29:36because you've got multiple males
29:38and multiple females that you're using
29:41to estimate this difference.
29:43And so, you get an estimated standard error
29:45of that Beta hat as well.
29:48And so, once you've set the problem up like this,
29:50of course, it looks suddenly like a normal means problem
29:55and we can kind of apply
29:57the empirical Bayes normal means idea.
30:00We're gonna put a prior on the Beta Js
30:03that is sparsity inducing.
30:05That is, it's kind of centered at zero,
30:06maybe it's got a point mass at zero.
30:08But we're gonna estimate that prior from the data.
30:14Okay.
30:17And so, not only can you get posterior means for Beta,
30:23as I said, you can get posterior interval estimates.
30:25So you can kind of do things
30:27like compute the posterior in 90% credible interval,
30:31given that prior and the likelihood
30:33for each Beta J
30:34and we could reject, for example,
30:35if the interval does not contain zero.
30:39And I'm not going to talk about this in detail
30:42because the details are in
30:44a biostatistics paper from 2017.
30:46I should say that the idea of using empirical Bayse for FDR
30:50actually dates back to before Benjamini and Hoffberg.
30:54Duncan Thomas has a really nice paper
30:57that was pointed out to me by John Witty
30:59that actually contains these basic ideas
31:03but not nice software implementation,
31:07which maybe explains why it hasn't caught on
31:09in practice yet.
31:10Efron's also been a pioneer in this area.
31:13So...
31:16Okay, so I don't want to dwell on that
31:18because, actually, I think I'll just summarize
31:21what I think is true compared with Benjamini-Hochberg.
31:25You get a bit of an increase in power
31:27by using an empirical Bayse approach.
31:29The Benjamini-Hochberg approach is
31:32more robust to correlated tests though,
31:34so the empirical Bayse normal means model does assume
31:38that the tests are independent
31:39and, in practice, we have seen
31:43that correlations can cause problems.
31:45If you're interested in that,
31:46I have a pre-print with Lei Sun on my website.
31:49But the empirical Bayse normal means
31:52also provides these interval estimates,
31:55which is kind of nice.
31:56Benjamini-Hochberg does not.
31:58So there are some advantages
31:59of the empirical Bayes approach
32:01and maybe some disadvantages compared
32:04with Benjamini-Hochberg.
32:05But I think that the real benefit
32:06of the empirical Bayse approach
32:08actually comes when we look at multi-variate extensions
32:11of this idea.
32:12So I just wanted to briefly highlight those
32:14and spend some time on those.
32:16So here's the multi-variate version
32:19of the empirical Bayse normal means models.
32:24And now, my Beta Js are a vector of observation.
32:29So I think of this as measuring, say,
32:32gene J in multiple different tissues.
32:34Think of different tissues.
32:36You look at at heart
32:37you look at lung,
32:38you look brain,
32:39you look at the spleen.
32:42In fact, we've got 50 different tissues in the example
32:46I'm gonna show in a minute.
32:48So we've measured some kind of effect
32:52in each gene, in each of these 50 different tissues
32:56and we want to know where the effects are...
33:01Which genes show effects in which tissues.
33:04So Beta J is now a vector of length R,
33:08the number of tissues.
33:09R is 50 in our example.
33:11And so you've got...
33:12We're gonna assume that the estimates are
33:16normally distributed with mean,
33:17the true values and some variance,
33:19covariance matrix now,
33:21which we're going to assume, for now, is known.
33:23That's actually a little trickier
33:26but I'm gonna gloss over that for...
33:28If you want to see details,
33:29take a look at the paper.
33:31I just wanna get the essence of the idea across.
33:33We're still going to assume that Beta J comes
33:35from some prior, G,
33:36and we're still gonna use a mixture of normals,
33:38but now we're using a mixture
33:39of multi-variate normals.
33:41And unlike the univariate case,
33:43we can't use a grid of...
33:45We can't use a grid of values
33:47that span all possible covariance matrices here.
33:50It's just too much.
33:52So we have to do something to estimate
33:53these covariance matrices,
33:55as well as estimate the pis.
33:57And again, if you want to see the details,
34:00take a look at a Urbut et al.
34:03But let me just illustrate
34:04the idea of what's going on here,
34:06or what happens when you apply this method
34:08to some data.
34:10So this is...
34:10I said 50,
34:11we have 44 tissues in this particular example.
34:14So each row here is a tissue.
34:17These yellow ones here are brain tissues,
34:20different brain tissues,
34:22and I think we'll see one later that's blood.
34:25I think this one might be blood.
34:26Anyway, each one is a tissue;
34:28lung, blood, etc.
34:29You don't need to know which ones are which, for now.
34:32And so, what we've done here is plot
34:34the Beta hat and plus or minus two standard deviations
34:38for each tissue at a particular...
34:41In this case, a particular snip, actually (indistinct).
34:44So this is an eQTL analysis,
34:46for those of you who know what that means.
34:48If you don't, don't worry about it.
34:50Just think of it as having an estimated effect
34:53plus or minus two standard deviations
34:55in 44 different tissues,
34:58and we want to know which ones are quote,
35:01"significantly different from zero."
35:05And so what happens...
35:10Sorry.
35:11Didn't expect that to happen.
35:17Sorry, okay.
35:17Yeah, these are just...
35:18These are just two examples.
35:20So this is one example,
35:22here's another example
35:23where we've done the same thing.
35:25Estimated effects, plus or minus two standard deviations.
35:28So what you can see in this first one is
35:30that it looks like, at least,
35:31that the brain tissues have some kind of effect.
35:33That's what you're supposed to see here.
35:36And maybe there are some effects in other tissues.
35:38There's a tendency for effects to be positive,
35:40which might suggest that maybe everything has
35:44a small effect to everywhere,
35:45but particularly strong in the brain.
35:47And whereas in this example,
35:50this one appears to have an effect
35:52in just one tissue.
35:53This is the blood actually.
35:54So this is an effect in blood
35:56but mostly, it doesn't look like
35:58there's an effect in other tissues.
36:00But these, just to emphasize,
36:01these are the raw data,
36:03in the sense that they're the Beta hats
36:04and the standard errors.
36:05There's no shrinkage occurred yet.
36:07But the idea is that the empirical Bayse approach takes
36:11all these examples,
36:12examples like this and examples like this,
36:14to learn about what kinds of patterns are present
36:17in the data.
36:18That is, "What does G look like?"
36:19So it learns from these examples
36:21that there are some effects that look like
36:24they're shared among the brain tissues,
36:27and there are some effects that are...
36:28These are actually somewhat rare
36:31but rarely, there's an effect that's specific
36:33to one tissue like, in this case, blood.
36:35And it also learns, in this case actually,
36:39that there's a lot of null things,
36:41because there are a lot of null things as well.
36:44So it puts lots of mass on the null as well
36:46and that causes the shrinkage.
36:48And then, having estimated those patterns from the data,
36:51it computes posteriors.
36:53And so, here's the data and then the posterior intervals
36:58for the same...
36:59For that first example.
37:00And what you can see is that because of
37:02the combining information across tissues,
37:06you get standard errors that are getting smaller,
37:09the brain estimates all get shrunk towards one another,
37:12and all these...
37:14There's some borrowing strength of information,
37:16borrowing information across these tissues,
37:18to make these these look like
37:20some of them are kind of borderline significant.
37:22Now, it looks like there's probably
37:23an effect in every tissue
37:25but a much stronger effect in brain.
37:26Whereas this example here,
37:29it recognizes that this looks like an effect
37:32that's specific to blood.
37:33And so, it shrinks everything else strongly towards zero
37:36because it knows that most things are null,
37:38it's learned that from the data,
37:40but the blood estimate gets hardly shrunk at all.
37:43We saw that kind of behavior where things
37:44that are near zero can get shrunk towards zero,
37:47whereas other things that are far
37:48away don't get shrunk as much.
37:50And it's really hard to do that kind of thing
37:53without doing some kind of model-based analysis,
37:57doing Benjamini-Hochberg type art non-model based
38:01without making any assumptions
38:03or making minimal assumptions,
38:06very hard to capture this kind of thing, I think.
38:09So I think the empirical Bayse approach
38:11has big advantages in this setting.
38:17I'll pause before I talk about regression.
38:20Any questions there?
38:24- So Matthew, I have some basic questions.
38:26So in your means multivariate multiple testing case,
38:30I guess for each of the plot,
38:32you are looking at maybe a particular genes influence
38:35on some... - Good, yeah.
38:37Sorry, I did skip over it a bit.
38:39So these are eQTLs.
38:40So actually, what I'm plotting here is each...
38:43This is a single snip associated
38:46with a single gene.
38:48And this is it's,
38:49"How associated is this snip
38:52"with this genes expression level
38:54"in the different brain tissues,
38:56"in the blood tissue in lung and spleen, etc?"
39:02The idea is that...
39:04What the scientific goal is to understand
39:07which genetic variants are impacting gene expression
39:10in different tissues,
39:12which might tell us something
39:13about the biology of the tissues
39:15and the regulation going on in the different tissues.
39:18- Got it.
39:19So in this case,
39:20I don't think I fully understand
39:23why it's multivariate multiple tests, not univariate
39:27because you are looking at each gene
39:28versus each snip.
39:32- Right, so sorry.
39:33Think of J indexing eQTL.
39:36So we've got 2 million potential eQTLs,
39:39so that's the multiple part of it.
39:41For 2 million potential eQTLs, that's...
39:45And then each eQTL has data on 44 tissues,
39:50so that's the multi-variate part of it.
39:52(speaking over each other)
39:53If you thought about it
39:54in terms of say P values or maybe Z scores,
39:57you have a matrix of Z scores.
39:59There are two million rows
40:01and there are 44 columns
40:03and you have a Z score or a P value
40:05for each element in that matrix,
40:09and what we're assuming is that
40:11the rows are independent,
40:13which is not quite true but still,
40:15we're assuming that the rows are independent
40:17and the columns,
40:18we're assuming that they can be correlated.
40:20And in particular,
40:21we're assuming that the...
40:22Well, we're assuming that both
40:24the measurements can be correlated,
40:26so it's V,
40:27but also that the effects can be correlated.
40:30So that's to capture the idea
40:31that there might be some effects
40:33that are shared between say brain tissues--
40:36- I see.
40:37I see.
40:38So this multi-variate is different
40:40from our usual notion where the multivariate
40:43and multivariate snip.
40:44So there's multivariate tissue.
40:46I guess, are the samples from the same cohort?
40:49- Yeah, so in this particular case,
40:52the samples are from the same individuals.
40:55So these different brain tissues...
40:56There's overlap anyway, let's say.
40:57And so, that's what causes this...
41:00That causes headaches, actually, for this--
41:02- Okay, got it, thanks. - Yeah.
41:04Just to emphasize,
41:05it doesn't have to be different tissues.
41:07The whole method works
41:08on any matrix of Z scores, basically.
41:11As long as you think that the rows correspond
41:14to different tests
41:15and the columns correspond
41:16to different, say, conditions for the same test.
41:21So examples might be
41:24you're looking at the same snip
41:25across lots of different phenotypes,
41:27so looking at schizophrenia,
41:29looking at bipolar,
41:31looking at different diseases
41:33or different traits,
41:35and you can have a Beta hat for that snip
41:37and a standard error for that snip
41:39in every trait.
41:40And you could try to learn,
41:42"Oh look, there are some traits
41:43"that tend to share effects
41:45"and other traits that don't,"
41:46or, often in experiments,
41:49people treat their samples
41:51with different treatments.
41:52They challenge them with different viruses.
41:54They look to see which things are being changed
41:58when you challenge a cell with different viruses
42:00or different heat shock treatments
42:02or any kind of different treatment.
42:03So yeah, basically, the idea is very generic.
42:06The idea is if you've got a matrix of Z scores
42:09where the effect, say, look likely to be shared
42:12among column sometimes
42:13and the rows are gonna be approximately independent,
42:17or at least you're willing to assume that,
42:20then you can apply the method.
42:21- Okay, got it, thanks.
42:22- So, actually, that's an important kind of...
42:25Also, something that I've been thinking about a lot is
42:28the benefits of modular or generic methods.
42:34So if you think about what methods are applied
42:36in statistics a lot,
42:37you think T-test, linear regression.
42:39These are all kind of very generic ideas.
42:42They don't...
42:44And Benjamini-Hochberg.
42:45The nice thing about Benjamini-Hochberg is
42:47you just need a set of P values
42:48and you can apply Benjamini-Hochberg.
42:50You don't have to worry too much
42:52about where those P values came from.
42:54So I think, for applications,
42:56it's really useful to try to think about
42:58what's the simplest type of data
43:01you could imagine inputting into the procedure
43:04in order to output something useful?
43:06And sometimes, that involves making compromises
43:09because to make a procedure generic enough,
43:12you have to compromise on what...
43:14On maybe what the details of what are going in.
43:17So here, what we've compromised on is
43:18that we take a matrix of Z scores,
43:20or potentially Beta hats and their standard errors,
43:23we can do either,
43:24and that's the input.
43:26So that makes it relatively generic.
43:28You don't have to worry too much
43:30about whether those Beta hats
43:31and the standard errors, or the Z scores,
43:33are coming from logistic regression
43:35or linear regression,
43:36or whether that controlling for some covariance
43:38or all sorts of...
43:39From a mixed model, etc.
43:41As long as they have the basic property that
43:44the Beta hat is normally distributed
43:47about the true Beta
43:48with some variance that you are willing to estimate,
43:51then you can go.
43:58- Sorry, (indistinct).
44:01A short question.
44:02So in practice, how you choose...
44:05How many mix...
44:06How many distribution you want to mixture
44:08like the (indistinct)? - (indistinct)
44:10Yeah, so great question.
44:12And my answer, generally,
44:14is just use as many as you want.
44:16So as many as you can stomach.
44:19The more you use, the slower it is.
44:25And so, you might worry about over-fitting,
44:27but it turns out that these procedures are
44:29very robust to over-fitting
44:30because of this fact that the mean is fixed at zero.
44:34So all the components have a mean zero
44:37and have some covariance
44:39and because of that,
44:40they have limited flexibility to overfit.
44:45They're just not that flexible.
44:48And in the univariate case,
44:50that's even more obvious, I think.
44:51That in the univariate case,
44:53every one of those distributions,
44:55any mixture of normals that are
44:57all centered at zero is unimodal at zero
45:01and has limited...
45:02Can't have wiggly distributions
45:04that are very spiky and overfitting.
45:06So these methods are relatively immune
45:09to overfitting in practice.
45:12If you're worried about that,
45:13you can do a test-train type thing
45:15where you use half your tests to train,
45:18and then you look at the log likelihood
45:20out of sample on others,
45:22and then tweak the number to avoid overfitting.
45:27And we did do that early on in the methods
45:30but we don't do it very often now,
45:32or we only do it now when we're worried
45:34'cause generally it seems like overfitting doesn't seem
45:37to be a problem,
45:38but if we see results are a little bit weird
45:39or a bit concerning,
45:41we try it to make sure we're not overfitting.
45:46- Okay, thank you.
45:47- I should say that, in the paper,
45:49we kind of outlined some procedures we use
45:51for estimating these variance, co-variance matrices
45:54but they're not like...
45:55They're kind of like...
45:57The whole philosophy is that we could probably do better
46:01and we're continuing to try and work
46:03on better methods for estimating this
46:05as we go forward.
46:07So we're continually improving
46:08the ways we can estimate this.
46:16Okay, so briefly, I'll talk about linear regression.
46:19So here's your standard linear regression where,
46:21so we've N observations,
46:23X is the matrix of covariates here,
46:25B are the regression coefficients.
46:27I'm kind of thinking of P as being big, potentially here.
46:32And the errors normal.
46:33And so, the empirical Bayes idea would be
46:36to assume that the Bs come from
46:38some prior distribution, G,
46:40which comes from some family, curly G.
46:42And what we'd like to do is estimate G
46:45and then shrink the estimates of B,
46:49using empirical Bayse type ideas
46:52and posterior count computations.
46:55But it's not a simple normal means model here,
46:58so how do we end up applying
47:00the empirical Bayse methods to this problem?
47:04Well, let's just...
47:05I'm gonna explain our algorithm
47:07by analogy with penalized regression algorithms
47:11because the algorithm ends up looking very similar,
47:13and then I'll tell you
47:14what the algorithm is actually kind of doing.
47:16So a penalized regression would solve this problem.
47:19So if you've seen the Lasso before...
47:22I hope many of you might have.
47:23If you've seen the Lasso before,
47:24this would be solving this problem
47:26with H being the L1 penalty,
47:29absolute value of B, right?
47:31So this...
47:32So what algorithm...
47:34There are many, many algorithms to solve this problem
47:37but a very simple one is coordinate ascent.
47:40So essentially, for each coordi...
47:42it just iterates the following.
47:43For each coordinate,
47:44you have some kind of current estimate for Bs.
47:47(indistinct)
47:48So what you do here is you form the residuals
47:50by taking away the effects of all the Bs
47:55except the one you're trying to update,
47:57the one you're trying to estimate.
47:59So X minus J here is all the covariates
48:01except covariate J.
48:02B minus J is all the corresponding coefficients.
48:06So this is the residual.
48:08RJ is the residual,
48:09after removing all the current estimated effects
48:12apart from the Jth one.
48:14And then you basically compute a estimate
48:18of the Jth effect by regressing those residuals on XJ.
48:24And then you shrink that using a shrinkage operator
48:29that we saw earlier.
48:31Just to remind you
48:33that a shrinkage operator is the one
48:34that minimizes this penalized least squares problem.
48:38And it turns out,
48:39it's not hard to show that this is coordinate ascent
48:45for minimizing this, penalized objective function.
48:49And so every iteration of this increases
48:53that objective function
48:55or decreases it.
48:58Okay.
49:00Okay, so it turns...
49:01So our algorithm looks very similar.
49:03You still compute the residuals,
49:06you compute a Beta hat
49:08by regressing the residuals on XJ.
49:10You also, at the same time,
49:11compute a standard error,
49:12which is familiar form.
49:17And then you, instead of shrinking using
49:21that penalized regression operator,
49:23you use a...
49:25Sorry, I should say,
49:27this is assuming G is known.
49:28I'm starting with G.
49:29G is known.
49:31So you can shrink...
49:32Instead of using the penalty-based method,
49:35you use the posterior mean shrinkage operator here
49:38that I introduced earlier.
49:40So it's basically exactly the same algorithm,
49:41except replacing this penalty-based shrinkage operator
49:47with an empirical Bayse
49:48or a Bayesean shrinkage operator.
49:54And so, you could ask what that's doing
49:56and it turns out that what it's doing is
49:57it's minimizing the Kullback-Leibler Divergence
50:01between some approximate posterior, Q,
50:05and the true posterior, P, here
50:08under the constraint that this Q is factorized.
50:13So this is what's called
50:15a variational approximation
50:17or a mean-field,
50:17or fully factorized variational approximation.
50:21If you've seen that before,
50:22you'll know what's going on here.
50:24If you haven't seen it before,
50:25it's trying to find an approximation
50:27to the posterior.
50:29This is the true posterior,
50:30it's trying to find an approximation
50:31to that posterior that minimizes
50:33the Kullbert-Leibler Divergence
50:35between the approximation
50:36and the true value under in a simplifying assumption
50:40that the posterior factorizes,
50:41which, of course, it doesn't,
50:42so that's why it's an approximation.
50:44So that algorithm I just said is
50:46a coordinate ascent algorithm
50:48for maximizing F or minimizing the KL divergence.
50:53So every iteration of that algorithm gets
50:55a better estimate estimate
50:58of the posterior, essentially.
51:02Just to outline
51:03and just to give you the intuition
51:06for how you could maybe estimate G,
51:08this isn't actually quite what we do
51:10so the details get a bit more complicated,
51:13but just to give you an intuition
51:14for how you might think that you can estimate G;
51:18Every iteration of this algorithm computes a B hat
51:21and a corresponding standard error,
51:23so you could imagine...
51:25These two steps here, you could imagine storing these
51:28through the iterations
51:29and, at the end,
51:31you could apply the empirical Bayes normal means procedure
51:35to estimate G from these B hats
51:37and standard errors,
51:38and something close to that kind of works.
51:43The details are a bit more complicated than that.
51:47So let me give you some kind of intuition
51:51for what we're trying to achieve here based
51:54on simulation results.
51:55So these are some simulations we've done.
51:57The covariates are all independent here.
52:00The true prior is a point normal,
52:02that means that most of the effects are zero.
52:07Well, actually maybe here,
52:09one of the effects is nonzero,
52:11five of the effects is nonzero,
52:1350 of the effects are nonzero
52:15and 500 of the effects of nonzero.
52:17And actually, there are 500 effects,
52:18500 variables in this example.
52:23So the X-axis here just shows the number
52:25of non-zero coordinates
52:27and the results I've shown here are the prediction error,
52:30so we're focusing on prediction error,
52:32the out of sample prediction error,
52:33using three different penalty-based approaches.
52:37The Lasso, which is this line,
52:40the L0Learn, which is this line,
52:43which is L0 zero penalty,
52:45and Ridge, which is this penalty,
52:47the L2 penalty.
52:48So the important thing to know is that
52:51the L0 penalty is really designed, if you like,
52:55to do well under very sparse models.
52:58So that's why it's got the lowest prediction error
53:03when the model is very sparse,
53:05but when the model is completely densed,
53:07it does very poorly.
53:08Whereas Ridge is designed much more to...
53:14It's actually based on a prior
53:15that the effects are normally distributed.
53:17So it's much better at dense models
53:19than sparse models.
53:20And you can see that at least relative to L0Learn,
53:23Ridge is much better for the dense case
53:26but also much worse for the sparse case.
53:30And then Lasso has some kind of ability
53:32to deal with both scenarios,
53:35but it's not quite as good as the L0 penalty
53:37when things are very sparse,
53:39and it's not quite as good as the Ridge penalty
53:41when things are very dense.
53:45So our goal is that by learning the prior G
53:49from the data,
53:50we can adapt to each of these scenarios
53:53and get performance close to L0Learn
53:55when the truth is sparse
53:58and get performance close to Ridge regression
54:01when the truth is dense.
54:03And so, the red line here actually shows
54:06the performance of our method.
54:09And you can see, indeed,
54:10we do even slightly better than L0Learn
54:13in this part here
54:14and slightly better than cross-validated Ridge regression
54:18in this here.
54:19The difference between these two is just that
54:21the Ridge regression is doing cross-validation
54:23to estimate the tuning parameter
54:25and we're using empirical Bayse maximum likelihood
54:28to estimate it.
54:28So that's just that difference there.
54:30And the Oracle here is using the true...
54:32You can do the Oracle computation
54:35for the Ridge regression
54:36with the true tuning parameter here.
54:38I should say that may be that this...
54:42Maybe I should just show you the results.
54:45So here is a bunch of other penalties,
54:47including elastic net, for example, you might wonder,
54:49which is kind of a compromise
54:51between L1 and L2.
54:52And you can see, it does do the compromising
54:56but it doesn't do as well
54:57as the empirical Bayse approach.
54:59And here are some other non-convex methods
55:01that are more, again...
55:04They're kind of more tuned to the sparse case
55:06than to the dense case.
55:09As promised, I'm gonna skip over
55:12the matrix factorization
55:14and just summarize to give time for questions.
55:18So the summary is that
55:21the empirical Bayse normal means model provides
55:23a flexible and convenient way
55:25to induce shrinkage and sparsity
55:26in a range of applications.
55:28And we've been spending a lot of time trying
55:32to apply these methods
55:33and provide software to do
55:35some of these different things.
55:36And there's a bunch of things on my publications page
55:39and if you're interested in...
55:40If you can't find what you're looking for,
55:41just let me know, I'd be happy to point you to it.
55:44Thanks very much.
55:47- Thanks Matthew, that's a great talk.
55:49I wonder whether the audience have
55:51any questions for Matthew.
55:57So I do have some questions for you.
56:00So I think I really like the idea
56:02of applying empirical Bayes to a lot of applications
56:06and it's really seems empirical Bayes has great success.
56:10But I do have a question or some doubt
56:13about the inference part,
56:15especially in that linear regression model.
56:18So currently, for the current work you have been doing,
56:22you essentially shrink each
56:24of the co-efficient that based on, essentially,
56:26their estimated value,
56:28but in some applications,
56:31such as a GWAS study or fine mapping,
56:33different snips can have
56:35very different LD score structure.
56:38So in this case,
56:39how much we can trust the inference,
56:43the P value, from this (indistinct)?
56:48- So, great question.
56:51So let me just first...
56:55First emphasize that the shrink...
56:57The estimate here is being done removing the effects,
57:01or the estimated effects,
57:03of all the other variables.
57:04So each iteration of this,
57:06when you're estimating the effect
57:07of snip J, in your example,
57:10you're taking the estimated effects
57:11of the other variables into account.
57:14So the LD structure, as you mentioned,
57:18that's the correlation structure
57:20for those who don't know,
57:21between the Xs is formerly taken into account.
57:24However, there is a problem with this approach
57:26for very highly correlated variables.
57:30So let's just suppose there are two variables
57:34that are completely correlated,
57:35what does this algorithm end up doing?
57:38It ends up basically choosing one of them
57:40and ignoring the other.
57:44The Lasso does the same in fact.
57:46So it ends up choosing one of them
57:50and ignoring the other.
57:52And if you look to the posterior distribution
57:55on its effect,
57:57it would be far too confident
57:58in the size of the effect
58:00because it would assume that the other one had zero effect.
58:04And so it would have a small credible
58:06and for, let's say, around the effect size
58:08when, really, it should be saying
58:09you don't know which one to include.
58:11And so, we've worked recently
58:14on a method for doing that.
58:17A different method,
58:19different work than what I've just described here
58:22for doing fine mapping using variational approximations
58:25that don't have this problem,
58:27and it's on my webpage.
58:29It's Wang et al in JRSS-B,
58:34just recently, this year.
58:362021, I guess. - That's awesome.
58:39Thanks, so any more question for Matthew from the audience?
58:47Okay, I think we're of running out of time also.
58:51So if you have any question
58:53about the stuff to (indistinct)
58:54you want to use,
58:55I think you can contact either
58:56the authors of the paper or Matthew off the line.
59:00And thank you again for agreeing to present your work here.
59:04It's looks really useful and interesting.
59:07- Thank you for having me.