YSPH Virtual Biostatistics Seminar: Statistical analysis of single cell CRISPR screens

February 03, 2021

Information

Eugene Katsevich, Ph.D.

Assistant Professor in the Department of Statistics

The Wharton School at the University of Pennsylvania

Tuesday, 2/2/2

ID6158

To CiteDCA Citation Guide

00:03- First good afternoon, everyone,
00:05and I hope you somehow managed to enjoy your winter break
00:10you in this special time.
00:11And this is our first talk, seminar talk this semester,
00:16and we have invited Dr. Eugene Katsevich
00:19from Wharton School at UPenn.
00:22And he's going to present something really exciting,
00:26I know his original work on statistical analysis
00:32single cell CRISPR screening.
00:34And I will hand it over to Eugene from now, from here.
00:40And, but if Eugene wanted to start or wait one
00:43or two minutes to start, it's up to you.
00:46- Yeah maybe, I mean, yeah, I don't know.
00:51If people will filter in, maybe I'll wait another minute
00:53or two, 'cause I think, I feel like the first part
00:57of the talk is very important.
00:58So I think if people missed the first part of the talk,
01:01then it'll be maybe hard to follow along later.
01:04So I'm happy to wait just another minute or two.
01:09I understand perfectly that it's a strange time
01:12for everyone, so for all those who were able
01:15to make it today, I really appreciate
01:17your adjusting the schedule.
01:22Also maybe one remark I can make is that,
01:25since it is a smaller audience,
01:27I think we can make this seminar just about
01:31as interactive as you want.
01:32So you should definitely feel free to stop me
01:37at any point.
01:39I don't know how many of you are familiar
01:40with the CRISPR screen stuff I'm gonna talk about,
01:43but I'm very happy to just make it very interactive.
01:54I will maybe start sharing my screen
01:58and maybe I'll start launching
02:00into some of the introductory things.
02:13So...
02:16Oh wow, wait, is this the...
02:21I greatly apologize.
02:24Clearly, the label on my slides is wrong.
02:28I have updated my slides since then,
02:31but I think the title page has not been updated,
02:33that's extremely embarrassing.
02:40Well, maybe then I should skip past this slide very quickly.
02:43So hello everyone, thank you so much
02:47for making it to my talk.
02:50Today, I'll be talking about some Statistical Analysis Tools
02:52for Single Cell CRISPR Screens.
02:55So the most important thing to take away
02:56from this slide are my collaborators here.
02:59So Tim Barry is a grad student
03:02of mine who was actually at CMU.
03:05I am jointly advising him with Kathryne Roeder
03:08also at CMU, who is used to be my postdoc advisor.
03:15So I'll skip quickly to the next slide.
03:20So here's the motivation.
03:22And by the way, if anyone has joined recently,
03:25please just stop me at any point.
03:29So here's the motivation.
03:30So we have done lots
03:31and lots of genome wide association studies to date.
03:35So we have a lot of little markers
03:37along the genome that we think are associated with diseases.
03:41And so the question is what's the next step?
03:43Like how do we actually translate these
03:46into insights into diseases?
03:49And hopefully later on things like,
03:51therapeutics and so on.
03:52So what we need to do is we need to understand how
03:56like basically the mechanisms
03:58by what mechanism are these associations actually resulting
04:01in an increased disease risk.
04:03So here's a typical situation here
04:05as our genome and here's a disease association
04:08and frequently these disease associations
04:11they might not take place within genes.
04:13And so that makes them pretty hard to interpret.
04:18So what's hypothesized to be the case here is that instead
04:24of disrupting genes directly, these variants
04:29are disrupting regulatory elements such as enhancers.
04:33So let's just like briefly here review
04:38that an enhancer is a region of the genome.
04:41That could be a certain distance
04:42from the gene that actually folds
04:45in three-dimensional space to come
04:47in close proximity to the promoter of the gene.
04:52And essentially the enhancers job is to recruit a lot
04:55of the machinery that actually is going to lead
04:57to the expression of this gene.
04:58So if you disrupt the enhancer
05:00then this will disrupt the recruitment
05:02of all of these different transcription factors
05:04which will then end up causing some trouble.
05:10And so it's this sort of like, for example, in this case
05:15let's say that this disease association
05:17as is disrupting enhanced or one, well, this might suggest
05:21if enhancer one is regulating gene two, that
05:24the disease mechanism is actually proceeding essentially
05:28or being mediated by the expression of gene too.
05:32And so this would be a very great
05:38and clean way of interpreting GWAS hits.
05:41But the problem is that we don't actually know
05:44or we have a very hazy sense
05:46of which enhancers actually regulate which genes.
05:50So this is kind of a difficult problem
05:52for a few different reasons.
05:54The first reason is that there's a potentially
05:57many to many mapping between enhancers in genes.
06:00So in enhancer it can regulate multiple genes
06:03and a single gene can be regulated by multiple enhancers.
06:08So the other thing is that any answers don't even
06:10need to be all too close to the genes that they regulate.
06:13There could be situations like we saw here where
06:18the regulation can skip the adjacent gene
06:20and go to the next one.
06:22And so in general regulations can
06:24are thought to happen within about a megabase distance
06:30in terms of the linear distance in the genome.
06:33So this is a hard problem, and it's basically
06:36the motivating problem for this talk
06:38which enhancers regulate which genes.
06:41This is a sort
06:42of a very fundamental and important problem in genomics.
06:47So in today's talk, I'm going to first talk about
06:53a new assay called a single cell CRISPR screen
06:56that allows us to get at this question,
07:03then I'm gonna talk about the challenges
07:06that previous methods have encountered
07:08in analyzing these single cell CRISPR screen
07:10datasets, never propose a new methodology based
07:14on this idea of conditional resampling.
07:18And then I will show you how this works
07:20on real data and close with the discussion.
07:25So let me first introduce the biological assay here
07:28which is called the Single Cell CRISPR screen.
07:31So actually backing up a second,
07:34this is a very important problem
07:35and people have considered it before.
07:37So how do people typically approach gene-enhancer mapping?
07:41I think the most common approach is what I call here
07:46an indirect observational approach.
07:48And there are many of these.
07:50So what this picture is,
07:51is a basically a more detailed picture
07:54of what happens when an enhancer or a pictured here comes
07:57into contact with the promoter of a gene.
08:00There are lots of kind
08:01of indirect signals of this regulation.
08:06Obviously you have just the actual expression
08:08of the gene, but you'll have the confirmation
08:12of the chromatin in the vicinity of the promoter
08:16and in the enhancer
08:18you have basically transcription factor binding data.
08:22And all of these data are essentially indirect ways
08:25of trying to make a conclusion
08:28about which enhancers might be regulating which genes.
08:31So for example, using high C data
08:33if you find an enhancer to be a 3D contact
08:37with the promoter, then this could be a single signal
08:39that there is some regulation going on.
08:43The issue is that these approaches have not
08:45proved very reliable at the end of the day.
08:47These are observational approaches,
08:49and basically even if you have
08:52contact in 3D space, this is not necessarily a signal.
08:57This doesn't necessarily mean that regulation
08:59is actually occurring,
09:00and so essentially we haven't gotten all too far
09:04with these indirect approaches.
09:05So the exciting thing is
09:07that recently with the development of CRISPR technology
09:13we can now actually go in and instead of observationally
09:17just essentially take a look inside a cell.
09:20We can actually go in and make modifications where we
09:24for example, knockouts enhancers using the system
09:28called CRISPR Interference.
09:30And then we try to look
09:31at what the results are for gene expression.
09:34So this shows you a little cartoon
09:38of the CRISPR interference system.
09:40And so the way that it works is
09:42that you have this CAS nine protein whose job is to attach
09:48to a certain segment of DNA.
09:51And the specific segment of DNA it attaches
09:53to is specified by this guide, or I do.
09:57And so in this way,
09:59the attachment can be highly specific
10:01to the sequence of the enhancer.
10:04And then this for CRISPR Interference
10:07the CAS nine brings along with it
10:09all of these repressive elements that essentially knock
10:13out this enhancer, meaning they prevent the enhancement
10:17from actually helping to regulate this gene.
10:22And so the idea, so firstly
10:24this is a promising solution because it allows us to
10:28interrogate these regulatory relationships
10:30in a much more direct way
10:32than we've been able to do until recently.
10:35And so the overall idea is that,
10:39it's the idea of simple disrupt enhancers
10:41and see which genes expression drops.
10:44And so just as a cartoon here, let's say we knock
10:46out this enhancer, then we would expect
10:49to see the gene that regulates to be down-regulated.
10:54And then we can think
10:55about designing perturbations for multiple enhancers.
10:58And so if you perturb this enhancer
11:00then maybe you'll see a response in these two genes.
11:07- Very naive question, just to make sure I
11:09didn't misunderstand notion here is enhancer always
11:14upregulating gene kind of regulate?
11:18- I think enhancers specifically
11:20are thought to upregulate genes.
11:22However, it's a good question because there are other kinds
11:25of elements that are, can actually be silencers for example.
11:28And so that's just another example of a kind
11:32of a regulatory element.
11:33So the effect could go in either direction
11:36and this talk I'll primarily talk about enhancers
11:38but really everything I say goes through for other kinds
11:42of regulatory elements.
11:44- Thanks.
11:45- Yeah, very good question.
11:50So now the actual assay
11:54That allows you to do this out of large scale.
11:59So the scale is the question here
12:00because you can do CRISPR experiments where
12:03you essentially like knock out one enhancer
12:05in a whole batch of cells, and then,
12:08maybe go enhancer by enhancer
12:10and this ends up not being a very scalable approach.
12:13So there has been proposed
12:16this new asset called the single cell CRISPR screen
12:19in which you basically pool a whole bunch
12:22of perturbations together,
12:23and then the readout that you get is single cell
12:26RNA sequencing, which allows you to also basically look
12:30at the impact of all
12:31of those different enhancement perturbations
12:32on the entire transcriptome.
12:35And so in the slide, I'm gonna give you a brief overview
12:38of how these screens work.
12:40So first way you do is you start
12:42with a library of CRISPR perturbations.
12:44So you just, let's say maybe you take,
12:4910,000 enhancers across the genome
12:52and then you basically design CRISPR guide.
12:55RNAs targeting each of those enhancers.
12:58Once you have a library of these perturbations
13:00you then infect a big pool
13:03of cells with all of these perturbations.
13:06And so what's important to note here is
13:08that essentially these perturbations get randomly integrated
13:13into the different cells they're delivered through a
13:17like a virus system
13:19the details aren't very important, but the importance is
13:22that these perturbations get integrated
13:24into cells essentially at random.
13:26And so each cell gets its own collection
13:28of CRISPR perturbations.
13:30So now in order to basically actually read out what happened
13:35in our experiment, we use single cell RNA sequencing.
13:38And as a result of the sequencing experiment
13:40we get two pieces of information, firstly, by the way
13:44two pieces of information for every step.
13:46So for every cell
13:47we first measure the perturbations that are present.
13:50So which of these guide or nays did we detect,
13:52and then secondly
13:53the gene expression for the whole transcriptome.
13:57So this is essentially our data here.
13:59And then once we have this data
14:01we can now do the analysis component, which really ends
14:05up being a kind of differential expression analysis.
14:08So consider a particular gene-enhancer pair.
14:12So what we can do is we could take all of the cells
14:15and we can break them up into two groups.
14:17Those cells for which that enhancer was knocked out
14:20which are in orange here, and those cells
14:23for which that enhancer was not knocked out.
14:26We can then split, essentially look
14:29at the expression of the gene of interest
14:33and see whether there's a systematic difference
14:35between the expression of this gene
14:37and these two populations of cells.
14:39So, and then if there is a significant difference
14:43then we can make a conclusion that that particular
14:45enhancer is regulating that particular gene.
14:48So it seems quite simple on first glance,
14:52but this analysis part actually turns
14:55out to be a challenging statistical problem.
14:59And so the analysis
15:02of these screens is actually the subject of this talk.
15:07Okay so, maybe one more slide
15:10and then I'll stop and see if people have questions.
15:12So just to make it a little bit more concrete
15:16there's a kind of a large data set that might be one
15:19of the largest out there right now by Gasperini at all.
15:22It was published in cell last year.
15:24Oh wow, I guess two years ago now to 2019,
15:27and so they were working with 200,000 K five 62 cells
15:31and they were looking at 6,000 candidate enhancers.
15:34And so they're looking at, I mean
15:35essentially the whole transcriptome, at least the part
15:38of it that has any expression in the cell type.
15:41And they identified 85,000 enhancer gene pairs
15:45that they essentially thought were plausible
15:50to have some regulation and in their experiment
15:53they had 28 per patients on average per cell.
15:58And so the way that this data would look is, think
16:01about the rows as being the cells and then the columns.
16:05So you have two groups of columns.
16:06Firstly, you have the gene expressions,
16:08and so since these are single cell data
16:10we have these highly discreet counts
16:13of reeds or UMRs for every gene.
16:18And then also we have the second bit of information
16:21which is a binary matrix, which tells you
16:23which cells received, which perturbations.
16:27So in general, in this presentation, I'll talk
16:29I'll denote gene expression by Y and perturbations by X.
16:36And so there's also a third and very important piece
16:38of information, which are technical factors per cell.
16:42Perhaps the main one that I'll talk
16:43about today is the sequencing depth.
16:46So this is just the total number
16:48of reads or UMRs I measured from this cell.
16:53And so this basically just varies randomly across cells
16:57just as an artifact of your experiment.
16:58There are other technical factors
17:00like batch and so on and so forth.
17:03Okay, so this brings me to the end
17:06of the first section where I tell you
17:08about the data and the asset.
17:11So are there any questions before I move on
17:14to talking more about the analysis of these types of data.
17:25I'm assuming there are no questions
17:27but do feel free to stop me if there are.
17:34So as I said, this actually turns out to be kind of
17:37like an annoyingly challenging statistical problem.
17:40And so to illustrate this to you, let me first
17:43give you a sense of what analysis methods there
17:45are out there.
17:47I should say, by the way that given the sort of the novelty
17:52of this assay, there hasn't been a lot of work in terms
17:55of designing methods specifically designed
17:59for this kind of data.
18:01So most of the existing analysis methods are basically
18:04proposed by the same people who are
18:07producing the single cell CRISPR screen data.
18:10So by the way, so in this slide
18:14I'm going to it actually for the remainder of the talk
18:17I'm actually going to essentially focus our attention
18:20on a certain gene and a certain enhancer
18:25and just consider the problem
18:27and figuring out whether that enhancer regulates that gene.
18:30And so I'm gonna use YI, to denote the expression
18:34of that gene and cell I XI as the binary indicator
18:38for whether that enhancer was perturbed in that cell
18:41and ZI the vector of these extra technical co-variants.
18:48So With that notation out of the way,
18:52the first kind of popular method for analyzing these data
18:58is negative binomial regression.
19:00For those of you familiar with bulk RNA-seq differential
19:04expression analysis, this is similar to the DESeq2
19:08methodology where you just run a negative binomial
19:11regression of the gene expression, Y on a linear combination
19:17of the perturbation indicator, as well as all
19:20of your technical co-variants.
19:23And so Negative Binomial is a common model for these sort of
19:28over dispersed count data that you encounter
19:31in RNA sequencing data.
19:35Okay, next, there is a rank based approach.
19:38So this is non-parametric where it's actually much simpler.
19:43You just, you cross tabulate yourselves by two criteria.
19:48First, you see whether they have the perturbation or not.
19:51And second, you see whether they have essentially higher
19:55than median expression on this gene or lower
19:57than median expression on this gene.
19:59And then you do a two by two table test for independence.
20:05And finally there are also permutation based approaches
20:08where the idea is to take some test statistic
20:12and then calibrate it under the null distribution
20:16by permuting this column right here
20:19the assignments of the perturbations to the cells.
20:24So yes, that, I guess that's, what's written here.
20:28So okay, there's like maybe all these methods sound
20:36reasonable at first, but the more you actually look
20:39at the existing literature
20:40the more there are various scattered signs
20:43like none of these methods are like really doing the trick.
20:47And so here are
20:50the methods that I described on the previous slide.
20:54I don't know if I named them
20:55but so virtual FACS is the rank based one
20:57and scMAGeCK is the one of the permutation based ones.
21:01And so you look at plots actually from
21:05the original papers themselves who propose these methods
21:10and you see some signs of miscalibration.
21:13And so like, for example, I'm gonna be talking mostly
21:16about this data and to a lesser extent
21:18about this data in my talk, but so looking here
21:23so I guess perhaps I should first talk
21:24about the concept of a Negative Control Perturbation.
21:27So a Negative Control Perturbation is a guide
21:30or but it's actually not designed to
21:33target any particular sequence along the genome.
21:37So you don't expect cells that are infected
21:40with a negative control perturbation to look any different
21:42from cells that have no perturbation.
21:46And so in this Gasperini data
21:50they have 50 different negative control guide RNAs,
21:53and so what they did is they basically plotted a QQ plot
21:57of all of the negative control guide RNAs,
22:00paired with all of the genes and the genome,
22:06and what they found is and perhaps on this QQ plot
22:09this doesn't look like a severe inflation from uniformity
22:12but it's important to keep in mind the scale of this Y axis.
22:17And so essentially this amounts
22:21to a massive amount of deviation
22:24from the uniform distribution in those P-values.
22:27So in other words, negative control,
22:31gene-enhancer pairs are looking incredibly
22:33significant according to this analysis.
22:37So in this particular analysis
22:40they essentially found
22:42the same thing here it's portrayed as a Manhattan plot
22:47but you see a lot
22:50of things reaching significance when right only
22:53the circle points are those that essentially were replicated
22:58in a bulk RNA sequencing experiment.
23:02And then this one finally looks like they perturbed
23:10lots of different enhancers and essentially looked
23:13at the effect on this one particular gene.
23:16And essentially what they found is that essentially all
23:19of the enhancers that they tested appeared to
23:22actually be per, like, have an effect on the expression
23:25of this gene, when in fact this is biologically imposible.
23:29So this is clearly an issue.
23:32Now, these original papers clearly knew
23:36that there was an issue, and so for each of the papers
23:39they kind of have a little bit
23:40of an ad hoc fix in order to basically correct their P-value
23:45of distributions, so that they look a little bit more,
23:50closer to being calibrated.
23:52And so I'm, I think for the sake of time
23:55I'm probably not going to get into exactly how
23:58they propose to fix their P-value distributions.
24:02What I will say is that we looked in detail
24:06especially at the strategy that they use here
24:08and to a lesser extent at the strategies.
24:10Well, actually I think here
24:11they basically said just not to apply their method
24:14to data where there's too high, essentially
24:18to where they're too many perturbations per cell.
24:21So in this case, they just said, don't apply this method.
24:24We looked into the kinds of fixes that they proposed
24:26in these two papers, and they essentially
24:28they don't quite work in the way that you would expect.
24:31And so what we thought is that,
24:33what we'd like to do is kind of look a little deeper
24:37into this problem and try to ask ourselves
24:39why are we seeing all of these issues?
24:41Why do people keep running into these miscalibration issues
24:44and let's try to basically address those underlying issues.
24:50So we thought about it a little bit
24:53and we thought about challenges
24:55for both parametric and non-parametric methods.
24:59So for parametric methods
25:01this actually shouldn't really come as a surprise probably
25:05to most people here, gene expression is known to
25:08be pretty hard to model in single cells.
25:11So of course we have these essentially highly discreet
25:16lots of zeros counts that are over dispersed
25:20perhaps more importantly, given how sparse the data are.
25:23It's actually pretty hard to get a good estimate
25:26of that dispersion parameter.
25:28And so there's currently no standard way
25:30of estimating that dispersion parameter
25:32and basically every paper, comes up
25:35with their own way of doing this.
25:40They're even just debates
25:41about what parametric models are appropriate for these data,
25:44should they be zero inflated,
25:46should they not be, and some genes have even been observed
25:50to have bi-modal expression patterns.
25:52So essentially all of these things are telling us
25:55that it's kind of hard to shoe horn
25:57single cell gene expression,
25:58into a nice, neat parametric model.
26:01So obviously if you have missed specification of your model
26:04such as a bad estimate for a dispersion perimeter
26:07that very well could cause miscalibration
26:09of the kind that we saw.
26:13So next we can think about non-parametric methods.
26:16So maybe, obviously if these data
26:19are hard to model parametrically
26:21maybe the non-parametric methods are going to save us.
26:25But the observation that we made that I think is
26:27quite important is that these technical factors
26:29that I mentioned before, like sequencing depth,
26:32they impact not only the expressions of genes
26:35but also the detection of these CRISPR guider in is.
26:38So I might have led you to believe
26:41in one of my early slides that we can basically
26:43perfectly measure which cell contains
26:46which CRISPR perturbations, but this is actually not true.
26:51So single cell RNA sequencing
26:54it's essentially just like this kind of a sampling process.
26:59And so the more reads you sample from a cell
27:03the more likely you are to detect a guide RNAs.
27:05And so we just essentially looked at, for example,
27:10this is for one of the datasets and we just made
27:13a scatterplot of the total number of guide RNAs detected
27:17per cell versus the total number of UMI.
27:20So this is the sequencing depth
27:21and we found this extremely clear
27:24I guess I'm not showing you the P-value
27:25but this P-value was like absurdly significant
27:29to just basically confirm that
27:31if you have more sequencing depth in a cell,
27:33you're going to find more guide our news in that cell.
27:37And so the issue with this is
27:40that we basically have a confounding problem on our hands.
27:43So think about this graphical model that's illustrating
27:48what's going on
27:49in a single cell CRISPR screen experiment in
27:52this gray box is kind of the underlying biological reality.
27:56Let's say we have this presence of this guide RNA
27:58and the expression of this gene and the guide RNA is
28:02or the, yeah, I guess the, the CRISPR knockdown
28:05of the enhancer is either affecting gene expression
28:08or it is not, but we read it out.
28:13Some essentially imprecise the measurement
28:18of the guide RNA presence.
28:19We also read out
28:20and imprecise measurement of the gene expression.
28:23And what's most important is that the technical factors such
28:27as sequencing depth, they're actually impacting both
28:30of these measurements, they're coming from the same cell.
28:33And so even if there is no association between the guide RNA
28:37and the gene, if you just basically naively look
28:42at the association between the measured guide RNA presence
28:45and the measured gene expression
28:46you're going to find some association.
28:50And so this is clearly an issue.
28:53And so essentially in order to correct
28:55for this confounding effect, it's very important
28:57to test instead of just testing independence between
29:01the perturbation and the expression.
29:04We want to test conditional independence, where
29:07we're conditioning on all of these technical factors.
29:10And so this shows you why non-parametric methods tend to
29:14suffer is because when you do things like permute your data
29:18or rank your data, there's this underlying assumption
29:21that all of the cells are exchangeable and you're
29:24using that exchange ability to build your inference on.
29:27And so when you do those tests, they're implicitly
29:30actually testing just the direct independence
29:34the unconditional independence.
29:36And so this sort of inflation we saw
29:39in the non-parametric methods be explained by this
29:43Source of confounding.
29:47So that's actually it for that part of my talk
29:51any questions about the existing methods
29:53and the analysis challenges
29:54and why there's a need to think about new methodology
29:57for this for this problem.
30:06Okay, I will move on.
30:10So this is the part of the talk where I'm going to
30:13propose a new analysis method for this kind of data.
30:19And so the key kind of idea we're gonna use is
30:22conditional resampling, which is proposed by not us.
30:30So the idea of the conditional randomization test
30:34well, it's actually, depending on how you look at it
30:37it's quite an old idea and it has some connections to
30:40causal inference, but it was proposed also incandescent all.
30:45And essentially the setup is that you want to
30:48test conditional independence and you're under
30:52the assumption that you have a decent estimate
30:55of the distribution of X given Z.
30:57So remember X is the perturbation.
31:00Y is the expression and Z are the,
31:02essentially the confounders.
31:04So one way of thinking about it from a causal inference
31:07standpoint is let's say we know the propensity score,
31:12can we test whether there's a causal relationship
31:15between X and Y sort of controlling for these Confounders?
31:20So the idea of the conditional randomization test
31:24is the following.
31:26First, you take any test statistic T of your data,
31:31and in order to calibrate this test statistic
31:34under the null hypothesis, instead of doing a permutation
31:39we're gonna do a slightly more sophisticated
31:41resampling operation, where we're going to go through,
31:45and for every cell, we are going to resample whether
31:50or not it received the given perturbation, but conditionally
31:55on the specific technical factors that were in that cell.
31:59And here we're using crucially the information that we have
32:03a handle on what this sort of propensity score is.
32:07And then we're just going to recompute the test
32:10the same test statistic on the resample data.
32:14And then we're just gonna define the a P-value
32:17in the usual way for a resampling based procedure.
32:20So one way of thinking about it is
32:23that it's kind of like a permutation test, but it's one
32:27in which the reassignments of the guide RNAs
32:31to the cells is one that respects
32:37the confounding that there is
32:40in the data instead of treating all the cells exchangeable.
32:45So this is great because the CRT adjust
32:51for confounders basically by construction and importantly
32:55it avoids assumptions on the gene expression distribution.
32:59And in fact, provably, the P-value you get
33:01out of the CRT is valid, even if essentially,
33:07even if the test statistic T is, anything you want.
33:13So in the sense that kind of addresses
33:15the confounding issues, like basically the Achilles heel
33:19of the non-parametric methods, but avoiding assumptions
33:23on the gene expression distribution
33:25as sort of was the pitfall of the parametric methods.
33:27And it kind of seems to be doing something
33:30that's avoiding both of those issues.
33:33Now, of course, there's a, trade-off in the
33:37CRT does require you to have some estimate
33:40of this propensity score.
33:42So, and then secondly, the CRT is computationally expensive
33:48if you consider, or if you compare it to like
33:51just like a parametric regression here
33:53we're doing a parametric regression
33:55but we're doing it lots of times.
33:57And so how do we get around some of these issues?
34:01So, and in particular, how do we actually go
34:05about applying this idea to single cell CRISPR screens?
34:08And so, firstly, do we understand this distribution
34:13of the probability of observing a guide or in a
34:17given a set of technical factors?
34:20So what we're going to do in this particular method,
34:25well, first we're gonna observe that it's
34:27this is kind of a simpler phenomenon than gene expression
34:31like guide our nays are not really, like subject
34:33to all of the complicated regulatory patterns of genes.
34:37And secondly, kind of under the hood,
34:40the actual assortments of guide our nays
34:45to cells is, you know, like fairly well modeled.
34:48It's just basically like in that sense
34:52the cells are pretty exchangeable.
34:53What's not exchangeable it just basically
34:55this measurement process.
34:56So this is just kind of a simpler object
34:59in the specific case of single cell CRISPR screens.
35:03So we can try to bring
35:04to bear various knowledge to try to get a good sense
35:08of this in this case,
35:10we're just gonna sort of do the easiest thing possible
35:12and we're gonna fit it using an logistic regression.
35:17The second thing we're going to do is think
35:19about what test statistic to use.
35:21So I had the separate paper about essentially the power
35:27of the conditioner randomization tests.
35:29And what we found is that the closer the test statistic is
35:33to the true conditional distribution of Y given X, Z
35:38I guess I should say the true likelihood,
35:40the better the power will be.
35:41And so in that sense, what we wanna do is we
35:45wanna leverage existing models that people have used such
35:49as negative binomial regression.
35:51It's not going to matter whether the model is true or not
35:54for the sake of type one error control, but we hope
35:59that we can do a better job in terms of power
36:02by trying to get a good model for this.
36:07And finally, how do we mitigate the computational cost?
36:10And so we had a few ideas for this as well.
36:12So one of them is called the distilled CRT.
36:15And so I'll if time permits, which might or might not
36:19I'll give you a few more details
36:20about how you can use this to have a much faster
36:25for every resample to be quick.
36:28And then we're also going to use this hack, essentially
36:32that what we found is that the resampling distribution
36:36it actually kind of looks pretty reasonable.
36:40It kind of looks like a normal, but it's sort
36:43of how some extra skew and maybe some extra heavy tails.
36:46And so what we're gonna do is we're going to
36:48fit a skew T distribution to the essentially
36:52the empirical distribution of the resample test statistics.
36:55And in that way, we can get more accurate P-values
36:58without doing as many recent samples.
37:01And so putting together all of these pieces
37:03we get this method, which we call Sceptre
37:06or single cell perturbation screen analysis
37:08via conditional resampling.
37:11And so essentially what we do is what I said
37:13on the previous slide.
37:15We first use a logistic regression to fit a probability
37:19for every cell that we would find a perturbation there.
37:23And then we're gonna use these perturbation probabilities
37:26and resample this particular column.
37:29And so we now we have a whole bunch of resample datasets.
37:32Now we're going to use a negative binomial regression
37:35or more precisely a distilled negative binomial regression
37:38for speed, to get the test statistic
37:42for both the original data.
37:43And for all of these re resample datasets.
37:47Then we're gonna put together all
37:48of these recycled test statistics into this gray histogram.
37:51And again, we're gonna fit this magenta curve
37:54which is the skew T distribution
37:57which seems to fit pretty well in most cases.
37:59And then we're gonna compare the original test statistic
38:02against this skew T distribution and get a P-value that way.
38:07And so this is represented by the shaded region here.
38:10And I think what's noteworthy is to compare this fitted
38:14and all No distribution
38:15to this standard normal No distribution.
38:19I guess I should have said here
38:20that the actual test statistics are a Z values extracted
38:24from the negative binomial regression.
38:26So if your model were true, the Z values
38:30under the No would follow a standard normal distribution.
38:34And so what we find is that when we resample we
38:37get something that's not the standard normal distribution.
38:40And so in the sense you can view it as,
38:42a sort of measure of the departure sort of from,
38:48or sort of the lack of model fit that went
38:51into this negative binomial regression.
38:54So another way of putting this is that
38:57you can imagine that if you did happen
38:59to correctly specify your negative binomial regression model
39:03then you would sort of be getting back the same P-value
39:06that you would have gotten otherwise.
39:08So in that sense
39:09we're not really reinventing the wheel here
39:11if you do have a good parametric model, but if you don't
39:13then we can correct for it using this resampling strategy.
39:18So I guess this is an important slide
39:19so maybe I will stay here for a little bit and ask
39:23if anyone has questions about how our methodology works.
39:31- Hi, I have a bunker question.
39:33So have you tried to hurdle model to deal
39:36with this kind of full data is the cause
39:40of the weird distribution of the data?
39:45- Oh, so let's see.
39:48You mean to model the, essentially to model the gene
39:52expressions or do you mean to model the CRISPR perturbations
39:58- From this page,
40:03so first step you use a logistic regression
40:05and then you use a nickname by knowing that binomial.
40:08So it's like a two step models, but to hurdle model
40:13they combine them together to deal with the overall dataset.
40:18- I see, I will admit that I'm not familiar with those
40:21models but I will definitely take a look
40:24at those and see if they might be applicable.
40:28Yeah, I guess like in this sense
40:33the approach that I've proposed here is pretty flexible.
40:37I mean, really
40:37what makes this approach work well is as long
40:42as you have a decent approximation
40:44to these probation probabilities
40:46we're thinking about them as propensity scores.
40:48So aside from that but
40:52because really what's standing behind this as the generality
40:55of the conditional randomization test where
40:56you can basically use any test statistic you want.
40:58And so, definitely the method is flexible
41:02and can incorporate different choices,
41:05like the one that you've mentioned,
41:08But we haven't tried it we haven't, we haven't tried it.
41:10I'm not familiar with this model.
41:12Thank you though.
41:15Anyone else have any questions about the methodology?
41:24Okay, perhaps I'll okay.
41:27So yes, so this is kind of like a separate thing
41:31which I will not get
41:33into details of for the sake of time, but we had
41:36the separate paper whose focus was just basically,
41:39the conditional randomization test is a cool test
41:42but everyone knows it's slow.
41:43So how can we essentially accelerate it
41:46while retaining a lot of its power advantages?
41:49And so what we found is that
41:51if you just ever so slightly modified the test statistic
41:54by sort of regressing Y first on the confounders,
41:59and then on X, instead
42:02of regressing it on both at the same time
42:04what we found is that this ends up being much, much faster
42:08because only the second step needs to be repeated
42:10upon resampling, and the second step is much cheaper.
42:14So what we did is that we, in the context of sector
42:19we built on this
42:20by accelerating the resampling steps even further
42:23by leveraging the sparsity
42:25of the CRISPR perturbation vector X.
42:28And so perhaps the most important part is that the cost
42:31of the CRT for one gene-enhancer pair went
42:34down from 25 minutes down to 20 seconds
42:38as a result of these computational accelerations.
42:40And so for reference a single negative binomial
42:43regression took three seconds.
42:45So it's still,
42:46we're a factor of six or seven, more expensive than the
42:50just the sort of vanilla single regression
42:53but it's definitely, I think sort of within,
42:56definitely within an order of magnitude
42:58and hopefully as you can tell a much better statistically.
43:03So I will show you a few, so this is a simulation.
43:12I'm not gonna go through it in detail, but the idea is
43:14that what we're demonstrating here is that you can give
43:19Sceptre essentially negative binomial models
43:22that are miss specified in different ways.
43:25You can, give it a dispersion
43:27that's too large, a dispersion that's too small
43:30or maybe the true model does have zero inflation
43:32but we're not accounting for it.
43:34And what we find is that Sceptre essentially
43:36is well calibrated, regardless,
43:40whereas if you just essentially took
43:43the like the wrong dispersion estimates at face value
43:47you would encounter problems.
43:49And this SE magic approach
43:51which basically is a permutation approach.
43:54It's just sort of not doing a great job accounting
43:56for the confounding it, so we see this inflation.
44:01So perhaps more excitingly
44:03I'd like to show you an application to real data.
44:08So I guess this is the, so firstly
44:11we wanna make sure method is actually calibrated.
44:13So if you remember the initial observation was
44:16in a lot of these methods, aren't calibrated.
44:18So because I'm running a little short on time
44:20let's kind of maybe ignore this panel here
44:22and focus our attention here.
44:24So this is the Gasperini data that I introduced before.
44:28And so this red line here is actually the QQ plot you saw
44:34on one of my first slides
44:35of all of those negative control gene-enhancer pairs.
44:39It looks different here because
44:41the scale is I've sort of cut off the scale
44:43so we can actually visualize it.
44:45So we see a quite significant departure.
44:48What we actually did is we thought, okay, maybe
44:51they have a bad estimate of the dispersion
44:54but maybe we can use some more
44:55like state-of-the-art single cell sort of methods
45:00to improve our estimate of the dispersion.
45:03And so maybe we don't need to go
45:04to all the effort of doing the resampling.
45:06And so what we found is that
45:07when we use a state-of-the-art dispersion estimate
45:10we still have very substantial miscalibration.
45:14This is, I think, just a Testament
45:15to the fact that it's just hard to estimate that perimeter
45:18because there's not all that much data to estimate it.
45:21And then by comparison, we built Sceptre
45:24from the same exact negative binomial model
45:27which is this improved one,
45:29and we found that the negative control P-values
45:32are I think, excellently calibrated.
45:36So this shows you, again, the benefit
45:40of this different way of calibrating your test statistic
45:42and not relying on the parametric model for gene expression.
45:47So this figure just shows a few of the other methods
45:50but for the sake of time, I'm going to move on.
45:55This is looking at positive control data.
45:58So this basically is like, trying to get a sense of power.
46:01And so, again, maybe if we restrict our attention
46:04to this left panel here, what we found is that
46:06if we just plot the, our P-values
46:10versus the P-values, by the way, maybe I should say
46:12what is a positive control.
46:14A positive control
46:15in this case is a CRISPR perturbation that instead
46:18of targeting and enhancer is targeting the transcription
46:22start sites of a gene.
46:25And so essentially, like we don't need any extra biology
46:29to know that, if you target a transcription start site
46:32that's really going to knock out the gene.
46:34And so you can still try to do your association test and see
46:37if you've picked up those positive control associations.
46:40And so what we find is that
46:41actually Sceptre not only is better calibrated
46:44but it also tends to have more significant P-values
46:48on those positive controls.
46:49So it apparently is boosting both the sensitivity
46:53and the specificity of this association tests.
46:57- Eugene here are the original empirical P-value is this
47:00from the negative binomial test.
47:03So after we did the conditional recommendation
47:08if you actually have better P-values
47:11for the positive control pairs.
47:13- Yes, so you would expect, you would expect it's like
47:22aren't we just making the P-value is just,
47:23like less significant
47:25in a way to just help with the calibration.
47:27So how can it be boosting power?
47:29But I like the degree of inflation sort of varies
47:34like essentially it's not like, and what we'll see this
47:37I think on the next slide as well, essentially
47:40we're not like, sort of what sector is doing is not
47:43like a monotone transformation of things.
47:46It kind of there's not actually just maybe to illustrate it.
47:50I think, this is just an example where essentially what
47:59we would have gotten from the sort
48:01of the vanilla negative binomial analysis is the area
48:04under this dotted or dashed curve here.
48:07And so Sceptre could, well, basically whoops sorry,
48:11it could have a, like a lighter tail as it has in this case.
48:15And so it could sort of either make the P-values
48:20on the more significant or less significant.
48:22It's correcting the miscalibration
48:24but not necessarily in a way that's like conservative.
48:26And so this is encouraging.
48:32Yeah, that's a good question though.
48:35- I guess that depends on
48:37the confounding you included in the model.
48:41So then I would expect it well, re reduce the significance
48:47but if you include other co-founding
48:50that's mostly contributing to the noise level probably.
48:55- Yeah, sure, so I think I'm right.
48:59Yeah, let me think we are, let me see
49:03I think in this case, we're correcting
49:05for approximately the same confounders here.
49:08So they already had some confounders
49:10that they were correcting for
49:11in the original negative binomial.
49:12So in that sense, it's a little bit more
49:14of maybe an apples to apples comparison.
49:16It's just a question of how do you calibrate
49:20that test statistic that is
49:21trying to correct for the confounders
49:23but I think what you're getting at
49:25I do think it can go either way.
49:27It's not obvious that Sceptre would make a P-value
49:29or they're more or less significant.
49:32I think I will say just as a small detail here
49:34that in addition to the negative binomial regression
49:38this P-value, it says,
49:40there's this strange word empirical here.
49:42What it means is that
49:43they've kind of also applied their fixed that they had
49:47because they realized that they had the miscalibration
49:48and then they kind of like smashed all
49:50of their P-values sort of,
49:52so these are sort of like, so in that sense
49:55it's not an apples to apples comparison
49:57but what we're doing is we're comparing
49:58to the P-values that were actually used
50:00for the analysis in this, in this paper.
50:02So maybe that makes it even harder to compare, but yes.
50:06So take this plot with a grain of salt, if you will.
50:10Perhaps I think the most exciting part is
50:14actually applying this to new gene-enhancer pairs
50:18where we don't know necessarily what the answer is.
50:21And so this plot just shows you
50:24we're just plotting it's actually, I guess
50:27similar to this plot we saw here
50:30except now we're looking at the candidate enhancers.
50:33And so essentially the different colors.
50:36So firstly, this also just shows you
50:38that this is very much not a monotonic transformation.
50:42Like you really can like, if you look into this quadrant
50:47this is an example where the original P-value was very
50:51not significant, but according to Sceptre
50:54it can be very significant and vice versa.
50:58So essentially I've just kind of highlighted
51:01those gene-enhancer pairs that were,
51:03found by one method and not the other.
51:05And so the upshot is that there's a total
51:09of about, roughly 500 or so found.
51:12Well, I guess after found 563
51:15of those 200 were new in the sense
51:18that they were not found by the original analysis.
51:21And then 107 were found by the original analysis
51:24but were not found by us.
51:26And we have strong reasons to believe
51:28that these could be false positives based
51:30on exactly the sorts of miscalibration that I presented.
51:35We did look at a few specific new discoveries here
51:38and found that they were corroborated by EQTL data.
51:43And for those of you who are familiar
51:45enhancer RNA correlation data, since I'm running low
51:48on time, I don't have time to explain this to you
51:51but these are all P-values
51:53of association based on orthogonal functional assets.
51:58Also, we found that our discoveries were more enriched
52:01for biological signals in a few different ways.
52:04One of them is that, and again,
52:06I'm sort of maybe going a little bit
52:08more quickly here 'cause I'm about to run out of time
52:11but there are these things called topologically
52:13associating domains, which are basically regions
52:16in the genome within which most
52:19of these regulatory interactions are thought to occur.
52:22And so what we find is that a greater fraction
52:25of the gene-enhancer pairs we found compared
52:27to the original analysis did lie
52:30in the same top logically associating domain.
52:32So in this case, 74% versus
52:35the 71% found in the original analysis.
52:37So in this sense, I mean, it's just kind of
52:39like a first order sense of biological plausibility.
52:44I think people are starting to think
52:46that there are interactions that are sort of
52:48outside of tabs as well.
52:49So I don't think this is a signal that,
52:5226% of these things are false discoveries
52:55but we definitely do expect, a high degree
52:59of enrichment for within tad interactions.
53:05Also if you do look
53:07at some of these more circumstantial pieces
53:09of evidence for regulations, such as things
53:14like transcription factor binding or histone modifications
53:19so we can use CHiP-seq to essentially assess
53:23for any given what
53:27whether there is these kind of signatures of regulation.
53:33And so what we found is that we did a little bit
53:35of an enrichment analysis where we looked at all
53:38of those enhancers that were found to be paired
53:40to genes by sector versus the original method
53:43and looked to what extent they were enriched
53:46for these other signatures
53:49these CHiP-seq based signatures of regulation.
53:52And what we found is that
53:53across eight of these CHiP-seq targets, and by the way
53:58these eight are not selected.
53:59These actually were the exact eight CHiP-seq targets
54:03that they examined in the original paper,
54:06we found greater enrichment.
54:08So in this sense, also the enhancers being
54:13picked up by Sceptre are just more biologically
54:15plausible using these orthogonal kinds of assets.
54:19So I find this very exciting
54:21and I'm just gonna maybe make a few remarks
54:25and hopefully there's just a little bit
54:26of time for questions.
54:27I will also be around for a few minutes after the seminar.
54:30If anyone wants to stick around and ask me questions
54:33you also might have your next thing to go to.
54:35So I understand if not.
54:37But maybe the summary is that, mapping gene-enhancer
54:41regulatory relationships is very important.
54:43If we wanna translate GWAS hits into disease insights.
54:47And there's been this very exciting new technology
54:50that allows us to answer that question.
54:53This technology was proposed very recently,
54:56and so there aren't that many methods out there
55:00to analyze these kinds of data.
55:02And so what we did with Sceptre is we leveraged recent
55:05methological advances in statistics to overcome the primary
55:08limitations of the parametric and non-parametric analysis
55:11methods that were available.
55:13And finally, we applied it to the largest existing
55:17data set of this kind.
55:19And what we get is a greater number of more biologically
55:22meaningful regulatory relationships.
55:25So I had a few other discussion slides, maybe I'll just
55:28read the title to you without getting into the details
55:31but this is a rapidly developing technology.
55:34And we do foresee that sector will be applicable
55:37to future iterations of the technology.
55:40So that's promising.
55:42And secondly, this is more like the beginning
55:46of the road than the end of the road.
55:47There are lots of remaining challenges,
55:51this includes looking for interactions
55:53among enhancers, things like dealing
55:56with multiple guidances, how are you in the same enhancer,
56:00they're just basically like a whole, I would say, playground
56:02of statistical problems that have yet to be addressed.
56:07So maybe finally, if you'd like to learn more
56:11we have a pre-printed on bio archive.
56:13I wanna acknowledge my co-authors again.
56:16And finally, so Tim has worked very well hard
56:20on putting, making this an art package so
56:24you can find out on GitHub
56:26and I'm very happy to take questions now
56:29but if you have any burning questions that come
56:32to you 30 minutes after my talk
56:35please feel free to email me at this address.
56:37So thank you, and I should have said at the top, thank you
56:40Lexi for the invitation.
56:42- Thank you for agreeing to present your work here.
56:45It's really a nice talk.
56:47- Yeah Thank you.
56:49- So I have some, maybe less related question
56:52to your current work, but maybe interesting to consider.
56:56I am not sure.
56:57Have you looked at the correlation structure
56:59between the X matrix?
57:03- Yeah, so essentially my sense is that gets
57:07like a factor model where you have all
57:11of these sort of confounders that are inducing correlation
57:16among all the axis, but essentially like once you account
57:21for that confounding, it's independent.
57:25- I see (indistinct) correlation.
57:29- So it's fairly small correlation and essentially
57:33the reason for, and this is very different from
57:36for example, genome-wide association studies.
57:38So it's like, Oh
57:39is there some analog of Lincoln's this equilibrium.
57:40And the key difference here is that
57:43it's essentially a design experiments.
57:46So even though you're not controlling exactly
57:49which cells receive what perturbations you are
57:51basically assigning them at random.
57:54So if it worked
57:55for this sort of pesky measurement mechanism business
57:58it would be an unconfounded problem.
58:01But essentially, so the only correlations are coming
58:05from this measurement.
58:08Yes so that is a great question
58:10because you can ask, well
58:11how did I do the slight of hand run?
58:13Like slide three all of a sudden I was working
58:15with like one enhancer
58:17and where did all the rest of them go.
58:18And I think we're actually not losing all too much
58:21by doing this, especially
58:23since we are controlling for those technical factors.
58:25- Yeah thanks that makes sense to me.
58:28And another thing is maybe more, less than less
58:31statistical is how many confounding factors
58:34they are controlling
58:35and what are the important ones that you have identified?
58:39- Yeah, I mean, so in this case
58:41we're doing essentially we're following the lead of
58:45the original paper
58:46for which confounding factors with control for.
58:48So in addition to sequencing depth.
58:51Yeah, so they do have a batch of fact
58:52and there's also something called Percent Might've Country.
58:55So it's like what fraction of all the reads that you got
58:58in this particular cell came from mitochondrial DNA
59:02as opposed to, regular DNA, maybe a few others
59:09like just total number
59:10of genes expressed in the cell, things of this nature.
59:13So I think here we're correcting
59:15for about five, but you could think of other things
59:18like cell cycle, this is a pretty K five 62 is a pretty
59:25homogeneous cell line, but especially
59:27once you get to other kinds of, tissue samples
59:30you might need to think about, cell type
59:33and things of this nature.
59:35So I think there are lots to consider here,
59:38we used kind of five easy ones.
59:42- Okay, thanks.
59:44Any more questions for Eugene?
59:48Yeah, I think we are approximating
59:52the end of the talk, the seminar.
59:55So thanks again for your great talk.
59:58And if you have any further questions
01:00:01you can just send emails to Eugene offline.
01:00:05- Yes, yes, definitely don't hesitate to reach out.