# YSPH Virtual Biostatistics Seminar: Statistical analysis of single cell CRISPR screens

February 03, 2021## Information

Eugene Katsevich, Ph.D.

Assistant Professor in the Department of Statistics

The Wharton School at the University of Pennsylvania

Tuesday, 2/2/2

ID6158

To CiteDCA Citation Guide

- 00:03- First good afternoon, everyone,
- 00:05and I hope you somehow managed to enjoy your winter break
- 00:10you in this special time.
- 00:11And this is our first talk, seminar talk this semester,
- 00:16and we have invited Dr. Eugene Katsevich
- 00:19from Wharton School at UPenn.
- 00:22And he's going to present something really exciting,
- 00:26I know his original work on statistical analysis
- 00:32single cell CRISPR screening.
- 00:34And I will hand it over to Eugene from now, from here.
- 00:40And, but if Eugene wanted to start or wait one
- 00:43or two minutes to start, it's up to you.
- 00:46- Yeah maybe, I mean, yeah, I don't know.
- 00:51If people will filter in, maybe I'll wait another minute
- 00:53or two, 'cause I think, I feel like the first part
- 00:57of the talk is very important.
- 00:58So I think if people missed the first part of the talk,
- 01:01then it'll be maybe hard to follow along later.
- 01:04So I'm happy to wait just another minute or two.
- 01:09I understand perfectly that it's a strange time
- 01:12for everyone, so for all those who were able
- 01:15to make it today, I really appreciate
- 01:17your adjusting the schedule.
- 01:22Also maybe one remark I can make is that,
- 01:25since it is a smaller audience,
- 01:27I think we can make this seminar just about
- 01:31as interactive as you want.
- 01:32So you should definitely feel free to stop me
- 01:37at any point.
- 01:39I don't know how many of you are familiar
- 01:40with the CRISPR screen stuff I'm gonna talk about,
- 01:43but I'm very happy to just make it very interactive.
- 01:54I will maybe start sharing my screen
- 01:58and maybe I'll start launching
- 02:00into some of the introductory things.
- 02:13So...
- 02:16Oh wow, wait, is this the...
- 02:21I greatly apologize.
- 02:24Clearly, the label on my slides is wrong.
- 02:28I have updated my slides since then,
- 02:31but I think the title page has not been updated,
- 02:33that's extremely embarrassing.
- 02:40Well, maybe then I should skip past this slide very quickly.
- 02:43So hello everyone, thank you so much
- 02:47for making it to my talk.
- 02:50Today, I'll be talking about some Statistical Analysis Tools
- 02:52for Single Cell CRISPR Screens.
- 02:55So the most important thing to take away
- 02:56from this slide are my collaborators here.
- 02:59So Tim Barry is a grad student
- 03:02of mine who was actually at CMU.
- 03:05I am jointly advising him with Kathryne Roeder
- 03:08also at CMU, who is used to be my postdoc advisor.
- 03:15So I'll skip quickly to the next slide.
- 03:20So here's the motivation.
- 03:22And by the way, if anyone has joined recently,
- 03:25please just stop me at any point.
- 03:29So here's the motivation.
- 03:30So we have done lots
- 03:31and lots of genome wide association studies to date.
- 03:35So we have a lot of little markers
- 03:37along the genome that we think are associated with diseases.
- 03:41And so the question is what's the next step?
- 03:43Like how do we actually translate these
- 03:46into insights into diseases?
- 03:49And hopefully later on things like,
- 03:51therapeutics and so on.
- 03:52So what we need to do is we need to understand how
- 03:56like basically the mechanisms
- 03:58by what mechanism are these associations actually resulting
- 04:01in an increased disease risk.
- 04:03So here's a typical situation here
- 04:05as our genome and here's a disease association
- 04:08and frequently these disease associations
- 04:11they might not take place within genes.
- 04:13And so that makes them pretty hard to interpret.
- 04:18So what's hypothesized to be the case here is that instead
- 04:24of disrupting genes directly, these variants
- 04:29are disrupting regulatory elements such as enhancers.
- 04:33So let's just like briefly here review
- 04:38that an enhancer is a region of the genome.
- 04:41That could be a certain distance
- 04:42from the gene that actually folds
- 04:45in three-dimensional space to come
- 04:47in close proximity to the promoter of the gene.
- 04:52And essentially the enhancers job is to recruit a lot
- 04:55of the machinery that actually is going to lead
- 04:57to the expression of this gene.
- 04:58So if you disrupt the enhancer
- 05:00then this will disrupt the recruitment
- 05:02of all of these different transcription factors
- 05:04which will then end up causing some trouble.
- 05:10And so it's this sort of like, for example, in this case
- 05:15let's say that this disease association
- 05:17as is disrupting enhanced or one, well, this might suggest
- 05:21if enhancer one is regulating gene two, that
- 05:24the disease mechanism is actually proceeding essentially
- 05:28or being mediated by the expression of gene too.
- 05:32And so this would be a very great
- 05:38and clean way of interpreting GWAS hits.
- 05:41But the problem is that we don't actually know
- 05:44or we have a very hazy sense
- 05:46of which enhancers actually regulate which genes.
- 05:50So this is kind of a difficult problem
- 05:52for a few different reasons.
- 05:54The first reason is that there's a potentially
- 05:57many to many mapping between enhancers in genes.
- 06:00So in enhancer it can regulate multiple genes
- 06:03and a single gene can be regulated by multiple enhancers.
- 06:08So the other thing is that any answers don't even
- 06:10need to be all too close to the genes that they regulate.
- 06:13There could be situations like we saw here where
- 06:18the regulation can skip the adjacent gene
- 06:20and go to the next one.
- 06:22And so in general regulations can
- 06:24are thought to happen within about a megabase distance
- 06:30in terms of the linear distance in the genome.
- 06:33So this is a hard problem, and it's basically
- 06:36the motivating problem for this talk
- 06:38which enhancers regulate which genes.
- 06:41This is a sort
- 06:42of a very fundamental and important problem in genomics.
- 06:47So in today's talk, I'm going to first talk about
- 06:53a new assay called a single cell CRISPR screen
- 06:56that allows us to get at this question,
- 07:03then I'm gonna talk about the challenges
- 07:06that previous methods have encountered
- 07:08in analyzing these single cell CRISPR screen
- 07:10datasets, never propose a new methodology based
- 07:14on this idea of conditional resampling.
- 07:18And then I will show you how this works
- 07:20on real data and close with the discussion.
- 07:25So let me first introduce the biological assay here
- 07:28which is called the Single Cell CRISPR screen.
- 07:31So actually backing up a second,
- 07:34this is a very important problem
- 07:35and people have considered it before.
- 07:37So how do people typically approach gene-enhancer mapping?
- 07:41I think the most common approach is what I call here
- 07:46an indirect observational approach.
- 07:48And there are many of these.
- 07:50So what this picture is,
- 07:51is a basically a more detailed picture
- 07:54of what happens when an enhancer or a pictured here comes
- 07:57into contact with the promoter of a gene.
- 08:00There are lots of kind
- 08:01of indirect signals of this regulation.
- 08:06Obviously you have just the actual expression
- 08:08of the gene, but you'll have the confirmation
- 08:12of the chromatin in the vicinity of the promoter
- 08:16and in the enhancer
- 08:18you have basically transcription factor binding data.
- 08:22And all of these data are essentially indirect ways
- 08:25of trying to make a conclusion
- 08:28about which enhancers might be regulating which genes.
- 08:31So for example, using high C data
- 08:33if you find an enhancer to be a 3D contact
- 08:37with the promoter, then this could be a single signal
- 08:39that there is some regulation going on.
- 08:43The issue is that these approaches have not
- 08:45proved very reliable at the end of the day.
- 08:47These are observational approaches,
- 08:49and basically even if you have
- 08:52contact in 3D space, this is not necessarily a signal.
- 08:57This doesn't necessarily mean that regulation
- 08:59is actually occurring,
- 09:00and so essentially we haven't gotten all too far
- 09:04with these indirect approaches.
- 09:05So the exciting thing is
- 09:07that recently with the development of CRISPR technology
- 09:13we can now actually go in and instead of observationally
- 09:17just essentially take a look inside a cell.
- 09:20We can actually go in and make modifications where we
- 09:24for example, knockouts enhancers using the system
- 09:28called CRISPR Interference.
- 09:30And then we try to look
- 09:31at what the results are for gene expression.
- 09:34So this shows you a little cartoon
- 09:38of the CRISPR interference system.
- 09:40And so the way that it works is
- 09:42that you have this CAS nine protein whose job is to attach
- 09:48to a certain segment of DNA.
- 09:51And the specific segment of DNA it attaches
- 09:53to is specified by this guide, or I do.
- 09:57And so in this way,
- 09:59the attachment can be highly specific
- 10:01to the sequence of the enhancer.
- 10:04And then this for CRISPR Interference
- 10:07the CAS nine brings along with it
- 10:09all of these repressive elements that essentially knock
- 10:13out this enhancer, meaning they prevent the enhancement
- 10:17from actually helping to regulate this gene.
- 10:22And so the idea, so firstly
- 10:24this is a promising solution because it allows us to
- 10:28interrogate these regulatory relationships
- 10:30in a much more direct way
- 10:32than we've been able to do until recently.
- 10:35And so the overall idea is that,
- 10:39it's the idea of simple disrupt enhancers
- 10:41and see which genes expression drops.
- 10:44And so just as a cartoon here, let's say we knock
- 10:46out this enhancer, then we would expect
- 10:49to see the gene that regulates to be down-regulated.
- 10:54And then we can think
- 10:55about designing perturbations for multiple enhancers.
- 10:58And so if you perturb this enhancer
- 11:00then maybe you'll see a response in these two genes.
- 11:07- Very naive question, just to make sure I
- 11:09didn't misunderstand notion here is enhancer always
- 11:14upregulating gene kind of regulate?
- 11:18- I think enhancers specifically
- 11:20are thought to upregulate genes.
- 11:22However, it's a good question because there are other kinds
- 11:25of elements that are, can actually be silencers for example.
- 11:28And so that's just another example of a kind
- 11:32of a regulatory element.
- 11:33So the effect could go in either direction
- 11:36and this talk I'll primarily talk about enhancers
- 11:38but really everything I say goes through for other kinds
- 11:42of regulatory elements.
- 11:44- Thanks.
- 11:45- Yeah, very good question.
- 11:50So now the actual assay
- 11:54That allows you to do this out of large scale.
- 11:59So the scale is the question here
- 12:00because you can do CRISPR experiments where
- 12:03you essentially like knock out one enhancer
- 12:05in a whole batch of cells, and then,
- 12:08maybe go enhancer by enhancer
- 12:10and this ends up not being a very scalable approach.
- 12:13So there has been proposed
- 12:16this new asset called the single cell CRISPR screen
- 12:19in which you basically pool a whole bunch
- 12:22of perturbations together,
- 12:23and then the readout that you get is single cell
- 12:26RNA sequencing, which allows you to also basically look
- 12:30at the impact of all
- 12:31of those different enhancement perturbations
- 12:32on the entire transcriptome.
- 12:35And so in the slide, I'm gonna give you a brief overview
- 12:38of how these screens work.
- 12:40So first way you do is you start
- 12:42with a library of CRISPR perturbations.
- 12:44So you just, let's say maybe you take,
- 12:4910,000 enhancers across the genome
- 12:52and then you basically design CRISPR guide.
- 12:55RNAs targeting each of those enhancers.
- 12:58Once you have a library of these perturbations
- 13:00you then infect a big pool
- 13:03of cells with all of these perturbations.
- 13:06And so what's important to note here is
- 13:08that essentially these perturbations get randomly integrated
- 13:13into the different cells they're delivered through a
- 13:17like a virus system
- 13:19the details aren't very important, but the importance is
- 13:22that these perturbations get integrated
- 13:24into cells essentially at random.
- 13:26And so each cell gets its own collection
- 13:28of CRISPR perturbations.
- 13:30So now in order to basically actually read out what happened
- 13:35in our experiment, we use single cell RNA sequencing.
- 13:38And as a result of the sequencing experiment
- 13:40we get two pieces of information, firstly, by the way
- 13:44two pieces of information for every step.
- 13:46So for every cell
- 13:47we first measure the perturbations that are present.
- 13:50So which of these guide or nays did we detect,
- 13:52and then secondly
- 13:53the gene expression for the whole transcriptome.
- 13:57So this is essentially our data here.
- 13:59And then once we have this data
- 14:01we can now do the analysis component, which really ends
- 14:05up being a kind of differential expression analysis.
- 14:08So consider a particular gene-enhancer pair.
- 14:12So what we can do is we could take all of the cells
- 14:15and we can break them up into two groups.
- 14:17Those cells for which that enhancer was knocked out
- 14:20which are in orange here, and those cells
- 14:23for which that enhancer was not knocked out.
- 14:26We can then split, essentially look
- 14:29at the expression of the gene of interest
- 14:33and see whether there's a systematic difference
- 14:35between the expression of this gene
- 14:37and these two populations of cells.
- 14:39So, and then if there is a significant difference
- 14:43then we can make a conclusion that that particular
- 14:45enhancer is regulating that particular gene.
- 14:48So it seems quite simple on first glance,
- 14:52but this analysis part actually turns
- 14:55out to be a challenging statistical problem.
- 14:59And so the analysis
- 15:02of these screens is actually the subject of this talk.
- 15:07Okay so, maybe one more slide
- 15:10and then I'll stop and see if people have questions.
- 15:12So just to make it a little bit more concrete
- 15:16there's a kind of a large data set that might be one
- 15:19of the largest out there right now by Gasperini at all.
- 15:22It was published in cell last year.
- 15:24Oh wow, I guess two years ago now to 2019,
- 15:27and so they were working with 200,000 K five 62 cells
- 15:31and they were looking at 6,000 candidate enhancers.
- 15:34And so they're looking at, I mean
- 15:35essentially the whole transcriptome, at least the part
- 15:38of it that has any expression in the cell type.
- 15:41And they identified 85,000 enhancer gene pairs
- 15:45that they essentially thought were plausible
- 15:50to have some regulation and in their experiment
- 15:53they had 28 per patients on average per cell.
- 15:58And so the way that this data would look is, think
- 16:01about the rows as being the cells and then the columns.
- 16:05So you have two groups of columns.
- 16:06Firstly, you have the gene expressions,
- 16:08and so since these are single cell data
- 16:10we have these highly discreet counts
- 16:13of reeds or UMRs for every gene.
- 16:18And then also we have the second bit of information
- 16:21which is a binary matrix, which tells you
- 16:23which cells received, which perturbations.
- 16:27So in general, in this presentation, I'll talk
- 16:29I'll denote gene expression by Y and perturbations by X.
- 16:36And so there's also a third and very important piece
- 16:38of information, which are technical factors per cell.
- 16:42Perhaps the main one that I'll talk
- 16:43about today is the sequencing depth.
- 16:46So this is just the total number
- 16:48of reads or UMRs I measured from this cell.
- 16:53And so this basically just varies randomly across cells
- 16:57just as an artifact of your experiment.
- 16:58There are other technical factors
- 17:00like batch and so on and so forth.
- 17:03Okay, so this brings me to the end
- 17:06of the first section where I tell you
- 17:08about the data and the asset.
- 17:11So are there any questions before I move on
- 17:14to talking more about the analysis of these types of data.
- 17:25I'm assuming there are no questions
- 17:27but do feel free to stop me if there are.
- 17:34So as I said, this actually turns out to be kind of
- 17:37like an annoyingly challenging statistical problem.
- 17:40And so to illustrate this to you, let me first
- 17:43give you a sense of what analysis methods there
- 17:45are out there.
- 17:47I should say, by the way that given the sort of the novelty
- 17:52of this assay, there hasn't been a lot of work in terms
- 17:55of designing methods specifically designed
- 17:59for this kind of data.
- 18:01So most of the existing analysis methods are basically
- 18:04proposed by the same people who are
- 18:07producing the single cell CRISPR screen data.
- 18:10So by the way, so in this slide
- 18:14I'm going to it actually for the remainder of the talk
- 18:17I'm actually going to essentially focus our attention
- 18:20on a certain gene and a certain enhancer
- 18:25and just consider the problem
- 18:27and figuring out whether that enhancer regulates that gene.
- 18:30And so I'm gonna use YI, to denote the expression
- 18:34of that gene and cell I XI as the binary indicator
- 18:38for whether that enhancer was perturbed in that cell
- 18:41and ZI the vector of these extra technical co-variants.
- 18:48So With that notation out of the way,
- 18:52the first kind of popular method for analyzing these data
- 18:58is negative binomial regression.
- 19:00For those of you familiar with bulk RNA-seq differential
- 19:04expression analysis, this is similar to the DESeq2
- 19:08methodology where you just run a negative binomial
- 19:11regression of the gene expression, Y on a linear combination
- 19:17of the perturbation indicator, as well as all
- 19:20of your technical co-variants.
- 19:23And so Negative Binomial is a common model for these sort of
- 19:28over dispersed count data that you encounter
- 19:31in RNA sequencing data.
- 19:35Okay, next, there is a rank based approach.
- 19:38So this is non-parametric where it's actually much simpler.
- 19:43You just, you cross tabulate yourselves by two criteria.
- 19:48First, you see whether they have the perturbation or not.
- 19:51And second, you see whether they have essentially higher
- 19:55than median expression on this gene or lower
- 19:57than median expression on this gene.
- 19:59And then you do a two by two table test for independence.
- 20:05And finally there are also permutation based approaches
- 20:08where the idea is to take some test statistic
- 20:12and then calibrate it under the null distribution
- 20:16by permuting this column right here
- 20:19the assignments of the perturbations to the cells.
- 20:24So yes, that, I guess that's, what's written here.
- 20:28So okay, there's like maybe all these methods sound
- 20:36reasonable at first, but the more you actually look
- 20:39at the existing literature
- 20:40the more there are various scattered signs
- 20:43like none of these methods are like really doing the trick.
- 20:47And so here are
- 20:50the methods that I described on the previous slide.
- 20:54I don't know if I named them
- 20:55but so virtual FACS is the rank based one
- 20:57and scMAGeCK is the one of the permutation based ones.
- 21:01And so you look at plots actually from
- 21:05the original papers themselves who propose these methods
- 21:10and you see some signs of miscalibration.
- 21:13And so like, for example, I'm gonna be talking mostly
- 21:16about this data and to a lesser extent
- 21:18about this data in my talk, but so looking here
- 21:23so I guess perhaps I should first talk
- 21:24about the concept of a Negative Control Perturbation.
- 21:27So a Negative Control Perturbation is a guide
- 21:30or but it's actually not designed to
- 21:33target any particular sequence along the genome.
- 21:37So you don't expect cells that are infected
- 21:40with a negative control perturbation to look any different
- 21:42from cells that have no perturbation.
- 21:46And so in this Gasperini data
- 21:50they have 50 different negative control guide RNAs,
- 21:53and so what they did is they basically plotted a QQ plot
- 21:57of all of the negative control guide RNAs,
- 22:00paired with all of the genes and the genome,
- 22:06and what they found is and perhaps on this QQ plot
- 22:09this doesn't look like a severe inflation from uniformity
- 22:12but it's important to keep in mind the scale of this Y axis.
- 22:17And so essentially this amounts
- 22:21to a massive amount of deviation
- 22:24from the uniform distribution in those P-values.
- 22:27So in other words, negative control,
- 22:31gene-enhancer pairs are looking incredibly
- 22:33significant according to this analysis.
- 22:37So in this particular analysis
- 22:40they essentially found
- 22:42the same thing here it's portrayed as a Manhattan plot
- 22:47but you see a lot
- 22:50of things reaching significance when right only
- 22:53the circle points are those that essentially were replicated
- 22:58in a bulk RNA sequencing experiment.
- 23:02And then this one finally looks like they perturbed
- 23:10lots of different enhancers and essentially looked
- 23:13at the effect on this one particular gene.
- 23:16And essentially what they found is that essentially all
- 23:19of the enhancers that they tested appeared to
- 23:22actually be per, like, have an effect on the expression
- 23:25of this gene, when in fact this is biologically imposible.
- 23:29So this is clearly an issue.
- 23:32Now, these original papers clearly knew
- 23:36that there was an issue, and so for each of the papers
- 23:39they kind of have a little bit
- 23:40of an ad hoc fix in order to basically correct their P-value
- 23:45of distributions, so that they look a little bit more,
- 23:50closer to being calibrated.
- 23:52And so I'm, I think for the sake of time
- 23:55I'm probably not going to get into exactly how
- 23:58they propose to fix their P-value distributions.
- 24:02What I will say is that we looked in detail
- 24:06especially at the strategy that they use here
- 24:08and to a lesser extent at the strategies.
- 24:10Well, actually I think here
- 24:11they basically said just not to apply their method
- 24:14to data where there's too high, essentially
- 24:18to where they're too many perturbations per cell.
- 24:21So in this case, they just said, don't apply this method.
- 24:24We looked into the kinds of fixes that they proposed
- 24:26in these two papers, and they essentially
- 24:28they don't quite work in the way that you would expect.
- 24:31And so what we thought is that,
- 24:33what we'd like to do is kind of look a little deeper
- 24:37into this problem and try to ask ourselves
- 24:39why are we seeing all of these issues?
- 24:41Why do people keep running into these miscalibration issues
- 24:44and let's try to basically address those underlying issues.
- 24:50So we thought about it a little bit
- 24:53and we thought about challenges
- 24:55for both parametric and non-parametric methods.
- 24:59So for parametric methods
- 25:01this actually shouldn't really come as a surprise probably
- 25:05to most people here, gene expression is known to
- 25:08be pretty hard to model in single cells.
- 25:11So of course we have these essentially highly discreet
- 25:16lots of zeros counts that are over dispersed
- 25:20perhaps more importantly, given how sparse the data are.
- 25:23It's actually pretty hard to get a good estimate
- 25:26of that dispersion parameter.
- 25:28And so there's currently no standard way
- 25:30of estimating that dispersion parameter
- 25:32and basically every paper, comes up
- 25:35with their own way of doing this.
- 25:40They're even just debates
- 25:41about what parametric models are appropriate for these data,
- 25:44should they be zero inflated,
- 25:46should they not be, and some genes have even been observed
- 25:50to have bi-modal expression patterns.
- 25:52So essentially all of these things are telling us
- 25:55that it's kind of hard to shoe horn
- 25:57single cell gene expression,
- 25:58into a nice, neat parametric model.
- 26:01So obviously if you have missed specification of your model
- 26:04such as a bad estimate for a dispersion perimeter
- 26:07that very well could cause miscalibration
- 26:09of the kind that we saw.
- 26:13So next we can think about non-parametric methods.
- 26:16So maybe, obviously if these data
- 26:19are hard to model parametrically
- 26:21maybe the non-parametric methods are going to save us.
- 26:25But the observation that we made that I think is
- 26:27quite important is that these technical factors
- 26:29that I mentioned before, like sequencing depth,
- 26:32they impact not only the expressions of genes
- 26:35but also the detection of these CRISPR guider in is.
- 26:38So I might have led you to believe
- 26:41in one of my early slides that we can basically
- 26:43perfectly measure which cell contains
- 26:46which CRISPR perturbations, but this is actually not true.
- 26:51So single cell RNA sequencing
- 26:54it's essentially just like this kind of a sampling process.
- 26:59And so the more reads you sample from a cell
- 27:03the more likely you are to detect a guide RNAs.
- 27:05And so we just essentially looked at, for example,
- 27:10this is for one of the datasets and we just made
- 27:13a scatterplot of the total number of guide RNAs detected
- 27:17per cell versus the total number of UMI.
- 27:20So this is the sequencing depth
- 27:21and we found this extremely clear
- 27:24I guess I'm not showing you the P-value
- 27:25but this P-value was like absurdly significant
- 27:29to just basically confirm that
- 27:31if you have more sequencing depth in a cell,
- 27:33you're going to find more guide our news in that cell.
- 27:37And so the issue with this is
- 27:40that we basically have a confounding problem on our hands.
- 27:43So think about this graphical model that's illustrating
- 27:48what's going on
- 27:49in a single cell CRISPR screen experiment in
- 27:52this gray box is kind of the underlying biological reality.
- 27:56Let's say we have this presence of this guide RNA
- 27:58and the expression of this gene and the guide RNA is
- 28:02or the, yeah, I guess the, the CRISPR knockdown
- 28:05of the enhancer is either affecting gene expression
- 28:08or it is not, but we read it out.
- 28:13Some essentially imprecise the measurement
- 28:18of the guide RNA presence.
- 28:19We also read out
- 28:20and imprecise measurement of the gene expression.
- 28:23And what's most important is that the technical factors such
- 28:27as sequencing depth, they're actually impacting both
- 28:30of these measurements, they're coming from the same cell.
- 28:33And so even if there is no association between the guide RNA
- 28:37and the gene, if you just basically naively look
- 28:42at the association between the measured guide RNA presence
- 28:45and the measured gene expression
- 28:46you're going to find some association.
- 28:50And so this is clearly an issue.
- 28:53And so essentially in order to correct
- 28:55for this confounding effect, it's very important
- 28:57to test instead of just testing independence between
- 29:01the perturbation and the expression.
- 29:04We want to test conditional independence, where
- 29:07we're conditioning on all of these technical factors.
- 29:10And so this shows you why non-parametric methods tend to
- 29:14suffer is because when you do things like permute your data
- 29:18or rank your data, there's this underlying assumption
- 29:21that all of the cells are exchangeable and you're
- 29:24using that exchange ability to build your inference on.
- 29:27And so when you do those tests, they're implicitly
- 29:30actually testing just the direct independence
- 29:34the unconditional independence.
- 29:36And so this sort of inflation we saw
- 29:39in the non-parametric methods be explained by this
- 29:43Source of confounding.
- 29:47So that's actually it for that part of my talk
- 29:51any questions about the existing methods
- 29:53and the analysis challenges
- 29:54and why there's a need to think about new methodology
- 29:57for this for this problem.
- 30:06Okay, I will move on.
- 30:10So this is the part of the talk where I'm going to
- 30:13propose a new analysis method for this kind of data.
- 30:19And so the key kind of idea we're gonna use is
- 30:22conditional resampling, which is proposed by not us.
- 30:30So the idea of the conditional randomization test
- 30:34well, it's actually, depending on how you look at it
- 30:37it's quite an old idea and it has some connections to
- 30:40causal inference, but it was proposed also incandescent all.
- 30:45And essentially the setup is that you want to
- 30:48test conditional independence and you're under
- 30:52the assumption that you have a decent estimate
- 30:55of the distribution of X given Z.
- 30:57So remember X is the perturbation.
- 31:00Y is the expression and Z are the,
- 31:02essentially the confounders.
- 31:04So one way of thinking about it from a causal inference
- 31:07standpoint is let's say we know the propensity score,
- 31:12can we test whether there's a causal relationship
- 31:15between X and Y sort of controlling for these Confounders?
- 31:20So the idea of the conditional randomization test
- 31:24is the following.
- 31:26First, you take any test statistic T of your data,
- 31:31and in order to calibrate this test statistic
- 31:34under the null hypothesis, instead of doing a permutation
- 31:39we're gonna do a slightly more sophisticated
- 31:41resampling operation, where we're going to go through,
- 31:45and for every cell, we are going to resample whether
- 31:50or not it received the given perturbation, but conditionally
- 31:55on the specific technical factors that were in that cell.
- 31:59And here we're using crucially the information that we have
- 32:03a handle on what this sort of propensity score is.
- 32:07And then we're just going to recompute the test
- 32:10the same test statistic on the resample data.
- 32:14And then we're just gonna define the a P-value
- 32:17in the usual way for a resampling based procedure.
- 32:20So one way of thinking about it is
- 32:23that it's kind of like a permutation test, but it's one
- 32:27in which the reassignments of the guide RNAs
- 32:31to the cells is one that respects
- 32:37the confounding that there is
- 32:40in the data instead of treating all the cells exchangeable.
- 32:45So this is great because the CRT adjust
- 32:51for confounders basically by construction and importantly
- 32:55it avoids assumptions on the gene expression distribution.
- 32:59And in fact, provably, the P-value you get
- 33:01out of the CRT is valid, even if essentially,
- 33:07even if the test statistic T is, anything you want.
- 33:13So in the sense that kind of addresses
- 33:15the confounding issues, like basically the Achilles heel
- 33:19of the non-parametric methods, but avoiding assumptions
- 33:23on the gene expression distribution
- 33:25as sort of was the pitfall of the parametric methods.
- 33:27And it kind of seems to be doing something
- 33:30that's avoiding both of those issues.
- 33:33Now, of course, there's a, trade-off in the
- 33:37CRT does require you to have some estimate
- 33:40of this propensity score.
- 33:42So, and then secondly, the CRT is computationally expensive
- 33:48if you consider, or if you compare it to like
- 33:51just like a parametric regression here
- 33:53we're doing a parametric regression
- 33:55but we're doing it lots of times.
- 33:57And so how do we get around some of these issues?
- 34:01So, and in particular, how do we actually go
- 34:05about applying this idea to single cell CRISPR screens?
- 34:08And so, firstly, do we understand this distribution
- 34:13of the probability of observing a guide or in a
- 34:17given a set of technical factors?
- 34:20So what we're going to do in this particular method,
- 34:25well, first we're gonna observe that it's
- 34:27this is kind of a simpler phenomenon than gene expression
- 34:31like guide our nays are not really, like subject
- 34:33to all of the complicated regulatory patterns of genes.
- 34:37And secondly, kind of under the hood,
- 34:40the actual assortments of guide our nays
- 34:45to cells is, you know, like fairly well modeled.
- 34:48It's just basically like in that sense
- 34:52the cells are pretty exchangeable.
- 34:53What's not exchangeable it just basically
- 34:55this measurement process.
- 34:56So this is just kind of a simpler object
- 34:59in the specific case of single cell CRISPR screens.
- 35:03So we can try to bring
- 35:04to bear various knowledge to try to get a good sense
- 35:08of this in this case,
- 35:10we're just gonna sort of do the easiest thing possible
- 35:12and we're gonna fit it using an logistic regression.
- 35:17The second thing we're going to do is think
- 35:19about what test statistic to use.
- 35:21So I had the separate paper about essentially the power
- 35:27of the conditioner randomization tests.
- 35:29And what we found is that the closer the test statistic is
- 35:33to the true conditional distribution of Y given X, Z
- 35:38I guess I should say the true likelihood,
- 35:40the better the power will be.
- 35:41And so in that sense, what we wanna do is we
- 35:45wanna leverage existing models that people have used such
- 35:49as negative binomial regression.
- 35:51It's not going to matter whether the model is true or not
- 35:54for the sake of type one error control, but we hope
- 35:59that we can do a better job in terms of power
- 36:02by trying to get a good model for this.
- 36:07And finally, how do we mitigate the computational cost?
- 36:10And so we had a few ideas for this as well.
- 36:12So one of them is called the distilled CRT.
- 36:15And so I'll if time permits, which might or might not
- 36:19I'll give you a few more details
- 36:20about how you can use this to have a much faster
- 36:25for every resample to be quick.
- 36:28And then we're also going to use this hack, essentially
- 36:32that what we found is that the resampling distribution
- 36:36it actually kind of looks pretty reasonable.
- 36:40It kind of looks like a normal, but it's sort
- 36:43of how some extra skew and maybe some extra heavy tails.
- 36:46And so what we're gonna do is we're going to
- 36:48fit a skew T distribution to the essentially
- 36:52the empirical distribution of the resample test statistics.
- 36:55And in that way, we can get more accurate P-values
- 36:58without doing as many recent samples.
- 37:01And so putting together all of these pieces
- 37:03we get this method, which we call Sceptre
- 37:06or single cell perturbation screen analysis
- 37:08via conditional resampling.
- 37:11And so essentially what we do is what I said
- 37:13on the previous slide.
- 37:15We first use a logistic regression to fit a probability
- 37:19for every cell that we would find a perturbation there.
- 37:23And then we're gonna use these perturbation probabilities
- 37:26and resample this particular column.
- 37:29And so we now we have a whole bunch of resample datasets.
- 37:32Now we're going to use a negative binomial regression
- 37:35or more precisely a distilled negative binomial regression
- 37:38for speed, to get the test statistic
- 37:42for both the original data.
- 37:43And for all of these re resample datasets.
- 37:47Then we're gonna put together all
- 37:48of these recycled test statistics into this gray histogram.
- 37:51And again, we're gonna fit this magenta curve
- 37:54which is the skew T distribution
- 37:57which seems to fit pretty well in most cases.
- 37:59And then we're gonna compare the original test statistic
- 38:02against this skew T distribution and get a P-value that way.
- 38:07And so this is represented by the shaded region here.
- 38:10And I think what's noteworthy is to compare this fitted
- 38:14and all No distribution
- 38:15to this standard normal No distribution.
- 38:19I guess I should have said here
- 38:20that the actual test statistics are a Z values extracted
- 38:24from the negative binomial regression.
- 38:26So if your model were true, the Z values
- 38:30under the No would follow a standard normal distribution.
- 38:34And so what we find is that when we resample we
- 38:37get something that's not the standard normal distribution.
- 38:40And so in the sense you can view it as,
- 38:42a sort of measure of the departure sort of from,
- 38:48or sort of the lack of model fit that went
- 38:51into this negative binomial regression.
- 38:54So another way of putting this is that
- 38:57you can imagine that if you did happen
- 38:59to correctly specify your negative binomial regression model
- 39:03then you would sort of be getting back the same P-value
- 39:06that you would have gotten otherwise.
- 39:08So in that sense
- 39:09we're not really reinventing the wheel here
- 39:11if you do have a good parametric model, but if you don't
- 39:13then we can correct for it using this resampling strategy.
- 39:18So I guess this is an important slide
- 39:19so maybe I will stay here for a little bit and ask
- 39:23if anyone has questions about how our methodology works.
- 39:31- Hi, I have a bunker question.
- 39:33So have you tried to hurdle model to deal
- 39:36with this kind of full data is the cause
- 39:40of the weird distribution of the data?
- 39:45- Oh, so let's see.
- 39:48You mean to model the, essentially to model the gene
- 39:52expressions or do you mean to model the CRISPR perturbations
- 39:58- From this page,
- 40:03so first step you use a logistic regression
- 40:05and then you use a nickname by knowing that binomial.
- 40:08So it's like a two step models, but to hurdle model
- 40:13they combine them together to deal with the overall dataset.
- 40:18- I see, I will admit that I'm not familiar with those
- 40:21models but I will definitely take a look
- 40:24at those and see if they might be applicable.
- 40:28Yeah, I guess like in this sense
- 40:33the approach that I've proposed here is pretty flexible.
- 40:37I mean, really
- 40:37what makes this approach work well is as long
- 40:42as you have a decent approximation
- 40:44to these probation probabilities
- 40:46we're thinking about them as propensity scores.
- 40:48So aside from that but
- 40:52because really what's standing behind this as the generality
- 40:55of the conditional randomization test where
- 40:56you can basically use any test statistic you want.
- 40:58And so, definitely the method is flexible
- 41:02and can incorporate different choices,
- 41:05like the one that you've mentioned,
- 41:08But we haven't tried it we haven't, we haven't tried it.
- 41:10I'm not familiar with this model.
- 41:12Thank you though.
- 41:15Anyone else have any questions about the methodology?
- 41:24Okay, perhaps I'll okay.
- 41:27So yes, so this is kind of like a separate thing
- 41:31which I will not get
- 41:33into details of for the sake of time, but we had
- 41:36the separate paper whose focus was just basically,
- 41:39the conditional randomization test is a cool test
- 41:42but everyone knows it's slow.
- 41:43So how can we essentially accelerate it
- 41:46while retaining a lot of its power advantages?
- 41:49And so what we found is that
- 41:51if you just ever so slightly modified the test statistic
- 41:54by sort of regressing Y first on the confounders,
- 41:59and then on X, instead
- 42:02of regressing it on both at the same time
- 42:04what we found is that this ends up being much, much faster
- 42:08because only the second step needs to be repeated
- 42:10upon resampling, and the second step is much cheaper.
- 42:14So what we did is that we, in the context of sector
- 42:19we built on this
- 42:20by accelerating the resampling steps even further
- 42:23by leveraging the sparsity
- 42:25of the CRISPR perturbation vector X.
- 42:28And so perhaps the most important part is that the cost
- 42:31of the CRT for one gene-enhancer pair went
- 42:34down from 25 minutes down to 20 seconds
- 42:38as a result of these computational accelerations.
- 42:40And so for reference a single negative binomial
- 42:43regression took three seconds.
- 42:45So it's still,
- 42:46we're a factor of six or seven, more expensive than the
- 42:50just the sort of vanilla single regression
- 42:53but it's definitely, I think sort of within,
- 42:56definitely within an order of magnitude
- 42:58and hopefully as you can tell a much better statistically.
- 43:03So I will show you a few, so this is a simulation.
- 43:12I'm not gonna go through it in detail, but the idea is
- 43:14that what we're demonstrating here is that you can give
- 43:19Sceptre essentially negative binomial models
- 43:22that are miss specified in different ways.
- 43:25You can, give it a dispersion
- 43:27that's too large, a dispersion that's too small
- 43:30or maybe the true model does have zero inflation
- 43:32but we're not accounting for it.
- 43:34And what we find is that Sceptre essentially
- 43:36is well calibrated, regardless,
- 43:40whereas if you just essentially took
- 43:43the like the wrong dispersion estimates at face value
- 43:47you would encounter problems.
- 43:49And this SE magic approach
- 43:51which basically is a permutation approach.
- 43:54It's just sort of not doing a great job accounting
- 43:56for the confounding it, so we see this inflation.
- 44:01So perhaps more excitingly
- 44:03I'd like to show you an application to real data.
- 44:08So I guess this is the, so firstly
- 44:11we wanna make sure method is actually calibrated.
- 44:13So if you remember the initial observation was
- 44:16in a lot of these methods, aren't calibrated.
- 44:18So because I'm running a little short on time
- 44:20let's kind of maybe ignore this panel here
- 44:22and focus our attention here.
- 44:24So this is the Gasperini data that I introduced before.
- 44:28And so this red line here is actually the QQ plot you saw
- 44:34on one of my first slides
- 44:35of all of those negative control gene-enhancer pairs.
- 44:39It looks different here because
- 44:41the scale is I've sort of cut off the scale
- 44:43so we can actually visualize it.
- 44:45So we see a quite significant departure.
- 44:48What we actually did is we thought, okay, maybe
- 44:51they have a bad estimate of the dispersion
- 44:54but maybe we can use some more
- 44:55like state-of-the-art single cell sort of methods
- 45:00to improve our estimate of the dispersion.
- 45:03And so maybe we don't need to go
- 45:04to all the effort of doing the resampling.
- 45:06And so what we found is that
- 45:07when we use a state-of-the-art dispersion estimate
- 45:10we still have very substantial miscalibration.
- 45:14This is, I think, just a Testament
- 45:15to the fact that it's just hard to estimate that perimeter
- 45:18because there's not all that much data to estimate it.
- 45:21And then by comparison, we built Sceptre
- 45:24from the same exact negative binomial model
- 45:27which is this improved one,
- 45:29and we found that the negative control P-values
- 45:32are I think, excellently calibrated.
- 45:36So this shows you, again, the benefit
- 45:40of this different way of calibrating your test statistic
- 45:42and not relying on the parametric model for gene expression.
- 45:47So this figure just shows a few of the other methods
- 45:50but for the sake of time, I'm going to move on.
- 45:55This is looking at positive control data.
- 45:58So this basically is like, trying to get a sense of power.
- 46:01And so, again, maybe if we restrict our attention
- 46:04to this left panel here, what we found is that
- 46:06if we just plot the, our P-values
- 46:10versus the P-values, by the way, maybe I should say
- 46:12what is a positive control.
- 46:14A positive control
- 46:15in this case is a CRISPR perturbation that instead
- 46:18of targeting and enhancer is targeting the transcription
- 46:22start sites of a gene.
- 46:25And so essentially, like we don't need any extra biology
- 46:29to know that, if you target a transcription start site
- 46:32that's really going to knock out the gene.
- 46:34And so you can still try to do your association test and see
- 46:37if you've picked up those positive control associations.
- 46:40And so what we find is that
- 46:41actually Sceptre not only is better calibrated
- 46:44but it also tends to have more significant P-values
- 46:48on those positive controls.
- 46:49So it apparently is boosting both the sensitivity
- 46:53and the specificity of this association tests.
- 46:57- Eugene here are the original empirical P-value is this
- 47:00from the negative binomial test.
- 47:03So after we did the conditional recommendation
- 47:08if you actually have better P-values
- 47:11for the positive control pairs.
- 47:13- Yes, so you would expect, you would expect it's like
- 47:22aren't we just making the P-value is just,
- 47:23like less significant
- 47:25in a way to just help with the calibration.
- 47:27So how can it be boosting power?
- 47:29But I like the degree of inflation sort of varies
- 47:34like essentially it's not like, and what we'll see this
- 47:37I think on the next slide as well, essentially
- 47:40we're not like, sort of what sector is doing is not
- 47:43like a monotone transformation of things.
- 47:46It kind of there's not actually just maybe to illustrate it.
- 47:50I think, this is just an example where essentially what
- 47:59we would have gotten from the sort
- 48:01of the vanilla negative binomial analysis is the area
- 48:04under this dotted or dashed curve here.
- 48:07And so Sceptre could, well, basically whoops sorry,
- 48:11it could have a, like a lighter tail as it has in this case.
- 48:15And so it could sort of either make the P-values
- 48:20on the more significant or less significant.
- 48:22It's correcting the miscalibration
- 48:24but not necessarily in a way that's like conservative.
- 48:26And so this is encouraging.
- 48:32Yeah, that's a good question though.
- 48:35- I guess that depends on
- 48:37the confounding you included in the model.
- 48:41So then I would expect it well, re reduce the significance
- 48:47but if you include other co-founding
- 48:50that's mostly contributing to the noise level probably.
- 48:55- Yeah, sure, so I think I'm right.
- 48:59Yeah, let me think we are, let me see
- 49:03I think in this case, we're correcting
- 49:05for approximately the same confounders here.
- 49:08So they already had some confounders
- 49:10that they were correcting for
- 49:11in the original negative binomial.
- 49:12So in that sense, it's a little bit more
- 49:14of maybe an apples to apples comparison.
- 49:16It's just a question of how do you calibrate
- 49:20that test statistic that is
- 49:21trying to correct for the confounders
- 49:23but I think what you're getting at
- 49:25I do think it can go either way.
- 49:27It's not obvious that Sceptre would make a P-value
- 49:29or they're more or less significant.
- 49:32I think I will say just as a small detail here
- 49:34that in addition to the negative binomial regression
- 49:38this P-value, it says,
- 49:40there's this strange word empirical here.
- 49:42What it means is that
- 49:43they've kind of also applied their fixed that they had
- 49:47because they realized that they had the miscalibration
- 49:48and then they kind of like smashed all
- 49:50of their P-values sort of,
- 49:52so these are sort of like, so in that sense
- 49:55it's not an apples to apples comparison
- 49:57but what we're doing is we're comparing
- 49:58to the P-values that were actually used
- 50:00for the analysis in this, in this paper.
- 50:02So maybe that makes it even harder to compare, but yes.
- 50:06So take this plot with a grain of salt, if you will.
- 50:10Perhaps I think the most exciting part is
- 50:14actually applying this to new gene-enhancer pairs
- 50:18where we don't know necessarily what the answer is.
- 50:21And so this plot just shows you
- 50:24we're just plotting it's actually, I guess
- 50:27similar to this plot we saw here
- 50:30except now we're looking at the candidate enhancers.
- 50:33And so essentially the different colors.
- 50:36So firstly, this also just shows you
- 50:38that this is very much not a monotonic transformation.
- 50:42Like you really can like, if you look into this quadrant
- 50:47this is an example where the original P-value was very
- 50:51not significant, but according to Sceptre
- 50:54it can be very significant and vice versa.
- 50:58So essentially I've just kind of highlighted
- 51:01those gene-enhancer pairs that were,
- 51:03found by one method and not the other.
- 51:05And so the upshot is that there's a total
- 51:09of about, roughly 500 or so found.
- 51:12Well, I guess after found 563
- 51:15of those 200 were new in the sense
- 51:18that they were not found by the original analysis.
- 51:21And then 107 were found by the original analysis
- 51:24but were not found by us.
- 51:26And we have strong reasons to believe
- 51:28that these could be false positives based
- 51:30on exactly the sorts of miscalibration that I presented.
- 51:35We did look at a few specific new discoveries here
- 51:38and found that they were corroborated by EQTL data.
- 51:43And for those of you who are familiar
- 51:45enhancer RNA correlation data, since I'm running low
- 51:48on time, I don't have time to explain this to you
- 51:51but these are all P-values
- 51:53of association based on orthogonal functional assets.
- 51:58Also, we found that our discoveries were more enriched
- 52:01for biological signals in a few different ways.
- 52:04One of them is that, and again,
- 52:06I'm sort of maybe going a little bit
- 52:08more quickly here 'cause I'm about to run out of time
- 52:11but there are these things called topologically
- 52:13associating domains, which are basically regions
- 52:16in the genome within which most
- 52:19of these regulatory interactions are thought to occur.
- 52:22And so what we find is that a greater fraction
- 52:25of the gene-enhancer pairs we found compared
- 52:27to the original analysis did lie
- 52:30in the same top logically associating domain.
- 52:32So in this case, 74% versus
- 52:35the 71% found in the original analysis.
- 52:37So in this sense, I mean, it's just kind of
- 52:39like a first order sense of biological plausibility.
- 52:44I think people are starting to think
- 52:46that there are interactions that are sort of
- 52:48outside of tabs as well.
- 52:49So I don't think this is a signal that,
- 52:5226% of these things are false discoveries
- 52:55but we definitely do expect, a high degree
- 52:59of enrichment for within tad interactions.
- 53:05Also if you do look
- 53:07at some of these more circumstantial pieces
- 53:09of evidence for regulations, such as things
- 53:14like transcription factor binding or histone modifications
- 53:19so we can use CHiP-seq to essentially assess
- 53:23for any given what
- 53:27whether there is these kind of signatures of regulation.
- 53:33And so what we found is that we did a little bit
- 53:35of an enrichment analysis where we looked at all
- 53:38of those enhancers that were found to be paired
- 53:40to genes by sector versus the original method
- 53:43and looked to what extent they were enriched
- 53:46for these other signatures
- 53:49these CHiP-seq based signatures of regulation.
- 53:52And what we found is that
- 53:53across eight of these CHiP-seq targets, and by the way
- 53:58these eight are not selected.
- 53:59These actually were the exact eight CHiP-seq targets
- 54:03that they examined in the original paper,
- 54:06we found greater enrichment.
- 54:08So in this sense, also the enhancers being
- 54:13picked up by Sceptre are just more biologically
- 54:15plausible using these orthogonal kinds of assets.
- 54:19So I find this very exciting
- 54:21and I'm just gonna maybe make a few remarks
- 54:25and hopefully there's just a little bit
- 54:26of time for questions.
- 54:27I will also be around for a few minutes after the seminar.
- 54:30If anyone wants to stick around and ask me questions
- 54:33you also might have your next thing to go to.
- 54:35So I understand if not.
- 54:37But maybe the summary is that, mapping gene-enhancer
- 54:41regulatory relationships is very important.
- 54:43If we wanna translate GWAS hits into disease insights.
- 54:47And there's been this very exciting new technology
- 54:50that allows us to answer that question.
- 54:53This technology was proposed very recently,
- 54:56and so there aren't that many methods out there
- 55:00to analyze these kinds of data.
- 55:02And so what we did with Sceptre is we leveraged recent
- 55:05methological advances in statistics to overcome the primary
- 55:08limitations of the parametric and non-parametric analysis
- 55:11methods that were available.
- 55:13And finally, we applied it to the largest existing
- 55:17data set of this kind.
- 55:19And what we get is a greater number of more biologically
- 55:22meaningful regulatory relationships.
- 55:25So I had a few other discussion slides, maybe I'll just
- 55:28read the title to you without getting into the details
- 55:31but this is a rapidly developing technology.
- 55:34And we do foresee that sector will be applicable
- 55:37to future iterations of the technology.
- 55:40So that's promising.
- 55:42And secondly, this is more like the beginning
- 55:46of the road than the end of the road.
- 55:47There are lots of remaining challenges,
- 55:51this includes looking for interactions
- 55:53among enhancers, things like dealing
- 55:56with multiple guidances, how are you in the same enhancer,
- 56:00they're just basically like a whole, I would say, playground
- 56:02of statistical problems that have yet to be addressed.
- 56:07So maybe finally, if you'd like to learn more
- 56:11we have a pre-printed on bio archive.
- 56:13I wanna acknowledge my co-authors again.
- 56:16And finally, so Tim has worked very well hard
- 56:20on putting, making this an art package so
- 56:24you can find out on GitHub
- 56:26and I'm very happy to take questions now
- 56:29but if you have any burning questions that come
- 56:32to you 30 minutes after my talk
- 56:35please feel free to email me at this address.
- 56:37So thank you, and I should have said at the top, thank you
- 56:40Lexi for the invitation.
- 56:42- Thank you for agreeing to present your work here.
- 56:45It's really a nice talk.
- 56:47- Yeah Thank you.
- 56:49- So I have some, maybe less related question
- 56:52to your current work, but maybe interesting to consider.
- 56:56I am not sure.
- 56:57Have you looked at the correlation structure
- 56:59between the X matrix?
- 57:03- Yeah, so essentially my sense is that gets
- 57:07like a factor model where you have all
- 57:11of these sort of confounders that are inducing correlation
- 57:16among all the axis, but essentially like once you account
- 57:21for that confounding, it's independent.
- 57:25- I see (indistinct) correlation.
- 57:29- So it's fairly small correlation and essentially
- 57:33the reason for, and this is very different from
- 57:36for example, genome-wide association studies.
- 57:38So it's like, Oh
- 57:39is there some analog of Lincoln's this equilibrium.
- 57:40And the key difference here is that
- 57:43it's essentially a design experiments.
- 57:46So even though you're not controlling exactly
- 57:49which cells receive what perturbations you are
- 57:51basically assigning them at random.
- 57:54So if it worked
- 57:55for this sort of pesky measurement mechanism business
- 57:58it would be an unconfounded problem.
- 58:01But essentially, so the only correlations are coming
- 58:05from this measurement.
- 58:08Yes so that is a great question
- 58:10because you can ask, well
- 58:11how did I do the slight of hand run?
- 58:13Like slide three all of a sudden I was working
- 58:15with like one enhancer
- 58:17and where did all the rest of them go.
- 58:18And I think we're actually not losing all too much
- 58:21by doing this, especially
- 58:23since we are controlling for those technical factors.
- 58:25- Yeah thanks that makes sense to me.
- 58:28And another thing is maybe more, less than less
- 58:31statistical is how many confounding factors
- 58:34they are controlling
- 58:35and what are the important ones that you have identified?
- 58:39- Yeah, I mean, so in this case
- 58:41we're doing essentially we're following the lead of
- 58:45the original paper
- 58:46for which confounding factors with control for.
- 58:48So in addition to sequencing depth.
- 58:51Yeah, so they do have a batch of fact
- 58:52and there's also something called Percent Might've Country.
- 58:55So it's like what fraction of all the reads that you got
- 58:58in this particular cell came from mitochondrial DNA
- 59:02as opposed to, regular DNA, maybe a few others
- 59:09like just total number
- 59:10of genes expressed in the cell, things of this nature.
- 59:13So I think here we're correcting
- 59:15for about five, but you could think of other things
- 59:18like cell cycle, this is a pretty K five 62 is a pretty
- 59:25homogeneous cell line, but especially
- 59:27once you get to other kinds of, tissue samples
- 59:30you might need to think about, cell type
- 59:33and things of this nature.
- 59:35So I think there are lots to consider here,
- 59:38we used kind of five easy ones.
- 59:42- Okay, thanks.
- 59:44Any more questions for Eugene?
- 59:48Yeah, I think we are approximating
- 59:52the end of the talk, the seminar.
- 59:55So thanks again for your great talk.
- 59:58And if you have any further questions
- 01:00:01you can just send emails to Eugene offline.
- 01:00:05- Yes, yes, definitely don't hesitate to reach out.