Skip to Main Content

YSPH Virtual Biostatistics Seminar: Statistical analysis of single cell CRISPR screens

February 03, 2021
  • 00:03- First good afternoon, everyone,
  • 00:05and I hope you somehow managed to enjoy your winter break
  • 00:10you in this special time.
  • 00:11And this is our first talk, seminar talk this semester,
  • 00:16and we have invited Dr. Eugene Katsevich
  • 00:19from Wharton School at UPenn.
  • 00:22And he's going to present something really exciting,
  • 00:26I know his original work on statistical analysis
  • 00:32single cell CRISPR screening.
  • 00:34And I will hand it over to Eugene from now, from here.
  • 00:40And, but if Eugene wanted to start or wait one
  • 00:43or two minutes to start, it's up to you.
  • 00:46- Yeah maybe, I mean, yeah, I don't know.
  • 00:51If people will filter in, maybe I'll wait another minute
  • 00:53or two, 'cause I think, I feel like the first part
  • 00:57of the talk is very important.
  • 00:58So I think if people missed the first part of the talk,
  • 01:01then it'll be maybe hard to follow along later.
  • 01:04So I'm happy to wait just another minute or two.
  • 01:09I understand perfectly that it's a strange time
  • 01:12for everyone, so for all those who were able
  • 01:15to make it today, I really appreciate
  • 01:17your adjusting the schedule.
  • 01:22Also maybe one remark I can make is that,
  • 01:25since it is a smaller audience,
  • 01:27I think we can make this seminar just about
  • 01:31as interactive as you want.
  • 01:32So you should definitely feel free to stop me
  • 01:37at any point.
  • 01:39I don't know how many of you are familiar
  • 01:40with the CRISPR screen stuff I'm gonna talk about,
  • 01:43but I'm very happy to just make it very interactive.
  • 01:54I will maybe start sharing my screen
  • 01:58and maybe I'll start launching
  • 02:00into some of the introductory things.
  • 02:13So...
  • 02:16Oh wow, wait, is this the...
  • 02:21I greatly apologize.
  • 02:24Clearly, the label on my slides is wrong.
  • 02:28I have updated my slides since then,
  • 02:31but I think the title page has not been updated,
  • 02:33that's extremely embarrassing.
  • 02:40Well, maybe then I should skip past this slide very quickly.
  • 02:43So hello everyone, thank you so much
  • 02:47for making it to my talk.
  • 02:50Today, I'll be talking about some Statistical Analysis Tools
  • 02:52for Single Cell CRISPR Screens.
  • 02:55So the most important thing to take away
  • 02:56from this slide are my collaborators here.
  • 02:59So Tim Barry is a grad student
  • 03:02of mine who was actually at CMU.
  • 03:05I am jointly advising him with Kathryne Roeder
  • 03:08also at CMU, who is used to be my postdoc advisor.
  • 03:15So I'll skip quickly to the next slide.
  • 03:20So here's the motivation.
  • 03:22And by the way, if anyone has joined recently,
  • 03:25please just stop me at any point.
  • 03:29So here's the motivation.
  • 03:30So we have done lots
  • 03:31and lots of genome wide association studies to date.
  • 03:35So we have a lot of little markers
  • 03:37along the genome that we think are associated with diseases.
  • 03:41And so the question is what's the next step?
  • 03:43Like how do we actually translate these
  • 03:46into insights into diseases?
  • 03:49And hopefully later on things like,
  • 03:51therapeutics and so on.
  • 03:52So what we need to do is we need to understand how
  • 03:56like basically the mechanisms
  • 03:58by what mechanism are these associations actually resulting
  • 04:01in an increased disease risk.
  • 04:03So here's a typical situation here
  • 04:05as our genome and here's a disease association
  • 04:08and frequently these disease associations
  • 04:11they might not take place within genes.
  • 04:13And so that makes them pretty hard to interpret.
  • 04:18So what's hypothesized to be the case here is that instead
  • 04:24of disrupting genes directly, these variants
  • 04:29are disrupting regulatory elements such as enhancers.
  • 04:33So let's just like briefly here review
  • 04:38that an enhancer is a region of the genome.
  • 04:41That could be a certain distance
  • 04:42from the gene that actually folds
  • 04:45in three-dimensional space to come
  • 04:47in close proximity to the promoter of the gene.
  • 04:52And essentially the enhancers job is to recruit a lot
  • 04:55of the machinery that actually is going to lead
  • 04:57to the expression of this gene.
  • 04:58So if you disrupt the enhancer
  • 05:00then this will disrupt the recruitment
  • 05:02of all of these different transcription factors
  • 05:04which will then end up causing some trouble.
  • 05:10And so it's this sort of like, for example, in this case
  • 05:15let's say that this disease association
  • 05:17as is disrupting enhanced or one, well, this might suggest
  • 05:21if enhancer one is regulating gene two, that
  • 05:24the disease mechanism is actually proceeding essentially
  • 05:28or being mediated by the expression of gene too.
  • 05:32And so this would be a very great
  • 05:38and clean way of interpreting GWAS hits.
  • 05:41But the problem is that we don't actually know
  • 05:44or we have a very hazy sense
  • 05:46of which enhancers actually regulate which genes.
  • 05:50So this is kind of a difficult problem
  • 05:52for a few different reasons.
  • 05:54The first reason is that there's a potentially
  • 05:57many to many mapping between enhancers in genes.
  • 06:00So in enhancer it can regulate multiple genes
  • 06:03and a single gene can be regulated by multiple enhancers.
  • 06:08So the other thing is that any answers don't even
  • 06:10need to be all too close to the genes that they regulate.
  • 06:13There could be situations like we saw here where
  • 06:18the regulation can skip the adjacent gene
  • 06:20and go to the next one.
  • 06:22And so in general regulations can
  • 06:24are thought to happen within about a megabase distance
  • 06:30in terms of the linear distance in the genome.
  • 06:33So this is a hard problem, and it's basically
  • 06:36the motivating problem for this talk
  • 06:38which enhancers regulate which genes.
  • 06:41This is a sort
  • 06:42of a very fundamental and important problem in genomics.
  • 06:47So in today's talk, I'm going to first talk about
  • 06:53a new assay called a single cell CRISPR screen
  • 06:56that allows us to get at this question,
  • 07:03then I'm gonna talk about the challenges
  • 07:06that previous methods have encountered
  • 07:08in analyzing these single cell CRISPR screen
  • 07:10datasets, never propose a new methodology based
  • 07:14on this idea of conditional resampling.
  • 07:18And then I will show you how this works
  • 07:20on real data and close with the discussion.
  • 07:25So let me first introduce the biological assay here
  • 07:28which is called the Single Cell CRISPR screen.
  • 07:31So actually backing up a second,
  • 07:34this is a very important problem
  • 07:35and people have considered it before.
  • 07:37So how do people typically approach gene-enhancer mapping?
  • 07:41I think the most common approach is what I call here
  • 07:46an indirect observational approach.
  • 07:48And there are many of these.
  • 07:50So what this picture is,
  • 07:51is a basically a more detailed picture
  • 07:54of what happens when an enhancer or a pictured here comes
  • 07:57into contact with the promoter of a gene.
  • 08:00There are lots of kind
  • 08:01of indirect signals of this regulation.
  • 08:06Obviously you have just the actual expression
  • 08:08of the gene, but you'll have the confirmation
  • 08:12of the chromatin in the vicinity of the promoter
  • 08:16and in the enhancer
  • 08:18you have basically transcription factor binding data.
  • 08:22And all of these data are essentially indirect ways
  • 08:25of trying to make a conclusion
  • 08:28about which enhancers might be regulating which genes.
  • 08:31So for example, using high C data
  • 08:33if you find an enhancer to be a 3D contact
  • 08:37with the promoter, then this could be a single signal
  • 08:39that there is some regulation going on.
  • 08:43The issue is that these approaches have not
  • 08:45proved very reliable at the end of the day.
  • 08:47These are observational approaches,
  • 08:49and basically even if you have
  • 08:52contact in 3D space, this is not necessarily a signal.
  • 08:57This doesn't necessarily mean that regulation
  • 08:59is actually occurring,
  • 09:00and so essentially we haven't gotten all too far
  • 09:04with these indirect approaches.
  • 09:05So the exciting thing is
  • 09:07that recently with the development of CRISPR technology
  • 09:13we can now actually go in and instead of observationally
  • 09:17just essentially take a look inside a cell.
  • 09:20We can actually go in and make modifications where we
  • 09:24for example, knockouts enhancers using the system
  • 09:28called CRISPR Interference.
  • 09:30And then we try to look
  • 09:31at what the results are for gene expression.
  • 09:34So this shows you a little cartoon
  • 09:38of the CRISPR interference system.
  • 09:40And so the way that it works is
  • 09:42that you have this CAS nine protein whose job is to attach
  • 09:48to a certain segment of DNA.
  • 09:51And the specific segment of DNA it attaches
  • 09:53to is specified by this guide, or I do.
  • 09:57And so in this way,
  • 09:59the attachment can be highly specific
  • 10:01to the sequence of the enhancer.
  • 10:04And then this for CRISPR Interference
  • 10:07the CAS nine brings along with it
  • 10:09all of these repressive elements that essentially knock
  • 10:13out this enhancer, meaning they prevent the enhancement
  • 10:17from actually helping to regulate this gene.
  • 10:22And so the idea, so firstly
  • 10:24this is a promising solution because it allows us to
  • 10:28interrogate these regulatory relationships
  • 10:30in a much more direct way
  • 10:32than we've been able to do until recently.
  • 10:35And so the overall idea is that,
  • 10:39it's the idea of simple disrupt enhancers
  • 10:41and see which genes expression drops.
  • 10:44And so just as a cartoon here, let's say we knock
  • 10:46out this enhancer, then we would expect
  • 10:49to see the gene that regulates to be down-regulated.
  • 10:54And then we can think
  • 10:55about designing perturbations for multiple enhancers.
  • 10:58And so if you perturb this enhancer
  • 11:00then maybe you'll see a response in these two genes.
  • 11:07- Very naive question, just to make sure I
  • 11:09didn't misunderstand notion here is enhancer always
  • 11:14upregulating gene kind of regulate?
  • 11:18- I think enhancers specifically
  • 11:20are thought to upregulate genes.
  • 11:22However, it's a good question because there are other kinds
  • 11:25of elements that are, can actually be silencers for example.
  • 11:28And so that's just another example of a kind
  • 11:32of a regulatory element.
  • 11:33So the effect could go in either direction
  • 11:36and this talk I'll primarily talk about enhancers
  • 11:38but really everything I say goes through for other kinds
  • 11:42of regulatory elements.
  • 11:44- Thanks.
  • 11:45- Yeah, very good question.
  • 11:50So now the actual assay
  • 11:54That allows you to do this out of large scale.
  • 11:59So the scale is the question here
  • 12:00because you can do CRISPR experiments where
  • 12:03you essentially like knock out one enhancer
  • 12:05in a whole batch of cells, and then,
  • 12:08maybe go enhancer by enhancer
  • 12:10and this ends up not being a very scalable approach.
  • 12:13So there has been proposed
  • 12:16this new asset called the single cell CRISPR screen
  • 12:19in which you basically pool a whole bunch
  • 12:22of perturbations together,
  • 12:23and then the readout that you get is single cell
  • 12:26RNA sequencing, which allows you to also basically look
  • 12:30at the impact of all
  • 12:31of those different enhancement perturbations
  • 12:32on the entire transcriptome.
  • 12:35And so in the slide, I'm gonna give you a brief overview
  • 12:38of how these screens work.
  • 12:40So first way you do is you start
  • 12:42with a library of CRISPR perturbations.
  • 12:44So you just, let's say maybe you take,
  • 12:4910,000 enhancers across the genome
  • 12:52and then you basically design CRISPR guide.
  • 12:55RNAs targeting each of those enhancers.
  • 12:58Once you have a library of these perturbations
  • 13:00you then infect a big pool
  • 13:03of cells with all of these perturbations.
  • 13:06And so what's important to note here is
  • 13:08that essentially these perturbations get randomly integrated
  • 13:13into the different cells they're delivered through a
  • 13:17like a virus system
  • 13:19the details aren't very important, but the importance is
  • 13:22that these perturbations get integrated
  • 13:24into cells essentially at random.
  • 13:26And so each cell gets its own collection
  • 13:28of CRISPR perturbations.
  • 13:30So now in order to basically actually read out what happened
  • 13:35in our experiment, we use single cell RNA sequencing.
  • 13:38And as a result of the sequencing experiment
  • 13:40we get two pieces of information, firstly, by the way
  • 13:44two pieces of information for every step.
  • 13:46So for every cell
  • 13:47we first measure the perturbations that are present.
  • 13:50So which of these guide or nays did we detect,
  • 13:52and then secondly
  • 13:53the gene expression for the whole transcriptome.
  • 13:57So this is essentially our data here.
  • 13:59And then once we have this data
  • 14:01we can now do the analysis component, which really ends
  • 14:05up being a kind of differential expression analysis.
  • 14:08So consider a particular gene-enhancer pair.
  • 14:12So what we can do is we could take all of the cells
  • 14:15and we can break them up into two groups.
  • 14:17Those cells for which that enhancer was knocked out
  • 14:20which are in orange here, and those cells
  • 14:23for which that enhancer was not knocked out.
  • 14:26We can then split, essentially look
  • 14:29at the expression of the gene of interest
  • 14:33and see whether there's a systematic difference
  • 14:35between the expression of this gene
  • 14:37and these two populations of cells.
  • 14:39So, and then if there is a significant difference
  • 14:43then we can make a conclusion that that particular
  • 14:45enhancer is regulating that particular gene.
  • 14:48So it seems quite simple on first glance,
  • 14:52but this analysis part actually turns
  • 14:55out to be a challenging statistical problem.
  • 14:59And so the analysis
  • 15:02of these screens is actually the subject of this talk.
  • 15:07Okay so, maybe one more slide
  • 15:10and then I'll stop and see if people have questions.
  • 15:12So just to make it a little bit more concrete
  • 15:16there's a kind of a large data set that might be one
  • 15:19of the largest out there right now by Gasperini at all.
  • 15:22It was published in cell last year.
  • 15:24Oh wow, I guess two years ago now to 2019,
  • 15:27and so they were working with 200,000 K five 62 cells
  • 15:31and they were looking at 6,000 candidate enhancers.
  • 15:34And so they're looking at, I mean
  • 15:35essentially the whole transcriptome, at least the part
  • 15:38of it that has any expression in the cell type.
  • 15:41And they identified 85,000 enhancer gene pairs
  • 15:45that they essentially thought were plausible
  • 15:50to have some regulation and in their experiment
  • 15:53they had 28 per patients on average per cell.
  • 15:58And so the way that this data would look is, think
  • 16:01about the rows as being the cells and then the columns.
  • 16:05So you have two groups of columns.
  • 16:06Firstly, you have the gene expressions,
  • 16:08and so since these are single cell data
  • 16:10we have these highly discreet counts
  • 16:13of reeds or UMRs for every gene.
  • 16:18And then also we have the second bit of information
  • 16:21which is a binary matrix, which tells you
  • 16:23which cells received, which perturbations.
  • 16:27So in general, in this presentation, I'll talk
  • 16:29I'll denote gene expression by Y and perturbations by X.
  • 16:36And so there's also a third and very important piece
  • 16:38of information, which are technical factors per cell.
  • 16:42Perhaps the main one that I'll talk
  • 16:43about today is the sequencing depth.
  • 16:46So this is just the total number
  • 16:48of reads or UMRs I measured from this cell.
  • 16:53And so this basically just varies randomly across cells
  • 16:57just as an artifact of your experiment.
  • 16:58There are other technical factors
  • 17:00like batch and so on and so forth.
  • 17:03Okay, so this brings me to the end
  • 17:06of the first section where I tell you
  • 17:08about the data and the asset.
  • 17:11So are there any questions before I move on
  • 17:14to talking more about the analysis of these types of data.
  • 17:25I'm assuming there are no questions
  • 17:27but do feel free to stop me if there are.
  • 17:34So as I said, this actually turns out to be kind of
  • 17:37like an annoyingly challenging statistical problem.
  • 17:40And so to illustrate this to you, let me first
  • 17:43give you a sense of what analysis methods there
  • 17:45are out there.
  • 17:47I should say, by the way that given the sort of the novelty
  • 17:52of this assay, there hasn't been a lot of work in terms
  • 17:55of designing methods specifically designed
  • 17:59for this kind of data.
  • 18:01So most of the existing analysis methods are basically
  • 18:04proposed by the same people who are
  • 18:07producing the single cell CRISPR screen data.
  • 18:10So by the way, so in this slide
  • 18:14I'm going to it actually for the remainder of the talk
  • 18:17I'm actually going to essentially focus our attention
  • 18:20on a certain gene and a certain enhancer
  • 18:25and just consider the problem
  • 18:27and figuring out whether that enhancer regulates that gene.
  • 18:30And so I'm gonna use YI, to denote the expression
  • 18:34of that gene and cell I XI as the binary indicator
  • 18:38for whether that enhancer was perturbed in that cell
  • 18:41and ZI the vector of these extra technical co-variants.
  • 18:48So With that notation out of the way,
  • 18:52the first kind of popular method for analyzing these data
  • 18:58is negative binomial regression.
  • 19:00For those of you familiar with bulk RNA-seq differential
  • 19:04expression analysis, this is similar to the DESeq2
  • 19:08methodology where you just run a negative binomial
  • 19:11regression of the gene expression, Y on a linear combination
  • 19:17of the perturbation indicator, as well as all
  • 19:20of your technical co-variants.
  • 19:23And so Negative Binomial is a common model for these sort of
  • 19:28over dispersed count data that you encounter
  • 19:31in RNA sequencing data.
  • 19:35Okay, next, there is a rank based approach.
  • 19:38So this is non-parametric where it's actually much simpler.
  • 19:43You just, you cross tabulate yourselves by two criteria.
  • 19:48First, you see whether they have the perturbation or not.
  • 19:51And second, you see whether they have essentially higher
  • 19:55than median expression on this gene or lower
  • 19:57than median expression on this gene.
  • 19:59And then you do a two by two table test for independence.
  • 20:05And finally there are also permutation based approaches
  • 20:08where the idea is to take some test statistic
  • 20:12and then calibrate it under the null distribution
  • 20:16by permuting this column right here
  • 20:19the assignments of the perturbations to the cells.
  • 20:24So yes, that, I guess that's, what's written here.
  • 20:28So okay, there's like maybe all these methods sound
  • 20:36reasonable at first, but the more you actually look
  • 20:39at the existing literature
  • 20:40the more there are various scattered signs
  • 20:43like none of these methods are like really doing the trick.
  • 20:47And so here are
  • 20:50the methods that I described on the previous slide.
  • 20:54I don't know if I named them
  • 20:55but so virtual FACS is the rank based one
  • 20:57and scMAGeCK is the one of the permutation based ones.
  • 21:01And so you look at plots actually from
  • 21:05the original papers themselves who propose these methods
  • 21:10and you see some signs of miscalibration.
  • 21:13And so like, for example, I'm gonna be talking mostly
  • 21:16about this data and to a lesser extent
  • 21:18about this data in my talk, but so looking here
  • 21:23so I guess perhaps I should first talk
  • 21:24about the concept of a Negative Control Perturbation.
  • 21:27So a Negative Control Perturbation is a guide
  • 21:30or but it's actually not designed to
  • 21:33target any particular sequence along the genome.
  • 21:37So you don't expect cells that are infected
  • 21:40with a negative control perturbation to look any different
  • 21:42from cells that have no perturbation.
  • 21:46And so in this Gasperini data
  • 21:50they have 50 different negative control guide RNAs,
  • 21:53and so what they did is they basically plotted a QQ plot
  • 21:57of all of the negative control guide RNAs,
  • 22:00paired with all of the genes and the genome,
  • 22:06and what they found is and perhaps on this QQ plot
  • 22:09this doesn't look like a severe inflation from uniformity
  • 22:12but it's important to keep in mind the scale of this Y axis.
  • 22:17And so essentially this amounts
  • 22:21to a massive amount of deviation
  • 22:24from the uniform distribution in those P-values.
  • 22:27So in other words, negative control,
  • 22:31gene-enhancer pairs are looking incredibly
  • 22:33significant according to this analysis.
  • 22:37So in this particular analysis
  • 22:40they essentially found
  • 22:42the same thing here it's portrayed as a Manhattan plot
  • 22:47but you see a lot
  • 22:50of things reaching significance when right only
  • 22:53the circle points are those that essentially were replicated
  • 22:58in a bulk RNA sequencing experiment.
  • 23:02And then this one finally looks like they perturbed
  • 23:10lots of different enhancers and essentially looked
  • 23:13at the effect on this one particular gene.
  • 23:16And essentially what they found is that essentially all
  • 23:19of the enhancers that they tested appeared to
  • 23:22actually be per, like, have an effect on the expression
  • 23:25of this gene, when in fact this is biologically imposible.
  • 23:29So this is clearly an issue.
  • 23:32Now, these original papers clearly knew
  • 23:36that there was an issue, and so for each of the papers
  • 23:39they kind of have a little bit
  • 23:40of an ad hoc fix in order to basically correct their P-value
  • 23:45of distributions, so that they look a little bit more,
  • 23:50closer to being calibrated.
  • 23:52And so I'm, I think for the sake of time
  • 23:55I'm probably not going to get into exactly how
  • 23:58they propose to fix their P-value distributions.
  • 24:02What I will say is that we looked in detail
  • 24:06especially at the strategy that they use here
  • 24:08and to a lesser extent at the strategies.
  • 24:10Well, actually I think here
  • 24:11they basically said just not to apply their method
  • 24:14to data where there's too high, essentially
  • 24:18to where they're too many perturbations per cell.
  • 24:21So in this case, they just said, don't apply this method.
  • 24:24We looked into the kinds of fixes that they proposed
  • 24:26in these two papers, and they essentially
  • 24:28they don't quite work in the way that you would expect.
  • 24:31And so what we thought is that,
  • 24:33what we'd like to do is kind of look a little deeper
  • 24:37into this problem and try to ask ourselves
  • 24:39why are we seeing all of these issues?
  • 24:41Why do people keep running into these miscalibration issues
  • 24:44and let's try to basically address those underlying issues.
  • 24:50So we thought about it a little bit
  • 24:53and we thought about challenges
  • 24:55for both parametric and non-parametric methods.
  • 24:59So for parametric methods
  • 25:01this actually shouldn't really come as a surprise probably
  • 25:05to most people here, gene expression is known to
  • 25:08be pretty hard to model in single cells.
  • 25:11So of course we have these essentially highly discreet
  • 25:16lots of zeros counts that are over dispersed
  • 25:20perhaps more importantly, given how sparse the data are.
  • 25:23It's actually pretty hard to get a good estimate
  • 25:26of that dispersion parameter.
  • 25:28And so there's currently no standard way
  • 25:30of estimating that dispersion parameter
  • 25:32and basically every paper, comes up
  • 25:35with their own way of doing this.
  • 25:40They're even just debates
  • 25:41about what parametric models are appropriate for these data,
  • 25:44should they be zero inflated,
  • 25:46should they not be, and some genes have even been observed
  • 25:50to have bi-modal expression patterns.
  • 25:52So essentially all of these things are telling us
  • 25:55that it's kind of hard to shoe horn
  • 25:57single cell gene expression,
  • 25:58into a nice, neat parametric model.
  • 26:01So obviously if you have missed specification of your model
  • 26:04such as a bad estimate for a dispersion perimeter
  • 26:07that very well could cause miscalibration
  • 26:09of the kind that we saw.
  • 26:13So next we can think about non-parametric methods.
  • 26:16So maybe, obviously if these data
  • 26:19are hard to model parametrically
  • 26:21maybe the non-parametric methods are going to save us.
  • 26:25But the observation that we made that I think is
  • 26:27quite important is that these technical factors
  • 26:29that I mentioned before, like sequencing depth,
  • 26:32they impact not only the expressions of genes
  • 26:35but also the detection of these CRISPR guider in is.
  • 26:38So I might have led you to believe
  • 26:41in one of my early slides that we can basically
  • 26:43perfectly measure which cell contains
  • 26:46which CRISPR perturbations, but this is actually not true.
  • 26:51So single cell RNA sequencing
  • 26:54it's essentially just like this kind of a sampling process.
  • 26:59And so the more reads you sample from a cell
  • 27:03the more likely you are to detect a guide RNAs.
  • 27:05And so we just essentially looked at, for example,
  • 27:10this is for one of the datasets and we just made
  • 27:13a scatterplot of the total number of guide RNAs detected
  • 27:17per cell versus the total number of UMI.
  • 27:20So this is the sequencing depth
  • 27:21and we found this extremely clear
  • 27:24I guess I'm not showing you the P-value
  • 27:25but this P-value was like absurdly significant
  • 27:29to just basically confirm that
  • 27:31if you have more sequencing depth in a cell,
  • 27:33you're going to find more guide our news in that cell.
  • 27:37And so the issue with this is
  • 27:40that we basically have a confounding problem on our hands.
  • 27:43So think about this graphical model that's illustrating
  • 27:48what's going on
  • 27:49in a single cell CRISPR screen experiment in
  • 27:52this gray box is kind of the underlying biological reality.
  • 27:56Let's say we have this presence of this guide RNA
  • 27:58and the expression of this gene and the guide RNA is
  • 28:02or the, yeah, I guess the, the CRISPR knockdown
  • 28:05of the enhancer is either affecting gene expression
  • 28:08or it is not, but we read it out.
  • 28:13Some essentially imprecise the measurement
  • 28:18of the guide RNA presence.
  • 28:19We also read out
  • 28:20and imprecise measurement of the gene expression.
  • 28:23And what's most important is that the technical factors such
  • 28:27as sequencing depth, they're actually impacting both
  • 28:30of these measurements, they're coming from the same cell.
  • 28:33And so even if there is no association between the guide RNA
  • 28:37and the gene, if you just basically naively look
  • 28:42at the association between the measured guide RNA presence
  • 28:45and the measured gene expression
  • 28:46you're going to find some association.
  • 28:50And so this is clearly an issue.
  • 28:53And so essentially in order to correct
  • 28:55for this confounding effect, it's very important
  • 28:57to test instead of just testing independence between
  • 29:01the perturbation and the expression.
  • 29:04We want to test conditional independence, where
  • 29:07we're conditioning on all of these technical factors.
  • 29:10And so this shows you why non-parametric methods tend to
  • 29:14suffer is because when you do things like permute your data
  • 29:18or rank your data, there's this underlying assumption
  • 29:21that all of the cells are exchangeable and you're
  • 29:24using that exchange ability to build your inference on.
  • 29:27And so when you do those tests, they're implicitly
  • 29:30actually testing just the direct independence
  • 29:34the unconditional independence.
  • 29:36And so this sort of inflation we saw
  • 29:39in the non-parametric methods be explained by this
  • 29:43Source of confounding.
  • 29:47So that's actually it for that part of my talk
  • 29:51any questions about the existing methods
  • 29:53and the analysis challenges
  • 29:54and why there's a need to think about new methodology
  • 29:57for this for this problem.
  • 30:06Okay, I will move on.
  • 30:10So this is the part of the talk where I'm going to
  • 30:13propose a new analysis method for this kind of data.
  • 30:19And so the key kind of idea we're gonna use is
  • 30:22conditional resampling, which is proposed by not us.
  • 30:30So the idea of the conditional randomization test
  • 30:34well, it's actually, depending on how you look at it
  • 30:37it's quite an old idea and it has some connections to
  • 30:40causal inference, but it was proposed also incandescent all.
  • 30:45And essentially the setup is that you want to
  • 30:48test conditional independence and you're under
  • 30:52the assumption that you have a decent estimate
  • 30:55of the distribution of X given Z.
  • 30:57So remember X is the perturbation.
  • 31:00Y is the expression and Z are the,
  • 31:02essentially the confounders.
  • 31:04So one way of thinking about it from a causal inference
  • 31:07standpoint is let's say we know the propensity score,
  • 31:12can we test whether there's a causal relationship
  • 31:15between X and Y sort of controlling for these Confounders?
  • 31:20So the idea of the conditional randomization test
  • 31:24is the following.
  • 31:26First, you take any test statistic T of your data,
  • 31:31and in order to calibrate this test statistic
  • 31:34under the null hypothesis, instead of doing a permutation
  • 31:39we're gonna do a slightly more sophisticated
  • 31:41resampling operation, where we're going to go through,
  • 31:45and for every cell, we are going to resample whether
  • 31:50or not it received the given perturbation, but conditionally
  • 31:55on the specific technical factors that were in that cell.
  • 31:59And here we're using crucially the information that we have
  • 32:03a handle on what this sort of propensity score is.
  • 32:07And then we're just going to recompute the test
  • 32:10the same test statistic on the resample data.
  • 32:14And then we're just gonna define the a P-value
  • 32:17in the usual way for a resampling based procedure.
  • 32:20So one way of thinking about it is
  • 32:23that it's kind of like a permutation test, but it's one
  • 32:27in which the reassignments of the guide RNAs
  • 32:31to the cells is one that respects
  • 32:37the confounding that there is
  • 32:40in the data instead of treating all the cells exchangeable.
  • 32:45So this is great because the CRT adjust
  • 32:51for confounders basically by construction and importantly
  • 32:55it avoids assumptions on the gene expression distribution.
  • 32:59And in fact, provably, the P-value you get
  • 33:01out of the CRT is valid, even if essentially,
  • 33:07even if the test statistic T is, anything you want.
  • 33:13So in the sense that kind of addresses
  • 33:15the confounding issues, like basically the Achilles heel
  • 33:19of the non-parametric methods, but avoiding assumptions
  • 33:23on the gene expression distribution
  • 33:25as sort of was the pitfall of the parametric methods.
  • 33:27And it kind of seems to be doing something
  • 33:30that's avoiding both of those issues.
  • 33:33Now, of course, there's a, trade-off in the
  • 33:37CRT does require you to have some estimate
  • 33:40of this propensity score.
  • 33:42So, and then secondly, the CRT is computationally expensive
  • 33:48if you consider, or if you compare it to like
  • 33:51just like a parametric regression here
  • 33:53we're doing a parametric regression
  • 33:55but we're doing it lots of times.
  • 33:57And so how do we get around some of these issues?
  • 34:01So, and in particular, how do we actually go
  • 34:05about applying this idea to single cell CRISPR screens?
  • 34:08And so, firstly, do we understand this distribution
  • 34:13of the probability of observing a guide or in a
  • 34:17given a set of technical factors?
  • 34:20So what we're going to do in this particular method,
  • 34:25well, first we're gonna observe that it's
  • 34:27this is kind of a simpler phenomenon than gene expression
  • 34:31like guide our nays are not really, like subject
  • 34:33to all of the complicated regulatory patterns of genes.
  • 34:37And secondly, kind of under the hood,
  • 34:40the actual assortments of guide our nays
  • 34:45to cells is, you know, like fairly well modeled.
  • 34:48It's just basically like in that sense
  • 34:52the cells are pretty exchangeable.
  • 34:53What's not exchangeable it just basically
  • 34:55this measurement process.
  • 34:56So this is just kind of a simpler object
  • 34:59in the specific case of single cell CRISPR screens.
  • 35:03So we can try to bring
  • 35:04to bear various knowledge to try to get a good sense
  • 35:08of this in this case,
  • 35:10we're just gonna sort of do the easiest thing possible
  • 35:12and we're gonna fit it using an logistic regression.
  • 35:17The second thing we're going to do is think
  • 35:19about what test statistic to use.
  • 35:21So I had the separate paper about essentially the power
  • 35:27of the conditioner randomization tests.
  • 35:29And what we found is that the closer the test statistic is
  • 35:33to the true conditional distribution of Y given X, Z
  • 35:38I guess I should say the true likelihood,
  • 35:40the better the power will be.
  • 35:41And so in that sense, what we wanna do is we
  • 35:45wanna leverage existing models that people have used such
  • 35:49as negative binomial regression.
  • 35:51It's not going to matter whether the model is true or not
  • 35:54for the sake of type one error control, but we hope
  • 35:59that we can do a better job in terms of power
  • 36:02by trying to get a good model for this.
  • 36:07And finally, how do we mitigate the computational cost?
  • 36:10And so we had a few ideas for this as well.
  • 36:12So one of them is called the distilled CRT.
  • 36:15And so I'll if time permits, which might or might not
  • 36:19I'll give you a few more details
  • 36:20about how you can use this to have a much faster
  • 36:25for every resample to be quick.
  • 36:28And then we're also going to use this hack, essentially
  • 36:32that what we found is that the resampling distribution
  • 36:36it actually kind of looks pretty reasonable.
  • 36:40It kind of looks like a normal, but it's sort
  • 36:43of how some extra skew and maybe some extra heavy tails.
  • 36:46And so what we're gonna do is we're going to
  • 36:48fit a skew T distribution to the essentially
  • 36:52the empirical distribution of the resample test statistics.
  • 36:55And in that way, we can get more accurate P-values
  • 36:58without doing as many recent samples.
  • 37:01And so putting together all of these pieces
  • 37:03we get this method, which we call Sceptre
  • 37:06or single cell perturbation screen analysis
  • 37:08via conditional resampling.
  • 37:11And so essentially what we do is what I said
  • 37:13on the previous slide.
  • 37:15We first use a logistic regression to fit a probability
  • 37:19for every cell that we would find a perturbation there.
  • 37:23And then we're gonna use these perturbation probabilities
  • 37:26and resample this particular column.
  • 37:29And so we now we have a whole bunch of resample datasets.
  • 37:32Now we're going to use a negative binomial regression
  • 37:35or more precisely a distilled negative binomial regression
  • 37:38for speed, to get the test statistic
  • 37:42for both the original data.
  • 37:43And for all of these re resample datasets.
  • 37:47Then we're gonna put together all
  • 37:48of these recycled test statistics into this gray histogram.
  • 37:51And again, we're gonna fit this magenta curve
  • 37:54which is the skew T distribution
  • 37:57which seems to fit pretty well in most cases.
  • 37:59And then we're gonna compare the original test statistic
  • 38:02against this skew T distribution and get a P-value that way.
  • 38:07And so this is represented by the shaded region here.
  • 38:10And I think what's noteworthy is to compare this fitted
  • 38:14and all No distribution
  • 38:15to this standard normal No distribution.
  • 38:19I guess I should have said here
  • 38:20that the actual test statistics are a Z values extracted
  • 38:24from the negative binomial regression.
  • 38:26So if your model were true, the Z values
  • 38:30under the No would follow a standard normal distribution.
  • 38:34And so what we find is that when we resample we
  • 38:37get something that's not the standard normal distribution.
  • 38:40And so in the sense you can view it as,
  • 38:42a sort of measure of the departure sort of from,
  • 38:48or sort of the lack of model fit that went
  • 38:51into this negative binomial regression.
  • 38:54So another way of putting this is that
  • 38:57you can imagine that if you did happen
  • 38:59to correctly specify your negative binomial regression model
  • 39:03then you would sort of be getting back the same P-value
  • 39:06that you would have gotten otherwise.
  • 39:08So in that sense
  • 39:09we're not really reinventing the wheel here
  • 39:11if you do have a good parametric model, but if you don't
  • 39:13then we can correct for it using this resampling strategy.
  • 39:18So I guess this is an important slide
  • 39:19so maybe I will stay here for a little bit and ask
  • 39:23if anyone has questions about how our methodology works.
  • 39:31- Hi, I have a bunker question.
  • 39:33So have you tried to hurdle model to deal
  • 39:36with this kind of full data is the cause
  • 39:40of the weird distribution of the data?
  • 39:45- Oh, so let's see.
  • 39:48You mean to model the, essentially to model the gene
  • 39:52expressions or do you mean to model the CRISPR perturbations
  • 39:58- From this page,
  • 40:03so first step you use a logistic regression
  • 40:05and then you use a nickname by knowing that binomial.
  • 40:08So it's like a two step models, but to hurdle model
  • 40:13they combine them together to deal with the overall dataset.
  • 40:18- I see, I will admit that I'm not familiar with those
  • 40:21models but I will definitely take a look
  • 40:24at those and see if they might be applicable.
  • 40:28Yeah, I guess like in this sense
  • 40:33the approach that I've proposed here is pretty flexible.
  • 40:37I mean, really
  • 40:37what makes this approach work well is as long
  • 40:42as you have a decent approximation
  • 40:44to these probation probabilities
  • 40:46we're thinking about them as propensity scores.
  • 40:48So aside from that but
  • 40:52because really what's standing behind this as the generality
  • 40:55of the conditional randomization test where
  • 40:56you can basically use any test statistic you want.
  • 40:58And so, definitely the method is flexible
  • 41:02and can incorporate different choices,
  • 41:05like the one that you've mentioned,
  • 41:08But we haven't tried it we haven't, we haven't tried it.
  • 41:10I'm not familiar with this model.
  • 41:12Thank you though.
  • 41:15Anyone else have any questions about the methodology?
  • 41:24Okay, perhaps I'll okay.
  • 41:27So yes, so this is kind of like a separate thing
  • 41:31which I will not get
  • 41:33into details of for the sake of time, but we had
  • 41:36the separate paper whose focus was just basically,
  • 41:39the conditional randomization test is a cool test
  • 41:42but everyone knows it's slow.
  • 41:43So how can we essentially accelerate it
  • 41:46while retaining a lot of its power advantages?
  • 41:49And so what we found is that
  • 41:51if you just ever so slightly modified the test statistic
  • 41:54by sort of regressing Y first on the confounders,
  • 41:59and then on X, instead
  • 42:02of regressing it on both at the same time
  • 42:04what we found is that this ends up being much, much faster
  • 42:08because only the second step needs to be repeated
  • 42:10upon resampling, and the second step is much cheaper.
  • 42:14So what we did is that we, in the context of sector
  • 42:19we built on this
  • 42:20by accelerating the resampling steps even further
  • 42:23by leveraging the sparsity
  • 42:25of the CRISPR perturbation vector X.
  • 42:28And so perhaps the most important part is that the cost
  • 42:31of the CRT for one gene-enhancer pair went
  • 42:34down from 25 minutes down to 20 seconds
  • 42:38as a result of these computational accelerations.
  • 42:40And so for reference a single negative binomial
  • 42:43regression took three seconds.
  • 42:45So it's still,
  • 42:46we're a factor of six or seven, more expensive than the
  • 42:50just the sort of vanilla single regression
  • 42:53but it's definitely, I think sort of within,
  • 42:56definitely within an order of magnitude
  • 42:58and hopefully as you can tell a much better statistically.
  • 43:03So I will show you a few, so this is a simulation.
  • 43:12I'm not gonna go through it in detail, but the idea is
  • 43:14that what we're demonstrating here is that you can give
  • 43:19Sceptre essentially negative binomial models
  • 43:22that are miss specified in different ways.
  • 43:25You can, give it a dispersion
  • 43:27that's too large, a dispersion that's too small
  • 43:30or maybe the true model does have zero inflation
  • 43:32but we're not accounting for it.
  • 43:34And what we find is that Sceptre essentially
  • 43:36is well calibrated, regardless,
  • 43:40whereas if you just essentially took
  • 43:43the like the wrong dispersion estimates at face value
  • 43:47you would encounter problems.
  • 43:49And this SE magic approach
  • 43:51which basically is a permutation approach.
  • 43:54It's just sort of not doing a great job accounting
  • 43:56for the confounding it, so we see this inflation.
  • 44:01So perhaps more excitingly
  • 44:03I'd like to show you an application to real data.
  • 44:08So I guess this is the, so firstly
  • 44:11we wanna make sure method is actually calibrated.
  • 44:13So if you remember the initial observation was
  • 44:16in a lot of these methods, aren't calibrated.
  • 44:18So because I'm running a little short on time
  • 44:20let's kind of maybe ignore this panel here
  • 44:22and focus our attention here.
  • 44:24So this is the Gasperini data that I introduced before.
  • 44:28And so this red line here is actually the QQ plot you saw
  • 44:34on one of my first slides
  • 44:35of all of those negative control gene-enhancer pairs.
  • 44:39It looks different here because
  • 44:41the scale is I've sort of cut off the scale
  • 44:43so we can actually visualize it.
  • 44:45So we see a quite significant departure.
  • 44:48What we actually did is we thought, okay, maybe
  • 44:51they have a bad estimate of the dispersion
  • 44:54but maybe we can use some more
  • 44:55like state-of-the-art single cell sort of methods
  • 45:00to improve our estimate of the dispersion.
  • 45:03And so maybe we don't need to go
  • 45:04to all the effort of doing the resampling.
  • 45:06And so what we found is that
  • 45:07when we use a state-of-the-art dispersion estimate
  • 45:10we still have very substantial miscalibration.
  • 45:14This is, I think, just a Testament
  • 45:15to the fact that it's just hard to estimate that perimeter
  • 45:18because there's not all that much data to estimate it.
  • 45:21And then by comparison, we built Sceptre
  • 45:24from the same exact negative binomial model
  • 45:27which is this improved one,
  • 45:29and we found that the negative control P-values
  • 45:32are I think, excellently calibrated.
  • 45:36So this shows you, again, the benefit
  • 45:40of this different way of calibrating your test statistic
  • 45:42and not relying on the parametric model for gene expression.
  • 45:47So this figure just shows a few of the other methods
  • 45:50but for the sake of time, I'm going to move on.
  • 45:55This is looking at positive control data.
  • 45:58So this basically is like, trying to get a sense of power.
  • 46:01And so, again, maybe if we restrict our attention
  • 46:04to this left panel here, what we found is that
  • 46:06if we just plot the, our P-values
  • 46:10versus the P-values, by the way, maybe I should say
  • 46:12what is a positive control.
  • 46:14A positive control
  • 46:15in this case is a CRISPR perturbation that instead
  • 46:18of targeting and enhancer is targeting the transcription
  • 46:22start sites of a gene.
  • 46:25And so essentially, like we don't need any extra biology
  • 46:29to know that, if you target a transcription start site
  • 46:32that's really going to knock out the gene.
  • 46:34And so you can still try to do your association test and see
  • 46:37if you've picked up those positive control associations.
  • 46:40And so what we find is that
  • 46:41actually Sceptre not only is better calibrated
  • 46:44but it also tends to have more significant P-values
  • 46:48on those positive controls.
  • 46:49So it apparently is boosting both the sensitivity
  • 46:53and the specificity of this association tests.
  • 46:57- Eugene here are the original empirical P-value is this
  • 47:00from the negative binomial test.
  • 47:03So after we did the conditional recommendation
  • 47:08if you actually have better P-values
  • 47:11for the positive control pairs.
  • 47:13- Yes, so you would expect, you would expect it's like
  • 47:22aren't we just making the P-value is just,
  • 47:23like less significant
  • 47:25in a way to just help with the calibration.
  • 47:27So how can it be boosting power?
  • 47:29But I like the degree of inflation sort of varies
  • 47:34like essentially it's not like, and what we'll see this
  • 47:37I think on the next slide as well, essentially
  • 47:40we're not like, sort of what sector is doing is not
  • 47:43like a monotone transformation of things.
  • 47:46It kind of there's not actually just maybe to illustrate it.
  • 47:50I think, this is just an example where essentially what
  • 47:59we would have gotten from the sort
  • 48:01of the vanilla negative binomial analysis is the area
  • 48:04under this dotted or dashed curve here.
  • 48:07And so Sceptre could, well, basically whoops sorry,
  • 48:11it could have a, like a lighter tail as it has in this case.
  • 48:15And so it could sort of either make the P-values
  • 48:20on the more significant or less significant.
  • 48:22It's correcting the miscalibration
  • 48:24but not necessarily in a way that's like conservative.
  • 48:26And so this is encouraging.
  • 48:32Yeah, that's a good question though.
  • 48:35- I guess that depends on
  • 48:37the confounding you included in the model.
  • 48:41So then I would expect it well, re reduce the significance
  • 48:47but if you include other co-founding
  • 48:50that's mostly contributing to the noise level probably.
  • 48:55- Yeah, sure, so I think I'm right.
  • 48:59Yeah, let me think we are, let me see
  • 49:03I think in this case, we're correcting
  • 49:05for approximately the same confounders here.
  • 49:08So they already had some confounders
  • 49:10that they were correcting for
  • 49:11in the original negative binomial.
  • 49:12So in that sense, it's a little bit more
  • 49:14of maybe an apples to apples comparison.
  • 49:16It's just a question of how do you calibrate
  • 49:20that test statistic that is
  • 49:21trying to correct for the confounders
  • 49:23but I think what you're getting at
  • 49:25I do think it can go either way.
  • 49:27It's not obvious that Sceptre would make a P-value
  • 49:29or they're more or less significant.
  • 49:32I think I will say just as a small detail here
  • 49:34that in addition to the negative binomial regression
  • 49:38this P-value, it says,
  • 49:40there's this strange word empirical here.
  • 49:42What it means is that
  • 49:43they've kind of also applied their fixed that they had
  • 49:47because they realized that they had the miscalibration
  • 49:48and then they kind of like smashed all
  • 49:50of their P-values sort of,
  • 49:52so these are sort of like, so in that sense
  • 49:55it's not an apples to apples comparison
  • 49:57but what we're doing is we're comparing
  • 49:58to the P-values that were actually used
  • 50:00for the analysis in this, in this paper.
  • 50:02So maybe that makes it even harder to compare, but yes.
  • 50:06So take this plot with a grain of salt, if you will.
  • 50:10Perhaps I think the most exciting part is
  • 50:14actually applying this to new gene-enhancer pairs
  • 50:18where we don't know necessarily what the answer is.
  • 50:21And so this plot just shows you
  • 50:24we're just plotting it's actually, I guess
  • 50:27similar to this plot we saw here
  • 50:30except now we're looking at the candidate enhancers.
  • 50:33And so essentially the different colors.
  • 50:36So firstly, this also just shows you
  • 50:38that this is very much not a monotonic transformation.
  • 50:42Like you really can like, if you look into this quadrant
  • 50:47this is an example where the original P-value was very
  • 50:51not significant, but according to Sceptre
  • 50:54it can be very significant and vice versa.
  • 50:58So essentially I've just kind of highlighted
  • 51:01those gene-enhancer pairs that were,
  • 51:03found by one method and not the other.
  • 51:05And so the upshot is that there's a total
  • 51:09of about, roughly 500 or so found.
  • 51:12Well, I guess after found 563
  • 51:15of those 200 were new in the sense
  • 51:18that they were not found by the original analysis.
  • 51:21And then 107 were found by the original analysis
  • 51:24but were not found by us.
  • 51:26And we have strong reasons to believe
  • 51:28that these could be false positives based
  • 51:30on exactly the sorts of miscalibration that I presented.
  • 51:35We did look at a few specific new discoveries here
  • 51:38and found that they were corroborated by EQTL data.
  • 51:43And for those of you who are familiar
  • 51:45enhancer RNA correlation data, since I'm running low
  • 51:48on time, I don't have time to explain this to you
  • 51:51but these are all P-values
  • 51:53of association based on orthogonal functional assets.
  • 51:58Also, we found that our discoveries were more enriched
  • 52:01for biological signals in a few different ways.
  • 52:04One of them is that, and again,
  • 52:06I'm sort of maybe going a little bit
  • 52:08more quickly here 'cause I'm about to run out of time
  • 52:11but there are these things called topologically
  • 52:13associating domains, which are basically regions
  • 52:16in the genome within which most
  • 52:19of these regulatory interactions are thought to occur.
  • 52:22And so what we find is that a greater fraction
  • 52:25of the gene-enhancer pairs we found compared
  • 52:27to the original analysis did lie
  • 52:30in the same top logically associating domain.
  • 52:32So in this case, 74% versus
  • 52:35the 71% found in the original analysis.
  • 52:37So in this sense, I mean, it's just kind of
  • 52:39like a first order sense of biological plausibility.
  • 52:44I think people are starting to think
  • 52:46that there are interactions that are sort of
  • 52:48outside of tabs as well.
  • 52:49So I don't think this is a signal that,
  • 52:5226% of these things are false discoveries
  • 52:55but we definitely do expect, a high degree
  • 52:59of enrichment for within tad interactions.
  • 53:05Also if you do look
  • 53:07at some of these more circumstantial pieces
  • 53:09of evidence for regulations, such as things
  • 53:14like transcription factor binding or histone modifications
  • 53:19so we can use CHiP-seq to essentially assess
  • 53:23for any given what
  • 53:27whether there is these kind of signatures of regulation.
  • 53:33And so what we found is that we did a little bit
  • 53:35of an enrichment analysis where we looked at all
  • 53:38of those enhancers that were found to be paired
  • 53:40to genes by sector versus the original method
  • 53:43and looked to what extent they were enriched
  • 53:46for these other signatures
  • 53:49these CHiP-seq based signatures of regulation.
  • 53:52And what we found is that
  • 53:53across eight of these CHiP-seq targets, and by the way
  • 53:58these eight are not selected.
  • 53:59These actually were the exact eight CHiP-seq targets
  • 54:03that they examined in the original paper,
  • 54:06we found greater enrichment.
  • 54:08So in this sense, also the enhancers being
  • 54:13picked up by Sceptre are just more biologically
  • 54:15plausible using these orthogonal kinds of assets.
  • 54:19So I find this very exciting
  • 54:21and I'm just gonna maybe make a few remarks
  • 54:25and hopefully there's just a little bit
  • 54:26of time for questions.
  • 54:27I will also be around for a few minutes after the seminar.
  • 54:30If anyone wants to stick around and ask me questions
  • 54:33you also might have your next thing to go to.
  • 54:35So I understand if not.
  • 54:37But maybe the summary is that, mapping gene-enhancer
  • 54:41regulatory relationships is very important.
  • 54:43If we wanna translate GWAS hits into disease insights.
  • 54:47And there's been this very exciting new technology
  • 54:50that allows us to answer that question.
  • 54:53This technology was proposed very recently,
  • 54:56and so there aren't that many methods out there
  • 55:00to analyze these kinds of data.
  • 55:02And so what we did with Sceptre is we leveraged recent
  • 55:05methological advances in statistics to overcome the primary
  • 55:08limitations of the parametric and non-parametric analysis
  • 55:11methods that were available.
  • 55:13And finally, we applied it to the largest existing
  • 55:17data set of this kind.
  • 55:19And what we get is a greater number of more biologically
  • 55:22meaningful regulatory relationships.
  • 55:25So I had a few other discussion slides, maybe I'll just
  • 55:28read the title to you without getting into the details
  • 55:31but this is a rapidly developing technology.
  • 55:34And we do foresee that sector will be applicable
  • 55:37to future iterations of the technology.
  • 55:40So that's promising.
  • 55:42And secondly, this is more like the beginning
  • 55:46of the road than the end of the road.
  • 55:47There are lots of remaining challenges,
  • 55:51this includes looking for interactions
  • 55:53among enhancers, things like dealing
  • 55:56with multiple guidances, how are you in the same enhancer,
  • 56:00they're just basically like a whole, I would say, playground
  • 56:02of statistical problems that have yet to be addressed.
  • 56:07So maybe finally, if you'd like to learn more
  • 56:11we have a pre-printed on bio archive.
  • 56:13I wanna acknowledge my co-authors again.
  • 56:16And finally, so Tim has worked very well hard
  • 56:20on putting, making this an art package so
  • 56:24you can find out on GitHub
  • 56:26and I'm very happy to take questions now
  • 56:29but if you have any burning questions that come
  • 56:32to you 30 minutes after my talk
  • 56:35please feel free to email me at this address.
  • 56:37So thank you, and I should have said at the top, thank you
  • 56:40Lexi for the invitation.
  • 56:42- Thank you for agreeing to present your work here.
  • 56:45It's really a nice talk.
  • 56:47- Yeah Thank you.
  • 56:49- So I have some, maybe less related question
  • 56:52to your current work, but maybe interesting to consider.
  • 56:56I am not sure.
  • 56:57Have you looked at the correlation structure
  • 56:59between the X matrix?
  • 57:03- Yeah, so essentially my sense is that gets
  • 57:07like a factor model where you have all
  • 57:11of these sort of confounders that are inducing correlation
  • 57:16among all the axis, but essentially like once you account
  • 57:21for that confounding, it's independent.
  • 57:25- I see (indistinct) correlation.
  • 57:29- So it's fairly small correlation and essentially
  • 57:33the reason for, and this is very different from
  • 57:36for example, genome-wide association studies.
  • 57:38So it's like, Oh
  • 57:39is there some analog of Lincoln's this equilibrium.
  • 57:40And the key difference here is that
  • 57:43it's essentially a design experiments.
  • 57:46So even though you're not controlling exactly
  • 57:49which cells receive what perturbations you are
  • 57:51basically assigning them at random.
  • 57:54So if it worked
  • 57:55for this sort of pesky measurement mechanism business
  • 57:58it would be an unconfounded problem.
  • 58:01But essentially, so the only correlations are coming
  • 58:05from this measurement.
  • 58:08Yes so that is a great question
  • 58:10because you can ask, well
  • 58:11how did I do the slight of hand run?
  • 58:13Like slide three all of a sudden I was working
  • 58:15with like one enhancer
  • 58:17and where did all the rest of them go.
  • 58:18And I think we're actually not losing all too much
  • 58:21by doing this, especially
  • 58:23since we are controlling for those technical factors.
  • 58:25- Yeah thanks that makes sense to me.
  • 58:28And another thing is maybe more, less than less
  • 58:31statistical is how many confounding factors
  • 58:34they are controlling
  • 58:35and what are the important ones that you have identified?
  • 58:39- Yeah, I mean, so in this case
  • 58:41we're doing essentially we're following the lead of
  • 58:45the original paper
  • 58:46for which confounding factors with control for.
  • 58:48So in addition to sequencing depth.
  • 58:51Yeah, so they do have a batch of fact
  • 58:52and there's also something called Percent Might've Country.
  • 58:55So it's like what fraction of all the reads that you got
  • 58:58in this particular cell came from mitochondrial DNA
  • 59:02as opposed to, regular DNA, maybe a few others
  • 59:09like just total number
  • 59:10of genes expressed in the cell, things of this nature.
  • 59:13So I think here we're correcting
  • 59:15for about five, but you could think of other things
  • 59:18like cell cycle, this is a pretty K five 62 is a pretty
  • 59:25homogeneous cell line, but especially
  • 59:27once you get to other kinds of, tissue samples
  • 59:30you might need to think about, cell type
  • 59:33and things of this nature.
  • 59:35So I think there are lots to consider here,
  • 59:38we used kind of five easy ones.
  • 59:42- Okay, thanks.
  • 59:44Any more questions for Eugene?
  • 59:48Yeah, I think we are approximating
  • 59:52the end of the talk, the seminar.
  • 59:55So thanks again for your great talk.
  • 59:58And if you have any further questions
  • 01:00:01you can just send emails to Eugene offline.
  • 01:00:05- Yes, yes, definitely don't hesitate to reach out.