# YSPH Biostatistics Seminar: “Opening Black Boxes: Enhancing Statistical Rigor in Genomics Data Science”

October 14, 2022## Information

Speaker: Jingyi Jessica Li, PhD, Associate Professor, Department of Statistics, University of California, Los Angeles

October 11, 2022

ID8168

To CiteDCA Citation Guide

- 00:00(interference drowns out speaker)
- 00:08<v Announcer>Biostatistics, computational.</v>
- 00:11(interference drowns out speaker)
- 00:15So before joining UCLA in 2013.
- 00:19(interference drowns out speaker)
- 00:24Production work.
- 00:25(interference drowns out speaker)
- 00:29On the script is (interference drowns out speaker)
- 00:33include differentiation factors,
- 00:35asymmetric (indistinct) replication,
- 00:37p-value-free false discovery (indistinct),
- 00:40and a high dimensional variable selection.
- 00:43And on the bio (indistinct) application side,
- 00:46her research include all single cell (indistinct)
- 00:50for (indistinct) genomics and (indistinct).
- 00:52(interference drowns out speaker)
- 01:00Research published.
- 01:01(interference drowns out speaker)
- 01:142019
- 01:16She's an MIT Technology Review certified (indistinct)
- 01:21in 2020, and she has received from Harvard.
- 01:25(interference drowns out speaker)
- 01:31<v Jingyi>I couldn't wait for the introduction.</v>
- 01:33It's my honor here to present my work,
- 01:36and my sabbatical in this fellowship program
- 01:40at Harvard Radcliffe Institute.
- 01:41So it's my pleasure to talk about some of our recent work
- 01:46related to how statistic rigor is important in genomics.
- 01:51So I want to say that when I was a student,
- 01:54especially I think most of our audience here are students,
- 01:57I want to give you this motivation.
- 02:00When I was a student back in 2007,
- 02:02that was when I just started my PhD
- 02:05and I was interested in bioinformatics.
- 02:08I had a lot of questions about bioinformatics methods
- 02:11after I took statistics classes.
- 02:14I think some of the questions I listed here
- 02:16include are P values valid?
- 02:19Because P values are so widely used
- 02:21in genomics bioinformatics.
- 02:23And also, we have a lot
- 02:25of bio bioinformatics methods developed for data analysis.
- 02:28And I wonder why don't we use classical statistical methods
- 02:32in textbooks?
- 02:33And the third thing is,
- 02:34when we use statistical test to understand the question,
- 02:38to answer some pivot question,
- 02:40what is the proper null hypothesis?
- 02:42So you will see those questions in the topics
- 02:45I will talk about next.
- 02:47So this talk will focus on the multiple testing problem.
- 02:52See, multiple testing, what it means
- 02:53is that we have multiple hypothesis tests,
- 02:57and the criteria we use in this problem are P values,
- 03:01which we have one P value per test.
- 03:04So we know that the requirement for a valid P value
- 03:08is that P values should follow the uniform distribution
- 03:11between zero one under the null hypothesis.
- 03:14Or we may relax this to be super uniform.
- 03:18Just for your information,
- 03:19super uniform means that the P values
- 03:22have higher density toward one
- 03:25and lower density towards zero.
- 03:26So that's still okay for type one error control,
- 03:30even though you may have a larger
- 03:31than expected type two error.
- 03:33So given the many, many P values,
- 03:35we need one criterion to set a cutoff on the P values.
- 03:40And the most commonly used criterion
- 03:42for multiple testing correction
- 03:44is called a false discovery rate, short as FPR.
- 03:47So the definition here is the expectation of this ratio,
- 03:52and this ratio is the number of false discoveries
- 03:55over the number of discoveries.
- 03:57So this notation means the maximum
- 04:00between the number of discoveries and one.
- 04:02In other words, we don't allow the denominator to be zero,
- 04:06if we don't make any discovery.
- 04:07So this is to avoid the dividing zero issue.
- 04:10And this ratio has a name called false discovery proportion.
- 04:14In other words, we can have this proportion
- 04:16for one particular data set.
- 04:18However, as you know, we don't observe this ratio
- 04:22because we don't know which discoveries are false.
- 04:25So therefore, this ratio is only a hypothetical concept,
- 04:28but not really computable.
- 04:30And here, the expectation
- 04:32is taken over all possible data set
- 04:35from the same distribution as our data set.
- 04:38So this is the frequentist concept
- 04:40because we have imaginary potential data sets.
- 04:44So therefore, the phenomena paper
- 04:47by Benjamini and Hochburg gave us a way
- 04:50to control this expectation called FDR
- 04:53under a claimed level, say, 5%,
- 04:57even though we couldn't realize this ratio itself.
- 05:01But we could control its expectation.
- 05:02So that's the magic of statistics.
- 05:05So Benjamini Hochburg algorithm allows us
- 05:07to set a cutoff on the P values to control the FDR.
- 05:11But I want to emphasize that the FDS's only controlled
- 05:15when P values satisfy this assumption,
- 05:17otherwise, it may not be.
- 05:19So I want to say three common causes of ill-posed P values,
- 05:24which make P values don't satisfy this assumption
- 05:27in genomics, and I'll go through them one by one.
- 05:31The first issue is what I call the formulation
- 05:34of a two sample test problem as a one sample test.
- 05:38What does this mean?
- 05:39So I will use the common genomic analysis
- 05:42of ChIP-seq data as an example.
- 05:45So in ChIP-seq data,
- 05:46we want to measure where a protein binds in the genome.
- 05:50So you can consider the X axis as the genome
- 05:54and the Y axis as the protein binding intensity
- 05:57measured by ChIP-seq.
- 05:59So here, we have experimental sample,
- 06:02the condition of our interest, say, a certain cell line.
- 06:06And the background sample is what we know
- 06:08that there's no protein,
- 06:10so there should be no protein binding.
- 06:12But we still want to measure the noise from the experiment.
- 06:15So we need this contrast.
- 06:17And here, we want to say that the region in the red box,
- 06:22this interval, we want to call it as a peak,
- 06:26if we see the intensity in the experimental sample
- 06:30is much larger than the intensity in the background sample.
- 06:33So we do the comparison and we want to cut this at a peak.
- 06:36That's the purpose of this analysis.
- 06:39And I wanna say that, in the field,
- 06:42because ChIP-seq has become popular since 2008,
- 06:45Macs and Homer are probably the two most popular software
- 06:49for cutting peaks.
- 06:51Even though they have very complex procedures
- 06:54for processing the sequencing data
- 06:56that in a statistical part
- 06:58to call a region as a peak or not,
- 07:01I can say, their formulation is as follows.
- 07:04Given a region, we count its number of ChIP-seq reads
- 07:09in the background sample and in the experimental sample.
- 07:12So let's just summarize this intensity as a count,
- 07:15a count here, a count here, and both are now negative.
- 07:19So I call the background count as big X,
- 07:21experimental count as big Y.
- 07:24And in our data, we have the observations, right?
- 07:27We refer to them as small x, small y.
- 07:30Then, the P value in both software
- 07:33is essentially this probability, the probability
- 07:37that big Y is greater or equal than the observed small y,
- 07:42where the big Y follows upon some distribution
- 07:45with mean parameter as the small x.
- 07:48Now, when I look at this formula back in 2008,
- 07:51the Macs paper, I wonder whether this is correct.
- 07:56And I don't think so.
- 07:57Because the reason, if you look at it,
- 07:59is what is the null hypothesis?
- 08:01The null hypothesis is essentially, okay,
- 08:04let's assume the experimental count
- 08:06is our test statistic, okay?
- 08:09We assume it follows a Poisson distribution
- 08:11with mean lambda.
- 08:13And here, the null hypothesis is lambda is equal to small x.
- 08:18Alternative is lambda greater than small x.
- 08:21So what's the problem with here?
- 08:23Essentially, we are using small x as a fixed parameter
- 08:27instead of a random observation.
- 08:29So in other words,
- 08:30the randomness in the background count is ignored.
- 08:33We only consider experimental count as the random variable.
- 08:37So in other words, where use the two sample testing problem
- 08:40to a one sample testing problem
- 08:42because we only consider the randomness
- 08:44in the experimental sample.
- 08:46But this is not something our textbook teaches us.
- 08:50The reason is because if we consider background
- 08:53as one condition, experimental has another condition,
- 08:56under each condition, our sample size is only one.
- 09:00So therefore, the T test will not apply
- 09:02because a central limit here clearly doesn't apply.
- 09:05So how do we calculate P value, any ideas?
- 09:10I think one possibility that we could still assume
- 09:13Poisson distribution for both background X
- 09:16and experimental Y.
- 09:17You have two Poisson, under the independence,
- 09:21we can probably derive the distribution
- 09:23for Y minus X, right, and what's the null distribution.
- 09:26That's the only way.
- 09:27But, if you think about it, how can we verify
- 09:30whether the Poisson distribution is reasonable?
- 09:33You only have one observation from it.
- 09:35The distribution could be anything, right?
- 09:37So assuming a parametric distribution seems quite,
- 09:41I will say, aggressive.
- 09:42So I think P value calculation is challenging here.
- 09:45And also, I even wonder, in this case,
- 09:49for this one versus one comparison,
- 09:51should we use a P value?
- 09:53Or is this really a testing problem that's feasible?
- 09:57So I would say, over the years, I gradually realized
- 10:00that here we looked at many, many regions,
- 10:03not just one region.
- 10:04So the goal or the criterion that's ultimately used
- 10:08is actually FDR.
- 10:09And in this process,
- 10:12P values are just intermediate for FDR control,
- 10:16instead of our final target.
- 10:18So do we have to stick with P values?
- 10:21This motivated me to write this paper with my students
- 10:25to propose a way to achieve p-value-free FDR control
- 10:30by leveraging the theory in Barber and Candes paper,
- 10:34their knockoff paper,
- 10:36so we could actually doing FDR control
- 10:39in this example without using P value.
- 10:41So I will talk about this later in my talk,
- 10:43but this is one motivation for the Clipper paper.
- 10:47The second issue with P values is that we observe,
- 10:50sometimes, P values are not valid
- 10:52because the parametric model used may not fit the data well.
- 10:57So this is an example for this commonly used
- 11:01differential expression analysis on RNA sequencing data.
- 11:05So for this task,
- 11:07the two popular softwares are DESeq2 and edgeR.
- 11:10So the data usually looks like this.
- 11:12So we want to compare two conditions
- 11:15and seeing which genes are differentially expressed.
- 11:19So condition one, we have three samples,
- 11:22which we cause to replicate,
- 11:23condition two, three replicates.
- 11:25So every row is one replicate,
- 11:29while every column is one gene.
- 11:31So to call a gene as differentially expressed,
- 11:34we need to compare its three values
- 11:36from condition one, two, three values from condition two.
- 11:39So clearly, we can see the left one may be a D gene,
- 11:43the right one may not be a D gene, right?
- 11:45That's our intuition.
- 11:46And we want to make this more formal
- 11:49by doing a statistical test.
- 11:52But in both edgeR and DESeq2,
- 11:55you can see that to compensate the small sample size,
- 11:59like three versus three,
- 12:00they assume a gene follows a negative binomial distribution
- 12:05under each condition.
- 12:07So essentially, these three values are assumed
- 12:09to follow one negative binomial distribution.
- 12:12These three values
- 12:13follow another negative binomial distribution.
- 12:16And the null hypothesis
- 12:18is the two negative binomial distributions
- 12:21have the same mean, that's the problem.
- 12:24Okay, so we actually discovered an issue
- 12:27with popular methods from this data set.
- 12:30And thanks to my collaborator Dr. Wei Li
- 12:32who is a computation of biologist at UC Irvine.
- 12:36So actually, from this patient data,
- 12:39we have a much larger sample size, 51 patients
- 12:43before the treatment of some immunotherapy medicine,
- 12:4758 patients on treatment.
- 12:50So we want to compare the RNA sequencing data
- 12:53of these two groups of patients.
- 12:55So essentially, when we apply DESeq2 or edgeR to this data,
- 13:01the red dots indicate the number of D genes identified.
- 13:06To verify whether we can still identify D genes
- 13:12from permuted data,
- 13:13because the reason is that we want to see
- 13:15whether the permuted data is actually really,
- 13:19because we know the permuted data
- 13:20shouldn't give us any signals.
- 13:22If we just disrupt the two groups,
- 13:24we shouldn't expect any D genes.
- 13:26But surprisingly, we found that each method
- 13:29can identify sometimes even more D genes from permuted data.
- 13:34So the bar and the error bars show the distribution
- 13:37of D genes identified from permuted data.
- 13:40So this is something quite unexpected.
- 13:44And to look into the reason, our first thought
- 13:47is to check the negative binomial assumption.
- 13:50Because now, under each group,
- 13:52we have 51 and 58 sample sizes,
- 13:55so we could check the distribution, and here's what we get.
- 13:59You see that for the genes that are frequently identified
- 14:03from permuted data, if we run the goodness-of-fit test,
- 14:07we check the negative binomial distribution,
- 14:10these genes have very small P values,
- 14:12indicating that this fit is not good.
- 14:15Well, if you look at the genes
- 14:16that are rarely identified from permuted data,
- 14:20the P values are bigger and the goodness-of-fit is better.
- 14:23So we do see this relationship
- 14:25between the goodness-of-fit of negative binomial
- 14:29and the frequency that a gene is identified
- 14:32from permuted data.
- 14:33So negative binomial model seems to not fit well
- 14:36on this patient data.
- 14:39Because here, the 51 patients shouldn't be regarded
- 14:42as replicates, they're not experimental replicates,
- 14:45they are individuals.
- 14:46So therefore, the theory for deriving negative binomials
- 14:50usually assume as a Gamma-Poisson Mixture model,
- 14:52Gamma-Poisson Hierarchical model.
- 14:54That one may no longer hold,
- 14:57and that's why we think the parametric model
- 15:00is not applicable to this patient data.
- 15:04So what's the consequence, right?
- 15:06So we want to convince the scientist
- 15:08what's the consequence of doing this analysis
- 15:11in this problematic way.
- 15:12We show that if we just use the D genes
- 15:15found by DESeq2 and edgeR,
- 15:17which are the genes corresponding to the red dot,
- 15:20around the so called gene oncology analysis,
- 15:23that is to check which functional terms
- 15:26are enriched in those two gene sets,
- 15:29we can see many functional terms
- 15:31are related to immune functions.
- 15:34Which would suggest that if we trust
- 15:36these two methods' results, we may conclude that,
- 15:39yes, between the two groups of patients,
- 15:41there are differences in immune responses, right?
- 15:44That seems to confirm our scientific hypothesis.
- 15:48However, now, we see many of these genes
- 15:51were also identified from permuted data,
- 15:54then, that will make the results dubious.
- 15:57So what we tried is that, even the sample size is so large,
- 16:01we tried the classical Wilcoxon rank sign test,
- 16:04which everybody learned, right?
- 16:06So non parametric two sample test
- 16:08that doesn't assume a parametric distribution.
- 16:11And here, it's self consistent,
- 16:13it doesn't identify D genes from real data,
- 16:17but also, it doesn't identify D genes from permuted data.
- 16:20So there's no contradiction here.
- 16:23And this result motivated me to ask this question,
- 16:26which I had years ago,
- 16:29should we always use popular bioinformatics tools?
- 16:33Like, check the citation of these two methods,
- 16:35super highly cited.
- 16:37Should I reuse popular method
- 16:39or should we consider general statistical methods,
- 16:43like Wilcoxon.
- 16:45So our recommendation is sample size matters, right?
- 16:50We may have different methods
- 16:52suitable for different sample sizes,
- 16:55and essentially, why statistics has so many methods,
- 16:58paramedic, non parametric,
- 17:00is because we have different scenarios in our data.
- 17:03That's the first thing we should realize.
- 17:05It's not like one method can do all the things.
- 17:08And the second thing is sanity check.
- 17:10We should always consider doing some sanity check
- 17:13to make sure we trust the results
- 17:15instead of just take the results for granted.
- 17:17So these things were summarized in our paper
- 17:20published earlier this year.
- 17:23And since its publication,
- 17:25we have received a lot of discussions on Twitter,
- 17:28if you are interested.
- 17:29But anyway, so it means that many people are interested
- 17:32in this topic, especially many people, users believe
- 17:36that popular bioinformatics tools are the state-of-the-art,
- 17:39right, the way, standard methods (indistinct).
- 17:42But if you are bio statisticians, you may not like this.
- 17:45Because we want to develop new methods.
- 17:48Otherwise, what's our job, right?
- 17:50So in this case, we need to really find the loopholes,
- 17:53or the limitations, or the gap between current approach
- 17:57and the data scenarios,
- 17:58and try convinces people that, yes,
- 18:01we do need careful thoughts when we choose method.
- 18:04It's not always one method.
- 18:06And a related question is,
- 18:08in Wilcoxon, definitely doesn't have a strong assumption,
- 18:13and (indistinct) have a reasonable power
- 18:15when the sample size is large.
- 18:17But what if sample sizes are small, right?
- 18:20So when it's small, we know,
- 18:21non parametric tests like Wilcoxon doesn't have power.
- 18:25So in this case, we actually proposed Clipper again,
- 18:30so it can work as a downstream correction tool
- 18:34for DESeq2 and edgeR.
- 18:36Because they are supposed to be quite powerful,
- 18:39even though they find probably too many.
- 18:41So hopefully, we could use that to borrow their power,
- 18:44but help them improve the FDR control.
- 18:47So I'll show the results later in my talk.
- 18:50That's the second cause.
- 18:52And the third cause for ill-posed P values
- 18:54is a little more complicated.
- 18:56And this is the issue commonly observed in single cell data,
- 19:00single cell RNA-seq data.
- 19:01So I will use this analysis
- 19:03called pseudotime differentially expressed genes as example.
- 19:08What is a pseudotime?
- 19:10Pseudotime means it's not real time, it's pseudo, right?
- 19:13So it's something we inferred
- 19:16from single cell RNA-seq data,
- 19:18so those cells are measured all at once.
- 19:20But we want to infer some time trajectory from the cells.
- 19:26So I'll just use the screenshot from Slingshot,
- 19:29which is a method for inferring pseudotime for explanation.
- 19:34So here, this is a two-dimensional PCA plot of cells,
- 19:39and the cells are pre-clustered,
- 19:41so each color represents one cluster.
- 19:44So the Slingshot algorithm does the following,
- 19:47first, it takes the cluster means' centers,
- 19:51and connect them using the algorithm
- 19:53called minimum spanning tree.
- 19:54So if you're not familiar with that,
- 19:56it has an equivalence with hierarchical clustering actually.
- 19:59So with the minimum spanning tree, you get this tree,
- 20:02and then, they smooth out the tree using principle curves.
- 20:07So we have two curves,
- 20:08and then for every cell,
- 20:10we find the closest curve and project the cell to the curve.
- 20:13So therefore, in each curve,
- 20:15the projections are called pseudotime values.
- 20:18And usually, it's normalized between zero and one,
- 20:21so we need to find the root and call it zero,
- 20:24the other end is called one.
- 20:25So this whole process is called pseudotime inference.
- 20:28In other words, after it, we will give every cell
- 20:32a pseudotime value in each trajectory.
- 20:35Okay, so one thing I want to emphasize
- 20:38is that in this pseudotime inference
- 20:40we used gene expression values already.
- 20:43So it's not like we observe pseudotime as external variable,
- 20:47but it's from the same data.
- 20:49So I want to show what we could do after the pseudotime.
- 20:53So a typical analysis is to identify
- 20:56which genes are differentially expressed
- 20:58along the pseudotime.
- 20:59Like the left one, we see, it has this upward trajectory,
- 21:03so we may call it differentially expressed.
- 21:06And here, we want to say the pseudotime
- 21:09represent some cell immune response,
- 21:12and this is an immuno-related gene,
- 21:14so we expect to see the upward trajectory.
- 21:17For the right gene, we expect to see something constant,
- 21:20so we don't want to come right (indistinct) a D gene,
- 21:23that's the intuition.
- 21:25And I want to say that we must realize,
- 21:28pseudotime values are random
- 21:31simply because the cells is a random sample, right?
- 21:35We need to consider randomness,
- 21:37and we want to show this to people by doing subsampling.
- 21:41So you can see that sampling variation
- 21:43would get into pseudotime values.
- 21:46Here, every row is a cell.
- 21:48If I randomly subsample,
- 21:50say, 80% of cells from the left cells
- 21:53and redo the pseudotime trajectory inference,
- 21:57we can see that for the cells in the subsamples
- 22:01that include it, its values will vary to some degree.
- 22:04So it's not a constant.
- 22:07Okay, so realizing this, we should consider
- 22:10the randomness of pseudotime from the data.
- 22:13However, existing methods all treat pseudotime
- 22:16as an observed covariate.
- 22:18So our goal here is to fix this,
- 22:22and we proposed this method called PseudotimeDE,
- 22:25which actually does the inference,
- 22:27which infers whether one gene
- 22:29is differentially expressed along pseudotime,
- 22:32and by considering pseudotime inference uncertainty.
- 22:36So what we did exactly is that, here,
- 22:41to see whether a gene changes with pseudotime,
- 22:44what's the intuition?
- 22:45We should do regression, right, do a regression analysis
- 22:48by treating a gene's expression value as Y,
- 22:53pseudotime as X, and regular regression.
- 22:55Yeah, this is exactly what existing methods did.
- 22:58And to make sure the regression
- 22:59is not restricted to be linear,
- 23:02and also account for that the gene expression values
- 23:05are non negative counts.
- 23:07So actually, we choose the generalized additive model,
- 23:12which is also used in an existing method,
- 23:14which I will show very soon.
- 23:16So this is a very flexible and interpretable model.
- 23:20So generalized means Y can be non Gaussian
- 23:24and the other distribution,
- 23:25just like generalized linear model.
- 23:28But additive means
- 23:29that we make the linear model more general,
- 23:32so every feature can be non linearly transformed,
- 23:38but the features after transformations are still added.
- 23:42So that's additive, short as GAM.
- 23:44So essentially, once we have a set of cells,
- 23:47we first infer the pseudotime,
- 23:49so we order the cells along the pseudotime,
- 23:52and for gene J,
- 23:53we check how the gene changes with pseudotime,
- 23:57so we run the generalized additive model
- 24:00to obtain a test statistic.
- 24:02Please know that generalized additive model has its theory,
- 24:05so we could use the theory to calculate
- 24:08to use the null distribution and calculate P value.
- 24:12And that was done in an existing method.
- 24:15We want say that this may be problematic
- 24:18because this whole null distribution
- 24:20considers pseudotime to be fixed.
- 24:22So to address this,
- 24:24we need to consider pseudotime inference
- 24:26as part of our test statistic calculation.
- 24:30So to do this, we use the top part.
- 24:33We actually do subsapling of the cells.
- 24:36The reason we didn't do bootstrap
- 24:38is simply because we want the method to be flexible
- 24:41for pseudotime inference method.
- 24:43Like I show here, there are Slingshot, Monocle3,
- 24:47and a few others.
- 24:48We want it to be flexible,
- 24:50and some methods don't allow cells to be repetitive,
- 24:54so bootstrap doesn't apply here.
- 24:56And we use subsampling with percentage pretty high,
- 24:59like 80%, 90%, and we did a robustness analysis.
- 25:03And then, on each subsample, we do pseudotime inference.
- 25:08With this, how do we get a null distribution
- 25:11of the test statistic?
- 25:12What we did is to permute the cells,
- 25:15so any relationship between the gene J
- 25:18and the pseudotime is disrupted.
- 25:20So this can be considered from the null,
- 25:22and then, we did the same GAM model,
- 25:25and then, we calculate the values of the test statistic
- 25:30on these permuted subsamples,
- 25:32that gave us a null distribution.
- 25:33So together, we can get a P value, this is what we did.
- 25:37And we can show that this approach
- 25:38indeed can control the P values,
- 25:41make the P values uniformly distributed on the null,
- 25:45while the existing method that uses GAM,
- 25:47but only the theoretical distribution called tradeSeq,
- 25:50they have some distortion for P values.
- 25:53And then, you may wonder, what's the consequence?
- 25:56We can show that, oh, and I should say,
- 25:58Monocle3 uses generalized linear model and not uncertainty.
- 26:03So you can see that even though it's not as bad as tradeSeq,
- 26:07still, some distortion.
- 26:09So we wanna show
- 26:10that by calibrating the P value using our way
- 26:13we can actually discover more functional terms
- 26:17in our differentially expressed genes.
- 26:19It means that we can find some new biological functions
- 26:22that were missed by this new method.
- 26:24Which shows that FDR control not just help with FDR control
- 26:28of P value calibration,
- 26:29not just help with FDR control,
- 26:31but may also boost some power.
- 26:34So I just quickly talk about this PseudotimeDE,
- 26:37but I want to say that its computational time
- 26:40is the biggest limitation.
- 26:42Because here, our P value calculation requires many rounds
- 26:46of subsampling, pseudotime inference, and permutation.
- 26:50So let's say we want the P value with resolution 0.001,
- 26:55we need at least 1000 rounds of such things, right?
- 26:58That will take time.
- 27:00So the natural question
- 27:01is can we reduce the number of rounds, right,
- 27:04and still achieve FDR control?
- 27:06That becomes similar to my first goal.
- 27:08Can we get rid of the higher resolution P values,
- 27:11control the FDR, and then, we will use Clipper again.
- 27:14So you can see,
- 27:15Clipper is used throughout all the motivations,
- 27:18that's why we proposed it,
- 27:20and I'll talk about it in the next minute.
- 27:22And the second question we didn't address
- 27:24is that what if the cells don't follow a trajectory at all?
- 27:29So clearly in our null hypothesis,
- 27:32we are assuming there is a trajectory,
- 27:34it's just that gene J doesn't change with the trajectory.
- 27:38But what if the trajectory doesn't exist?
- 27:40So this whole idea of this trajectory pseudotime inference
- 27:45doesn't make sense, right?
- 27:46We need to consider that.
- 27:47But I don't think we have a good way to do it,
- 27:50unless we can change the cells to have a null
- 27:54where the cells don't follow a trajectory.
- 27:56So this motivated us to generate cells
- 27:59that don't follow a trajectory, and we used a simulator.
- 28:02So which it will be the last part I will talk about today.
- 28:06Okay, PseudotimeDE is one such a problem
- 28:09where pseudotime is inferred from the same data.
- 28:12Another common problem is to do clustering on single cells
- 28:17to identify cell clusters,
- 28:19and between cell clusters,
- 28:21we identify differentially expressed genes.
- 28:23We call this problem ClusterDE.
- 28:26But this is also using the data twice, right?
- 28:29So people have called this term double dipping,
- 28:32meaning that the same data used for twice.
- 28:36To tackle this problem,
- 28:38we need to consider the uncertainty in cell clustering,
- 28:41and there are three existing papers
- 28:44that try to address this problem
- 28:46that they either need to assume a distribution,
- 28:48like genes follow Gaussian distribution in every cluster
- 28:53or every gene follows a Poisson distribution here
- 28:57and they need to do count splitting.
- 28:59So I won't talk into the couple of details here,
- 29:02but I just want to say
- 29:02that the count splitting approach in my opinion
- 29:05tackles a different problem.
- 29:07It is conditional on the observed data matrix,
- 29:11rather than considered to be random.
- 29:13But I will not talk about the detail here.
- 29:15So motivated by the challenge in this problem,
- 29:17and we want to propose something not distribution-specific.
- 29:22We want to use our simulator to generate the null data
- 29:28and then use Clipper to achieve the FDR control.
- 29:31So we want to do this non parametrically.
- 29:34So I think the idea was motivated
- 29:36by two phenomenal statistical papers.
- 29:39One is the gap statistic paper,
- 29:42which was proposed to find the number of clusters
- 29:45in the clustering problem.
- 29:47And if you read a paper, I think the smart idea there
- 29:49is they try to generate data points without clusters
- 29:54as the negative control.
- 29:56Then, you can control your number of clusters
- 29:59with some statistic,
- 30:01versus what if there's no clusters, right,
- 30:03and do the comparison and find the gap.
- 30:05That's the gap statistic.
- 30:06And knockoffs gave the theoretical foundation
- 30:09for FDR control without using high resolution P values.
- 30:13Okay, so the halftime summary
- 30:16is that I talked about three common causes
- 30:18of ill-posed P values.
- 30:19Hopefully, I have convinced you
- 30:21that we need something to avoid this problem.
- 30:25So I talked about Clipper,
- 30:26the p-value-free FDR control for genomic feature screening.
- 30:30And as I said, it was motivated and enabled
- 30:33by the FDR control procedure from this paper.
- 30:36But the difference here is that we focus
- 30:39on marginal screening of interesting features.
- 30:42So in other words, we look at one feature at a time.
- 30:45In my previous examples,
- 30:47a feature could be a region or a gene.
- 30:51So in the original knockoff paper,
- 30:53their goal is to generate knockoff data
- 30:57just like fake data for multiple features jointly.
- 31:01And that's the very challenging part.
- 31:03But in our case, we don't need that
- 31:05because we are looking at one feature at a time,
- 31:07so it's not a multi-varied problem,
- 31:10but it's a marginal screening problem.
- 31:12So our goal is to get rid of high resolution P values.
- 31:15So the advantage of this
- 31:17is we don't need parametric distribution assumptions,
- 31:20or we don't need large sample sizes
- 31:22to enable non parametric tests, these are not needed.
- 31:26We just need to summarize every feature
- 31:29into a contrast score,
- 31:31and then, set a cutoff on the contrast scores.
- 31:35So what do I mean by contrast score?
- 31:37So every feature, say, I have total d features,
- 31:40they have C, D, sorry, d contrast scores
- 31:43shown as C1 to Cd,
- 31:45so I'm calling the histogram
- 31:47of the distribution of contrast scores.
- 31:50So if the theoretical assumption is satisfied,
- 31:54then the features that are null features
- 31:57should follow a symmetrical distribution
- 32:00around the zero, okay?
- 32:02And for the features that are interesting
- 32:04and should be discovered,
- 32:06should be large and positive on the right tail.
- 32:09So the theory of the FDR control just says,
- 32:12we can find the contrast score cutoff as t,
- 32:16such that this ratio is controlled under q.
- 32:20We ought to find the minimum t for this.
- 32:23What this means is can you can consider this ratio
- 32:26as a rough estimator of FDR.
- 32:30So the denominator is just the left tail,
- 32:33the red part plus one,
- 32:36sorry, the numerator is the right tail plus one,
- 32:39the denominator is the, sorry, the left tail is, sorry,
- 32:43the numerator is the left tail plus one,
- 32:45the denominator is the right tail with maximum with one.
- 32:49So in other words, still trying to avoid dividing zero.
- 32:52And the idea is that we want to find a threshold t,
- 32:56so that the right tail will be called discoveries
- 32:59and the left tail represent false discoveries.
- 33:03That's the intuition.
- 33:05Because we know, if the feature's null,
- 33:08then it will be randomly positive or negative.
- 33:11And the sign is independent of the absolute value.
- 33:15So that just replaces
- 33:18the uniform distribution requirement for P values,
- 33:22we change that to symmetry.
- 33:24And another thing is that the feature,
- 33:26if it's large positive, we want to discover it, right?
- 33:29So this will be the discovery set
- 33:31and this represents the negative, false discovery set.
- 33:36So that's the idea intuition behind this approach.
- 33:41But the theory to really prove it,
- 33:42we need to use Martingale in probability to prove it.
- 33:46And some of the technique was used
- 33:47for the Benjamini Hochburg procedure
- 33:49still based on Martingale.
- 33:50So anyway, this allows us to really control the FDR
- 33:54just using contrast scores.
- 33:56And another thing I found as appealing
- 33:58is that if you visually inspect the contract scores,
- 34:01you can see whether the assumption seems to be reasonable
- 34:05because you expect to see something symmetrical
- 34:08plus a heavy right tail.
- 34:10Okay, so we are currently writing to make this more formal,
- 34:14so we could actually check
- 34:15whether the assumption is reasonably holding.
- 34:18So with this approach,
- 34:19we can make a lot of the comparison analysis easier
- 34:26because the key is to find a reasonable contrast score
- 34:30that satisfies this assumption.
- 34:32And I can say that there may be multiple contrast scores
- 34:35that satisfy, not just the unique one.
- 34:37Then the difference is power, right?
- 34:40So we may have a better power
- 34:41if you have a heavier right tail.
- 34:44Okay, so for a ChIP-seq peak calling analysis,
- 34:47we can say that the contrast score
- 34:49will be comparing the target data
- 34:52from experimental condition to the null data,
- 34:55which is the background condition.
- 34:56They serve a natural pair of contrast,
- 34:59and we could apply any pipeline to each data,
- 35:03the same pipeline and then do the contrast, right?
- 35:06You can imagine, if there's no peak,
- 35:08then these two values will be,
- 35:10which one is bigger is equally likely.
- 35:13And for the RNA-seq analysis,
- 35:16here, I showed we could use permuted data as the null data
- 35:20actual data as a target data.
- 35:22So if we run some test on actual data
- 35:25to get a test statistic,
- 35:27we use the same test on permuted data
- 35:29to get a test statistic, and they serve as a contrast.
- 35:32And finally, for the PseudotimeDE and ClusterDE,
- 35:35the single cell problem,
- 35:37actual data will give us some comparison,
- 35:40either PseudotimeDE
- 35:42or the between ClusterDE test statistic.
- 35:46And if we have some similar data
- 35:48that represents the null,
- 35:49like null trajectory, null cluster,
- 35:52we could run the same pipeline and then do the contrast.
- 35:55So you see, this actually free us
- 35:57from saying we need to derive P values
- 36:00and we need to know the distribution
- 36:02by either theory or by numerical simulation, right?
- 36:05These are all relieved
- 36:06because we just need to do a contrast.
- 36:08And the power is gained from the many, many tests,
- 36:12we look at them together.
- 36:13So that's why
- 36:14this idea (background noise drowns out speaker).
- 36:16Okay, so as I said, we tried to implement Clipper
- 36:20as a way to improve FDR control,
- 36:22and we did achieve this
- 36:24for the popular software Macs and Homer
- 36:27for ChIP-seq peak calling
- 36:29and DESeq2 to edgeR for RNA-seq DEG identification.
- 36:33So you see that they did have inflated FDR,
- 36:37so the Y axis is the actual FDR,
- 36:39X axis is the target FDR threshold.
- 36:42There are inflations,
- 36:44but with our Clipper as an add-on
- 36:46to be used downstream of what they output
- 36:49and do the contrast,
- 36:50we can largely reduce the FDR to the target
- 36:54and still maintain quite good power.
- 36:56So that's the usage of Clipper as and add-on.
- 37:00And for the single cell part,
- 37:02I didn't finish about the null data generation.
- 37:05How do we do it?
- 37:06Our simulator was proposed partly for this reason,
- 37:10but it has more uses.
- 37:12So I just want to say that it's called scDesign3
- 37:15because we have scDesign and scDesign2 as two previous work.
- 37:19Now, focus on scDesign2
- 37:20because it is the direct predecessor of scDesign3.
- 37:24So what scDesign2 two does
- 37:26is it tries to fit a multi-gene probabilistic model
- 37:30for each cell type,
- 37:32and then, every gene assumes to follow
- 37:35a parametric distribution within the cell type.
- 37:39And the major contribution
- 37:41is that we capture gene-gene correlations
- 37:44using Gaussian copula.
- 37:45That will make the data more realistic.
- 37:47Here is the comparison.
- 37:49This is the real data used for fitting the model.
- 37:52This is the lab (indistinct) test data used for validation,
- 37:55and this is the synthetic cells using copula.
- 37:59If we remove the copula, the cells will look like this.
- 38:02So not realistic at all.
- 38:04And our data is more realistic than other simulators
- 38:08that did not explicitly capture gene-gene correlation.
- 38:12Although, they have some implicit mechanism,
- 38:14but the model is different.
- 38:17Okay, so we realize that scDesign2 is doing a good job
- 38:21for displaying cell types,
- 38:23but it cannot generate data like this
- 38:25from a continuous trajectory.
- 38:27What we could do is to force the cells to be divided
- 38:31and then use scDesign2.
- 38:32But then, you can see the cells are kind of in clusters,
- 38:35right, not in real data.
- 38:37But with our generalization to the version three,
- 38:41we now can generate cells from a continuous trajectory.
- 38:45And I can quickly say that we basically generalize this,
- 38:48this count distribution per cell type
- 38:51to a generalized additive model, which I already said.
- 38:55So we could make it more flexible in general,
- 38:57and scDesign2 becomes a special case of scDesign3.
- 39:02And one more thing we could do
- 39:03is we actually use the technique vine copula,
- 39:06so we could get the likelihood of how the model fits
- 39:11to the real data, so we can get the likelihood of the model,
- 39:15which can also give us more information.
- 39:18So besides the single cell trajectory data,
- 39:21we can also use the idea to generate spatial data.
- 39:25So here, the modification is that for every gene
- 39:28we assume a Gaussian process in the 2D space,
- 39:32so it can have a smooth function
- 39:34for (indistinct) expression (indistinct).
- 39:36And also, my other student help with making the simulator
- 39:40to generate reads, sequencing reads, not just counts.
- 39:44So we can go from counts to reads,
- 39:46and this will give us more functionality
- 39:48to benchmark some low level tools.
- 39:51So in short,
- 39:52the scDesign3 simulator has two functionalities.
- 39:56One is to do, of course, simulation.
- 39:59We can generate single cell data from cell types,
- 40:02discrete, continuous trajectories,
- 40:05or even in the spatial domain.
- 40:07We could generate feature modalities
- 40:09we call multi-omics, including RNA-seq,
- 40:12ATAC-seq, which is a technology
- 40:14for open chromatin measurement,
- 40:16CITE-seq, which includes both protein and RNA,
- 40:19and also DNA methylation.
- 40:21These are the examples we tried, but we could do even more.
- 40:24We could allow it to generate data with experimental designs
- 40:28including sample covariate, conditions, or even batches.
- 40:33So these can make us generate cases
- 40:36for more types of benchmarking.
- 40:39And for interpreting real data,
- 40:41scDesign3 can give us model parameters,
- 40:45so we can know whether a gene has different means
- 40:47in two cell types,
- 40:49whether a gene has a certain change on a pseudotime,
- 40:52or a gene has a certain change in two dimensional space.
- 40:55And also, as I said,
- 40:56we can output a likelihood that can give us a way
- 40:59to calculate the basic information criterion BIC,
- 41:03so we could evaluate
- 41:04whether some pseudotime describes data well,
- 41:07whether the algorithm for pseudotime inference
- 41:09does a good job,
- 41:11or whether the clusters explain data well.
- 41:13So these are the things we could do.
- 41:15And finally, to generate the null data
- 41:18for the Clipper (indistinct),
- 41:20we can alter the model parameters.
- 41:22Like this is what we fit from real data,
- 41:25we could change the model parameters
- 41:27to make the gene no longer differentially expressed,
- 41:30have the same mean in two subtypes.
- 41:33Or, after we fit a real data
- 41:35with two cell types or two clusters,
- 41:37we could change the cluster parameter
- 41:40to make sure the cells come from one cluster
- 41:42instead of two clusters.
- 41:44So these are the things we could do with the model.
- 41:47And so this is how our paper,
- 41:49but more details are in our paper, which has been posted,
- 41:52if you are interested.
- 41:54And I want to just quickly show
- 41:56how the ClusterDE analysis could be done.
- 41:59This is the real data with two clusters.
- 42:01I want to say that this is the case
- 42:03where permutation wouldn't work.
- 42:05If you just permute the cluster labels,
- 42:07the cells will look like the same cells, right?
- 42:10They're still two clusters.
- 42:11But if you use our simulator,
- 42:13we could generate cells from one cluster
- 42:15that reflects the complete null, there's no cluster.
- 42:19And the use of this can be shown in this example.
- 42:23There's only one cluster,
- 42:24but if we use clustering algorithms,
- 42:27like these two choices, Seurat is a popular pipeline,
- 42:31Kmeans is the standard classical algorithm,
- 42:34using either to force the cells into two clusters,
- 42:38we are using gene expression data.
- 42:40So no wonder that if you look at a gene's expression
- 42:44between the two clusters, you may call it DE,
- 42:46but that's not interesting, since there's no clusters.
- 42:50So if we use our scDesign3 to generate null data,
- 42:54in this case, null data should be very similar to real data.
- 42:58It still has only one cluster.
- 43:00Then, if we run Seurat or Kmeans,
- 43:04similarly, on null data,
- 43:06we would divide the cell in a similar way,
- 43:09and then, if you do a contrast of the two sets of results,
- 43:12you should see no big difference.
- 43:14That's the idea for controlling FDR.
- 43:16So indeed, in that example, if we're just naively wrong,
- 43:21the Seurat pipeline clustering followed by some tests
- 43:25like t, Wilcoxon, bimodal,
- 43:28yeah, you will see FDR is one.
- 43:30The reason is you keep finding D genes,
- 43:33even though there's no cluster.
- 43:34But using our approach,
- 43:35we could control the FDR reasonably well.
- 43:38So that's the predominant results for this purpose
- 43:42for this task, so that summarizes my talk today.
- 43:46And finally, I just want to make a few notes
- 43:48to give some messages.
- 43:50I talk about multiple testing,
- 43:52but in many scientific problems,
- 43:54I think the key is whether it should be formulated
- 43:57as a multiple testing problem.
- 43:59So actually, to address this question,
- 44:01I wrote a prospective article
- 44:03with my collaborator Xin Tong at USC.
- 44:06We try to clarify statistical hypothesis testing
- 44:10from machine learning binary classification.
- 44:13They seem similar
- 44:14because both would give you a binary decision, right?
- 44:17But I can say that testing is an inference problem,
- 44:20classification is a prediction problem.
- 44:23So if you really think about it,
- 44:25their fundamental concepts are different.
- 44:27So that's why we wrote this to really talk with biologists,
- 44:31for computational people who use this simultaneously.
- 44:35So if you're interested, you can check it out.
- 44:37And finally, I wanna say that,
- 44:40so if it's a multiple testing problem,
- 44:43I talked about three common causes of ill-posed P values,
- 44:48and I propose a solution, Clipper,
- 44:50for simplifying this problem by just using contrast scores,
- 44:55and then, set a cutoff.
- 44:56And the simulator, which we hope to be useful
- 44:59for the single cell and spatial omics field
- 45:01because this field is so popular,
- 45:03we have more than 1000 methods already.
- 45:05So benchmarking seems to be something quite necessary.
- 45:08Because if there's no benchmarking,
- 45:11then maybe new methods wouldn't have much of a chance
- 45:14because people may still use the older method
- 45:16that are better cited.
- 45:18Okay, so these are the papers related to my talk today.
- 45:23And so, finally, I want to say that,
- 45:26so if you're interested, you want to check them out,
- 45:29and let me know if you have any questions.
- 45:31So finally, I'll just say this,
- 45:33this is something quite interesting.
- 45:34It's another paper we just recently wrote,
- 45:37and I can say,
- 45:38you should be online in genome biology very soon.
- 45:40So we actually did this benchmark
- 45:43for the so called QTL analysis in genetics, right?
- 45:47Quantitative Trait Locus mapping.
- 45:50So in this analysis,
- 45:51a common procedure is to infer hidden variables
- 45:55from the data, like genes expression matrix,
- 45:57want to do hidden variable improvements.
- 46:00Besides the most part, (indistinct) has the classical PCA,
- 46:03several methods propose specific (indistinct).
- 46:07And my student Heather, actually gave her the full credit,
- 46:10she was so careful and she really wanted to understand
- 46:13the method before using it,
- 46:14then that lead to this project.
- 46:17She wants to see, huh, do I really see advantages
- 46:19of this new method even compared to PCA?
- 46:22But that's what she found, right?
- 46:24PCA still seems to be the most stable,
- 46:26robust, and also faster algorithm,
- 46:30but this is one of the reviewer's comments
- 46:32I wanna share with you.
- 46:34These results may come as a surprise to some,
- 46:37given the nearly un-contestable status
- 46:39that method A has achieved within the community.
- 46:42But sadly, they reflect the fact
- 46:44that computational biology methods can rise to fame
- 46:47almost by accident rather than by sound statistic arguments.
- 46:50So if you're interest,
- 46:52you can check out this paper, it's on bio archive.
- 46:54But anyway, I think it says how important it is
- 46:57for statisticians to convey our message, right?
- 47:00Why do we need statistical rigor, why does it matter?
- 47:04So for our students,
- 47:05if you want to know more about GAM and copulas,
- 47:08there are two books I want to recommend.
- 47:10So they're very good introductory textbooks,
- 47:12so you can know the (indistinct).
- 47:14Finally, I want to thank my collaborator at UC Irvine,
- 47:19my students for all their tremendous work I talk about today
- 47:23and also the funding agencies for giving us the support.
- 47:26So thank you very much.
- 47:36<v Attendee>A question?</v>
- 47:38<v Jingyi>Yes.</v>
- 47:39<v Attendee>So I was really curious</v>
- 47:40about the analysis of like the large patient sample.
- 47:45I know that there has in fact
- 47:46been extensive discussion on it.
- 47:48<v ->Yeah, yeah.</v> <v ->Which is</v>
- 47:52interesting, to say the least, how it's gone down.
- 47:55But I was kinda curious,
- 47:56the way that it was presented here made me think about like,
- 48:01apologies, if this is like a path that's already been tread,
- 48:07so, yeah, the bar graph.
- 48:11<v Jingyi>Yeah.</v>
- 48:12<v Attendee>Yeah, so it sort of,</v>
- 48:16it makes me wonder about the application
- 48:19of the term false discovery in different contexts.
- 48:23And taking patients, you can imagine,
- 48:26there can be like unintended structure
- 48:29within those populations.
- 48:32And by (interference drowns out speaker) chance,
- 48:34if there is 30,000 potential transcripts
- 48:38that you're looking at, there might actually be,
- 48:41between individuals who are not isogenic,
- 48:44truly differentially expressed genes
- 48:47between even permuted groups.
- 48:50And so I'm wondering if there's a useful distinction
- 48:53between a false discovery and a true,
- 48:56but uninteresting discovery.
- 49:00<v Jingyi>I think it depends on how you define truth.</v>
- 49:04I think that's the key.
- 49:05But what is the definition of D genes?
- 49:08I wanna say, to be exact,
- 49:10the definition of D genes in DESeq2,
- 49:14edgeR, and that of Wilcoxon is different.
- 49:18Because in Wilcoxon, the D gene is defined,
- 49:21okay, if a gene, it has two distributions,
- 49:24one under each condition,
- 49:27and if I randomly take one observation
- 49:29from each distribution from each condition,
- 49:32is the chance that one is bigger than the other
- 49:35equal to 0.5?
- 49:36That's the Wilcoxon question.
- 49:38While DESeq2 and edgeR, their D gene definition
- 49:42is the negative binomial means are different.
- 49:45But clearly, you can see, it only depends
- 49:48on that negative binomial is a reasonable distribution,
- 49:51that's the key.
- 49:52So that's why in theory,
- 49:53if negative binomial is no longer valid or reasonable,
- 49:58then why should we define a D gene
- 50:00based on negative binomial mean indifference?
- 50:03I think that's kind of my answer to your question.
- 50:06But the tricky thing about statistical inference
- 50:09compared to supervised learning
- 50:11is that we don't observe the truth, that's always the case.
- 50:14So we're making a guess.
- 50:16Frequentist people have one way to guess,
- 50:20Poisson people have another way of guess.
- 50:21And so one issue I've seen in the Twitter discussion
- 50:24is that several people try to,
- 50:26maybe not intentionally,
- 50:28confuse frequentist concept with Poisson concept,
- 50:32but they're not really comparable, right?
- 50:34You cannot talk about them in the same ground.
- 50:36That's a problem, and here, our criterion,
- 50:39false discovery rate is a frequentist criteria,
- 50:42it relies on P values, right?
- 50:44So therefore, you cannot use Poisson arguments
- 50:46to argue against such frequentist way.
- 50:49Because you are doing frequentist, right?
- 50:52But whether frequentist makes sense or not,
- 50:54that's a different topic.
- 50:55Hopefully, that answers your question.
- 50:57<v ->Yeah, thank you.</v> <v ->Thank you.</v>
- 51:01Yes. <v ->Hello,</v>
- 51:02thank you much for your talk,
- 51:03and I think that is very interesting.
- 51:05However, I have a question on slide 26 actually..
- 51:13It's about what you said that,
- 51:17maybe 26.
- 51:20<v Jingyi>26, okay, yeah.</v>
- 51:23<v Attendee>Yeah, you said</v>
- 51:24that like it is a multi-gene probabilistic model
- 51:28for cell type.
- 51:30However, I'm a little bit confused
- 51:33about how you define the cell type.
- 51:37But basically, from my own understandings,
- 51:40that after you get, for example, the single cell rise data,
- 51:45for example, you will use the route to get the cluster.
- 51:48<v Jingyi>Yeah.</v>
- 51:49<v Attendee>And you will annotate this cluster</v>
- 51:52based on the-
- 51:54<v ->Knowledge, yeah.</v> <v ->Gene.</v>
- 51:56And then, if this model based on your
- 52:04annotation of, okay.
- 52:08<v Jingyi>Yeah, I see you point.</v>
- 52:10Essentially, yeah, we need cell cluster to be pre-defined.
- 52:13So if it's not reasonable, then, yes,
- 52:15it will affect the results for sure.
- 52:17Because the key is that you need to make sure
- 52:20it is reasonable to assume a gene follows
- 52:23one of the four distribution within a cluster, right?
- 52:26So that's why there are methods
- 52:28that try to refine clustering
- 52:30by checking the negative binomial distribution.
- 52:34So there are several research on that,
- 52:36and they're trying to refine that.
- 52:37But basically, we are sitting on those methods
- 52:40to do the simulation, that's what we do.
- 52:43But again, so that's why this is the problem with scDesign2,
- 52:46but scDesign3 sort of tries to address this problem
- 52:50by providing the BIC.
- 52:52So if the input clusters are bad,
- 52:55then you can see that in the BIC.
- 52:57Because the likelihood will not be there, yeah.
- 52:59<v Attendee>A similar question.</v>
- 53:02I have another question
- 53:04is that basically I assumed (indistinct) about
- 53:09the experiments have duplicates,
- 53:13however, in some situations,
- 53:16maybe we do not have the replication.
- 53:20But in this situation, how could we control the FDR,
- 53:26if we do not have replicates,
- 53:28then we cannot get the P value.
- 53:30<v Jingyi>That's exactly the point of this talk.</v>
- 53:31The only part that has replicates is the RNA-seq part.
- 53:35The second part, that's the only part we have replicates.
- 53:37In the first part, when we do the ChIP-seq,
- 53:39it's just one replicate per condition, right?
- 53:42That's why I said P value calculation would be helpful.
- 53:45Right, so the reason we could control the FDR
- 53:47without using P values
- 53:49is just because we have many, many tests.
- 53:52So that's why we're doing this large scale testing.
- 53:56I think the idea, if you check it out,
- 53:59Bran Efron has talked about it extensively in his book,
- 54:04it's called, so his idea of Empirical Bayes
- 54:07is very similar to this.
- 54:08We try to borrow information across tests
- 54:11to set a threshold.
- 54:13Yeah, hopefully that answers your question.
- 54:15Yeah?
- 54:16(interference drowns out speaker)
- 54:21Yeah, sounds good, thank you.
- 54:23(interference drowns out speaker)
- 54:26Thank you.