Skip to Main Content

YSPH Biostatistics Seminar: “Opening Black Boxes: Enhancing Statistical Rigor in Genomics Data Science”

October 14, 2022
  • 00:00(interference drowns out speaker)
  • 00:08<v Announcer>Biostatistics, computational.</v>
  • 00:11(interference drowns out speaker)
  • 00:15So before joining UCLA in 2013.
  • 00:19(interference drowns out speaker)
  • 00:24Production work.
  • 00:25(interference drowns out speaker)
  • 00:29On the script is (interference drowns out speaker)
  • 00:33include differentiation factors,
  • 00:35asymmetric (indistinct) replication,
  • 00:37p-value-free false discovery (indistinct),
  • 00:40and a high dimensional variable selection.
  • 00:43And on the bio (indistinct) application side,
  • 00:46her research include all single cell (indistinct)
  • 00:50for (indistinct) genomics and (indistinct).
  • 00:52(interference drowns out speaker)
  • 01:00Research published.
  • 01:01(interference drowns out speaker)
  • 01:142019
  • 01:16She's an MIT Technology Review certified (indistinct)
  • 01:21in 2020, and she has received from Harvard.
  • 01:25(interference drowns out speaker)
  • 01:31<v Jingyi>I couldn't wait for the introduction.</v>
  • 01:33It's my honor here to present my work,
  • 01:36and my sabbatical in this fellowship program
  • 01:40at Harvard Radcliffe Institute.
  • 01:41So it's my pleasure to talk about some of our recent work
  • 01:46related to how statistic rigor is important in genomics.
  • 01:51So I want to say that when I was a student,
  • 01:54especially I think most of our audience here are students,
  • 01:57I want to give you this motivation.
  • 02:00When I was a student back in 2007,
  • 02:02that was when I just started my PhD
  • 02:05and I was interested in bioinformatics.
  • 02:08I had a lot of questions about bioinformatics methods
  • 02:11after I took statistics classes.
  • 02:14I think some of the questions I listed here
  • 02:16include are P values valid?
  • 02:19Because P values are so widely used
  • 02:21in genomics bioinformatics.
  • 02:23And also, we have a lot
  • 02:25of bio bioinformatics methods developed for data analysis.
  • 02:28And I wonder why don't we use classical statistical methods
  • 02:32in textbooks?
  • 02:33And the third thing is,
  • 02:34when we use statistical test to understand the question,
  • 02:38to answer some pivot question,
  • 02:40what is the proper null hypothesis?
  • 02:42So you will see those questions in the topics
  • 02:45I will talk about next.
  • 02:47So this talk will focus on the multiple testing problem.
  • 02:52See, multiple testing, what it means
  • 02:53is that we have multiple hypothesis tests,
  • 02:57and the criteria we use in this problem are P values,
  • 03:01which we have one P value per test.
  • 03:04So we know that the requirement for a valid P value
  • 03:08is that P values should follow the uniform distribution
  • 03:11between zero one under the null hypothesis.
  • 03:14Or we may relax this to be super uniform.
  • 03:18Just for your information,
  • 03:19super uniform means that the P values
  • 03:22have higher density toward one
  • 03:25and lower density towards zero.
  • 03:26So that's still okay for type one error control,
  • 03:30even though you may have a larger
  • 03:31than expected type two error.
  • 03:33So given the many, many P values,
  • 03:35we need one criterion to set a cutoff on the P values.
  • 03:40And the most commonly used criterion
  • 03:42for multiple testing correction
  • 03:44is called a false discovery rate, short as FPR.
  • 03:47So the definition here is the expectation of this ratio,
  • 03:52and this ratio is the number of false discoveries
  • 03:55over the number of discoveries.
  • 03:57So this notation means the maximum
  • 04:00between the number of discoveries and one.
  • 04:02In other words, we don't allow the denominator to be zero,
  • 04:06if we don't make any discovery.
  • 04:07So this is to avoid the dividing zero issue.
  • 04:10And this ratio has a name called false discovery proportion.
  • 04:14In other words, we can have this proportion
  • 04:16for one particular data set.
  • 04:18However, as you know, we don't observe this ratio
  • 04:22because we don't know which discoveries are false.
  • 04:25So therefore, this ratio is only a hypothetical concept,
  • 04:28but not really computable.
  • 04:30And here, the expectation
  • 04:32is taken over all possible data set
  • 04:35from the same distribution as our data set.
  • 04:38So this is the frequentist concept
  • 04:40because we have imaginary potential data sets.
  • 04:44So therefore, the phenomena paper
  • 04:47by Benjamini and Hochburg gave us a way
  • 04:50to control this expectation called FDR
  • 04:53under a claimed level, say, 5%,
  • 04:57even though we couldn't realize this ratio itself.
  • 05:01But we could control its expectation.
  • 05:02So that's the magic of statistics.
  • 05:05So Benjamini Hochburg algorithm allows us
  • 05:07to set a cutoff on the P values to control the FDR.
  • 05:11But I want to emphasize that the FDS's only controlled
  • 05:15when P values satisfy this assumption,
  • 05:17otherwise, it may not be.
  • 05:19So I want to say three common causes of ill-posed P values,
  • 05:24which make P values don't satisfy this assumption
  • 05:27in genomics, and I'll go through them one by one.
  • 05:31The first issue is what I call the formulation
  • 05:34of a two sample test problem as a one sample test.
  • 05:38What does this mean?
  • 05:39So I will use the common genomic analysis
  • 05:42of ChIP-seq data as an example.
  • 05:45So in ChIP-seq data,
  • 05:46we want to measure where a protein binds in the genome.
  • 05:50So you can consider the X axis as the genome
  • 05:54and the Y axis as the protein binding intensity
  • 05:57measured by ChIP-seq.
  • 05:59So here, we have experimental sample,
  • 06:02the condition of our interest, say, a certain cell line.
  • 06:06And the background sample is what we know
  • 06:08that there's no protein,
  • 06:10so there should be no protein binding.
  • 06:12But we still want to measure the noise from the experiment.
  • 06:15So we need this contrast.
  • 06:17And here, we want to say that the region in the red box,
  • 06:22this interval, we want to call it as a peak,
  • 06:26if we see the intensity in the experimental sample
  • 06:30is much larger than the intensity in the background sample.
  • 06:33So we do the comparison and we want to cut this at a peak.
  • 06:36That's the purpose of this analysis.
  • 06:39And I wanna say that, in the field,
  • 06:42because ChIP-seq has become popular since 2008,
  • 06:45Macs and Homer are probably the two most popular software
  • 06:49for cutting peaks.
  • 06:51Even though they have very complex procedures
  • 06:54for processing the sequencing data
  • 06:56that in a statistical part
  • 06:58to call a region as a peak or not,
  • 07:01I can say, their formulation is as follows.
  • 07:04Given a region, we count its number of ChIP-seq reads
  • 07:09in the background sample and in the experimental sample.
  • 07:12So let's just summarize this intensity as a count,
  • 07:15a count here, a count here, and both are now negative.
  • 07:19So I call the background count as big X,
  • 07:21experimental count as big Y.
  • 07:24And in our data, we have the observations, right?
  • 07:27We refer to them as small x, small y.
  • 07:30Then, the P value in both software
  • 07:33is essentially this probability, the probability
  • 07:37that big Y is greater or equal than the observed small y,
  • 07:42where the big Y follows upon some distribution
  • 07:45with mean parameter as the small x.
  • 07:48Now, when I look at this formula back in 2008,
  • 07:51the Macs paper, I wonder whether this is correct.
  • 07:56And I don't think so.
  • 07:57Because the reason, if you look at it,
  • 07:59is what is the null hypothesis?
  • 08:01The null hypothesis is essentially, okay,
  • 08:04let's assume the experimental count
  • 08:06is our test statistic, okay?
  • 08:09We assume it follows a Poisson distribution
  • 08:11with mean lambda.
  • 08:13And here, the null hypothesis is lambda is equal to small x.
  • 08:18Alternative is lambda greater than small x.
  • 08:21So what's the problem with here?
  • 08:23Essentially, we are using small x as a fixed parameter
  • 08:27instead of a random observation.
  • 08:29So in other words,
  • 08:30the randomness in the background count is ignored.
  • 08:33We only consider experimental count as the random variable.
  • 08:37So in other words, where use the two sample testing problem
  • 08:40to a one sample testing problem
  • 08:42because we only consider the randomness
  • 08:44in the experimental sample.
  • 08:46But this is not something our textbook teaches us.
  • 08:50The reason is because if we consider background
  • 08:53as one condition, experimental has another condition,
  • 08:56under each condition, our sample size is only one.
  • 09:00So therefore, the T test will not apply
  • 09:02because a central limit here clearly doesn't apply.
  • 09:05So how do we calculate P value, any ideas?
  • 09:10I think one possibility that we could still assume
  • 09:13Poisson distribution for both background X
  • 09:16and experimental Y.
  • 09:17You have two Poisson, under the independence,
  • 09:21we can probably derive the distribution
  • 09:23for Y minus X, right, and what's the null distribution.
  • 09:26That's the only way.
  • 09:27But, if you think about it, how can we verify
  • 09:30whether the Poisson distribution is reasonable?
  • 09:33You only have one observation from it.
  • 09:35The distribution could be anything, right?
  • 09:37So assuming a parametric distribution seems quite,
  • 09:41I will say, aggressive.
  • 09:42So I think P value calculation is challenging here.
  • 09:45And also, I even wonder, in this case,
  • 09:49for this one versus one comparison,
  • 09:51should we use a P value?
  • 09:53Or is this really a testing problem that's feasible?
  • 09:57So I would say, over the years, I gradually realized
  • 10:00that here we looked at many, many regions,
  • 10:03not just one region.
  • 10:04So the goal or the criterion that's ultimately used
  • 10:08is actually FDR.
  • 10:09And in this process,
  • 10:12P values are just intermediate for FDR control,
  • 10:16instead of our final target.
  • 10:18So do we have to stick with P values?
  • 10:21This motivated me to write this paper with my students
  • 10:25to propose a way to achieve p-value-free FDR control
  • 10:30by leveraging the theory in Barber and Candes paper,
  • 10:34their knockoff paper,
  • 10:36so we could actually doing FDR control
  • 10:39in this example without using P value.
  • 10:41So I will talk about this later in my talk,
  • 10:43but this is one motivation for the Clipper paper.
  • 10:47The second issue with P values is that we observe,
  • 10:50sometimes, P values are not valid
  • 10:52because the parametric model used may not fit the data well.
  • 10:57So this is an example for this commonly used
  • 11:01differential expression analysis on RNA sequencing data.
  • 11:05So for this task,
  • 11:07the two popular softwares are DESeq2 and edgeR.
  • 11:10So the data usually looks like this.
  • 11:12So we want to compare two conditions
  • 11:15and seeing which genes are differentially expressed.
  • 11:19So condition one, we have three samples,
  • 11:22which we cause to replicate,
  • 11:23condition two, three replicates.
  • 11:25So every row is one replicate,
  • 11:29while every column is one gene.
  • 11:31So to call a gene as differentially expressed,
  • 11:34we need to compare its three values
  • 11:36from condition one, two, three values from condition two.
  • 11:39So clearly, we can see the left one may be a D gene,
  • 11:43the right one may not be a D gene, right?
  • 11:45That's our intuition.
  • 11:46And we want to make this more formal
  • 11:49by doing a statistical test.
  • 11:52But in both edgeR and DESeq2,
  • 11:55you can see that to compensate the small sample size,
  • 11:59like three versus three,
  • 12:00they assume a gene follows a negative binomial distribution
  • 12:05under each condition.
  • 12:07So essentially, these three values are assumed
  • 12:09to follow one negative binomial distribution.
  • 12:12These three values
  • 12:13follow another negative binomial distribution.
  • 12:16And the null hypothesis
  • 12:18is the two negative binomial distributions
  • 12:21have the same mean, that's the problem.
  • 12:24Okay, so we actually discovered an issue
  • 12:27with popular methods from this data set.
  • 12:30And thanks to my collaborator Dr. Wei Li
  • 12:32who is a computation of biologist at UC Irvine.
  • 12:36So actually, from this patient data,
  • 12:39we have a much larger sample size, 51 patients
  • 12:43before the treatment of some immunotherapy medicine,
  • 12:4758 patients on treatment.
  • 12:50So we want to compare the RNA sequencing data
  • 12:53of these two groups of patients.
  • 12:55So essentially, when we apply DESeq2 or edgeR to this data,
  • 13:01the red dots indicate the number of D genes identified.
  • 13:06To verify whether we can still identify D genes
  • 13:12from permuted data,
  • 13:13because the reason is that we want to see
  • 13:15whether the permuted data is actually really,
  • 13:19because we know the permuted data
  • 13:20shouldn't give us any signals.
  • 13:22If we just disrupt the two groups,
  • 13:24we shouldn't expect any D genes.
  • 13:26But surprisingly, we found that each method
  • 13:29can identify sometimes even more D genes from permuted data.
  • 13:34So the bar and the error bars show the distribution
  • 13:37of D genes identified from permuted data.
  • 13:40So this is something quite unexpected.
  • 13:44And to look into the reason, our first thought
  • 13:47is to check the negative binomial assumption.
  • 13:50Because now, under each group,
  • 13:52we have 51 and 58 sample sizes,
  • 13:55so we could check the distribution, and here's what we get.
  • 13:59You see that for the genes that are frequently identified
  • 14:03from permuted data, if we run the goodness-of-fit test,
  • 14:07we check the negative binomial distribution,
  • 14:10these genes have very small P values,
  • 14:12indicating that this fit is not good.
  • 14:15Well, if you look at the genes
  • 14:16that are rarely identified from permuted data,
  • 14:20the P values are bigger and the goodness-of-fit is better.
  • 14:23So we do see this relationship
  • 14:25between the goodness-of-fit of negative binomial
  • 14:29and the frequency that a gene is identified
  • 14:32from permuted data.
  • 14:33So negative binomial model seems to not fit well
  • 14:36on this patient data.
  • 14:39Because here, the 51 patients shouldn't be regarded
  • 14:42as replicates, they're not experimental replicates,
  • 14:45they are individuals.
  • 14:46So therefore, the theory for deriving negative binomials
  • 14:50usually assume as a Gamma-Poisson Mixture model,
  • 14:52Gamma-Poisson Hierarchical model.
  • 14:54That one may no longer hold,
  • 14:57and that's why we think the parametric model
  • 15:00is not applicable to this patient data.
  • 15:04So what's the consequence, right?
  • 15:06So we want to convince the scientist
  • 15:08what's the consequence of doing this analysis
  • 15:11in this problematic way.
  • 15:12We show that if we just use the D genes
  • 15:15found by DESeq2 and edgeR,
  • 15:17which are the genes corresponding to the red dot,
  • 15:20around the so called gene oncology analysis,
  • 15:23that is to check which functional terms
  • 15:26are enriched in those two gene sets,
  • 15:29we can see many functional terms
  • 15:31are related to immune functions.
  • 15:34Which would suggest that if we trust
  • 15:36these two methods' results, we may conclude that,
  • 15:39yes, between the two groups of patients,
  • 15:41there are differences in immune responses, right?
  • 15:44That seems to confirm our scientific hypothesis.
  • 15:48However, now, we see many of these genes
  • 15:51were also identified from permuted data,
  • 15:54then, that will make the results dubious.
  • 15:57So what we tried is that, even the sample size is so large,
  • 16:01we tried the classical Wilcoxon rank sign test,
  • 16:04which everybody learned, right?
  • 16:06So non parametric two sample test
  • 16:08that doesn't assume a parametric distribution.
  • 16:11And here, it's self consistent,
  • 16:13it doesn't identify D genes from real data,
  • 16:17but also, it doesn't identify D genes from permuted data.
  • 16:20So there's no contradiction here.
  • 16:23And this result motivated me to ask this question,
  • 16:26which I had years ago,
  • 16:29should we always use popular bioinformatics tools?
  • 16:33Like, check the citation of these two methods,
  • 16:35super highly cited.
  • 16:37Should I reuse popular method
  • 16:39or should we consider general statistical methods,
  • 16:43like Wilcoxon.
  • 16:45So our recommendation is sample size matters, right?
  • 16:50We may have different methods
  • 16:52suitable for different sample sizes,
  • 16:55and essentially, why statistics has so many methods,
  • 16:58paramedic, non parametric,
  • 17:00is because we have different scenarios in our data.
  • 17:03That's the first thing we should realize.
  • 17:05It's not like one method can do all the things.
  • 17:08And the second thing is sanity check.
  • 17:10We should always consider doing some sanity check
  • 17:13to make sure we trust the results
  • 17:15instead of just take the results for granted.
  • 17:17So these things were summarized in our paper
  • 17:20published earlier this year.
  • 17:23And since its publication,
  • 17:25we have received a lot of discussions on Twitter,
  • 17:28if you are interested.
  • 17:29But anyway, so it means that many people are interested
  • 17:32in this topic, especially many people, users believe
  • 17:36that popular bioinformatics tools are the state-of-the-art,
  • 17:39right, the way, standard methods (indistinct).
  • 17:42But if you are bio statisticians, you may not like this.
  • 17:45Because we want to develop new methods.
  • 17:48Otherwise, what's our job, right?
  • 17:50So in this case, we need to really find the loopholes,
  • 17:53or the limitations, or the gap between current approach
  • 17:57and the data scenarios,
  • 17:58and try convinces people that, yes,
  • 18:01we do need careful thoughts when we choose method.
  • 18:04It's not always one method.
  • 18:06And a related question is,
  • 18:08in Wilcoxon, definitely doesn't have a strong assumption,
  • 18:13and (indistinct) have a reasonable power
  • 18:15when the sample size is large.
  • 18:17But what if sample sizes are small, right?
  • 18:20So when it's small, we know,
  • 18:21non parametric tests like Wilcoxon doesn't have power.
  • 18:25So in this case, we actually proposed Clipper again,
  • 18:30so it can work as a downstream correction tool
  • 18:34for DESeq2 and edgeR.
  • 18:36Because they are supposed to be quite powerful,
  • 18:39even though they find probably too many.
  • 18:41So hopefully, we could use that to borrow their power,
  • 18:44but help them improve the FDR control.
  • 18:47So I'll show the results later in my talk.
  • 18:50That's the second cause.
  • 18:52And the third cause for ill-posed P values
  • 18:54is a little more complicated.
  • 18:56And this is the issue commonly observed in single cell data,
  • 19:00single cell RNA-seq data.
  • 19:01So I will use this analysis
  • 19:03called pseudotime differentially expressed genes as example.
  • 19:08What is a pseudotime?
  • 19:10Pseudotime means it's not real time, it's pseudo, right?
  • 19:13So it's something we inferred
  • 19:16from single cell RNA-seq data,
  • 19:18so those cells are measured all at once.
  • 19:20But we want to infer some time trajectory from the cells.
  • 19:26So I'll just use the screenshot from Slingshot,
  • 19:29which is a method for inferring pseudotime for explanation.
  • 19:34So here, this is a two-dimensional PCA plot of cells,
  • 19:39and the cells are pre-clustered,
  • 19:41so each color represents one cluster.
  • 19:44So the Slingshot algorithm does the following,
  • 19:47first, it takes the cluster means' centers,
  • 19:51and connect them using the algorithm
  • 19:53called minimum spanning tree.
  • 19:54So if you're not familiar with that,
  • 19:56it has an equivalence with hierarchical clustering actually.
  • 19:59So with the minimum spanning tree, you get this tree,
  • 20:02and then, they smooth out the tree using principle curves.
  • 20:07So we have two curves,
  • 20:08and then for every cell,
  • 20:10we find the closest curve and project the cell to the curve.
  • 20:13So therefore, in each curve,
  • 20:15the projections are called pseudotime values.
  • 20:18And usually, it's normalized between zero and one,
  • 20:21so we need to find the root and call it zero,
  • 20:24the other end is called one.
  • 20:25So this whole process is called pseudotime inference.
  • 20:28In other words, after it, we will give every cell
  • 20:32a pseudotime value in each trajectory.
  • 20:35Okay, so one thing I want to emphasize
  • 20:38is that in this pseudotime inference
  • 20:40we used gene expression values already.
  • 20:43So it's not like we observe pseudotime as external variable,
  • 20:47but it's from the same data.
  • 20:49So I want to show what we could do after the pseudotime.
  • 20:53So a typical analysis is to identify
  • 20:56which genes are differentially expressed
  • 20:58along the pseudotime.
  • 20:59Like the left one, we see, it has this upward trajectory,
  • 21:03so we may call it differentially expressed.
  • 21:06And here, we want to say the pseudotime
  • 21:09represent some cell immune response,
  • 21:12and this is an immuno-related gene,
  • 21:14so we expect to see the upward trajectory.
  • 21:17For the right gene, we expect to see something constant,
  • 21:20so we don't want to come right (indistinct) a D gene,
  • 21:23that's the intuition.
  • 21:25And I want to say that we must realize,
  • 21:28pseudotime values are random
  • 21:31simply because the cells is a random sample, right?
  • 21:35We need to consider randomness,
  • 21:37and we want to show this to people by doing subsampling.
  • 21:41So you can see that sampling variation
  • 21:43would get into pseudotime values.
  • 21:46Here, every row is a cell.
  • 21:48If I randomly subsample,
  • 21:50say, 80% of cells from the left cells
  • 21:53and redo the pseudotime trajectory inference,
  • 21:57we can see that for the cells in the subsamples
  • 22:01that include it, its values will vary to some degree.
  • 22:04So it's not a constant.
  • 22:07Okay, so realizing this, we should consider
  • 22:10the randomness of pseudotime from the data.
  • 22:13However, existing methods all treat pseudotime
  • 22:16as an observed covariate.
  • 22:18So our goal here is to fix this,
  • 22:22and we proposed this method called PseudotimeDE,
  • 22:25which actually does the inference,
  • 22:27which infers whether one gene
  • 22:29is differentially expressed along pseudotime,
  • 22:32and by considering pseudotime inference uncertainty.
  • 22:36So what we did exactly is that, here,
  • 22:41to see whether a gene changes with pseudotime,
  • 22:44what's the intuition?
  • 22:45We should do regression, right, do a regression analysis
  • 22:48by treating a gene's expression value as Y,
  • 22:53pseudotime as X, and regular regression.
  • 22:55Yeah, this is exactly what existing methods did.
  • 22:58And to make sure the regression
  • 22:59is not restricted to be linear,
  • 23:02and also account for that the gene expression values
  • 23:05are non negative counts.
  • 23:07So actually, we choose the generalized additive model,
  • 23:12which is also used in an existing method,
  • 23:14which I will show very soon.
  • 23:16So this is a very flexible and interpretable model.
  • 23:20So generalized means Y can be non Gaussian
  • 23:24and the other distribution,
  • 23:25just like generalized linear model.
  • 23:28But additive means
  • 23:29that we make the linear model more general,
  • 23:32so every feature can be non linearly transformed,
  • 23:38but the features after transformations are still added.
  • 23:42So that's additive, short as GAM.
  • 23:44So essentially, once we have a set of cells,
  • 23:47we first infer the pseudotime,
  • 23:49so we order the cells along the pseudotime,
  • 23:52and for gene J,
  • 23:53we check how the gene changes with pseudotime,
  • 23:57so we run the generalized additive model
  • 24:00to obtain a test statistic.
  • 24:02Please know that generalized additive model has its theory,
  • 24:05so we could use the theory to calculate
  • 24:08to use the null distribution and calculate P value.
  • 24:12And that was done in an existing method.
  • 24:15We want say that this may be problematic
  • 24:18because this whole null distribution
  • 24:20considers pseudotime to be fixed.
  • 24:22So to address this,
  • 24:24we need to consider pseudotime inference
  • 24:26as part of our test statistic calculation.
  • 24:30So to do this, we use the top part.
  • 24:33We actually do subsapling of the cells.
  • 24:36The reason we didn't do bootstrap
  • 24:38is simply because we want the method to be flexible
  • 24:41for pseudotime inference method.
  • 24:43Like I show here, there are Slingshot, Monocle3,
  • 24:47and a few others.
  • 24:48We want it to be flexible,
  • 24:50and some methods don't allow cells to be repetitive,
  • 24:54so bootstrap doesn't apply here.
  • 24:56And we use subsampling with percentage pretty high,
  • 24:59like 80%, 90%, and we did a robustness analysis.
  • 25:03And then, on each subsample, we do pseudotime inference.
  • 25:08With this, how do we get a null distribution
  • 25:11of the test statistic?
  • 25:12What we did is to permute the cells,
  • 25:15so any relationship between the gene J
  • 25:18and the pseudotime is disrupted.
  • 25:20So this can be considered from the null,
  • 25:22and then, we did the same GAM model,
  • 25:25and then, we calculate the values of the test statistic
  • 25:30on these permuted subsamples,
  • 25:32that gave us a null distribution.
  • 25:33So together, we can get a P value, this is what we did.
  • 25:37And we can show that this approach
  • 25:38indeed can control the P values,
  • 25:41make the P values uniformly distributed on the null,
  • 25:45while the existing method that uses GAM,
  • 25:47but only the theoretical distribution called tradeSeq,
  • 25:50they have some distortion for P values.
  • 25:53And then, you may wonder, what's the consequence?
  • 25:56We can show that, oh, and I should say,
  • 25:58Monocle3 uses generalized linear model and not uncertainty.
  • 26:03So you can see that even though it's not as bad as tradeSeq,
  • 26:07still, some distortion.
  • 26:09So we wanna show
  • 26:10that by calibrating the P value using our way
  • 26:13we can actually discover more functional terms
  • 26:17in our differentially expressed genes.
  • 26:19It means that we can find some new biological functions
  • 26:22that were missed by this new method.
  • 26:24Which shows that FDR control not just help with FDR control
  • 26:28of P value calibration,
  • 26:29not just help with FDR control,
  • 26:31but may also boost some power.
  • 26:34So I just quickly talk about this PseudotimeDE,
  • 26:37but I want to say that its computational time
  • 26:40is the biggest limitation.
  • 26:42Because here, our P value calculation requires many rounds
  • 26:46of subsampling, pseudotime inference, and permutation.
  • 26:50So let's say we want the P value with resolution 0.001,
  • 26:55we need at least 1000 rounds of such things, right?
  • 26:58That will take time.
  • 27:00So the natural question
  • 27:01is can we reduce the number of rounds, right,
  • 27:04and still achieve FDR control?
  • 27:06That becomes similar to my first goal.
  • 27:08Can we get rid of the higher resolution P values,
  • 27:11control the FDR, and then, we will use Clipper again.
  • 27:14So you can see,
  • 27:15Clipper is used throughout all the motivations,
  • 27:18that's why we proposed it,
  • 27:20and I'll talk about it in the next minute.
  • 27:22And the second question we didn't address
  • 27:24is that what if the cells don't follow a trajectory at all?
  • 27:29So clearly in our null hypothesis,
  • 27:32we are assuming there is a trajectory,
  • 27:34it's just that gene J doesn't change with the trajectory.
  • 27:38But what if the trajectory doesn't exist?
  • 27:40So this whole idea of this trajectory pseudotime inference
  • 27:45doesn't make sense, right?
  • 27:46We need to consider that.
  • 27:47But I don't think we have a good way to do it,
  • 27:50unless we can change the cells to have a null
  • 27:54where the cells don't follow a trajectory.
  • 27:56So this motivated us to generate cells
  • 27:59that don't follow a trajectory, and we used a simulator.
  • 28:02So which it will be the last part I will talk about today.
  • 28:06Okay, PseudotimeDE is one such a problem
  • 28:09where pseudotime is inferred from the same data.
  • 28:12Another common problem is to do clustering on single cells
  • 28:17to identify cell clusters,
  • 28:19and between cell clusters,
  • 28:21we identify differentially expressed genes.
  • 28:23We call this problem ClusterDE.
  • 28:26But this is also using the data twice, right?
  • 28:29So people have called this term double dipping,
  • 28:32meaning that the same data used for twice.
  • 28:36To tackle this problem,
  • 28:38we need to consider the uncertainty in cell clustering,
  • 28:41and there are three existing papers
  • 28:44that try to address this problem
  • 28:46that they either need to assume a distribution,
  • 28:48like genes follow Gaussian distribution in every cluster
  • 28:53or every gene follows a Poisson distribution here
  • 28:57and they need to do count splitting.
  • 28:59So I won't talk into the couple of details here,
  • 29:02but I just want to say
  • 29:02that the count splitting approach in my opinion
  • 29:05tackles a different problem.
  • 29:07It is conditional on the observed data matrix,
  • 29:11rather than considered to be random.
  • 29:13But I will not talk about the detail here.
  • 29:15So motivated by the challenge in this problem,
  • 29:17and we want to propose something not distribution-specific.
  • 29:22We want to use our simulator to generate the null data
  • 29:28and then use Clipper to achieve the FDR control.
  • 29:31So we want to do this non parametrically.
  • 29:34So I think the idea was motivated
  • 29:36by two phenomenal statistical papers.
  • 29:39One is the gap statistic paper,
  • 29:42which was proposed to find the number of clusters
  • 29:45in the clustering problem.
  • 29:47And if you read a paper, I think the smart idea there
  • 29:49is they try to generate data points without clusters
  • 29:54as the negative control.
  • 29:56Then, you can control your number of clusters
  • 29:59with some statistic,
  • 30:01versus what if there's no clusters, right,
  • 30:03and do the comparison and find the gap.
  • 30:05That's the gap statistic.
  • 30:06And knockoffs gave the theoretical foundation
  • 30:09for FDR control without using high resolution P values.
  • 30:13Okay, so the halftime summary
  • 30:16is that I talked about three common causes
  • 30:18of ill-posed P values.
  • 30:19Hopefully, I have convinced you
  • 30:21that we need something to avoid this problem.
  • 30:25So I talked about Clipper,
  • 30:26the p-value-free FDR control for genomic feature screening.
  • 30:30And as I said, it was motivated and enabled
  • 30:33by the FDR control procedure from this paper.
  • 30:36But the difference here is that we focus
  • 30:39on marginal screening of interesting features.
  • 30:42So in other words, we look at one feature at a time.
  • 30:45In my previous examples,
  • 30:47a feature could be a region or a gene.
  • 30:51So in the original knockoff paper,
  • 30:53their goal is to generate knockoff data
  • 30:57just like fake data for multiple features jointly.
  • 31:01And that's the very challenging part.
  • 31:03But in our case, we don't need that
  • 31:05because we are looking at one feature at a time,
  • 31:07so it's not a multi-varied problem,
  • 31:10but it's a marginal screening problem.
  • 31:12So our goal is to get rid of high resolution P values.
  • 31:15So the advantage of this
  • 31:17is we don't need parametric distribution assumptions,
  • 31:20or we don't need large sample sizes
  • 31:22to enable non parametric tests, these are not needed.
  • 31:26We just need to summarize every feature
  • 31:29into a contrast score,
  • 31:31and then, set a cutoff on the contrast scores.
  • 31:35So what do I mean by contrast score?
  • 31:37So every feature, say, I have total d features,
  • 31:40they have C, D, sorry, d contrast scores
  • 31:43shown as C1 to Cd,
  • 31:45so I'm calling the histogram
  • 31:47of the distribution of contrast scores.
  • 31:50So if the theoretical assumption is satisfied,
  • 31:54then the features that are null features
  • 31:57should follow a symmetrical distribution
  • 32:00around the zero, okay?
  • 32:02And for the features that are interesting
  • 32:04and should be discovered,
  • 32:06should be large and positive on the right tail.
  • 32:09So the theory of the FDR control just says,
  • 32:12we can find the contrast score cutoff as t,
  • 32:16such that this ratio is controlled under q.
  • 32:20We ought to find the minimum t for this.
  • 32:23What this means is can you can consider this ratio
  • 32:26as a rough estimator of FDR.
  • 32:30So the denominator is just the left tail,
  • 32:33the red part plus one,
  • 32:36sorry, the numerator is the right tail plus one,
  • 32:39the denominator is the, sorry, the left tail is, sorry,
  • 32:43the numerator is the left tail plus one,
  • 32:45the denominator is the right tail with maximum with one.
  • 32:49So in other words, still trying to avoid dividing zero.
  • 32:52And the idea is that we want to find a threshold t,
  • 32:56so that the right tail will be called discoveries
  • 32:59and the left tail represent false discoveries.
  • 33:03That's the intuition.
  • 33:05Because we know, if the feature's null,
  • 33:08then it will be randomly positive or negative.
  • 33:11And the sign is independent of the absolute value.
  • 33:15So that just replaces
  • 33:18the uniform distribution requirement for P values,
  • 33:22we change that to symmetry.
  • 33:24And another thing is that the feature,
  • 33:26if it's large positive, we want to discover it, right?
  • 33:29So this will be the discovery set
  • 33:31and this represents the negative, false discovery set.
  • 33:36So that's the idea intuition behind this approach.
  • 33:41But the theory to really prove it,
  • 33:42we need to use Martingale in probability to prove it.
  • 33:46And some of the technique was used
  • 33:47for the Benjamini Hochburg procedure
  • 33:49still based on Martingale.
  • 33:50So anyway, this allows us to really control the FDR
  • 33:54just using contrast scores.
  • 33:56And another thing I found as appealing
  • 33:58is that if you visually inspect the contract scores,
  • 34:01you can see whether the assumption seems to be reasonable
  • 34:05because you expect to see something symmetrical
  • 34:08plus a heavy right tail.
  • 34:10Okay, so we are currently writing to make this more formal,
  • 34:14so we could actually check
  • 34:15whether the assumption is reasonably holding.
  • 34:18So with this approach,
  • 34:19we can make a lot of the comparison analysis easier
  • 34:26because the key is to find a reasonable contrast score
  • 34:30that satisfies this assumption.
  • 34:32And I can say that there may be multiple contrast scores
  • 34:35that satisfy, not just the unique one.
  • 34:37Then the difference is power, right?
  • 34:40So we may have a better power
  • 34:41if you have a heavier right tail.
  • 34:44Okay, so for a ChIP-seq peak calling analysis,
  • 34:47we can say that the contrast score
  • 34:49will be comparing the target data
  • 34:52from experimental condition to the null data,
  • 34:55which is the background condition.
  • 34:56They serve a natural pair of contrast,
  • 34:59and we could apply any pipeline to each data,
  • 35:03the same pipeline and then do the contrast, right?
  • 35:06You can imagine, if there's no peak,
  • 35:08then these two values will be,
  • 35:10which one is bigger is equally likely.
  • 35:13And for the RNA-seq analysis,
  • 35:16here, I showed we could use permuted data as the null data
  • 35:20actual data as a target data.
  • 35:22So if we run some test on actual data
  • 35:25to get a test statistic,
  • 35:27we use the same test on permuted data
  • 35:29to get a test statistic, and they serve as a contrast.
  • 35:32And finally, for the PseudotimeDE and ClusterDE,
  • 35:35the single cell problem,
  • 35:37actual data will give us some comparison,
  • 35:40either PseudotimeDE
  • 35:42or the between ClusterDE test statistic.
  • 35:46And if we have some similar data
  • 35:48that represents the null,
  • 35:49like null trajectory, null cluster,
  • 35:52we could run the same pipeline and then do the contrast.
  • 35:55So you see, this actually free us
  • 35:57from saying we need to derive P values
  • 36:00and we need to know the distribution
  • 36:02by either theory or by numerical simulation, right?
  • 36:05These are all relieved
  • 36:06because we just need to do a contrast.
  • 36:08And the power is gained from the many, many tests,
  • 36:12we look at them together.
  • 36:13So that's why
  • 36:14this idea (background noise drowns out speaker).
  • 36:16Okay, so as I said, we tried to implement Clipper
  • 36:20as a way to improve FDR control,
  • 36:22and we did achieve this
  • 36:24for the popular software Macs and Homer
  • 36:27for ChIP-seq peak calling
  • 36:29and DESeq2 to edgeR for RNA-seq DEG identification.
  • 36:33So you see that they did have inflated FDR,
  • 36:37so the Y axis is the actual FDR,
  • 36:39X axis is the target FDR threshold.
  • 36:42There are inflations,
  • 36:44but with our Clipper as an add-on
  • 36:46to be used downstream of what they output
  • 36:49and do the contrast,
  • 36:50we can largely reduce the FDR to the target
  • 36:54and still maintain quite good power.
  • 36:56So that's the usage of Clipper as and add-on.
  • 37:00And for the single cell part,
  • 37:02I didn't finish about the null data generation.
  • 37:05How do we do it?
  • 37:06Our simulator was proposed partly for this reason,
  • 37:10but it has more uses.
  • 37:12So I just want to say that it's called scDesign3
  • 37:15because we have scDesign and scDesign2 as two previous work.
  • 37:19Now, focus on scDesign2
  • 37:20because it is the direct predecessor of scDesign3.
  • 37:24So what scDesign2 two does
  • 37:26is it tries to fit a multi-gene probabilistic model
  • 37:30for each cell type,
  • 37:32and then, every gene assumes to follow
  • 37:35a parametric distribution within the cell type.
  • 37:39And the major contribution
  • 37:41is that we capture gene-gene correlations
  • 37:44using Gaussian copula.
  • 37:45That will make the data more realistic.
  • 37:47Here is the comparison.
  • 37:49This is the real data used for fitting the model.
  • 37:52This is the lab (indistinct) test data used for validation,
  • 37:55and this is the synthetic cells using copula.
  • 37:59If we remove the copula, the cells will look like this.
  • 38:02So not realistic at all.
  • 38:04And our data is more realistic than other simulators
  • 38:08that did not explicitly capture gene-gene correlation.
  • 38:12Although, they have some implicit mechanism,
  • 38:14but the model is different.
  • 38:17Okay, so we realize that scDesign2 is doing a good job
  • 38:21for displaying cell types,
  • 38:23but it cannot generate data like this
  • 38:25from a continuous trajectory.
  • 38:27What we could do is to force the cells to be divided
  • 38:31and then use scDesign2.
  • 38:32But then, you can see the cells are kind of in clusters,
  • 38:35right, not in real data.
  • 38:37But with our generalization to the version three,
  • 38:41we now can generate cells from a continuous trajectory.
  • 38:45And I can quickly say that we basically generalize this,
  • 38:48this count distribution per cell type
  • 38:51to a generalized additive model, which I already said.
  • 38:55So we could make it more flexible in general,
  • 38:57and scDesign2 becomes a special case of scDesign3.
  • 39:02And one more thing we could do
  • 39:03is we actually use the technique vine copula,
  • 39:06so we could get the likelihood of how the model fits
  • 39:11to the real data, so we can get the likelihood of the model,
  • 39:15which can also give us more information.
  • 39:18So besides the single cell trajectory data,
  • 39:21we can also use the idea to generate spatial data.
  • 39:25So here, the modification is that for every gene
  • 39:28we assume a Gaussian process in the 2D space,
  • 39:32so it can have a smooth function
  • 39:34for (indistinct) expression (indistinct).
  • 39:36And also, my other student help with making the simulator
  • 39:40to generate reads, sequencing reads, not just counts.
  • 39:44So we can go from counts to reads,
  • 39:46and this will give us more functionality
  • 39:48to benchmark some low level tools.
  • 39:51So in short,
  • 39:52the scDesign3 simulator has two functionalities.
  • 39:56One is to do, of course, simulation.
  • 39:59We can generate single cell data from cell types,
  • 40:02discrete, continuous trajectories,
  • 40:05or even in the spatial domain.
  • 40:07We could generate feature modalities
  • 40:09we call multi-omics, including RNA-seq,
  • 40:12ATAC-seq, which is a technology
  • 40:14for open chromatin measurement,
  • 40:16CITE-seq, which includes both protein and RNA,
  • 40:19and also DNA methylation.
  • 40:21These are the examples we tried, but we could do even more.
  • 40:24We could allow it to generate data with experimental designs
  • 40:28including sample covariate, conditions, or even batches.
  • 40:33So these can make us generate cases
  • 40:36for more types of benchmarking.
  • 40:39And for interpreting real data,
  • 40:41scDesign3 can give us model parameters,
  • 40:45so we can know whether a gene has different means
  • 40:47in two cell types,
  • 40:49whether a gene has a certain change on a pseudotime,
  • 40:52or a gene has a certain change in two dimensional space.
  • 40:55And also, as I said,
  • 40:56we can output a likelihood that can give us a way
  • 40:59to calculate the basic information criterion BIC,
  • 41:03so we could evaluate
  • 41:04whether some pseudotime describes data well,
  • 41:07whether the algorithm for pseudotime inference
  • 41:09does a good job,
  • 41:11or whether the clusters explain data well.
  • 41:13So these are the things we could do.
  • 41:15And finally, to generate the null data
  • 41:18for the Clipper (indistinct),
  • 41:20we can alter the model parameters.
  • 41:22Like this is what we fit from real data,
  • 41:25we could change the model parameters
  • 41:27to make the gene no longer differentially expressed,
  • 41:30have the same mean in two subtypes.
  • 41:33Or, after we fit a real data
  • 41:35with two cell types or two clusters,
  • 41:37we could change the cluster parameter
  • 41:40to make sure the cells come from one cluster
  • 41:42instead of two clusters.
  • 41:44So these are the things we could do with the model.
  • 41:47And so this is how our paper,
  • 41:49but more details are in our paper, which has been posted,
  • 41:52if you are interested.
  • 41:54And I want to just quickly show
  • 41:56how the ClusterDE analysis could be done.
  • 41:59This is the real data with two clusters.
  • 42:01I want to say that this is the case
  • 42:03where permutation wouldn't work.
  • 42:05If you just permute the cluster labels,
  • 42:07the cells will look like the same cells, right?
  • 42:10They're still two clusters.
  • 42:11But if you use our simulator,
  • 42:13we could generate cells from one cluster
  • 42:15that reflects the complete null, there's no cluster.
  • 42:19And the use of this can be shown in this example.
  • 42:23There's only one cluster,
  • 42:24but if we use clustering algorithms,
  • 42:27like these two choices, Seurat is a popular pipeline,
  • 42:31Kmeans is the standard classical algorithm,
  • 42:34using either to force the cells into two clusters,
  • 42:38we are using gene expression data.
  • 42:40So no wonder that if you look at a gene's expression
  • 42:44between the two clusters, you may call it DE,
  • 42:46but that's not interesting, since there's no clusters.
  • 42:50So if we use our scDesign3 to generate null data,
  • 42:54in this case, null data should be very similar to real data.
  • 42:58It still has only one cluster.
  • 43:00Then, if we run Seurat or Kmeans,
  • 43:04similarly, on null data,
  • 43:06we would divide the cell in a similar way,
  • 43:09and then, if you do a contrast of the two sets of results,
  • 43:12you should see no big difference.
  • 43:14That's the idea for controlling FDR.
  • 43:16So indeed, in that example, if we're just naively wrong,
  • 43:21the Seurat pipeline clustering followed by some tests
  • 43:25like t, Wilcoxon, bimodal,
  • 43:28yeah, you will see FDR is one.
  • 43:30The reason is you keep finding D genes,
  • 43:33even though there's no cluster.
  • 43:34But using our approach,
  • 43:35we could control the FDR reasonably well.
  • 43:38So that's the predominant results for this purpose
  • 43:42for this task, so that summarizes my talk today.
  • 43:46And finally, I just want to make a few notes
  • 43:48to give some messages.
  • 43:50I talk about multiple testing,
  • 43:52but in many scientific problems,
  • 43:54I think the key is whether it should be formulated
  • 43:57as a multiple testing problem.
  • 43:59So actually, to address this question,
  • 44:01I wrote a prospective article
  • 44:03with my collaborator Xin Tong at USC.
  • 44:06We try to clarify statistical hypothesis testing
  • 44:10from machine learning binary classification.
  • 44:13They seem similar
  • 44:14because both would give you a binary decision, right?
  • 44:17But I can say that testing is an inference problem,
  • 44:20classification is a prediction problem.
  • 44:23So if you really think about it,
  • 44:25their fundamental concepts are different.
  • 44:27So that's why we wrote this to really talk with biologists,
  • 44:31for computational people who use this simultaneously.
  • 44:35So if you're interested, you can check it out.
  • 44:37And finally, I wanna say that,
  • 44:40so if it's a multiple testing problem,
  • 44:43I talked about three common causes of ill-posed P values,
  • 44:48and I propose a solution, Clipper,
  • 44:50for simplifying this problem by just using contrast scores,
  • 44:55and then, set a cutoff.
  • 44:56And the simulator, which we hope to be useful
  • 44:59for the single cell and spatial omics field
  • 45:01because this field is so popular,
  • 45:03we have more than 1000 methods already.
  • 45:05So benchmarking seems to be something quite necessary.
  • 45:08Because if there's no benchmarking,
  • 45:11then maybe new methods wouldn't have much of a chance
  • 45:14because people may still use the older method
  • 45:16that are better cited.
  • 45:18Okay, so these are the papers related to my talk today.
  • 45:23And so, finally, I want to say that,
  • 45:26so if you're interested, you want to check them out,
  • 45:29and let me know if you have any questions.
  • 45:31So finally, I'll just say this,
  • 45:33this is something quite interesting.
  • 45:34It's another paper we just recently wrote,
  • 45:37and I can say,
  • 45:38you should be online in genome biology very soon.
  • 45:40So we actually did this benchmark
  • 45:43for the so called QTL analysis in genetics, right?
  • 45:47Quantitative Trait Locus mapping.
  • 45:50So in this analysis,
  • 45:51a common procedure is to infer hidden variables
  • 45:55from the data, like genes expression matrix,
  • 45:57want to do hidden variable improvements.
  • 46:00Besides the most part, (indistinct) has the classical PCA,
  • 46:03several methods propose specific (indistinct).
  • 46:07And my student Heather, actually gave her the full credit,
  • 46:10she was so careful and she really wanted to understand
  • 46:13the method before using it,
  • 46:14then that lead to this project.
  • 46:17She wants to see, huh, do I really see advantages
  • 46:19of this new method even compared to PCA?
  • 46:22But that's what she found, right?
  • 46:24PCA still seems to be the most stable,
  • 46:26robust, and also faster algorithm,
  • 46:30but this is one of the reviewer's comments
  • 46:32I wanna share with you.
  • 46:34These results may come as a surprise to some,
  • 46:37given the nearly un-contestable status
  • 46:39that method A has achieved within the community.
  • 46:42But sadly, they reflect the fact
  • 46:44that computational biology methods can rise to fame
  • 46:47almost by accident rather than by sound statistic arguments.
  • 46:50So if you're interest,
  • 46:52you can check out this paper, it's on bio archive.
  • 46:54But anyway, I think it says how important it is
  • 46:57for statisticians to convey our message, right?
  • 47:00Why do we need statistical rigor, why does it matter?
  • 47:04So for our students,
  • 47:05if you want to know more about GAM and copulas,
  • 47:08there are two books I want to recommend.
  • 47:10So they're very good introductory textbooks,
  • 47:12so you can know the (indistinct).
  • 47:14Finally, I want to thank my collaborator at UC Irvine,
  • 47:19my students for all their tremendous work I talk about today
  • 47:23and also the funding agencies for giving us the support.
  • 47:26So thank you very much.
  • 47:36<v Attendee>A question?</v>
  • 47:38<v Jingyi>Yes.</v>
  • 47:39<v Attendee>So I was really curious</v>
  • 47:40about the analysis of like the large patient sample.
  • 47:45I know that there has in fact
  • 47:46been extensive discussion on it.
  • 47:48<v ->Yeah, yeah.</v> <v ->Which is</v>
  • 47:52interesting, to say the least, how it's gone down.
  • 47:55But I was kinda curious,
  • 47:56the way that it was presented here made me think about like,
  • 48:01apologies, if this is like a path that's already been tread,
  • 48:07so, yeah, the bar graph.
  • 48:11<v Jingyi>Yeah.</v>
  • 48:12<v Attendee>Yeah, so it sort of,</v>
  • 48:16it makes me wonder about the application
  • 48:19of the term false discovery in different contexts.
  • 48:23And taking patients, you can imagine,
  • 48:26there can be like unintended structure
  • 48:29within those populations.
  • 48:32And by (interference drowns out speaker) chance,
  • 48:34if there is 30,000 potential transcripts
  • 48:38that you're looking at, there might actually be,
  • 48:41between individuals who are not isogenic,
  • 48:44truly differentially expressed genes
  • 48:47between even permuted groups.
  • 48:50And so I'm wondering if there's a useful distinction
  • 48:53between a false discovery and a true,
  • 48:56but uninteresting discovery.
  • 49:00<v Jingyi>I think it depends on how you define truth.</v>
  • 49:04I think that's the key.
  • 49:05But what is the definition of D genes?
  • 49:08I wanna say, to be exact,
  • 49:10the definition of D genes in DESeq2,
  • 49:14edgeR, and that of Wilcoxon is different.
  • 49:18Because in Wilcoxon, the D gene is defined,
  • 49:21okay, if a gene, it has two distributions,
  • 49:24one under each condition,
  • 49:27and if I randomly take one observation
  • 49:29from each distribution from each condition,
  • 49:32is the chance that one is bigger than the other
  • 49:35equal to 0.5?
  • 49:36That's the Wilcoxon question.
  • 49:38While DESeq2 and edgeR, their D gene definition
  • 49:42is the negative binomial means are different.
  • 49:45But clearly, you can see, it only depends
  • 49:48on that negative binomial is a reasonable distribution,
  • 49:51that's the key.
  • 49:52So that's why in theory,
  • 49:53if negative binomial is no longer valid or reasonable,
  • 49:58then why should we define a D gene
  • 50:00based on negative binomial mean indifference?
  • 50:03I think that's kind of my answer to your question.
  • 50:06But the tricky thing about statistical inference
  • 50:09compared to supervised learning
  • 50:11is that we don't observe the truth, that's always the case.
  • 50:14So we're making a guess.
  • 50:16Frequentist people have one way to guess,
  • 50:20Poisson people have another way of guess.
  • 50:21And so one issue I've seen in the Twitter discussion
  • 50:24is that several people try to,
  • 50:26maybe not intentionally,
  • 50:28confuse frequentist concept with Poisson concept,
  • 50:32but they're not really comparable, right?
  • 50:34You cannot talk about them in the same ground.
  • 50:36That's a problem, and here, our criterion,
  • 50:39false discovery rate is a frequentist criteria,
  • 50:42it relies on P values, right?
  • 50:44So therefore, you cannot use Poisson arguments
  • 50:46to argue against such frequentist way.
  • 50:49Because you are doing frequentist, right?
  • 50:52But whether frequentist makes sense or not,
  • 50:54that's a different topic.
  • 50:55Hopefully, that answers your question.
  • 50:57<v ->Yeah, thank you.</v> <v ->Thank you.</v>
  • 51:01Yes. <v ->Hello,</v>
  • 51:02thank you much for your talk,
  • 51:03and I think that is very interesting.
  • 51:05However, I have a question on slide 26 actually..
  • 51:13It's about what you said that,
  • 51:17maybe 26.
  • 51:20<v Jingyi>26, okay, yeah.</v>
  • 51:23<v Attendee>Yeah, you said</v>
  • 51:24that like it is a multi-gene probabilistic model
  • 51:28for cell type.
  • 51:30However, I'm a little bit confused
  • 51:33about how you define the cell type.
  • 51:37But basically, from my own understandings,
  • 51:40that after you get, for example, the single cell rise data,
  • 51:45for example, you will use the route to get the cluster.
  • 51:48<v Jingyi>Yeah.</v>
  • 51:49<v Attendee>And you will annotate this cluster</v>
  • 51:52based on the-
  • 51:54<v ->Knowledge, yeah.</v> <v ->Gene.</v>
  • 51:56And then, if this model based on your
  • 52:04annotation of, okay.
  • 52:08<v Jingyi>Yeah, I see you point.</v>
  • 52:10Essentially, yeah, we need cell cluster to be pre-defined.
  • 52:13So if it's not reasonable, then, yes,
  • 52:15it will affect the results for sure.
  • 52:17Because the key is that you need to make sure
  • 52:20it is reasonable to assume a gene follows
  • 52:23one of the four distribution within a cluster, right?
  • 52:26So that's why there are methods
  • 52:28that try to refine clustering
  • 52:30by checking the negative binomial distribution.
  • 52:34So there are several research on that,
  • 52:36and they're trying to refine that.
  • 52:37But basically, we are sitting on those methods
  • 52:40to do the simulation, that's what we do.
  • 52:43But again, so that's why this is the problem with scDesign2,
  • 52:46but scDesign3 sort of tries to address this problem
  • 52:50by providing the BIC.
  • 52:52So if the input clusters are bad,
  • 52:55then you can see that in the BIC.
  • 52:57Because the likelihood will not be there, yeah.
  • 52:59<v Attendee>A similar question.</v>
  • 53:02I have another question
  • 53:04is that basically I assumed (indistinct) about
  • 53:09the experiments have duplicates,
  • 53:13however, in some situations,
  • 53:16maybe we do not have the replication.
  • 53:20But in this situation, how could we control the FDR,
  • 53:26if we do not have replicates,
  • 53:28then we cannot get the P value.
  • 53:30<v Jingyi>That's exactly the point of this talk.</v>
  • 53:31The only part that has replicates is the RNA-seq part.
  • 53:35The second part, that's the only part we have replicates.
  • 53:37In the first part, when we do the ChIP-seq,
  • 53:39it's just one replicate per condition, right?
  • 53:42That's why I said P value calculation would be helpful.
  • 53:45Right, so the reason we could control the FDR
  • 53:47without using P values
  • 53:49is just because we have many, many tests.
  • 53:52So that's why we're doing this large scale testing.
  • 53:56I think the idea, if you check it out,
  • 53:59Bran Efron has talked about it extensively in his book,
  • 54:04it's called, so his idea of Empirical Bayes
  • 54:07is very similar to this.
  • 54:08We try to borrow information across tests
  • 54:11to set a threshold.
  • 54:13Yeah, hopefully that answers your question.
  • 54:15Yeah?
  • 54:16(interference drowns out speaker)
  • 54:21Yeah, sounds good, thank you.
  • 54:23(interference drowns out speaker)
  • 54:26Thank you.