YSPH Biostatistics Seminar: “Opening Black Boxes: Enhancing Statistical Rigor in Genomics Data Science”

October 14, 2022

Information

Speaker: Jingyi Jessica Li, PhD, Associate Professor, Department of Statistics, University of California, Los Angeles

October 11, 2022

ID8168

To CiteDCA Citation Guide

00:00(interference drowns out speaker)
00:08<v Announcer>Biostatistics, computational.</v>
00:11(interference drowns out speaker)
00:15So before joining UCLA in 2013.
00:19(interference drowns out speaker)
00:24Production work.
00:25(interference drowns out speaker)
00:29On the script is (interference drowns out speaker)
00:33include differentiation factors,
00:35asymmetric (indistinct) replication,
00:37p-value-free false discovery (indistinct),
00:40and a high dimensional variable selection.
00:43And on the bio (indistinct) application side,
00:46her research include all single cell (indistinct)
00:50for (indistinct) genomics and (indistinct).
00:52(interference drowns out speaker)
01:00Research published.
01:01(interference drowns out speaker)
01:142019
01:16She's an MIT Technology Review certified (indistinct)
01:21in 2020, and she has received from Harvard.
01:25(interference drowns out speaker)
01:31<v Jingyi>I couldn't wait for the introduction.</v>
01:33It's my honor here to present my work,
01:36and my sabbatical in this fellowship program
01:40at Harvard Radcliffe Institute.
01:41So it's my pleasure to talk about some of our recent work
01:46related to how statistic rigor is important in genomics.
01:51So I want to say that when I was a student,
01:54especially I think most of our audience here are students,
01:57I want to give you this motivation.
02:00When I was a student back in 2007,
02:02that was when I just started my PhD
02:05and I was interested in bioinformatics.
02:08I had a lot of questions about bioinformatics methods
02:11after I took statistics classes.
02:14I think some of the questions I listed here
02:16include are P values valid?
02:19Because P values are so widely used
02:21in genomics bioinformatics.
02:23And also, we have a lot
02:25of bio bioinformatics methods developed for data analysis.
02:28And I wonder why don't we use classical statistical methods
02:32in textbooks?
02:33And the third thing is,
02:34when we use statistical test to understand the question,
02:38to answer some pivot question,
02:40what is the proper null hypothesis?
02:42So you will see those questions in the topics
02:45I will talk about next.
02:47So this talk will focus on the multiple testing problem.
02:52See, multiple testing, what it means
02:53is that we have multiple hypothesis tests,
02:57and the criteria we use in this problem are P values,
03:01which we have one P value per test.
03:04So we know that the requirement for a valid P value
03:08is that P values should follow the uniform distribution
03:11between zero one under the null hypothesis.
03:14Or we may relax this to be super uniform.
03:18Just for your information,
03:19super uniform means that the P values
03:22have higher density toward one
03:25and lower density towards zero.
03:26So that's still okay for type one error control,
03:30even though you may have a larger
03:31than expected type two error.
03:33So given the many, many P values,
03:35we need one criterion to set a cutoff on the P values.
03:40And the most commonly used criterion
03:42for multiple testing correction
03:44is called a false discovery rate, short as FPR.
03:47So the definition here is the expectation of this ratio,
03:52and this ratio is the number of false discoveries
03:55over the number of discoveries.
03:57So this notation means the maximum
04:00between the number of discoveries and one.
04:02In other words, we don't allow the denominator to be zero,
04:06if we don't make any discovery.
04:07So this is to avoid the dividing zero issue.
04:10And this ratio has a name called false discovery proportion.
04:14In other words, we can have this proportion
04:16for one particular data set.
04:18However, as you know, we don't observe this ratio
04:22because we don't know which discoveries are false.
04:25So therefore, this ratio is only a hypothetical concept,
04:28but not really computable.
04:30And here, the expectation
04:32is taken over all possible data set
04:35from the same distribution as our data set.
04:38So this is the frequentist concept
04:40because we have imaginary potential data sets.
04:44So therefore, the phenomena paper
04:47by Benjamini and Hochburg gave us a way
04:50to control this expectation called FDR
04:53under a claimed level, say, 5%,
04:57even though we couldn't realize this ratio itself.
05:01But we could control its expectation.
05:02So that's the magic of statistics.
05:05So Benjamini Hochburg algorithm allows us
05:07to set a cutoff on the P values to control the FDR.
05:11But I want to emphasize that the FDS's only controlled
05:15when P values satisfy this assumption,
05:17otherwise, it may not be.
05:19So I want to say three common causes of ill-posed P values,
05:24which make P values don't satisfy this assumption
05:27in genomics, and I'll go through them one by one.
05:31The first issue is what I call the formulation
05:34of a two sample test problem as a one sample test.
05:38What does this mean?
05:39So I will use the common genomic analysis
05:42of ChIP-seq data as an example.
05:45So in ChIP-seq data,
05:46we want to measure where a protein binds in the genome.
05:50So you can consider the X axis as the genome
05:54and the Y axis as the protein binding intensity
05:57measured by ChIP-seq.
05:59So here, we have experimental sample,
06:02the condition of our interest, say, a certain cell line.
06:06And the background sample is what we know
06:08that there's no protein,
06:10so there should be no protein binding.
06:12But we still want to measure the noise from the experiment.
06:15So we need this contrast.
06:17And here, we want to say that the region in the red box,
06:22this interval, we want to call it as a peak,
06:26if we see the intensity in the experimental sample
06:30is much larger than the intensity in the background sample.
06:33So we do the comparison and we want to cut this at a peak.
06:36That's the purpose of this analysis.
06:39And I wanna say that, in the field,
06:42because ChIP-seq has become popular since 2008,
06:45Macs and Homer are probably the two most popular software
06:49for cutting peaks.
06:51Even though they have very complex procedures
06:54for processing the sequencing data
06:56that in a statistical part
06:58to call a region as a peak or not,
07:01I can say, their formulation is as follows.
07:04Given a region, we count its number of ChIP-seq reads
07:09in the background sample and in the experimental sample.
07:12So let's just summarize this intensity as a count,
07:15a count here, a count here, and both are now negative.
07:19So I call the background count as big X,
07:21experimental count as big Y.
07:24And in our data, we have the observations, right?
07:27We refer to them as small x, small y.
07:30Then, the P value in both software
07:33is essentially this probability, the probability
07:37that big Y is greater or equal than the observed small y,
07:42where the big Y follows upon some distribution
07:45with mean parameter as the small x.
07:48Now, when I look at this formula back in 2008,
07:51the Macs paper, I wonder whether this is correct.
07:56And I don't think so.
07:57Because the reason, if you look at it,
07:59is what is the null hypothesis?
08:01The null hypothesis is essentially, okay,
08:04let's assume the experimental count
08:06is our test statistic, okay?
08:09We assume it follows a Poisson distribution
08:11with mean lambda.
08:13And here, the null hypothesis is lambda is equal to small x.
08:18Alternative is lambda greater than small x.
08:21So what's the problem with here?
08:23Essentially, we are using small x as a fixed parameter
08:27instead of a random observation.
08:29So in other words,
08:30the randomness in the background count is ignored.
08:33We only consider experimental count as the random variable.
08:37So in other words, where use the two sample testing problem
08:40to a one sample testing problem
08:42because we only consider the randomness
08:44in the experimental sample.
08:46But this is not something our textbook teaches us.
08:50The reason is because if we consider background
08:53as one condition, experimental has another condition,
08:56under each condition, our sample size is only one.
09:00So therefore, the T test will not apply
09:02because a central limit here clearly doesn't apply.
09:05So how do we calculate P value, any ideas?
09:10I think one possibility that we could still assume
09:13Poisson distribution for both background X
09:16and experimental Y.
09:17You have two Poisson, under the independence,
09:21we can probably derive the distribution
09:23for Y minus X, right, and what's the null distribution.
09:26That's the only way.
09:27But, if you think about it, how can we verify
09:30whether the Poisson distribution is reasonable?
09:33You only have one observation from it.
09:35The distribution could be anything, right?
09:37So assuming a parametric distribution seems quite,
09:41I will say, aggressive.
09:42So I think P value calculation is challenging here.
09:45And also, I even wonder, in this case,
09:49for this one versus one comparison,
09:51should we use a P value?
09:53Or is this really a testing problem that's feasible?
09:57So I would say, over the years, I gradually realized
10:00that here we looked at many, many regions,
10:03not just one region.
10:04So the goal or the criterion that's ultimately used
10:08is actually FDR.
10:09And in this process,
10:12P values are just intermediate for FDR control,
10:16instead of our final target.
10:18So do we have to stick with P values?
10:21This motivated me to write this paper with my students
10:25to propose a way to achieve p-value-free FDR control
10:30by leveraging the theory in Barber and Candes paper,
10:34their knockoff paper,
10:36so we could actually doing FDR control
10:39in this example without using P value.
10:41So I will talk about this later in my talk,
10:43but this is one motivation for the Clipper paper.
10:47The second issue with P values is that we observe,
10:50sometimes, P values are not valid
10:52because the parametric model used may not fit the data well.
10:57So this is an example for this commonly used
11:01differential expression analysis on RNA sequencing data.
11:05So for this task,
11:07the two popular softwares are DESeq2 and edgeR.
11:10So the data usually looks like this.
11:12So we want to compare two conditions
11:15and seeing which genes are differentially expressed.
11:19So condition one, we have three samples,
11:22which we cause to replicate,
11:23condition two, three replicates.
11:25So every row is one replicate,
11:29while every column is one gene.
11:31So to call a gene as differentially expressed,
11:34we need to compare its three values
11:36from condition one, two, three values from condition two.
11:39So clearly, we can see the left one may be a D gene,
11:43the right one may not be a D gene, right?
11:45That's our intuition.
11:46And we want to make this more formal
11:49by doing a statistical test.
11:52But in both edgeR and DESeq2,
11:55you can see that to compensate the small sample size,
11:59like three versus three,
12:00they assume a gene follows a negative binomial distribution
12:05under each condition.
12:07So essentially, these three values are assumed
12:09to follow one negative binomial distribution.
12:12These three values
12:13follow another negative binomial distribution.
12:16And the null hypothesis
12:18is the two negative binomial distributions
12:21have the same mean, that's the problem.
12:24Okay, so we actually discovered an issue
12:27with popular methods from this data set.
12:30And thanks to my collaborator Dr. Wei Li
12:32who is a computation of biologist at UC Irvine.
12:36So actually, from this patient data,
12:39we have a much larger sample size, 51 patients
12:43before the treatment of some immunotherapy medicine,
12:4758 patients on treatment.
12:50So we want to compare the RNA sequencing data
12:53of these two groups of patients.
12:55So essentially, when we apply DESeq2 or edgeR to this data,
13:01the red dots indicate the number of D genes identified.
13:06To verify whether we can still identify D genes
13:12from permuted data,
13:13because the reason is that we want to see
13:15whether the permuted data is actually really,
13:19because we know the permuted data
13:20shouldn't give us any signals.
13:22If we just disrupt the two groups,
13:24we shouldn't expect any D genes.
13:26But surprisingly, we found that each method
13:29can identify sometimes even more D genes from permuted data.
13:34So the bar and the error bars show the distribution
13:37of D genes identified from permuted data.
13:40So this is something quite unexpected.
13:44And to look into the reason, our first thought
13:47is to check the negative binomial assumption.
13:50Because now, under each group,
13:52we have 51 and 58 sample sizes,
13:55so we could check the distribution, and here's what we get.
13:59You see that for the genes that are frequently identified
14:03from permuted data, if we run the goodness-of-fit test,
14:07we check the negative binomial distribution,
14:10these genes have very small P values,
14:12indicating that this fit is not good.
14:15Well, if you look at the genes
14:16that are rarely identified from permuted data,
14:20the P values are bigger and the goodness-of-fit is better.
14:23So we do see this relationship
14:25between the goodness-of-fit of negative binomial
14:29and the frequency that a gene is identified
14:32from permuted data.
14:33So negative binomial model seems to not fit well
14:36on this patient data.
14:39Because here, the 51 patients shouldn't be regarded
14:42as replicates, they're not experimental replicates,
14:45they are individuals.
14:46So therefore, the theory for deriving negative binomials
14:50usually assume as a Gamma-Poisson Mixture model,
14:52Gamma-Poisson Hierarchical model.
14:54That one may no longer hold,
14:57and that's why we think the parametric model
15:00is not applicable to this patient data.
15:04So what's the consequence, right?
15:06So we want to convince the scientist
15:08what's the consequence of doing this analysis
15:11in this problematic way.
15:12We show that if we just use the D genes
15:15found by DESeq2 and edgeR,
15:17which are the genes corresponding to the red dot,
15:20around the so called gene oncology analysis,
15:23that is to check which functional terms
15:26are enriched in those two gene sets,
15:29we can see many functional terms
15:31are related to immune functions.
15:34Which would suggest that if we trust
15:36these two methods' results, we may conclude that,
15:39yes, between the two groups of patients,
15:41there are differences in immune responses, right?
15:44That seems to confirm our scientific hypothesis.
15:48However, now, we see many of these genes
15:51were also identified from permuted data,
15:54then, that will make the results dubious.
15:57So what we tried is that, even the sample size is so large,
16:01we tried the classical Wilcoxon rank sign test,
16:04which everybody learned, right?
16:06So non parametric two sample test
16:08that doesn't assume a parametric distribution.
16:11And here, it's self consistent,
16:13it doesn't identify D genes from real data,
16:17but also, it doesn't identify D genes from permuted data.
16:20So there's no contradiction here.
16:23And this result motivated me to ask this question,
16:26which I had years ago,
16:29should we always use popular bioinformatics tools?
16:33Like, check the citation of these two methods,
16:35super highly cited.
16:37Should I reuse popular method
16:39or should we consider general statistical methods,
16:43like Wilcoxon.
16:45So our recommendation is sample size matters, right?
16:50We may have different methods
16:52suitable for different sample sizes,
16:55and essentially, why statistics has so many methods,
16:58paramedic, non parametric,
17:00is because we have different scenarios in our data.
17:03That's the first thing we should realize.
17:05It's not like one method can do all the things.
17:08And the second thing is sanity check.
17:10We should always consider doing some sanity check
17:13to make sure we trust the results
17:15instead of just take the results for granted.
17:17So these things were summarized in our paper
17:20published earlier this year.
17:23And since its publication,
17:25we have received a lot of discussions on Twitter,
17:28if you are interested.
17:29But anyway, so it means that many people are interested
17:32in this topic, especially many people, users believe
17:36that popular bioinformatics tools are the state-of-the-art,
17:39right, the way, standard methods (indistinct).
17:42But if you are bio statisticians, you may not like this.
17:45Because we want to develop new methods.
17:48Otherwise, what's our job, right?
17:50So in this case, we need to really find the loopholes,
17:53or the limitations, or the gap between current approach
17:57and the data scenarios,
17:58and try convinces people that, yes,
18:01we do need careful thoughts when we choose method.
18:04It's not always one method.
18:06And a related question is,
18:08in Wilcoxon, definitely doesn't have a strong assumption,
18:13and (indistinct) have a reasonable power
18:15when the sample size is large.
18:17But what if sample sizes are small, right?
18:20So when it's small, we know,
18:21non parametric tests like Wilcoxon doesn't have power.
18:25So in this case, we actually proposed Clipper again,
18:30so it can work as a downstream correction tool
18:34for DESeq2 and edgeR.
18:36Because they are supposed to be quite powerful,
18:39even though they find probably too many.
18:41So hopefully, we could use that to borrow their power,
18:44but help them improve the FDR control.
18:47So I'll show the results later in my talk.
18:50That's the second cause.
18:52And the third cause for ill-posed P values
18:54is a little more complicated.
18:56And this is the issue commonly observed in single cell data,
19:00single cell RNA-seq data.
19:01So I will use this analysis
19:03called pseudotime differentially expressed genes as example.
19:08What is a pseudotime?
19:10Pseudotime means it's not real time, it's pseudo, right?
19:13So it's something we inferred
19:16from single cell RNA-seq data,
19:18so those cells are measured all at once.
19:20But we want to infer some time trajectory from the cells.
19:26So I'll just use the screenshot from Slingshot,
19:29which is a method for inferring pseudotime for explanation.
19:34So here, this is a two-dimensional PCA plot of cells,
19:39and the cells are pre-clustered,
19:41so each color represents one cluster.
19:44So the Slingshot algorithm does the following,
19:47first, it takes the cluster means' centers,
19:51and connect them using the algorithm
19:53called minimum spanning tree.
19:54So if you're not familiar with that,
19:56it has an equivalence with hierarchical clustering actually.
19:59So with the minimum spanning tree, you get this tree,
20:02and then, they smooth out the tree using principle curves.
20:07So we have two curves,
20:08and then for every cell,
20:10we find the closest curve and project the cell to the curve.
20:13So therefore, in each curve,
20:15the projections are called pseudotime values.
20:18And usually, it's normalized between zero and one,
20:21so we need to find the root and call it zero,
20:24the other end is called one.
20:25So this whole process is called pseudotime inference.
20:28In other words, after it, we will give every cell
20:32a pseudotime value in each trajectory.
20:35Okay, so one thing I want to emphasize
20:38is that in this pseudotime inference
20:40we used gene expression values already.
20:43So it's not like we observe pseudotime as external variable,
20:47but it's from the same data.
20:49So I want to show what we could do after the pseudotime.
20:53So a typical analysis is to identify
20:56which genes are differentially expressed
20:58along the pseudotime.
20:59Like the left one, we see, it has this upward trajectory,
21:03so we may call it differentially expressed.
21:06And here, we want to say the pseudotime
21:09represent some cell immune response,
21:12and this is an immuno-related gene,
21:14so we expect to see the upward trajectory.
21:17For the right gene, we expect to see something constant,
21:20so we don't want to come right (indistinct) a D gene,
21:23that's the intuition.
21:25And I want to say that we must realize,
21:28pseudotime values are random
21:31simply because the cells is a random sample, right?
21:35We need to consider randomness,
21:37and we want to show this to people by doing subsampling.
21:41So you can see that sampling variation
21:43would get into pseudotime values.
21:46Here, every row is a cell.
21:48If I randomly subsample,
21:50say, 80% of cells from the left cells
21:53and redo the pseudotime trajectory inference,
21:57we can see that for the cells in the subsamples
22:01that include it, its values will vary to some degree.
22:04So it's not a constant.
22:07Okay, so realizing this, we should consider
22:10the randomness of pseudotime from the data.
22:13However, existing methods all treat pseudotime
22:16as an observed covariate.
22:18So our goal here is to fix this,
22:22and we proposed this method called PseudotimeDE,
22:25which actually does the inference,
22:27which infers whether one gene
22:29is differentially expressed along pseudotime,
22:32and by considering pseudotime inference uncertainty.
22:36So what we did exactly is that, here,
22:41to see whether a gene changes with pseudotime,
22:44what's the intuition?
22:45We should do regression, right, do a regression analysis
22:48by treating a gene's expression value as Y,
22:53pseudotime as X, and regular regression.
22:55Yeah, this is exactly what existing methods did.
22:58And to make sure the regression
22:59is not restricted to be linear,
23:02and also account for that the gene expression values
23:05are non negative counts.
23:07So actually, we choose the generalized additive model,
23:12which is also used in an existing method,
23:14which I will show very soon.
23:16So this is a very flexible and interpretable model.
23:20So generalized means Y can be non Gaussian
23:24and the other distribution,
23:25just like generalized linear model.
23:28But additive means
23:29that we make the linear model more general,
23:32so every feature can be non linearly transformed,
23:38but the features after transformations are still added.
23:42So that's additive, short as GAM.
23:44So essentially, once we have a set of cells,
23:47we first infer the pseudotime,
23:49so we order the cells along the pseudotime,
23:52and for gene J,
23:53we check how the gene changes with pseudotime,
23:57so we run the generalized additive model
24:00to obtain a test statistic.
24:02Please know that generalized additive model has its theory,
24:05so we could use the theory to calculate
24:08to use the null distribution and calculate P value.
24:12And that was done in an existing method.
24:15We want say that this may be problematic
24:18because this whole null distribution
24:20considers pseudotime to be fixed.
24:22So to address this,
24:24we need to consider pseudotime inference
24:26as part of our test statistic calculation.
24:30So to do this, we use the top part.
24:33We actually do subsapling of the cells.
24:36The reason we didn't do bootstrap
24:38is simply because we want the method to be flexible
24:41for pseudotime inference method.
24:43Like I show here, there are Slingshot, Monocle3,
24:47and a few others.
24:48We want it to be flexible,
24:50and some methods don't allow cells to be repetitive,
24:54so bootstrap doesn't apply here.
24:56And we use subsampling with percentage pretty high,
24:59like 80%, 90%, and we did a robustness analysis.
25:03And then, on each subsample, we do pseudotime inference.
25:08With this, how do we get a null distribution
25:11of the test statistic?
25:12What we did is to permute the cells,
25:15so any relationship between the gene J
25:18and the pseudotime is disrupted.
25:20So this can be considered from the null,
25:22and then, we did the same GAM model,
25:25and then, we calculate the values of the test statistic
25:30on these permuted subsamples,
25:32that gave us a null distribution.
25:33So together, we can get a P value, this is what we did.
25:37And we can show that this approach
25:38indeed can control the P values,
25:41make the P values uniformly distributed on the null,
25:45while the existing method that uses GAM,
25:47but only the theoretical distribution called tradeSeq,
25:50they have some distortion for P values.
25:53And then, you may wonder, what's the consequence?
25:56We can show that, oh, and I should say,
25:58Monocle3 uses generalized linear model and not uncertainty.
26:03So you can see that even though it's not as bad as tradeSeq,
26:07still, some distortion.
26:09So we wanna show
26:10that by calibrating the P value using our way
26:13we can actually discover more functional terms
26:17in our differentially expressed genes.
26:19It means that we can find some new biological functions
26:22that were missed by this new method.
26:24Which shows that FDR control not just help with FDR control
26:28of P value calibration,
26:29not just help with FDR control,
26:31but may also boost some power.
26:34So I just quickly talk about this PseudotimeDE,
26:37but I want to say that its computational time
26:40is the biggest limitation.
26:42Because here, our P value calculation requires many rounds
26:46of subsampling, pseudotime inference, and permutation.
26:50So let's say we want the P value with resolution 0.001,
26:55we need at least 1000 rounds of such things, right?
26:58That will take time.
27:00So the natural question
27:01is can we reduce the number of rounds, right,
27:04and still achieve FDR control?
27:06That becomes similar to my first goal.
27:08Can we get rid of the higher resolution P values,
27:11control the FDR, and then, we will use Clipper again.
27:14So you can see,
27:15Clipper is used throughout all the motivations,
27:18that's why we proposed it,
27:20and I'll talk about it in the next minute.
27:22And the second question we didn't address
27:24is that what if the cells don't follow a trajectory at all?
27:29So clearly in our null hypothesis,
27:32we are assuming there is a trajectory,
27:34it's just that gene J doesn't change with the trajectory.
27:38But what if the trajectory doesn't exist?
27:40So this whole idea of this trajectory pseudotime inference
27:45doesn't make sense, right?
27:46We need to consider that.
27:47But I don't think we have a good way to do it,
27:50unless we can change the cells to have a null
27:54where the cells don't follow a trajectory.
27:56So this motivated us to generate cells
27:59that don't follow a trajectory, and we used a simulator.
28:02So which it will be the last part I will talk about today.
28:06Okay, PseudotimeDE is one such a problem
28:09where pseudotime is inferred from the same data.
28:12Another common problem is to do clustering on single cells
28:17to identify cell clusters,
28:19and between cell clusters,
28:21we identify differentially expressed genes.
28:23We call this problem ClusterDE.
28:26But this is also using the data twice, right?
28:29So people have called this term double dipping,
28:32meaning that the same data used for twice.
28:36To tackle this problem,
28:38we need to consider the uncertainty in cell clustering,
28:41and there are three existing papers
28:44that try to address this problem
28:46that they either need to assume a distribution,
28:48like genes follow Gaussian distribution in every cluster
28:53or every gene follows a Poisson distribution here
28:57and they need to do count splitting.
28:59So I won't talk into the couple of details here,
29:02but I just want to say
29:02that the count splitting approach in my opinion
29:05tackles a different problem.
29:07It is conditional on the observed data matrix,
29:11rather than considered to be random.
29:13But I will not talk about the detail here.
29:15So motivated by the challenge in this problem,
29:17and we want to propose something not distribution-specific.
29:22We want to use our simulator to generate the null data
29:28and then use Clipper to achieve the FDR control.
29:31So we want to do this non parametrically.
29:34So I think the idea was motivated
29:36by two phenomenal statistical papers.
29:39One is the gap statistic paper,
29:42which was proposed to find the number of clusters
29:45in the clustering problem.
29:47And if you read a paper, I think the smart idea there
29:49is they try to generate data points without clusters
29:54as the negative control.
29:56Then, you can control your number of clusters
29:59with some statistic,
30:01versus what if there's no clusters, right,
30:03and do the comparison and find the gap.
30:05That's the gap statistic.
30:06And knockoffs gave the theoretical foundation
30:09for FDR control without using high resolution P values.
30:13Okay, so the halftime summary
30:16is that I talked about three common causes
30:18of ill-posed P values.
30:19Hopefully, I have convinced you
30:21that we need something to avoid this problem.
30:25So I talked about Clipper,
30:26the p-value-free FDR control for genomic feature screening.
30:30And as I said, it was motivated and enabled
30:33by the FDR control procedure from this paper.
30:36But the difference here is that we focus
30:39on marginal screening of interesting features.
30:42So in other words, we look at one feature at a time.
30:45In my previous examples,
30:47a feature could be a region or a gene.
30:51So in the original knockoff paper,
30:53their goal is to generate knockoff data
30:57just like fake data for multiple features jointly.
31:01And that's the very challenging part.
31:03But in our case, we don't need that
31:05because we are looking at one feature at a time,
31:07so it's not a multi-varied problem,
31:10but it's a marginal screening problem.
31:12So our goal is to get rid of high resolution P values.
31:15So the advantage of this
31:17is we don't need parametric distribution assumptions,
31:20or we don't need large sample sizes
31:22to enable non parametric tests, these are not needed.
31:26We just need to summarize every feature
31:29into a contrast score,
31:31and then, set a cutoff on the contrast scores.
31:35So what do I mean by contrast score?
31:37So every feature, say, I have total d features,
31:40they have C, D, sorry, d contrast scores
31:43shown as C1 to Cd,
31:45so I'm calling the histogram
31:47of the distribution of contrast scores.
31:50So if the theoretical assumption is satisfied,
31:54then the features that are null features
31:57should follow a symmetrical distribution
32:00around the zero, okay?
32:02And for the features that are interesting
32:04and should be discovered,
32:06should be large and positive on the right tail.
32:09So the theory of the FDR control just says,
32:12we can find the contrast score cutoff as t,
32:16such that this ratio is controlled under q.
32:20We ought to find the minimum t for this.
32:23What this means is can you can consider this ratio
32:26as a rough estimator of FDR.
32:30So the denominator is just the left tail,
32:33the red part plus one,
32:36sorry, the numerator is the right tail plus one,
32:39the denominator is the, sorry, the left tail is, sorry,
32:43the numerator is the left tail plus one,
32:45the denominator is the right tail with maximum with one.
32:49So in other words, still trying to avoid dividing zero.
32:52And the idea is that we want to find a threshold t,
32:56so that the right tail will be called discoveries
32:59and the left tail represent false discoveries.
33:03That's the intuition.
33:05Because we know, if the feature's null,
33:08then it will be randomly positive or negative.
33:11And the sign is independent of the absolute value.
33:15So that just replaces
33:18the uniform distribution requirement for P values,
33:22we change that to symmetry.
33:24And another thing is that the feature,
33:26if it's large positive, we want to discover it, right?
33:29So this will be the discovery set
33:31and this represents the negative, false discovery set.
33:36So that's the idea intuition behind this approach.
33:41But the theory to really prove it,
33:42we need to use Martingale in probability to prove it.
33:46And some of the technique was used
33:47for the Benjamini Hochburg procedure
33:49still based on Martingale.
33:50So anyway, this allows us to really control the FDR
33:54just using contrast scores.
33:56And another thing I found as appealing
33:58is that if you visually inspect the contract scores,
34:01you can see whether the assumption seems to be reasonable
34:05because you expect to see something symmetrical
34:08plus a heavy right tail.
34:10Okay, so we are currently writing to make this more formal,
34:14so we could actually check
34:15whether the assumption is reasonably holding.
34:18So with this approach,
34:19we can make a lot of the comparison analysis easier
34:26because the key is to find a reasonable contrast score
34:30that satisfies this assumption.
34:32And I can say that there may be multiple contrast scores
34:35that satisfy, not just the unique one.
34:37Then the difference is power, right?
34:40So we may have a better power
34:41if you have a heavier right tail.
34:44Okay, so for a ChIP-seq peak calling analysis,
34:47we can say that the contrast score
34:49will be comparing the target data
34:52from experimental condition to the null data,
34:55which is the background condition.
34:56They serve a natural pair of contrast,
34:59and we could apply any pipeline to each data,
35:03the same pipeline and then do the contrast, right?
35:06You can imagine, if there's no peak,
35:08then these two values will be,
35:10which one is bigger is equally likely.
35:13And for the RNA-seq analysis,
35:16here, I showed we could use permuted data as the null data
35:20actual data as a target data.
35:22So if we run some test on actual data
35:25to get a test statistic,
35:27we use the same test on permuted data
35:29to get a test statistic, and they serve as a contrast.
35:32And finally, for the PseudotimeDE and ClusterDE,
35:35the single cell problem,
35:37actual data will give us some comparison,
35:40either PseudotimeDE
35:42or the between ClusterDE test statistic.
35:46And if we have some similar data
35:48that represents the null,
35:49like null trajectory, null cluster,
35:52we could run the same pipeline and then do the contrast.
35:55So you see, this actually free us
35:57from saying we need to derive P values
36:00and we need to know the distribution
36:02by either theory or by numerical simulation, right?
36:05These are all relieved
36:06because we just need to do a contrast.
36:08And the power is gained from the many, many tests,
36:12we look at them together.
36:13So that's why
36:14this idea (background noise drowns out speaker).
36:16Okay, so as I said, we tried to implement Clipper
36:20as a way to improve FDR control,
36:22and we did achieve this
36:24for the popular software Macs and Homer
36:27for ChIP-seq peak calling
36:29and DESeq2 to edgeR for RNA-seq DEG identification.
36:33So you see that they did have inflated FDR,
36:37so the Y axis is the actual FDR,
36:39X axis is the target FDR threshold.
36:42There are inflations,
36:44but with our Clipper as an add-on
36:46to be used downstream of what they output
36:49and do the contrast,
36:50we can largely reduce the FDR to the target
36:54and still maintain quite good power.
36:56So that's the usage of Clipper as and add-on.
37:00And for the single cell part,
37:02I didn't finish about the null data generation.
37:05How do we do it?
37:06Our simulator was proposed partly for this reason,
37:10but it has more uses.
37:12So I just want to say that it's called scDesign3
37:15because we have scDesign and scDesign2 as two previous work.
37:19Now, focus on scDesign2
37:20because it is the direct predecessor of scDesign3.
37:24So what scDesign2 two does
37:26is it tries to fit a multi-gene probabilistic model
37:30for each cell type,
37:32and then, every gene assumes to follow
37:35a parametric distribution within the cell type.
37:39And the major contribution
37:41is that we capture gene-gene correlations
37:44using Gaussian copula.
37:45That will make the data more realistic.
37:47Here is the comparison.
37:49This is the real data used for fitting the model.
37:52This is the lab (indistinct) test data used for validation,
37:55and this is the synthetic cells using copula.
37:59If we remove the copula, the cells will look like this.
38:02So not realistic at all.
38:04And our data is more realistic than other simulators
38:08that did not explicitly capture gene-gene correlation.
38:12Although, they have some implicit mechanism,
38:14but the model is different.
38:17Okay, so we realize that scDesign2 is doing a good job
38:21for displaying cell types,
38:23but it cannot generate data like this
38:25from a continuous trajectory.
38:27What we could do is to force the cells to be divided
38:31and then use scDesign2.
38:32But then, you can see the cells are kind of in clusters,
38:35right, not in real data.
38:37But with our generalization to the version three,
38:41we now can generate cells from a continuous trajectory.
38:45And I can quickly say that we basically generalize this,
38:48this count distribution per cell type
38:51to a generalized additive model, which I already said.
38:55So we could make it more flexible in general,
38:57and scDesign2 becomes a special case of scDesign3.
39:02And one more thing we could do
39:03is we actually use the technique vine copula,
39:06so we could get the likelihood of how the model fits
39:11to the real data, so we can get the likelihood of the model,
39:15which can also give us more information.
39:18So besides the single cell trajectory data,
39:21we can also use the idea to generate spatial data.
39:25So here, the modification is that for every gene
39:28we assume a Gaussian process in the 2D space,
39:32so it can have a smooth function
39:34for (indistinct) expression (indistinct).
39:36And also, my other student help with making the simulator
39:40to generate reads, sequencing reads, not just counts.
39:44So we can go from counts to reads,
39:46and this will give us more functionality
39:48to benchmark some low level tools.
39:51So in short,
39:52the scDesign3 simulator has two functionalities.
39:56One is to do, of course, simulation.
39:59We can generate single cell data from cell types,
40:02discrete, continuous trajectories,
40:05or even in the spatial domain.
40:07We could generate feature modalities
40:09we call multi-omics, including RNA-seq,
40:12ATAC-seq, which is a technology
40:14for open chromatin measurement,
40:16CITE-seq, which includes both protein and RNA,
40:19and also DNA methylation.
40:21These are the examples we tried, but we could do even more.
40:24We could allow it to generate data with experimental designs
40:28including sample covariate, conditions, or even batches.
40:33So these can make us generate cases
40:36for more types of benchmarking.
40:39And for interpreting real data,
40:41scDesign3 can give us model parameters,
40:45so we can know whether a gene has different means
40:47in two cell types,
40:49whether a gene has a certain change on a pseudotime,
40:52or a gene has a certain change in two dimensional space.
40:55And also, as I said,
40:56we can output a likelihood that can give us a way
40:59to calculate the basic information criterion BIC,
41:03so we could evaluate
41:04whether some pseudotime describes data well,
41:07whether the algorithm for pseudotime inference
41:09does a good job,
41:11or whether the clusters explain data well.
41:13So these are the things we could do.
41:15And finally, to generate the null data
41:18for the Clipper (indistinct),
41:20we can alter the model parameters.
41:22Like this is what we fit from real data,
41:25we could change the model parameters
41:27to make the gene no longer differentially expressed,
41:30have the same mean in two subtypes.
41:33Or, after we fit a real data
41:35with two cell types or two clusters,
41:37we could change the cluster parameter
41:40to make sure the cells come from one cluster
41:42instead of two clusters.
41:44So these are the things we could do with the model.
41:47And so this is how our paper,
41:49but more details are in our paper, which has been posted,
41:52if you are interested.
41:54And I want to just quickly show
41:56how the ClusterDE analysis could be done.
41:59This is the real data with two clusters.
42:01I want to say that this is the case
42:03where permutation wouldn't work.
42:05If you just permute the cluster labels,
42:07the cells will look like the same cells, right?
42:10They're still two clusters.
42:11But if you use our simulator,
42:13we could generate cells from one cluster
42:15that reflects the complete null, there's no cluster.
42:19And the use of this can be shown in this example.
42:23There's only one cluster,
42:24but if we use clustering algorithms,
42:27like these two choices, Seurat is a popular pipeline,
42:31Kmeans is the standard classical algorithm,
42:34using either to force the cells into two clusters,
42:38we are using gene expression data.
42:40So no wonder that if you look at a gene's expression
42:44between the two clusters, you may call it DE,
42:46but that's not interesting, since there's no clusters.
42:50So if we use our scDesign3 to generate null data,
42:54in this case, null data should be very similar to real data.
42:58It still has only one cluster.
43:00Then, if we run Seurat or Kmeans,
43:04similarly, on null data,
43:06we would divide the cell in a similar way,
43:09and then, if you do a contrast of the two sets of results,
43:12you should see no big difference.
43:14That's the idea for controlling FDR.
43:16So indeed, in that example, if we're just naively wrong,
43:21the Seurat pipeline clustering followed by some tests
43:25like t, Wilcoxon, bimodal,
43:28yeah, you will see FDR is one.
43:30The reason is you keep finding D genes,
43:33even though there's no cluster.
43:34But using our approach,
43:35we could control the FDR reasonably well.
43:38So that's the predominant results for this purpose
43:42for this task, so that summarizes my talk today.
43:46And finally, I just want to make a few notes
43:48to give some messages.
43:50I talk about multiple testing,
43:52but in many scientific problems,
43:54I think the key is whether it should be formulated
43:57as a multiple testing problem.
43:59So actually, to address this question,
44:01I wrote a prospective article
44:03with my collaborator Xin Tong at USC.
44:06We try to clarify statistical hypothesis testing
44:10from machine learning binary classification.
44:13They seem similar
44:14because both would give you a binary decision, right?
44:17But I can say that testing is an inference problem,
44:20classification is a prediction problem.
44:23So if you really think about it,
44:25their fundamental concepts are different.
44:27So that's why we wrote this to really talk with biologists,
44:31for computational people who use this simultaneously.
44:35So if you're interested, you can check it out.
44:37And finally, I wanna say that,
44:40so if it's a multiple testing problem,
44:43I talked about three common causes of ill-posed P values,
44:48and I propose a solution, Clipper,
44:50for simplifying this problem by just using contrast scores,
44:55and then, set a cutoff.
44:56And the simulator, which we hope to be useful
44:59for the single cell and spatial omics field
45:01because this field is so popular,
45:03we have more than 1000 methods already.
45:05So benchmarking seems to be something quite necessary.
45:08Because if there's no benchmarking,
45:11then maybe new methods wouldn't have much of a chance
45:14because people may still use the older method
45:16that are better cited.
45:18Okay, so these are the papers related to my talk today.
45:23And so, finally, I want to say that,
45:26so if you're interested, you want to check them out,
45:29and let me know if you have any questions.
45:31So finally, I'll just say this,
45:33this is something quite interesting.
45:34It's another paper we just recently wrote,
45:37and I can say,
45:38you should be online in genome biology very soon.
45:40So we actually did this benchmark
45:43for the so called QTL analysis in genetics, right?
45:47Quantitative Trait Locus mapping.
45:50So in this analysis,
45:51a common procedure is to infer hidden variables
45:55from the data, like genes expression matrix,
45:57want to do hidden variable improvements.
46:00Besides the most part, (indistinct) has the classical PCA,
46:03several methods propose specific (indistinct).
46:07And my student Heather, actually gave her the full credit,
46:10she was so careful and she really wanted to understand
46:13the method before using it,
46:14then that lead to this project.
46:17She wants to see, huh, do I really see advantages
46:19of this new method even compared to PCA?
46:22But that's what she found, right?
46:24PCA still seems to be the most stable,
46:26robust, and also faster algorithm,
46:30but this is one of the reviewer's comments
46:32I wanna share with you.
46:34These results may come as a surprise to some,
46:37given the nearly un-contestable status
46:39that method A has achieved within the community.
46:42But sadly, they reflect the fact
46:44that computational biology methods can rise to fame
46:47almost by accident rather than by sound statistic arguments.
46:50So if you're interest,
46:52you can check out this paper, it's on bio archive.
46:54But anyway, I think it says how important it is
46:57for statisticians to convey our message, right?
47:00Why do we need statistical rigor, why does it matter?
47:04So for our students,
47:05if you want to know more about GAM and copulas,
47:08there are two books I want to recommend.
47:10So they're very good introductory textbooks,
47:12so you can know the (indistinct).
47:14Finally, I want to thank my collaborator at UC Irvine,
47:19my students for all their tremendous work I talk about today
47:23and also the funding agencies for giving us the support.
47:26So thank you very much.
47:36<v Attendee>A question?</v>
47:38<v Jingyi>Yes.</v>
47:39<v Attendee>So I was really curious</v>
47:40about the analysis of like the large patient sample.
47:45I know that there has in fact
47:46been extensive discussion on it.
47:48<v ->Yeah, yeah.</v> <v ->Which is</v>
47:52interesting, to say the least, how it's gone down.
47:55But I was kinda curious,
47:56the way that it was presented here made me think about like,
48:01apologies, if this is like a path that's already been tread,
48:07so, yeah, the bar graph.
48:11<v Jingyi>Yeah.</v>
48:12<v Attendee>Yeah, so it sort of,</v>
48:16it makes me wonder about the application
48:19of the term false discovery in different contexts.
48:23And taking patients, you can imagine,
48:26there can be like unintended structure
48:29within those populations.
48:32And by (interference drowns out speaker) chance,
48:34if there is 30,000 potential transcripts
48:38that you're looking at, there might actually be,
48:41between individuals who are not isogenic,
48:44truly differentially expressed genes
48:47between even permuted groups.
48:50And so I'm wondering if there's a useful distinction
48:53between a false discovery and a true,
48:56but uninteresting discovery.
49:00<v Jingyi>I think it depends on how you define truth.</v>
49:04I think that's the key.
49:05But what is the definition of D genes?
49:08I wanna say, to be exact,
49:10the definition of D genes in DESeq2,
49:14edgeR, and that of Wilcoxon is different.
49:18Because in Wilcoxon, the D gene is defined,
49:21okay, if a gene, it has two distributions,
49:24one under each condition,
49:27and if I randomly take one observation
49:29from each distribution from each condition,
49:32is the chance that one is bigger than the other
49:35equal to 0.5?
49:36That's the Wilcoxon question.
49:38While DESeq2 and edgeR, their D gene definition
49:42is the negative binomial means are different.
49:45But clearly, you can see, it only depends
49:48on that negative binomial is a reasonable distribution,
49:51that's the key.
49:52So that's why in theory,
49:53if negative binomial is no longer valid or reasonable,
49:58then why should we define a D gene
50:00based on negative binomial mean indifference?
50:03I think that's kind of my answer to your question.
50:06But the tricky thing about statistical inference
50:09compared to supervised learning
50:11is that we don't observe the truth, that's always the case.
50:14So we're making a guess.
50:16Frequentist people have one way to guess,
50:20Poisson people have another way of guess.
50:21And so one issue I've seen in the Twitter discussion
50:24is that several people try to,
50:26maybe not intentionally,
50:28confuse frequentist concept with Poisson concept,
50:32but they're not really comparable, right?
50:34You cannot talk about them in the same ground.
50:36That's a problem, and here, our criterion,
50:39false discovery rate is a frequentist criteria,
50:42it relies on P values, right?
50:44So therefore, you cannot use Poisson arguments
50:46to argue against such frequentist way.
50:49Because you are doing frequentist, right?
50:52But whether frequentist makes sense or not,
50:54that's a different topic.
50:55Hopefully, that answers your question.
50:57<v ->Yeah, thank you.</v> <v ->Thank you.</v>
51:01Yes. <v ->Hello,</v>
51:02thank you much for your talk,
51:03and I think that is very interesting.
51:05However, I have a question on slide 26 actually..
51:13It's about what you said that,
51:17maybe 26.
51:20<v Jingyi>26, okay, yeah.</v>
51:23<v Attendee>Yeah, you said</v>
51:24that like it is a multi-gene probabilistic model
51:28for cell type.
51:30However, I'm a little bit confused
51:33about how you define the cell type.
51:37But basically, from my own understandings,
51:40that after you get, for example, the single cell rise data,
51:45for example, you will use the route to get the cluster.
51:48<v Jingyi>Yeah.</v>
51:49<v Attendee>And you will annotate this cluster</v>
51:52based on the-
51:54<v ->Knowledge, yeah.</v> <v ->Gene.</v>
51:56And then, if this model based on your
52:04annotation of, okay.
52:08<v Jingyi>Yeah, I see you point.</v>
52:10Essentially, yeah, we need cell cluster to be pre-defined.
52:13So if it's not reasonable, then, yes,
52:15it will affect the results for sure.
52:17Because the key is that you need to make sure
52:20it is reasonable to assume a gene follows
52:23one of the four distribution within a cluster, right?
52:26So that's why there are methods
52:28that try to refine clustering
52:30by checking the negative binomial distribution.
52:34So there are several research on that,
52:36and they're trying to refine that.
52:37But basically, we are sitting on those methods
52:40to do the simulation, that's what we do.
52:43But again, so that's why this is the problem with scDesign2,
52:46but scDesign3 sort of tries to address this problem
52:50by providing the BIC.
52:52So if the input clusters are bad,
52:55then you can see that in the BIC.
52:57Because the likelihood will not be there, yeah.
52:59<v Attendee>A similar question.</v>
53:02I have another question
53:04is that basically I assumed (indistinct) about
53:09the experiments have duplicates,
53:13however, in some situations,
53:16maybe we do not have the replication.
53:20But in this situation, how could we control the FDR,
53:26if we do not have replicates,
53:28then we cannot get the P value.
53:30<v Jingyi>That's exactly the point of this talk.</v>
53:31The only part that has replicates is the RNA-seq part.
53:35The second part, that's the only part we have replicates.
53:37In the first part, when we do the ChIP-seq,
53:39it's just one replicate per condition, right?
53:42That's why I said P value calculation would be helpful.
53:45Right, so the reason we could control the FDR
53:47without using P values
53:49is just because we have many, many tests.
53:52So that's why we're doing this large scale testing.
53:56I think the idea, if you check it out,
53:59Bran Efron has talked about it extensively in his book,
54:04it's called, so his idea of Empirical Bayes
54:07is very similar to this.
54:08We try to borrow information across tests
54:11to set a threshold.
54:13Yeah, hopefully that answers your question.
54:15Yeah?
54:16(interference drowns out speaker)
54:21Yeah, sounds good, thank you.
54:23(interference drowns out speaker)
54:26Thank you.