YSPH Biostatistics Seminar- Dr. Jingshu Wang

February 10, 2021

Information

Jingshu Wang, Ph.D.

Assistant Professor

Department of Statistics and the College

The University of Chicago

Tuesday, February 9, 2021

ID6188

To CiteDCA Citation Guide

00:04- Right, so I think while we're waiting,
00:06I'll just give a very brief introduction about Jingshu.
00:11Jingshu is an Assistant Professor from the Stats Department
00:14at University of Chicago.
00:17And today she's going to present some very exciting work
00:22on trajectory inference for the single cell data.
00:27I'm very excited to hear about her work.
00:31I think...
00:34Lets wait for two minutes
00:35and then we'll start with Jingshu's work.
00:40So if you have any other related questions about Jingshu
00:45before the talk start, you're free to ask as well.
00:48(chuckles)
00:51- Hi Lynn. Can you make me a co-host.
00:53- Oh right. Thanks for reminding me.
00:58Let me see.
01:14So one more minute to start.
01:42- So you can see my screen, right?
01:43- Yes. I can see your screen. Looks good to me.
01:48Maybe I'll hand it over to you now Jingshu I think,
01:52if (indistinct) late I think,
01:56they can ask questions if you miss any details.
01:59- Yes. Okay.
02:01So thanks everyone for coming
02:04and settling for the introduction invitation.
02:08Today I will talk about the Single-Cell RNA Sequencing Data
02:13and how we can learn the cell dynamics
02:16from the single-cell RNA sequencing.
02:23So the single-cell RNA sequencing is a relatively,
02:26is a newly development, newly-developed
02:30but also relatively mature technology
02:33for measuring the RNA expression levels in the cells.
02:38And the traditional microarrays or bulk RNA sequencing
02:43matches the gene expressions
02:45as the average across all cells in a tissue.
02:49However, a cell is made of many cells and, Oh, sorry.
02:54A tissue is made of many cells
02:56and the cell population is typically not homogenous
03:00and the cells can have different functions
03:02and different cell types.
03:04So in contrast, in single-cell RNA sequencing,
03:08we have measured the transcriptional profile
03:12in each individual cell.
03:14So we can expand this vector
03:16of gene expressions for a tissue to a matrix
03:19of the gene inspections in the cells.
03:23And each entry is a mattered RNA count
03:28for a particular gene or a particular cell.
03:31So that's...
03:32So the benefit is that we have no,
03:34we have a more detailed understanding of what is going on
03:40in the tissue.
03:41So the benefit of single-cell RNA sequencing,
03:44is that it can give you a relatively unbiased
03:48and complete picture of the cell population.
03:51And this is particularly useful
03:54when the cell population is complicated.
03:57For example when the cells are experiencing dynamic changes.
04:04And as an application of the method
04:06that I will introduce today in this lecture, in this talk,
04:11I will focus on the study of the mouse neocortex.
04:18This is a cartoon showing the migration and generation
04:23of the projection neurons in the mouse neocortex.
04:28Yeah, you guys see that this is quite a complicated process
04:33and there are still a lot of things that are unknown
04:37about the neuronal diversity and the mechanism
04:40of how the projection neurons are generated.
04:45And the goal,
04:45is that we want to use the single-cell RNA sequencing
04:49so that we can have a more complete understanding
04:52of this, the neuronal diversity and the neuron development.
04:57So you can see that here in this cartoon, this shapes,
05:03there are different shapes and colors,
05:06to represent different cell types in the neocortex,
05:12as the cells are experiencing the continuous dynamic changes
05:17actually in the real cell population,
05:19it is much complicated than that.
05:22There is not clear boundaries
05:25between different cell types and there may be...
05:28There even, it's not a clear definition of cell type.
05:33So, what we hope,
05:35is that we want to use single-cell RNA sequencing
05:38to first recover the trajectory of the dynamic changes
05:43or the developmental process
05:45that the cells are experiencing.
05:48So specifically we focus on two datasets.
05:52One data set, we name it as data set A.
05:56So this is a data set
05:58that is recently collected by my collaborator.
06:01And so we have samples...
06:05The cells from the mouse neocortex
06:08at six different embryonic days.
06:11And before our data,
06:13there is another dataset we call it, we name it data set B.
06:17And this dataset is a smaller dataset than ours
06:22but they are...
06:23They have also sequenced a very similar brain region
06:28of the mouses and they have a sequence of cells
06:32from four different embryonic days.
06:34So you can see that our,
06:37most of the days that are sequenced in our dataset
06:41and with the other dataset B, do not overlap.
06:44And so it would be beneficial if we can have with...
06:49If we can combine the two there datasets
06:51and so that we can make use of the cells from both studies.
06:56For instance, for our dataset,
06:59we don't have these cells from the day 11,
07:04which is quite important day.
07:06For example here, day 11 are the day that,
07:11there are projection neurons that are, beginning time,
07:17well, there are projection neurons that are generated.
07:21And so this E11 cells are sequenced from the other dataset.
07:26So it would be beneficial
07:28if we can perform a choice analysis of the two datasets
07:32and learn a shared developmental trajectory
07:35as these two datasets,
07:36are actually sequencing the same mouse brain region.
07:42So as you may have imagined, if we don't do anything,
07:47if we just concatenate the cells from two datasets
07:51and treat them as datasets from the same lab,
07:54then these two datasets actually will not,
07:57the cells will not merge
07:59because of the bash effects between the two datasets.
08:03Because these are from two labs
08:05and they have different sequencing machines
08:08so the cells become different,
08:10though they are coming from the same brain region.
08:14And this is a figure called the UMAP
08:17which is a two-dimensional projection
08:20of the high dimensional,
08:21observed single-cell RNA sequencing data
08:24so that we can have a visualization
08:27of the cell population.
08:28(clears throat)
08:29And using our marker which is called vitae
08:33that I will introduce later
08:35or we can merge the cells from two different sources.
08:41And as I will show later,
08:44we can also keep the uniqueness, the unique characteristics
08:49that only exist in one of the datasets.
08:51So we can keep the biological meaningful differences
08:55between the two datasets.
08:57And our method is actually not just,
09:01data integration approach.
09:03So what we can do, is that we can also simultaneously,
09:08learn a shared trajectory structure
09:12and we can at the same time do the disintegration
09:16or more generally correct for confounding effects
09:19such as the data source and other various like cell cycles.
09:25And in the...
09:27In this figure,
09:28the arrows show the direction of the developmental process
09:34and the line width represents the score for an edge.
09:38So it shows how confident we are in,
09:43in like whether there's a transition
09:45between the two states that the line connects.
09:55So our method actually belongs to a larger group
09:59of computational tools for single-cell RNA sequencing
10:03which is called the trajectory inference.
10:05So here we call it...
10:08So it is called trajectory inference
10:09that is different from statistical inference.
10:13So it's a computational tool
10:15so that we can understand in the, our cell lineage
10:20and the cell fate decisions in biological process,
10:25such as cell differentiation
10:27as what we have already seen in the mouse neocortex,
10:31and some other biological process,
10:33such as immune response, cancer expansion
10:36and many more are using single-cell RNA sequencing data.
10:41In general, the trajectory inference approaches,
10:44they will infer or they start with a,
10:50a type of the underlying trajectory structure
10:53and other methods,
10:55they will assume a specific type of the trajectory structure
11:00for the underlying developmental process,
11:03such as a linear structure, a linear topology
11:06or a bifurcating, a bifurcation or tree-like trajectory.
11:13And as the cell populations
11:14that we are trying to understand,
11:16become more and more complicated,
11:18recent methods also try to infer,
11:21the type of the trajectory structure
11:23from the observed single-cell RNA sequencing data.
11:27And whilst we have learned the trajectory structure,
11:30then this trajectory inference approaches,
11:33will computationally project
11:36and order the cells along the trajectory.
11:39And the right order of the cells along the trajectory,
11:43are called the pseudotime of the cells.
11:46So the trajectory inference,
11:48is also called the pseudotime analysis.
11:53And since...
11:55So the first trajectory inference method is proposed in 2014
12:01and since then it has become a very popular tool
12:05that are used in analyzing single-cell RNA sequencing data.
12:09And in this study, it calculates, it summarizes the number
12:15of single-cell RNA sequences studies
12:16that are published per month.
12:19And you can see that in recent years,
12:21more than half of the published
12:24single-cell RNA sequencing studies
12:26will have some investments of the pseudotime
12:29and trajectories in the cell population
12:31that they are investigating.
12:35And there has also been a lot of methods
12:39for trajectory inference.
12:41And in this,
12:45there is a comprehensive benchmarking paper,
12:48recently in "Nature Biotech"
12:50and it has summarized 70 different trajectory methods.
12:55And in your paper they have compared
12:57about 45 different trajectory inference methods
13:00from different aspects.
13:03So you may wonder,
13:04since there are so many trajectory inference methods
13:07that are already there, why do we still want to develop,
13:11a new trajectory inference method?
13:14So the first point is that,
13:17although we have 70 different methods,
13:19many trajectory inference methods,
13:22they are assuming a specific type
13:24of the trajectory structure.
13:26So many methods only work for a sound developmental process.
13:31If you consider the methods that can work for...
13:34They have the flexibility
13:35if you work for a variety of the trajectory structures
13:39then we don't have that many methods that are available.
13:43And another concern that I have is that most methods,
13:48these trajectory inference methods,
13:51do not have explicit statistical models.
13:55So what I mean is that,
13:56though people are kind of clear
13:59about what's the biological signal
14:01that we want to find in the trajectory inference,
14:06it is actually, many methods are actually pretty vague about
14:11from the aspect of like for the single-cell data matrix,
14:16what can be the definition of the trajectory
14:19that they want to infer.
14:21So, and how that they are generating,
14:25and how the data it can be modeled and generated
14:28with the trajectory structure.
14:30So as the statistician, I think it would be beneficial,
14:34if we have a model-based trajectory inference approach,
14:38so that we can better understand the profit,
14:41how good our estimations are
14:44and have some certain qualification
14:46of the trajectories or slow times that we infer.
14:54And the third point is that
14:56as you have shown at the beginning,
14:59there is also a growing need,
15:01to efficiently align trajectories
15:05or do a joint analysis
15:07from multiple single-cell RNA sequencing datasets.
15:10As the, as the studies...
15:14As the single-cell RNA sequencing datasets are expanding,
15:18there has already been a lot of studies for datasets,
15:21they are for the same tissue or for the same cell type.
15:25And it will be...
15:26(clears throat)
15:27And we can learn a better picture of, on this,
15:31the biological process in the tissue or for the cell time,
15:35if we can use all available datasets.
15:39And so there's a strong need,
15:41an increasing need to do this joint trajectory analysis
15:46for multiple datasets.
15:50So because of these reasons,
15:52we develop a new statistical framework and a new method,
15:57and we call it VITAE,
15:59which is short for variational inference
16:02for trajectory by autoencoders.
16:05And it is a model-based trajectory inference approach.
16:11So our model starts with a definition
16:15of the trajectory backbone.
16:18So we use a graph to define the trajectory backbone.
16:22So we start with a complete graph G,
16:25well the vertices are the distinct cell states and cell type
16:31and an edge denotes a possible transition
16:34between two cell states and or cell types.
16:39And then we can define a cell position on the graph
16:43which is a vector, which is a landscape vector.
16:47And it's...
16:48A K is the number of vertices on the graph,
16:51in the graph.
16:52So if a cell is exactly,
16:55belongs to one cell state or cell type,
16:58then it is on cell vertex.
17:01And if the cell is experiencing a transition
17:05between two cell states or cell types,
17:08then we denote it as on the edge between two vertices.
17:16And then we can define the trajectory backbone
17:18as a subgraph of G.
17:20So we only include an edge or vertex
17:23if we really observe cells that are on the edge.
17:27So though there are many possible transitions
17:30between the cell types or cell states,
17:33We believe, we include a transition only
17:38when we do observe cells
17:39that are experiencing such transitions.
17:42So though there can be many edges in our complete graph,
17:48on the sub graph, it's a sparse success of the other edges
17:52that are possible on the graph.
17:57And a benefit of the above definition,
18:01is that we can allow any types of the trajectory structure.
18:05So it can be either a linear structure,
18:07a bifurcate chain or a tree-like structure or a cycle,
18:12it completely depends on how the data shows.
18:16And so we allow,
18:18we want the data to automatically determine the other,
18:24the trajectory structure or topology
18:26of the underlying dynamic process.
18:33And we can also define the pseudotime for each cell.
18:38I have not written down the exact definition here
18:43but the idea is that we first need a root vertex.
18:46So a root vertex is the start of this dynamic process.
18:51And it can be given by the user,
18:54depending on looking at the marker genes
18:57or other side biological information.
19:00And later we will also...
19:02I will also show you that for some datasets,
19:04we can automatically determine the root vertex.
19:08And with a given root vertex,
19:09then the graph becomes a directed graph.
19:13And we had defined the pseudotime of the cell
19:17as the shortest path from the root to a specific cell
19:22along the trajectory, along the trajectory backbone.
19:28So this graph defines the trajectory structure.
19:33And the next step,
19:34is that we want to link the trajectory structure
19:38with the data generation model.
19:41So the single-cell RNA sequencing data matrix,
19:46is typically a high dimensional matrix
19:50because for each cell,
19:51we typically observe tens of thousands of genes
19:54and there are also complicated dependency relationships
19:59among the genes.
20:01And what we assume,
20:02is that we assume that these dependencies across genes,
20:06can be explained by a latent variables, Z(i)
20:10in a low dimensional space.
20:13And we assume that these latent variables,
20:18are following our normal distributions
20:20and they also have the graph structure.
20:25So U here, are the positions of the vertices on the graph
20:29in this low dimensional space,
20:32and the meaning of Z(i),
20:34is a linear combination of these vertices,
20:38depending on, of the positions of the vertices,
20:41depending on the cell's graphic position on the graph.
20:46And,
20:49what I want to emphasize here is one point,
20:52is that we assume a non-linear marking
20:55from the latent space to the high dimensional observed data,
20:59because we think that though in the low dimensional space,
21:03we can represent the trajectory as these linear lines,
21:07it is very likely a manifold on the observed data.
21:10So this non-linear mapping,
21:12can map this linear lines to hertz
21:16in the high dimensional space.
21:18And now to consider,
21:20to account for the confounding covariates,
21:23such as the data source or cell cycle,
21:27we also allow this non-linear mapping,
21:30to depend on this covariates.
21:33And here we are...
21:35Because the observed data count,
21:37we assume it follows an active binomial distribution,
21:40and L(i) are...
21:41Oh, sorry. L(i) here should be known library size.
21:44Sorry for the typo. It should be known library sizes.
21:46And CRJ, and the CRG, the dispersion parameter switch gene,
21:51are unknown parameters.
21:53And so in this, in the current model,
21:55the unknown parameters we have,
21:57are these cell, the vertex positions,
22:00U, the cell positions on the graph,
22:03the W(i) the non-linear mapping
22:06and this unknown dispersion parameters.
22:08So we have a lot of parameters.
22:11So to further simplify our estimation,
22:15we assume that there are a mixture prior
22:18on the cell positions.
22:20So it's a very tactical idea.
22:23So we assume that first the cell,
22:27there are some latent variables, Ci for each cell.
22:30And so Ci determines which edge or vertex a cell chooses.
22:35So the cell has some probability to choose a specific edge
22:39or a specific vertex,
22:42and if it chooses an edge,
22:46then, it then eventually choose the location of,
22:51the relative location on the edge..
22:55So that's what becomes a mixture prior on this diagram two.
23:00And,
23:02for these non-linear mapping functions,
23:06we're including non-linear mappings.
23:10We model these F(G(i)) functions by a neural network.
23:17So, Oh, do you?
23:18So our parameters now, our known parameters now are,
23:22this U the positions of the vertices
23:24on the low-dimensional space,
23:27this PI, the probability of each vertex and edge,
23:34and the non-linear mappings,
23:37which are the waste in the neural network
23:39and this, this dispersion parameters.
23:42And we...
23:43I space these parameters by combining our mixture model
23:47with a variational autoencoder.
23:51So the variational autoencoder.
23:53So the autoencoder has been a very popular model
23:58for in deep learning.
23:59So what it can do is, is that it can,
24:02can have some non-linear mapping
24:07of the observed data to a low-dimensional space
24:11and we want the low-dimensional space to best,
24:14recover our observed time rational data.
24:18And here, we also have such a task.
24:21We have the low-dimensional space
24:23and we want to best to recover our observed data.
24:27And what's different is that we also have a prior
24:31on the latent space, because we have the prior,
24:34so we use the variational autoencoder model
24:38in deep learning.
24:39So the classical variational autoencoder
24:43in deep learning, we'll assume that the latent space,
24:46has the standard normal distribution as the prior.
24:50And here we just modify it so that we have the...
24:53So that the latent space have the mixture prior
24:56that are assumed in our previous mixture models.
25:02And we use the same approach
25:04as the variational autoencoder, the variational path
25:08which is, though the,
25:10to approximate the posteriors of the latent space.
25:13So though, because of the complicated priors
25:16and non-linear mappings,
25:19this prior, the posterior of the latent space conditional
25:23on the observed data and the confounding covariates,
25:27it can be complicated,
25:28we approximate it by our normal distributions
25:32and the mean and the variances of the normal distributions
25:36which are functions of the observed data Y(i)
25:40and the covariance Xi,
25:41are also non-linear functions
25:44and we model them by the neural network
25:46and that's the encoder.
25:47So the decoder is the nominal mapping function F(G(i)),
25:53mapping the latent space to the observed data.
25:56And encoder are neural networks
25:58that are approximating the posteriors of the latent space.
26:03- Hi, Jingshu, can I ask a very quick question?
26:07If I understand correctly,
26:08up to now you have not used the time information,
26:11is this true?
26:13Or you have considered to include the time information
26:16in the covariate X?
26:17- Oh, oh, yes.
26:19So we will not use time information
26:24in our trajectory inference.
26:26If I have time, I may have a last slide which is a,
26:31which is a review of the literature into data inference.
26:34So into data inference,
26:37the most commonly used and best performing methods,
26:41they will tend to not use the time information
26:44because there is...
26:45Though the time information is typically correlated
26:49with developmental like, timing of the cells,
26:54but because at each collectional time,
26:57is a mixture of cells at different environmental time.
27:00So it's a big, complicated relation
27:03and some methods use that information
27:05but many methods do not use that.
27:09And so our approach is the methods that do not use the,
27:14the collection time information.
27:16And, we use it only when we decide
27:22which vertex is the root.
27:24And I will show that later.
27:27- Thanks.
27:30- So our...
27:33So the last function is, is composed of three parts.
27:37The first part is this likelihood based reconstruction loss.
27:42So this evaluates how goods are our latent spaces to,
27:48to reconstruct the high-dimensional observed data.
27:54And the second part is the KL divergence
27:57between the posterior distribution and the prior.
28:02And you can think of it as a regularization term.
28:06And so to regularize
28:07if the posterior is very far away from the prior,
28:12also when for variational autoencoders so beta equal to one,
28:17it can be also viewed as a lower bound
28:21of the marginalized data.
28:23So here we make the,
28:27we add the training parameter beta,
28:30and in practice we said, beta are larger than one,
28:34so that we can encourage the posterior, the regularization,
28:39so that the posteriors of the I will,
28:41are more likely to tend to align
28:44along the edges and vertices.
28:45And that's the idea that has been used in deep learning
28:48which is called the beta.
28:53And the third term,
28:55it's a term for adjusting for the covariance.
28:58So the covariance...
29:01So this covariance * covers.
29:03So we want our latent space C to be kind of,
29:08be correlated with the covariance,
29:10While...
29:11So we, certain...
29:14So we want to maximize the reconstruction of the data
29:18by only by the covariates.
29:20And setting the tuning parameter alpha larger than zero,
29:24we can help decorrelate Z(i) from Xi.
29:31So another art in our training is that,
29:35we need a good internalization of the graph.
29:38So specifically we need to determine,
29:40how many vertices there are,
29:42and also the positions of the vertices
29:45in the low-dimensional space.
29:47That's not an easy job.
29:49And if we just randomly,
29:52because our final graph depend on,
29:56the total number of vertices that we set at the beginning.
30:00So how we pretrain the model to return it's initial value?
30:05To get the initial values of the unknown parameters,
30:08is that we first trained with beta equal to zero,
30:12so that we don't make use of any,
30:14these prior distributions of the I.
30:17So it's only...
30:18We're only looking at the reconstruction loss
30:20from the likelihood of the data.
30:22So it's like the normal, the classical autoencoder.
30:27And from that we can get some initial estimate of Z(i),
30:33The latent space variables.
30:34And then we perform clustering on Z(i),
30:38and we let the clustering algorithm,
30:41to automatically determine the number of clusters
30:43and we use that as the number of vertices.
30:46And we also use the cluster centers
30:49as the initialization, as the initial values of U.
30:53So that's the main part,
30:55the key ideas in our pre-training start,
30:57so that we can have a good initial addition to,
31:01for the start of the training.
31:05And another trick that we have taken,
31:08is that in practice, sorry,
31:10the best performing existing trajectory inference methods.
31:16They will attempt...
31:17So they are typically very fast.
31:20And in order to have comparable computational costs
31:24of these methods,
31:25we also have accelerated version of our algorithm
31:28which is a simply to reduce the input,
31:34the dimension of the input space,
31:36so we can replace Y(i),
31:39the high-dimensional vector of the gene expressions
31:44with it's principal components.
31:46Now, principal component, principal scores,
31:50which is a low-dimensional vector L.
31:53We, by default we will take the first 64 dimensions
31:59for the principal scores.
32:01And so we replace the elected binomial distribution
32:05by a normal gaussian distribution assumption
32:09of these principal scores.
32:11And as you will see later in our,
32:15in our evaluations with real and synthetic data,
32:19we see that we actually have comparable performance
32:24with our previous likelihood,
32:26with our standard likelihood based methods
32:30for this accelerated version.
32:35So after the final step is that
32:38after the training the autoencoder,
32:40we have approximated distributions,
32:44posterior distributions of the latent space
32:47and also the cell positions that
32:52and which vertex,
32:54the vertex or the posterior distribution of Ci,
32:58well Ci* is which vertex or edge the cell is from.
33:04And we need to use those information,
33:07to determine the trajectory backbone
33:10and to project each cell
33:13on our inferred trajectory backbone.
33:17So how we do that is, first we calculate an edge score.
33:22So this edge score is...
33:27So we have different scores for an edge,
33:30and that is determined
33:32on looking at the posteriors of cells.
33:35How many cells from the posterior distribution?
33:38How many cells choose to lie on that specific edge?
33:43If there are a lot of cells then that means
33:45that it's very likely that edge exist.
33:47If there are very few cells
33:49then very likely that edge should not be,
33:52the edge that is included in the trajectory backbone.
33:58And the denominator is that we want to,
34:03give a relatively high fair score
34:06for the edges of that connecting to small clusters,
34:11to small cell types,
34:13such as we want to also capture the transition
34:17between two rare cell types.
34:20So that's why we have this regularization,
34:25waiting by with, in the denominator.
34:29And we include an edge in the trajectory backbone
34:32if it's edge score is larger than some*.
34:38And when we have an inferred trajectory backbone,
34:43then the next step is,
34:45we want to project the cells on the inferred trajectory,
34:50and we do it by looking at the...
34:54Based on the posterior, approximated posterior distributions
34:58of the cell positions that *.
35:01We want to find the closest point on the inferred trajectory
35:08for this cell based on the posterior distributions.
35:12And the distance,
35:14this expectation we can also use as a evaluation
35:18of the, some uncertainty quantification
35:21of this projection
35:24or the cell positions.
35:27And the third thing we need to,
35:29because our final results is some kind of directed graph.
35:33So we need to determine the root vertex.
35:36So the root vertex can be either given by the user, or if
35:43as I feel I asked,
35:44for some datasets like the data at the beginning of my talk,
35:48the cells are collected in the time series,
35:52and we can make use of that time series,
35:54to determine the root vertex.
35:58The rough idea is that for each vertex,
36:01we can calculate a fraction time score,
36:03which is an average of the cells
36:06that belong to the vertex
36:07or projected on the edge that connects to the vertex,
36:11depending on the distance from the cell to the vertex.
36:15And so we can have some vertex collection time,
36:20collection time score for each vertex.
36:21And we choose the root vertex
36:25as the vertex that has the smallest collection time score.
36:32And with the roots and with our inferred trajectory,
36:35it's straightforward to calculate the pseudotimes
36:38for each score.
36:40So that's the whole process
36:42of our model-based methods for trajectory inference.
36:48And now let's look at some benchmarking results.
36:52So first we...
36:54So our benchmarking includes both some real datasets
36:58and some synthetic datasets.
37:00And we follow the...
37:02Most of the benchmarking follows the same framework
37:05as this well known benchmarking paper
37:09in the "Nature of Biotech" in 2019.
37:12And our datasets are selected
37:14as a subset of the datasets that they have tried
37:18and our criteria is that these datasets,
37:19maybe have enough number of cells not too few cells.
37:25And we wanted to cover different types of topologies.
37:33And this is the benchmarking results.
37:37So the columns, sorry.
37:39So the rows are the different datasets
37:43that I have mentioned.
37:45And we compare five different methods.
37:50So we have, would come first compare two versions
37:52of our approach.
37:53Vitae one as the original elected binomial likelihood base.
38:00Vitae and accelerated version,
38:03replacing the gene expression vectors
38:07by principal scores.
38:11Then we compare it with three different,
38:13state of the arch trajectory inference methods.
38:19The monocle PAGA.
38:20So the monocle series is from the lab that,
38:24who developed the first trajectory inference methods,
38:27and there, and then they further,
38:30the first take monocle for one
38:32and now they have monocle three.
38:33So the monocle series are always commonly used
38:36in these single-cell RNA sequencing papers.
38:40And two, I expect the performing,
38:44trajectory inference methods in the benchmarking paper,
38:48the PAGA and Slingshot.
38:51And all these methods,
38:55do not use this time information explicitly.
38:59So it's a fair comparison between these methods.
39:04And so the...
39:07And for all the methods they are given to by those.
39:11We give them the two number
39:13of clusters or the vertices to start from,
39:19and the two root vertex.
39:22And we, and each column is,
39:25we compare it's measurement, it's metric
39:30for the evaluation of the performance of each method.
39:36So the first two columns,
39:37are the matrix for recovery of the trajectory topology
39:42or the trajectory structure.
39:46And next two columns are the evaluation
39:49of the cell position, estimation accuracy.
39:53And the last metric is
39:55for evaluating the pseudotime accuracy.
39:58And a larger score means a better performance,
40:01and a lower score means like, a worse performance.
40:06So you can see that, our approach first is,
40:11our approach has much better performance
40:13in recovery of the trajectory topology.
40:18We also have some benefits of the cell position estimates,
40:23and because of both,
40:24we have a better performance in the pseudotime accuracy.
40:29And the other thing you can see is that our,
40:33our accelerated version have comparable,
40:36slightly worse but comparable performance,
40:40compared to the, our likelihood based vitae.
40:46And though it has a much quicker,
40:48much less computational cost.
40:54So finally,
40:56let's come back to the case study on mouse neocortex.
41:02So this is the,
41:05the visualization of merging the raw data.
41:10And this is the performance of our methods.
41:14And for comparison, we compare is, another very popular use,
41:21data integration method called Seurat.
41:24So Seurat is the software, the most often used software,
41:29for single-cell RNA sequencing analysis.
41:31Their lab have different,
41:32have developed a series of computational tools
41:35for analyZ(i)ng the single-cell RNA sequencing data.
41:38And this is from their integration methods.
41:40So you can see that both methods can,
41:44is able to integrate the both two datasets,
41:50but for some details, I think,
41:52because we are assuming this trajectory structure,
41:55we have a slightly better performance.
41:57For example, this group of cells are the layer one neurons,
42:01where the group of here, are here in Seurat.
42:07So you can see that because they come from,
42:09because the outer layer parts and the layer parts,
42:12come from two datasets.
42:16Because as I mentioned earlier in dataset B,
42:20they have collected cells from, at day 11.
42:24So this are, we can take a look
42:27of the collection days of each cell.
42:32So you can see that these cells, they are,
42:34the layer one parts come from day 11,
42:36And the rest parts is a mixture
42:39of cells from both two datasets.
42:42And by the way they all belong to the layer one.
42:43So we know that they belong to layer one
42:45by looking at the marker genes expression
42:48which I did not show here.
42:50So it's because we encourage the cells to align together
42:55if they are along the address, if they are similar cells.
43:03And you can see here, that's,
43:06so the two datasets, they have this interpolation
43:10of the pseudo, of the collection time.
43:14And you can see for example,
43:16for these projected cells,
43:18we can see this continuous positions,
43:24like alignments of the cells of different days
43:27from so the most the dark is the cells from day ten.
43:34And the red ones are the cells from day 18
43:36and even days are, are from dataset A,
43:40and odd days are from dataset B.
43:42So you can see that though they're coming
43:43from two different sources,
43:46we can, we are able to align them in the right order.
43:54And,
43:56and as another comparison.
43:58So we compare our estimation of shared trajectory,
44:03with another partisan approach
44:06which is we're first to do data integration with Seurat
44:12and then we can use Slingshots,
44:14to perform trajectory inference on the integrated data.
44:19And you can see that this,
44:21we have a much cleaner trajectory structure.
44:24And we also have a comparable computational cost.
44:29So Seurat and Slingshots,
44:31they cannot be, they do not need regularization.
44:34And with one CPU, they, it takes about 12 minutes.
44:38And for our accelerated VITAE,
44:42generating this figure, we have,
44:45we can, we take about three minutes
44:47on one GPU port at about 10 minutes on eight CPU cores
44:52which is, the eight CPU cores are like currently,
44:55like most of our laptops,
44:57but we'll have such computational resources.
45:00So we have comparable computation cost
45:04with this state of our methods.
45:07And in addition, because we are...
45:11Based on this approximated posterior distributions
45:14we also have some kind of uncertainty quantifications
45:19on the cell positions.
45:20For example, here, it shall say some parts of the cells,
45:24these cell positions
45:26along the trajectory are not very reliable.
45:32And that will help us to evaluate our, the estimate,
45:37how we think our estimate in pseudotime.
45:42And finally,
45:44this is showing some gene expression change
45:49along the pseudotime order.
45:50And, and we look at some top markers
45:55that are changing along the pseudotime order
45:57for some trajectories in the whole trajectory structure.
46:03And you can see,
46:05here we separately fish the curve for two datasets,
46:11but you can see that they overlap
46:13with each other quite well.
46:15And so that's also an evidence showing
46:18that we can have a good,
46:23a good performance in aligning the two datasets.
46:28So the take home message is,
46:31first we perform this model-based trajectory inference,
46:36to understand cell dynamics.
46:38And our, the second is our methods.
46:41So our method is a model-based approach.
46:43We can combine the mixture prior model, Oh, sorry.
46:50We can combine the collected mixture structure
46:54for defining the trajectory structure
46:58with the variational autoencoders
47:00so that we can efficiently,
47:05efficiently solve the mixture model
47:08and have enough flexibility to fit the data well.
47:12And so our,
47:14trajectory inference approach features,
47:17the analysis of integrating
47:19multiple single-cell RNA sequencing datasets.
47:22And if you are anxious to know more details,
47:26we have our paper, a manuscript,
47:29a manuscript already available on bio archives
47:33and we also have our package codes on VITAE.
47:38And that's all. Thank you.
47:40And if you have any questions, I'm happy to answer them.
47:44- Thanks Jingshu for this excellent, excellent talk.
47:49I wonder whether the audience,
47:50have any questions for Jingshu.
47:58Okay. So Jingshu I have some maybe minor questions.
48:02I recall that in the model,
48:04you have this actual term encouraging X,
48:07the data explained by X the covariates,
48:12the confounding covariates,
48:13to be orthogonal to the leading factor, right?
48:19And then there is a penalty term alpha.
48:21So I wonder,
48:25what's the motivation for you to including this term
48:28instead of first removing the effect from the i directly.
48:33And in practice, how should we have set alpha?
48:39- So, so, so, so the thing is a bit tricky here
48:44from a statistical point of view.
48:45So we want to remove these confounding effects, right?
48:51But the other fact is that,
48:53these confounding effects, the X,
48:56are not exactly orthogonal with Z,
48:59because for instance for the two datasets that I have,
49:03we cannot say that the signal is completely orthogonal
49:06to each dataset they have come from
49:09because there are two biological differences
49:12between the two datasets.
49:13So, so here,
49:17so here, the problem is not completely identifiable
49:19but people do it in practice a lot.
49:22So we want to kind of decorrelate Z and X to some extent
49:29so that we can remove,
49:32remove the batch effects that we do not want
49:36but keep the two biological differences.
49:39So I think the underlying assumption
49:41the big assumption is that we are assuming
49:44that the two biological differences are large enough
49:48so that compared to the...
49:50So we remove the smaller differences, the batch effects
49:53but we can still pick the two biological differences,
49:56to some extent.
49:57So there's no guarantee that it will work,
50:01but in practice, it work on a lot of datasets.
50:06I think that's...
50:08So we will inherit this idea from this paper
50:11by Nancy Huang and her students.
50:16So I think in removing the batch effects.
50:19So I think the idea is that,
50:21we hope that it can work for a lot of datasets.
50:25And the reason that we want to have this penalty,
50:28is that if we don't add any penalty,
50:32then, because this autoencoder is trained by this,
50:36by this stochastic gradient descent.
50:39So sometime it may not find the optimal global solution.
50:44So if we don't encourage it, the X and Z to be decorrelated,
50:49it sometimes may not be able,
50:51it may give you a solution that is not,
50:54that's the Z still are highly correlated with X
50:58and the batch effects are still there.
50:59So then this alpha, I think in practice we set it to be 0.0,
51:06so it's a very small penalty so that we can put some,
51:11some kind of penalty to regularize that.
51:16- Small amount, I guess you mentioned.
51:18- Yes, yes.
51:21And it may be that does not work,
51:22and then in practice you can choose another alpha and try it
51:26and see if it gives you the best results that you want.
51:30- Right. So, so I want...
51:32I guess, right.
51:34So I guess my question is like,
51:36since it's not entirely a supervised problem,
51:40like how, right.
51:42So I'm not sure how to check,
51:45what is a good alpha in the sense,
51:48but if you tell me like a small alpha,
51:51well you're going to be fine
51:51then I just take it to be small alpha.
51:54- Yeah. I think the way you check it is that,
51:57for example the way we check it here is that,
52:02so sometimes we have some referenced cell types,
52:04so that, you know roughly what you are doing.
52:09So here, for example,
52:10here this reference cell types, these are,
52:14these are not used in the modeling approach,
52:18so these are for evaluation the performance.
52:21And for some datasets, you can't,
52:23we can mark our genes and do some class points,
52:26so, you know, roughly like which though,
52:29(indistinct)
52:32and for two datasets, you can see,
52:34like whether you can correctly merge the cell types
52:37that are shared among the datasets
52:41but keep the cell types
52:42that are unique to different cells.
52:46Then for our trajectory inference is slightly complicated
52:49because these cell types are not well separated.
52:52So another way that we can check our performance,
52:55is that you can see here that we can correctly,
53:00for these projected cells we can correctly like,
53:03order the wrong days in the right order.
53:06So that we know that we keep
53:08some biological meaningful signals there.
53:13I think there still can be some bias. Yeah.
53:18- Okay, great. Thanks.
53:21So, thanks Jingshu again for this excellent talk.
53:25And if you have any question you want to ask Jingshu
53:28that you cannot think about for now,
53:31you can always email her offline
53:33and if you want to use her software,
53:36I think she'll be more than happy to answer your question.
53:39- Yes. Yes.
53:41(chuckles)
53:42- So I guess that's all for today.
53:46Thank you everyone joining.
53:47Thank you Jingshu for being here,
53:49and have a nice remaining day.