Skip to Main Content

YSPH Biostatistics Seminar- Dr. Jingshu Wang

February 10, 2021
  • 00:04- Right, so I think while we're waiting,
  • 00:06I'll just give a very brief introduction about Jingshu.
  • 00:11Jingshu is an Assistant Professor from the Stats Department
  • 00:14at University of Chicago.
  • 00:17And today she's going to present some very exciting work
  • 00:22on trajectory inference for the single cell data.
  • 00:27I'm very excited to hear about her work.
  • 00:31I think...
  • 00:34Lets wait for two minutes
  • 00:35and then we'll start with Jingshu's work.
  • 00:40So if you have any other related questions about Jingshu
  • 00:45before the talk start, you're free to ask as well.
  • 00:48(chuckles)
  • 00:51- Hi Lynn. Can you make me a co-host.
  • 00:53- Oh right. Thanks for reminding me.
  • 00:58Let me see.
  • 01:14So one more minute to start.
  • 01:42- So you can see my screen, right?
  • 01:43- Yes. I can see your screen. Looks good to me.
  • 01:48Maybe I'll hand it over to you now Jingshu I think,
  • 01:52if (indistinct) late I think,
  • 01:56they can ask questions if you miss any details.
  • 01:59- Yes. Okay.
  • 02:01So thanks everyone for coming
  • 02:04and settling for the introduction invitation.
  • 02:08Today I will talk about the Single-Cell RNA Sequencing Data
  • 02:13and how we can learn the cell dynamics
  • 02:16from the single-cell RNA sequencing.
  • 02:23So the single-cell RNA sequencing is a relatively,
  • 02:26is a newly development, newly-developed
  • 02:30but also relatively mature technology
  • 02:33for measuring the RNA expression levels in the cells.
  • 02:38And the traditional microarrays or bulk RNA sequencing
  • 02:43matches the gene expressions
  • 02:45as the average across all cells in a tissue.
  • 02:49However, a cell is made of many cells and, Oh, sorry.
  • 02:54A tissue is made of many cells
  • 02:56and the cell population is typically not homogenous
  • 03:00and the cells can have different functions
  • 03:02and different cell types.
  • 03:04So in contrast, in single-cell RNA sequencing,
  • 03:08we have measured the transcriptional profile
  • 03:12in each individual cell.
  • 03:14So we can expand this vector
  • 03:16of gene expressions for a tissue to a matrix
  • 03:19of the gene inspections in the cells.
  • 03:23And each entry is a mattered RNA count
  • 03:28for a particular gene or a particular cell.
  • 03:31So that's...
  • 03:32So the benefit is that we have no,
  • 03:34we have a more detailed understanding of what is going on
  • 03:40in the tissue.
  • 03:41So the benefit of single-cell RNA sequencing,
  • 03:44is that it can give you a relatively unbiased
  • 03:48and complete picture of the cell population.
  • 03:51And this is particularly useful
  • 03:54when the cell population is complicated.
  • 03:57For example when the cells are experiencing dynamic changes.
  • 04:04And as an application of the method
  • 04:06that I will introduce today in this lecture, in this talk,
  • 04:11I will focus on the study of the mouse neocortex.
  • 04:18This is a cartoon showing the migration and generation
  • 04:23of the projection neurons in the mouse neocortex.
  • 04:28Yeah, you guys see that this is quite a complicated process
  • 04:33and there are still a lot of things that are unknown
  • 04:37about the neuronal diversity and the mechanism
  • 04:40of how the projection neurons are generated.
  • 04:45And the goal,
  • 04:45is that we want to use the single-cell RNA sequencing
  • 04:49so that we can have a more complete understanding
  • 04:52of this, the neuronal diversity and the neuron development.
  • 04:57So you can see that here in this cartoon, this shapes,
  • 05:03there are different shapes and colors,
  • 05:06to represent different cell types in the neocortex,
  • 05:12as the cells are experiencing the continuous dynamic changes
  • 05:17actually in the real cell population,
  • 05:19it is much complicated than that.
  • 05:22There is not clear boundaries
  • 05:25between different cell types and there may be...
  • 05:28There even, it's not a clear definition of cell type.
  • 05:33So, what we hope,
  • 05:35is that we want to use single-cell RNA sequencing
  • 05:38to first recover the trajectory of the dynamic changes
  • 05:43or the developmental process
  • 05:45that the cells are experiencing.
  • 05:48So specifically we focus on two datasets.
  • 05:52One data set, we name it as data set A.
  • 05:56So this is a data set
  • 05:58that is recently collected by my collaborator.
  • 06:01And so we have samples...
  • 06:05The cells from the mouse neocortex
  • 06:08at six different embryonic days.
  • 06:11And before our data,
  • 06:13there is another dataset we call it, we name it data set B.
  • 06:17And this dataset is a smaller dataset than ours
  • 06:22but they are...
  • 06:23They have also sequenced a very similar brain region
  • 06:28of the mouses and they have a sequence of cells
  • 06:32from four different embryonic days.
  • 06:34So you can see that our,
  • 06:37most of the days that are sequenced in our dataset
  • 06:41and with the other dataset B, do not overlap.
  • 06:44And so it would be beneficial if we can have with...
  • 06:49If we can combine the two there datasets
  • 06:51and so that we can make use of the cells from both studies.
  • 06:56For instance, for our dataset,
  • 06:59we don't have these cells from the day 11,
  • 07:04which is quite important day.
  • 07:06For example here, day 11 are the day that,
  • 07:11there are projection neurons that are, beginning time,
  • 07:17well, there are projection neurons that are generated.
  • 07:21And so this E11 cells are sequenced from the other dataset.
  • 07:26So it would be beneficial
  • 07:28if we can perform a choice analysis of the two datasets
  • 07:32and learn a shared developmental trajectory
  • 07:35as these two datasets,
  • 07:36are actually sequencing the same mouse brain region.
  • 07:42So as you may have imagined, if we don't do anything,
  • 07:47if we just concatenate the cells from two datasets
  • 07:51and treat them as datasets from the same lab,
  • 07:54then these two datasets actually will not,
  • 07:57the cells will not merge
  • 07:59because of the bash effects between the two datasets.
  • 08:03Because these are from two labs
  • 08:05and they have different sequencing machines
  • 08:08so the cells become different,
  • 08:10though they are coming from the same brain region.
  • 08:14And this is a figure called the UMAP
  • 08:17which is a two-dimensional projection
  • 08:20of the high dimensional,
  • 08:21observed single-cell RNA sequencing data
  • 08:24so that we can have a visualization
  • 08:27of the cell population.
  • 08:28(clears throat)
  • 08:29And using our marker which is called vitae
  • 08:33that I will introduce later
  • 08:35or we can merge the cells from two different sources.
  • 08:41And as I will show later,
  • 08:44we can also keep the uniqueness, the unique characteristics
  • 08:49that only exist in one of the datasets.
  • 08:51So we can keep the biological meaningful differences
  • 08:55between the two datasets.
  • 08:57And our method is actually not just,
  • 09:01data integration approach.
  • 09:03So what we can do, is that we can also simultaneously,
  • 09:08learn a shared trajectory structure
  • 09:12and we can at the same time do the disintegration
  • 09:16or more generally correct for confounding effects
  • 09:19such as the data source and other various like cell cycles.
  • 09:25And in the...
  • 09:27In this figure,
  • 09:28the arrows show the direction of the developmental process
  • 09:34and the line width represents the score for an edge.
  • 09:38So it shows how confident we are in,
  • 09:43in like whether there's a transition
  • 09:45between the two states that the line connects.
  • 09:55So our method actually belongs to a larger group
  • 09:59of computational tools for single-cell RNA sequencing
  • 10:03which is called the trajectory inference.
  • 10:05So here we call it...
  • 10:08So it is called trajectory inference
  • 10:09that is different from statistical inference.
  • 10:13So it's a computational tool
  • 10:15so that we can understand in the, our cell lineage
  • 10:20and the cell fate decisions in biological process,
  • 10:25such as cell differentiation
  • 10:27as what we have already seen in the mouse neocortex,
  • 10:31and some other biological process,
  • 10:33such as immune response, cancer expansion
  • 10:36and many more are using single-cell RNA sequencing data.
  • 10:41In general, the trajectory inference approaches,
  • 10:44they will infer or they start with a,
  • 10:50a type of the underlying trajectory structure
  • 10:53and other methods,
  • 10:55they will assume a specific type of the trajectory structure
  • 11:00for the underlying developmental process,
  • 11:03such as a linear structure, a linear topology
  • 11:06or a bifurcating, a bifurcation or tree-like trajectory.
  • 11:13And as the cell populations
  • 11:14that we are trying to understand,
  • 11:16become more and more complicated,
  • 11:18recent methods also try to infer,
  • 11:21the type of the trajectory structure
  • 11:23from the observed single-cell RNA sequencing data.
  • 11:27And whilst we have learned the trajectory structure,
  • 11:30then this trajectory inference approaches,
  • 11:33will computationally project
  • 11:36and order the cells along the trajectory.
  • 11:39And the right order of the cells along the trajectory,
  • 11:43are called the pseudotime of the cells.
  • 11:46So the trajectory inference,
  • 11:48is also called the pseudotime analysis.
  • 11:53And since...
  • 11:55So the first trajectory inference method is proposed in 2014
  • 12:01and since then it has become a very popular tool
  • 12:05that are used in analyzing single-cell RNA sequencing data.
  • 12:09And in this study, it calculates, it summarizes the number
  • 12:15of single-cell RNA sequences studies
  • 12:16that are published per month.
  • 12:19And you can see that in recent years,
  • 12:21more than half of the published
  • 12:24single-cell RNA sequencing studies
  • 12:26will have some investments of the pseudotime
  • 12:29and trajectories in the cell population
  • 12:31that they are investigating.
  • 12:35And there has also been a lot of methods
  • 12:39for trajectory inference.
  • 12:41And in this,
  • 12:45there is a comprehensive benchmarking paper,
  • 12:48recently in "Nature Biotech"
  • 12:50and it has summarized 70 different trajectory methods.
  • 12:55And in your paper they have compared
  • 12:57about 45 different trajectory inference methods
  • 13:00from different aspects.
  • 13:03So you may wonder,
  • 13:04since there are so many trajectory inference methods
  • 13:07that are already there, why do we still want to develop,
  • 13:11a new trajectory inference method?
  • 13:14So the first point is that,
  • 13:17although we have 70 different methods,
  • 13:19many trajectory inference methods,
  • 13:22they are assuming a specific type
  • 13:24of the trajectory structure.
  • 13:26So many methods only work for a sound developmental process.
  • 13:31If you consider the methods that can work for...
  • 13:34They have the flexibility
  • 13:35if you work for a variety of the trajectory structures
  • 13:39then we don't have that many methods that are available.
  • 13:43And another concern that I have is that most methods,
  • 13:48these trajectory inference methods,
  • 13:51do not have explicit statistical models.
  • 13:55So what I mean is that,
  • 13:56though people are kind of clear
  • 13:59about what's the biological signal
  • 14:01that we want to find in the trajectory inference,
  • 14:06it is actually, many methods are actually pretty vague about
  • 14:11from the aspect of like for the single-cell data matrix,
  • 14:16what can be the definition of the trajectory
  • 14:19that they want to infer.
  • 14:21So, and how that they are generating,
  • 14:25and how the data it can be modeled and generated
  • 14:28with the trajectory structure.
  • 14:30So as the statistician, I think it would be beneficial,
  • 14:34if we have a model-based trajectory inference approach,
  • 14:38so that we can better understand the profit,
  • 14:41how good our estimations are
  • 14:44and have some certain qualification
  • 14:46of the trajectories or slow times that we infer.
  • 14:54And the third point is that
  • 14:56as you have shown at the beginning,
  • 14:59there is also a growing need,
  • 15:01to efficiently align trajectories
  • 15:05or do a joint analysis
  • 15:07from multiple single-cell RNA sequencing datasets.
  • 15:10As the, as the studies...
  • 15:14As the single-cell RNA sequencing datasets are expanding,
  • 15:18there has already been a lot of studies for datasets,
  • 15:21they are for the same tissue or for the same cell type.
  • 15:25And it will be...
  • 15:26(clears throat)
  • 15:27And we can learn a better picture of, on this,
  • 15:31the biological process in the tissue or for the cell time,
  • 15:35if we can use all available datasets.
  • 15:39And so there's a strong need,
  • 15:41an increasing need to do this joint trajectory analysis
  • 15:46for multiple datasets.
  • 15:50So because of these reasons,
  • 15:52we develop a new statistical framework and a new method,
  • 15:57and we call it VITAE,
  • 15:59which is short for variational inference
  • 16:02for trajectory by autoencoders.
  • 16:05And it is a model-based trajectory inference approach.
  • 16:11So our model starts with a definition
  • 16:15of the trajectory backbone.
  • 16:18So we use a graph to define the trajectory backbone.
  • 16:22So we start with a complete graph G,
  • 16:25well the vertices are the distinct cell states and cell type
  • 16:31and an edge denotes a possible transition
  • 16:34between two cell states and or cell types.
  • 16:39And then we can define a cell position on the graph
  • 16:43which is a vector, which is a landscape vector.
  • 16:47And it's...
  • 16:48A K is the number of vertices on the graph,
  • 16:51in the graph.
  • 16:52So if a cell is exactly,
  • 16:55belongs to one cell state or cell type,
  • 16:58then it is on cell vertex.
  • 17:01And if the cell is experiencing a transition
  • 17:05between two cell states or cell types,
  • 17:08then we denote it as on the edge between two vertices.
  • 17:16And then we can define the trajectory backbone
  • 17:18as a subgraph of G.
  • 17:20So we only include an edge or vertex
  • 17:23if we really observe cells that are on the edge.
  • 17:27So though there are many possible transitions
  • 17:30between the cell types or cell states,
  • 17:33We believe, we include a transition only
  • 17:38when we do observe cells
  • 17:39that are experiencing such transitions.
  • 17:42So though there can be many edges in our complete graph,
  • 17:48on the sub graph, it's a sparse success of the other edges
  • 17:52that are possible on the graph.
  • 17:57And a benefit of the above definition,
  • 18:01is that we can allow any types of the trajectory structure.
  • 18:05So it can be either a linear structure,
  • 18:07a bifurcate chain or a tree-like structure or a cycle,
  • 18:12it completely depends on how the data shows.
  • 18:16And so we allow,
  • 18:18we want the data to automatically determine the other,
  • 18:24the trajectory structure or topology
  • 18:26of the underlying dynamic process.
  • 18:33And we can also define the pseudotime for each cell.
  • 18:38I have not written down the exact definition here
  • 18:43but the idea is that we first need a root vertex.
  • 18:46So a root vertex is the start of this dynamic process.
  • 18:51And it can be given by the user,
  • 18:54depending on looking at the marker genes
  • 18:57or other side biological information.
  • 19:00And later we will also...
  • 19:02I will also show you that for some datasets,
  • 19:04we can automatically determine the root vertex.
  • 19:08And with a given root vertex,
  • 19:09then the graph becomes a directed graph.
  • 19:13And we had defined the pseudotime of the cell
  • 19:17as the shortest path from the root to a specific cell
  • 19:22along the trajectory, along the trajectory backbone.
  • 19:28So this graph defines the trajectory structure.
  • 19:33And the next step,
  • 19:34is that we want to link the trajectory structure
  • 19:38with the data generation model.
  • 19:41So the single-cell RNA sequencing data matrix,
  • 19:46is typically a high dimensional matrix
  • 19:50because for each cell,
  • 19:51we typically observe tens of thousands of genes
  • 19:54and there are also complicated dependency relationships
  • 19:59among the genes.
  • 20:01And what we assume,
  • 20:02is that we assume that these dependencies across genes,
  • 20:06can be explained by a latent variables, Z(i)
  • 20:10in a low dimensional space.
  • 20:13And we assume that these latent variables,
  • 20:18are following our normal distributions
  • 20:20and they also have the graph structure.
  • 20:25So U here, are the positions of the vertices on the graph
  • 20:29in this low dimensional space,
  • 20:32and the meaning of Z(i),
  • 20:34is a linear combination of these vertices,
  • 20:38depending on, of the positions of the vertices,
  • 20:41depending on the cell's graphic position on the graph.
  • 20:46And,
  • 20:49what I want to emphasize here is one point,
  • 20:52is that we assume a non-linear marking
  • 20:55from the latent space to the high dimensional observed data,
  • 20:59because we think that though in the low dimensional space,
  • 21:03we can represent the trajectory as these linear lines,
  • 21:07it is very likely a manifold on the observed data.
  • 21:10So this non-linear mapping,
  • 21:12can map this linear lines to hertz
  • 21:16in the high dimensional space.
  • 21:18And now to consider,
  • 21:20to account for the confounding covariates,
  • 21:23such as the data source or cell cycle,
  • 21:27we also allow this non-linear mapping,
  • 21:30to depend on this covariates.
  • 21:33And here we are...
  • 21:35Because the observed data count,
  • 21:37we assume it follows an active binomial distribution,
  • 21:40and L(i) are...
  • 21:41Oh, sorry. L(i) here should be known library size.
  • 21:44Sorry for the typo. It should be known library sizes.
  • 21:46And CRJ, and the CRG, the dispersion parameter switch gene,
  • 21:51are unknown parameters.
  • 21:53And so in this, in the current model,
  • 21:55the unknown parameters we have,
  • 21:57are these cell, the vertex positions,
  • 22:00U, the cell positions on the graph,
  • 22:03the W(i) the non-linear mapping
  • 22:06and this unknown dispersion parameters.
  • 22:08So we have a lot of parameters.
  • 22:11So to further simplify our estimation,
  • 22:15we assume that there are a mixture prior
  • 22:18on the cell positions.
  • 22:20So it's a very tactical idea.
  • 22:23So we assume that first the cell,
  • 22:27there are some latent variables, Ci for each cell.
  • 22:30And so Ci determines which edge or vertex a cell chooses.
  • 22:35So the cell has some probability to choose a specific edge
  • 22:39or a specific vertex,
  • 22:42and if it chooses an edge,
  • 22:46then, it then eventually choose the location of,
  • 22:51the relative location on the edge..
  • 22:55So that's what becomes a mixture prior on this diagram two.
  • 23:00And,
  • 23:02for these non-linear mapping functions,
  • 23:06we're including non-linear mappings.
  • 23:10We model these F(G(i)) functions by a neural network.
  • 23:17So, Oh, do you?
  • 23:18So our parameters now, our known parameters now are,
  • 23:22this U the positions of the vertices
  • 23:24on the low-dimensional space,
  • 23:27this PI, the probability of each vertex and edge,
  • 23:34and the non-linear mappings,
  • 23:37which are the waste in the neural network
  • 23:39and this, this dispersion parameters.
  • 23:42And we...
  • 23:43I space these parameters by combining our mixture model
  • 23:47with a variational autoencoder.
  • 23:51So the variational autoencoder.
  • 23:53So the autoencoder has been a very popular model
  • 23:58for in deep learning.
  • 23:59So what it can do is, is that it can,
  • 24:02can have some non-linear mapping
  • 24:07of the observed data to a low-dimensional space
  • 24:11and we want the low-dimensional space to best,
  • 24:14recover our observed time rational data.
  • 24:18And here, we also have such a task.
  • 24:21We have the low-dimensional space
  • 24:23and we want to best to recover our observed data.
  • 24:27And what's different is that we also have a prior
  • 24:31on the latent space, because we have the prior,
  • 24:34so we use the variational autoencoder model
  • 24:38in deep learning.
  • 24:39So the classical variational autoencoder
  • 24:43in deep learning, we'll assume that the latent space,
  • 24:46has the standard normal distribution as the prior.
  • 24:50And here we just modify it so that we have the...
  • 24:53So that the latent space have the mixture prior
  • 24:56that are assumed in our previous mixture models.
  • 25:02And we use the same approach
  • 25:04as the variational autoencoder, the variational path
  • 25:08which is, though the,
  • 25:10to approximate the posteriors of the latent space.
  • 25:13So though, because of the complicated priors
  • 25:16and non-linear mappings,
  • 25:19this prior, the posterior of the latent space conditional
  • 25:23on the observed data and the confounding covariates,
  • 25:27it can be complicated,
  • 25:28we approximate it by our normal distributions
  • 25:32and the mean and the variances of the normal distributions
  • 25:36which are functions of the observed data Y(i)
  • 25:40and the covariance Xi,
  • 25:41are also non-linear functions
  • 25:44and we model them by the neural network
  • 25:46and that's the encoder.
  • 25:47So the decoder is the nominal mapping function F(G(i)),
  • 25:53mapping the latent space to the observed data.
  • 25:56And encoder are neural networks
  • 25:58that are approximating the posteriors of the latent space.
  • 26:03- Hi, Jingshu, can I ask a very quick question?
  • 26:07If I understand correctly,
  • 26:08up to now you have not used the time information,
  • 26:11is this true?
  • 26:13Or you have considered to include the time information
  • 26:16in the covariate X?
  • 26:17- Oh, oh, yes.
  • 26:19So we will not use time information
  • 26:24in our trajectory inference.
  • 26:26If I have time, I may have a last slide which is a,
  • 26:31which is a review of the literature into data inference.
  • 26:34So into data inference,
  • 26:37the most commonly used and best performing methods,
  • 26:41they will tend to not use the time information
  • 26:44because there is...
  • 26:45Though the time information is typically correlated
  • 26:49with developmental like, timing of the cells,
  • 26:54but because at each collectional time,
  • 26:57is a mixture of cells at different environmental time.
  • 27:00So it's a big, complicated relation
  • 27:03and some methods use that information
  • 27:05but many methods do not use that.
  • 27:09And so our approach is the methods that do not use the,
  • 27:14the collection time information.
  • 27:16And, we use it only when we decide
  • 27:22which vertex is the root.
  • 27:24And I will show that later.
  • 27:27- Thanks.
  • 27:30- So our...
  • 27:33So the last function is, is composed of three parts.
  • 27:37The first part is this likelihood based reconstruction loss.
  • 27:42So this evaluates how goods are our latent spaces to,
  • 27:48to reconstruct the high-dimensional observed data.
  • 27:54And the second part is the KL divergence
  • 27:57between the posterior distribution and the prior.
  • 28:02And you can think of it as a regularization term.
  • 28:06And so to regularize
  • 28:07if the posterior is very far away from the prior,
  • 28:12also when for variational autoencoders so beta equal to one,
  • 28:17it can be also viewed as a lower bound
  • 28:21of the marginalized data.
  • 28:23So here we make the,
  • 28:27we add the training parameter beta,
  • 28:30and in practice we said, beta are larger than one,
  • 28:34so that we can encourage the posterior, the regularization,
  • 28:39so that the posteriors of the I will,
  • 28:41are more likely to tend to align
  • 28:44along the edges and vertices.
  • 28:45And that's the idea that has been used in deep learning
  • 28:48which is called the beta.
  • 28:53And the third term,
  • 28:55it's a term for adjusting for the covariance.
  • 28:58So the covariance...
  • 29:01So this covariance * covers.
  • 29:03So we want our latent space C to be kind of,
  • 29:08be correlated with the covariance,
  • 29:10While...
  • 29:11So we, certain...
  • 29:14So we want to maximize the reconstruction of the data
  • 29:18by only by the covariates.
  • 29:20And setting the tuning parameter alpha larger than zero,
  • 29:24we can help decorrelate Z(i) from Xi.
  • 29:31So another art in our training is that,
  • 29:35we need a good internalization of the graph.
  • 29:38So specifically we need to determine,
  • 29:40how many vertices there are,
  • 29:42and also the positions of the vertices
  • 29:45in the low-dimensional space.
  • 29:47That's not an easy job.
  • 29:49And if we just randomly,
  • 29:52because our final graph depend on,
  • 29:56the total number of vertices that we set at the beginning.
  • 30:00So how we pretrain the model to return it's initial value?
  • 30:05To get the initial values of the unknown parameters,
  • 30:08is that we first trained with beta equal to zero,
  • 30:12so that we don't make use of any,
  • 30:14these prior distributions of the I.
  • 30:17So it's only...
  • 30:18We're only looking at the reconstruction loss
  • 30:20from the likelihood of the data.
  • 30:22So it's like the normal, the classical autoencoder.
  • 30:27And from that we can get some initial estimate of Z(i),
  • 30:33The latent space variables.
  • 30:34And then we perform clustering on Z(i),
  • 30:38and we let the clustering algorithm,
  • 30:41to automatically determine the number of clusters
  • 30:43and we use that as the number of vertices.
  • 30:46And we also use the cluster centers
  • 30:49as the initialization, as the initial values of U.
  • 30:53So that's the main part,
  • 30:55the key ideas in our pre-training start,
  • 30:57so that we can have a good initial addition to,
  • 31:01for the start of the training.
  • 31:05And another trick that we have taken,
  • 31:08is that in practice, sorry,
  • 31:10the best performing existing trajectory inference methods.
  • 31:16They will attempt...
  • 31:17So they are typically very fast.
  • 31:20And in order to have comparable computational costs
  • 31:24of these methods,
  • 31:25we also have accelerated version of our algorithm
  • 31:28which is a simply to reduce the input,
  • 31:34the dimension of the input space,
  • 31:36so we can replace Y(i),
  • 31:39the high-dimensional vector of the gene expressions
  • 31:44with it's principal components.
  • 31:46Now, principal component, principal scores,
  • 31:50which is a low-dimensional vector L.
  • 31:53We, by default we will take the first 64 dimensions
  • 31:59for the principal scores.
  • 32:01And so we replace the elected binomial distribution
  • 32:05by a normal gaussian distribution assumption
  • 32:09of these principal scores.
  • 32:11And as you will see later in our,
  • 32:15in our evaluations with real and synthetic data,
  • 32:19we see that we actually have comparable performance
  • 32:24with our previous likelihood,
  • 32:26with our standard likelihood based methods
  • 32:30for this accelerated version.
  • 32:35So after the final step is that
  • 32:38after the training the autoencoder,
  • 32:40we have approximated distributions,
  • 32:44posterior distributions of the latent space
  • 32:47and also the cell positions that
  • 32:52and which vertex,
  • 32:54the vertex or the posterior distribution of Ci,
  • 32:58well Ci* is which vertex or edge the cell is from.
  • 33:04And we need to use those information,
  • 33:07to determine the trajectory backbone
  • 33:10and to project each cell
  • 33:13on our inferred trajectory backbone.
  • 33:17So how we do that is, first we calculate an edge score.
  • 33:22So this edge score is...
  • 33:27So we have different scores for an edge,
  • 33:30and that is determined
  • 33:32on looking at the posteriors of cells.
  • 33:35How many cells from the posterior distribution?
  • 33:38How many cells choose to lie on that specific edge?
  • 33:43If there are a lot of cells then that means
  • 33:45that it's very likely that edge exist.
  • 33:47If there are very few cells
  • 33:49then very likely that edge should not be,
  • 33:52the edge that is included in the trajectory backbone.
  • 33:58And the denominator is that we want to,
  • 34:03give a relatively high fair score
  • 34:06for the edges of that connecting to small clusters,
  • 34:11to small cell types,
  • 34:13such as we want to also capture the transition
  • 34:17between two rare cell types.
  • 34:20So that's why we have this regularization,
  • 34:25waiting by with, in the denominator.
  • 34:29And we include an edge in the trajectory backbone
  • 34:32if it's edge score is larger than some*.
  • 34:38And when we have an inferred trajectory backbone,
  • 34:43then the next step is,
  • 34:45we want to project the cells on the inferred trajectory,
  • 34:50and we do it by looking at the...
  • 34:54Based on the posterior, approximated posterior distributions
  • 34:58of the cell positions that *.
  • 35:01We want to find the closest point on the inferred trajectory
  • 35:08for this cell based on the posterior distributions.
  • 35:12And the distance,
  • 35:14this expectation we can also use as a evaluation
  • 35:18of the, some uncertainty quantification
  • 35:21of this projection
  • 35:24or the cell positions.
  • 35:27And the third thing we need to,
  • 35:29because our final results is some kind of directed graph.
  • 35:33So we need to determine the root vertex.
  • 35:36So the root vertex can be either given by the user, or if
  • 35:43as I feel I asked,
  • 35:44for some datasets like the data at the beginning of my talk,
  • 35:48the cells are collected in the time series,
  • 35:52and we can make use of that time series,
  • 35:54to determine the root vertex.
  • 35:58The rough idea is that for each vertex,
  • 36:01we can calculate a fraction time score,
  • 36:03which is an average of the cells
  • 36:06that belong to the vertex
  • 36:07or projected on the edge that connects to the vertex,
  • 36:11depending on the distance from the cell to the vertex.
  • 36:15And so we can have some vertex collection time,
  • 36:20collection time score for each vertex.
  • 36:21And we choose the root vertex
  • 36:25as the vertex that has the smallest collection time score.
  • 36:32And with the roots and with our inferred trajectory,
  • 36:35it's straightforward to calculate the pseudotimes
  • 36:38for each score.
  • 36:40So that's the whole process
  • 36:42of our model-based methods for trajectory inference.
  • 36:48And now let's look at some benchmarking results.
  • 36:52So first we...
  • 36:54So our benchmarking includes both some real datasets
  • 36:58and some synthetic datasets.
  • 37:00And we follow the...
  • 37:02Most of the benchmarking follows the same framework
  • 37:05as this well known benchmarking paper
  • 37:09in the "Nature of Biotech" in 2019.
  • 37:12And our datasets are selected
  • 37:14as a subset of the datasets that they have tried
  • 37:18and our criteria is that these datasets,
  • 37:19maybe have enough number of cells not too few cells.
  • 37:25And we wanted to cover different types of topologies.
  • 37:33And this is the benchmarking results.
  • 37:37So the columns, sorry.
  • 37:39So the rows are the different datasets
  • 37:43that I have mentioned.
  • 37:45And we compare five different methods.
  • 37:50So we have, would come first compare two versions
  • 37:52of our approach.
  • 37:53Vitae one as the original elected binomial likelihood base.
  • 38:00Vitae and accelerated version,
  • 38:03replacing the gene expression vectors
  • 38:07by principal scores.
  • 38:11Then we compare it with three different,
  • 38:13state of the arch trajectory inference methods.
  • 38:19The monocle PAGA.
  • 38:20So the monocle series is from the lab that,
  • 38:24who developed the first trajectory inference methods,
  • 38:27and there, and then they further,
  • 38:30the first take monocle for one
  • 38:32and now they have monocle three.
  • 38:33So the monocle series are always commonly used
  • 38:36in these single-cell RNA sequencing papers.
  • 38:40And two, I expect the performing,
  • 38:44trajectory inference methods in the benchmarking paper,
  • 38:48the PAGA and Slingshot.
  • 38:51And all these methods,
  • 38:55do not use this time information explicitly.
  • 38:59So it's a fair comparison between these methods.
  • 39:04And so the...
  • 39:07And for all the methods they are given to by those.
  • 39:11We give them the two number
  • 39:13of clusters or the vertices to start from,
  • 39:19and the two root vertex.
  • 39:22And we, and each column is,
  • 39:25we compare it's measurement, it's metric
  • 39:30for the evaluation of the performance of each method.
  • 39:36So the first two columns,
  • 39:37are the matrix for recovery of the trajectory topology
  • 39:42or the trajectory structure.
  • 39:46And next two columns are the evaluation
  • 39:49of the cell position, estimation accuracy.
  • 39:53And the last metric is
  • 39:55for evaluating the pseudotime accuracy.
  • 39:58And a larger score means a better performance,
  • 40:01and a lower score means like, a worse performance.
  • 40:06So you can see that, our approach first is,
  • 40:11our approach has much better performance
  • 40:13in recovery of the trajectory topology.
  • 40:18We also have some benefits of the cell position estimates,
  • 40:23and because of both,
  • 40:24we have a better performance in the pseudotime accuracy.
  • 40:29And the other thing you can see is that our,
  • 40:33our accelerated version have comparable,
  • 40:36slightly worse but comparable performance,
  • 40:40compared to the, our likelihood based vitae.
  • 40:46And though it has a much quicker,
  • 40:48much less computational cost.
  • 40:54So finally,
  • 40:56let's come back to the case study on mouse neocortex.
  • 41:02So this is the,
  • 41:05the visualization of merging the raw data.
  • 41:10And this is the performance of our methods.
  • 41:14And for comparison, we compare is, another very popular use,
  • 41:21data integration method called Seurat.
  • 41:24So Seurat is the software, the most often used software,
  • 41:29for single-cell RNA sequencing analysis.
  • 41:31Their lab have different,
  • 41:32have developed a series of computational tools
  • 41:35for analyZ(i)ng the single-cell RNA sequencing data.
  • 41:38And this is from their integration methods.
  • 41:40So you can see that both methods can,
  • 41:44is able to integrate the both two datasets,
  • 41:50but for some details, I think,
  • 41:52because we are assuming this trajectory structure,
  • 41:55we have a slightly better performance.
  • 41:57For example, this group of cells are the layer one neurons,
  • 42:01where the group of here, are here in Seurat.
  • 42:07So you can see that because they come from,
  • 42:09because the outer layer parts and the layer parts,
  • 42:12come from two datasets.
  • 42:16Because as I mentioned earlier in dataset B,
  • 42:20they have collected cells from, at day 11.
  • 42:24So this are, we can take a look
  • 42:27of the collection days of each cell.
  • 42:32So you can see that these cells, they are,
  • 42:34the layer one parts come from day 11,
  • 42:36And the rest parts is a mixture
  • 42:39of cells from both two datasets.
  • 42:42And by the way they all belong to the layer one.
  • 42:43So we know that they belong to layer one
  • 42:45by looking at the marker genes expression
  • 42:48which I did not show here.
  • 42:50So it's because we encourage the cells to align together
  • 42:55if they are along the address, if they are similar cells.
  • 43:03And you can see here, that's,
  • 43:06so the two datasets, they have this interpolation
  • 43:10of the pseudo, of the collection time.
  • 43:14And you can see for example,
  • 43:16for these projected cells,
  • 43:18we can see this continuous positions,
  • 43:24like alignments of the cells of different days
  • 43:27from so the most the dark is the cells from day ten.
  • 43:34And the red ones are the cells from day 18
  • 43:36and even days are, are from dataset A,
  • 43:40and odd days are from dataset B.
  • 43:42So you can see that though they're coming
  • 43:43from two different sources,
  • 43:46we can, we are able to align them in the right order.
  • 43:54And,
  • 43:56and as another comparison.
  • 43:58So we compare our estimation of shared trajectory,
  • 44:03with another partisan approach
  • 44:06which is we're first to do data integration with Seurat
  • 44:12and then we can use Slingshots,
  • 44:14to perform trajectory inference on the integrated data.
  • 44:19And you can see that this,
  • 44:21we have a much cleaner trajectory structure.
  • 44:24And we also have a comparable computational cost.
  • 44:29So Seurat and Slingshots,
  • 44:31they cannot be, they do not need regularization.
  • 44:34And with one CPU, they, it takes about 12 minutes.
  • 44:38And for our accelerated VITAE,
  • 44:42generating this figure, we have,
  • 44:45we can, we take about three minutes
  • 44:47on one GPU port at about 10 minutes on eight CPU cores
  • 44:52which is, the eight CPU cores are like currently,
  • 44:55like most of our laptops,
  • 44:57but we'll have such computational resources.
  • 45:00So we have comparable computation cost
  • 45:04with this state of our methods.
  • 45:07And in addition, because we are...
  • 45:11Based on this approximated posterior distributions
  • 45:14we also have some kind of uncertainty quantifications
  • 45:19on the cell positions.
  • 45:20For example, here, it shall say some parts of the cells,
  • 45:24these cell positions
  • 45:26along the trajectory are not very reliable.
  • 45:32And that will help us to evaluate our, the estimate,
  • 45:37how we think our estimate in pseudotime.
  • 45:42And finally,
  • 45:44this is showing some gene expression change
  • 45:49along the pseudotime order.
  • 45:50And, and we look at some top markers
  • 45:55that are changing along the pseudotime order
  • 45:57for some trajectories in the whole trajectory structure.
  • 46:03And you can see,
  • 46:05here we separately fish the curve for two datasets,
  • 46:11but you can see that they overlap
  • 46:13with each other quite well.
  • 46:15And so that's also an evidence showing
  • 46:18that we can have a good,
  • 46:23a good performance in aligning the two datasets.
  • 46:28So the take home message is,
  • 46:31first we perform this model-based trajectory inference,
  • 46:36to understand cell dynamics.
  • 46:38And our, the second is our methods.
  • 46:41So our method is a model-based approach.
  • 46:43We can combine the mixture prior model, Oh, sorry.
  • 46:50We can combine the collected mixture structure
  • 46:54for defining the trajectory structure
  • 46:58with the variational autoencoders
  • 47:00so that we can efficiently,
  • 47:05efficiently solve the mixture model
  • 47:08and have enough flexibility to fit the data well.
  • 47:12And so our,
  • 47:14trajectory inference approach features,
  • 47:17the analysis of integrating
  • 47:19multiple single-cell RNA sequencing datasets.
  • 47:22And if you are anxious to know more details,
  • 47:26we have our paper, a manuscript,
  • 47:29a manuscript already available on bio archives
  • 47:33and we also have our package codes on VITAE.
  • 47:38And that's all. Thank you.
  • 47:40And if you have any questions, I'm happy to answer them.
  • 47:44- Thanks Jingshu for this excellent, excellent talk.
  • 47:49I wonder whether the audience,
  • 47:50have any questions for Jingshu.
  • 47:58Okay. So Jingshu I have some maybe minor questions.
  • 48:02I recall that in the model,
  • 48:04you have this actual term encouraging X,
  • 48:07the data explained by X the covariates,
  • 48:12the confounding covariates,
  • 48:13to be orthogonal to the leading factor, right?
  • 48:19And then there is a penalty term alpha.
  • 48:21So I wonder,
  • 48:25what's the motivation for you to including this term
  • 48:28instead of first removing the effect from the i directly.
  • 48:33And in practice, how should we have set alpha?
  • 48:39- So, so, so, so the thing is a bit tricky here
  • 48:44from a statistical point of view.
  • 48:45So we want to remove these confounding effects, right?
  • 48:51But the other fact is that,
  • 48:53these confounding effects, the X,
  • 48:56are not exactly orthogonal with Z,
  • 48:59because for instance for the two datasets that I have,
  • 49:03we cannot say that the signal is completely orthogonal
  • 49:06to each dataset they have come from
  • 49:09because there are two biological differences
  • 49:12between the two datasets.
  • 49:13So, so here,
  • 49:17so here, the problem is not completely identifiable
  • 49:19but people do it in practice a lot.
  • 49:22So we want to kind of decorrelate Z and X to some extent
  • 49:29so that we can remove,
  • 49:32remove the batch effects that we do not want
  • 49:36but keep the two biological differences.
  • 49:39So I think the underlying assumption
  • 49:41the big assumption is that we are assuming
  • 49:44that the two biological differences are large enough
  • 49:48so that compared to the...
  • 49:50So we remove the smaller differences, the batch effects
  • 49:53but we can still pick the two biological differences,
  • 49:56to some extent.
  • 49:57So there's no guarantee that it will work,
  • 50:01but in practice, it work on a lot of datasets.
  • 50:06I think that's...
  • 50:08So we will inherit this idea from this paper
  • 50:11by Nancy Huang and her students.
  • 50:16So I think in removing the batch effects.
  • 50:19So I think the idea is that,
  • 50:21we hope that it can work for a lot of datasets.
  • 50:25And the reason that we want to have this penalty,
  • 50:28is that if we don't add any penalty,
  • 50:32then, because this autoencoder is trained by this,
  • 50:36by this stochastic gradient descent.
  • 50:39So sometime it may not find the optimal global solution.
  • 50:44So if we don't encourage it, the X and Z to be decorrelated,
  • 50:49it sometimes may not be able,
  • 50:51it may give you a solution that is not,
  • 50:54that's the Z still are highly correlated with X
  • 50:58and the batch effects are still there.
  • 50:59So then this alpha, I think in practice we set it to be 0.0,
  • 51:06so it's a very small penalty so that we can put some,
  • 51:11some kind of penalty to regularize that.
  • 51:16- Small amount, I guess you mentioned.
  • 51:18- Yes, yes.
  • 51:21And it may be that does not work,
  • 51:22and then in practice you can choose another alpha and try it
  • 51:26and see if it gives you the best results that you want.
  • 51:30- Right. So, so I want...
  • 51:32I guess, right.
  • 51:34So I guess my question is like,
  • 51:36since it's not entirely a supervised problem,
  • 51:40like how, right.
  • 51:42So I'm not sure how to check,
  • 51:45what is a good alpha in the sense,
  • 51:48but if you tell me like a small alpha,
  • 51:51well you're going to be fine
  • 51:51then I just take it to be small alpha.
  • 51:54- Yeah. I think the way you check it is that,
  • 51:57for example the way we check it here is that,
  • 52:02so sometimes we have some referenced cell types,
  • 52:04so that, you know roughly what you are doing.
  • 52:09So here, for example,
  • 52:10here this reference cell types, these are,
  • 52:14these are not used in the modeling approach,
  • 52:18so these are for evaluation the performance.
  • 52:21And for some datasets, you can't,
  • 52:23we can mark our genes and do some class points,
  • 52:26so, you know, roughly like which though,
  • 52:29(indistinct)
  • 52:32and for two datasets, you can see,
  • 52:34like whether you can correctly merge the cell types
  • 52:37that are shared among the datasets
  • 52:41but keep the cell types
  • 52:42that are unique to different cells.
  • 52:46Then for our trajectory inference is slightly complicated
  • 52:49because these cell types are not well separated.
  • 52:52So another way that we can check our performance,
  • 52:55is that you can see here that we can correctly,
  • 53:00for these projected cells we can correctly like,
  • 53:03order the wrong days in the right order.
  • 53:06So that we know that we keep
  • 53:08some biological meaningful signals there.
  • 53:13I think there still can be some bias. Yeah.
  • 53:18- Okay, great. Thanks.
  • 53:21So, thanks Jingshu again for this excellent talk.
  • 53:25And if you have any question you want to ask Jingshu
  • 53:28that you cannot think about for now,
  • 53:31you can always email her offline
  • 53:33and if you want to use her software,
  • 53:36I think she'll be more than happy to answer your question.
  • 53:39- Yes. Yes.
  • 53:41(chuckles)
  • 53:42- So I guess that's all for today.
  • 53:46Thank you everyone joining.
  • 53:47Thank you Jingshu for being here,
  • 53:49and have a nice remaining day.