# YSPH Biostatistics Seminar- Dr. Jingshu Wang

February 10, 2021## Information

Jingshu Wang, Ph.D.

Assistant Professor

Department of Statistics and the College

The University of Chicago

Tuesday, February 9, 2021

ID6188

To CiteDCA Citation Guide

- 00:04- Right, so I think while we're waiting,
- 00:06I'll just give a very brief introduction about Jingshu.
- 00:11Jingshu is an Assistant Professor from the Stats Department
- 00:14at University of Chicago.
- 00:17And today she's going to present some very exciting work
- 00:22on trajectory inference for the single cell data.
- 00:27I'm very excited to hear about her work.
- 00:31I think...
- 00:34Lets wait for two minutes
- 00:35and then we'll start with Jingshu's work.
- 00:40So if you have any other related questions about Jingshu
- 00:45before the talk start, you're free to ask as well.
- 00:48(chuckles)
- 00:51- Hi Lynn. Can you make me a co-host.
- 00:53- Oh right. Thanks for reminding me.
- 00:58Let me see.
- 01:14So one more minute to start.
- 01:42- So you can see my screen, right?
- 01:43- Yes. I can see your screen. Looks good to me.
- 01:48Maybe I'll hand it over to you now Jingshu I think,
- 01:52if (indistinct) late I think,
- 01:56they can ask questions if you miss any details.
- 01:59- Yes. Okay.
- 02:01So thanks everyone for coming
- 02:04and settling for the introduction invitation.
- 02:08Today I will talk about the Single-Cell RNA Sequencing Data
- 02:13and how we can learn the cell dynamics
- 02:16from the single-cell RNA sequencing.
- 02:23So the single-cell RNA sequencing is a relatively,
- 02:26is a newly development, newly-developed
- 02:30but also relatively mature technology
- 02:33for measuring the RNA expression levels in the cells.
- 02:38And the traditional microarrays or bulk RNA sequencing
- 02:43matches the gene expressions
- 02:45as the average across all cells in a tissue.
- 02:49However, a cell is made of many cells and, Oh, sorry.
- 02:54A tissue is made of many cells
- 02:56and the cell population is typically not homogenous
- 03:00and the cells can have different functions
- 03:02and different cell types.
- 03:04So in contrast, in single-cell RNA sequencing,
- 03:08we have measured the transcriptional profile
- 03:12in each individual cell.
- 03:14So we can expand this vector
- 03:16of gene expressions for a tissue to a matrix
- 03:19of the gene inspections in the cells.
- 03:23And each entry is a mattered RNA count
- 03:28for a particular gene or a particular cell.
- 03:31So that's...
- 03:32So the benefit is that we have no,
- 03:34we have a more detailed understanding of what is going on
- 03:40in the tissue.
- 03:41So the benefit of single-cell RNA sequencing,
- 03:44is that it can give you a relatively unbiased
- 03:48and complete picture of the cell population.
- 03:51And this is particularly useful
- 03:54when the cell population is complicated.
- 03:57For example when the cells are experiencing dynamic changes.
- 04:04And as an application of the method
- 04:06that I will introduce today in this lecture, in this talk,
- 04:11I will focus on the study of the mouse neocortex.
- 04:18This is a cartoon showing the migration and generation
- 04:23of the projection neurons in the mouse neocortex.
- 04:28Yeah, you guys see that this is quite a complicated process
- 04:33and there are still a lot of things that are unknown
- 04:37about the neuronal diversity and the mechanism
- 04:40of how the projection neurons are generated.
- 04:45And the goal,
- 04:45is that we want to use the single-cell RNA sequencing
- 04:49so that we can have a more complete understanding
- 04:52of this, the neuronal diversity and the neuron development.
- 04:57So you can see that here in this cartoon, this shapes,
- 05:03there are different shapes and colors,
- 05:06to represent different cell types in the neocortex,
- 05:12as the cells are experiencing the continuous dynamic changes
- 05:17actually in the real cell population,
- 05:19it is much complicated than that.
- 05:22There is not clear boundaries
- 05:25between different cell types and there may be...
- 05:28There even, it's not a clear definition of cell type.
- 05:33So, what we hope,
- 05:35is that we want to use single-cell RNA sequencing
- 05:38to first recover the trajectory of the dynamic changes
- 05:43or the developmental process
- 05:45that the cells are experiencing.
- 05:48So specifically we focus on two datasets.
- 05:52One data set, we name it as data set A.
- 05:56So this is a data set
- 05:58that is recently collected by my collaborator.
- 06:01And so we have samples...
- 06:05The cells from the mouse neocortex
- 06:08at six different embryonic days.
- 06:11And before our data,
- 06:13there is another dataset we call it, we name it data set B.
- 06:17And this dataset is a smaller dataset than ours
- 06:22but they are...
- 06:23They have also sequenced a very similar brain region
- 06:28of the mouses and they have a sequence of cells
- 06:32from four different embryonic days.
- 06:34So you can see that our,
- 06:37most of the days that are sequenced in our dataset
- 06:41and with the other dataset B, do not overlap.
- 06:44And so it would be beneficial if we can have with...
- 06:49If we can combine the two there datasets
- 06:51and so that we can make use of the cells from both studies.
- 06:56For instance, for our dataset,
- 06:59we don't have these cells from the day 11,
- 07:04which is quite important day.
- 07:06For example here, day 11 are the day that,
- 07:11there are projection neurons that are, beginning time,
- 07:17well, there are projection neurons that are generated.
- 07:21And so this E11 cells are sequenced from the other dataset.
- 07:26So it would be beneficial
- 07:28if we can perform a choice analysis of the two datasets
- 07:32and learn a shared developmental trajectory
- 07:35as these two datasets,
- 07:36are actually sequencing the same mouse brain region.
- 07:42So as you may have imagined, if we don't do anything,
- 07:47if we just concatenate the cells from two datasets
- 07:51and treat them as datasets from the same lab,
- 07:54then these two datasets actually will not,
- 07:57the cells will not merge
- 07:59because of the bash effects between the two datasets.
- 08:03Because these are from two labs
- 08:05and they have different sequencing machines
- 08:08so the cells become different,
- 08:10though they are coming from the same brain region.
- 08:14And this is a figure called the UMAP
- 08:17which is a two-dimensional projection
- 08:20of the high dimensional,
- 08:21observed single-cell RNA sequencing data
- 08:24so that we can have a visualization
- 08:27of the cell population.
- 08:28(clears throat)
- 08:29And using our marker which is called vitae
- 08:33that I will introduce later
- 08:35or we can merge the cells from two different sources.
- 08:41And as I will show later,
- 08:44we can also keep the uniqueness, the unique characteristics
- 08:49that only exist in one of the datasets.
- 08:51So we can keep the biological meaningful differences
- 08:55between the two datasets.
- 08:57And our method is actually not just,
- 09:01data integration approach.
- 09:03So what we can do, is that we can also simultaneously,
- 09:08learn a shared trajectory structure
- 09:12and we can at the same time do the disintegration
- 09:16or more generally correct for confounding effects
- 09:19such as the data source and other various like cell cycles.
- 09:25And in the...
- 09:27In this figure,
- 09:28the arrows show the direction of the developmental process
- 09:34and the line width represents the score for an edge.
- 09:38So it shows how confident we are in,
- 09:43in like whether there's a transition
- 09:45between the two states that the line connects.
- 09:55So our method actually belongs to a larger group
- 09:59of computational tools for single-cell RNA sequencing
- 10:03which is called the trajectory inference.
- 10:05So here we call it...
- 10:08So it is called trajectory inference
- 10:09that is different from statistical inference.
- 10:13So it's a computational tool
- 10:15so that we can understand in the, our cell lineage
- 10:20and the cell fate decisions in biological process,
- 10:25such as cell differentiation
- 10:27as what we have already seen in the mouse neocortex,
- 10:31and some other biological process,
- 10:33such as immune response, cancer expansion
- 10:36and many more are using single-cell RNA sequencing data.
- 10:41In general, the trajectory inference approaches,
- 10:44they will infer or they start with a,
- 10:50a type of the underlying trajectory structure
- 10:53and other methods,
- 10:55they will assume a specific type of the trajectory structure
- 11:00for the underlying developmental process,
- 11:03such as a linear structure, a linear topology
- 11:06or a bifurcating, a bifurcation or tree-like trajectory.
- 11:13And as the cell populations
- 11:14that we are trying to understand,
- 11:16become more and more complicated,
- 11:18recent methods also try to infer,
- 11:21the type of the trajectory structure
- 11:23from the observed single-cell RNA sequencing data.
- 11:27And whilst we have learned the trajectory structure,
- 11:30then this trajectory inference approaches,
- 11:33will computationally project
- 11:36and order the cells along the trajectory.
- 11:39And the right order of the cells along the trajectory,
- 11:43are called the pseudotime of the cells.
- 11:46So the trajectory inference,
- 11:48is also called the pseudotime analysis.
- 11:53And since...
- 11:55So the first trajectory inference method is proposed in 2014
- 12:01and since then it has become a very popular tool
- 12:05that are used in analyzing single-cell RNA sequencing data.
- 12:09And in this study, it calculates, it summarizes the number
- 12:15of single-cell RNA sequences studies
- 12:16that are published per month.
- 12:19And you can see that in recent years,
- 12:21more than half of the published
- 12:24single-cell RNA sequencing studies
- 12:26will have some investments of the pseudotime
- 12:29and trajectories in the cell population
- 12:31that they are investigating.
- 12:35And there has also been a lot of methods
- 12:39for trajectory inference.
- 12:41And in this,
- 12:45there is a comprehensive benchmarking paper,
- 12:48recently in "Nature Biotech"
- 12:50and it has summarized 70 different trajectory methods.
- 12:55And in your paper they have compared
- 12:57about 45 different trajectory inference methods
- 13:00from different aspects.
- 13:03So you may wonder,
- 13:04since there are so many trajectory inference methods
- 13:07that are already there, why do we still want to develop,
- 13:11a new trajectory inference method?
- 13:14So the first point is that,
- 13:17although we have 70 different methods,
- 13:19many trajectory inference methods,
- 13:22they are assuming a specific type
- 13:24of the trajectory structure.
- 13:26So many methods only work for a sound developmental process.
- 13:31If you consider the methods that can work for...
- 13:34They have the flexibility
- 13:35if you work for a variety of the trajectory structures
- 13:39then we don't have that many methods that are available.
- 13:43And another concern that I have is that most methods,
- 13:48these trajectory inference methods,
- 13:51do not have explicit statistical models.
- 13:55So what I mean is that,
- 13:56though people are kind of clear
- 13:59about what's the biological signal
- 14:01that we want to find in the trajectory inference,
- 14:06it is actually, many methods are actually pretty vague about
- 14:11from the aspect of like for the single-cell data matrix,
- 14:16what can be the definition of the trajectory
- 14:19that they want to infer.
- 14:21So, and how that they are generating,
- 14:25and how the data it can be modeled and generated
- 14:28with the trajectory structure.
- 14:30So as the statistician, I think it would be beneficial,
- 14:34if we have a model-based trajectory inference approach,
- 14:38so that we can better understand the profit,
- 14:41how good our estimations are
- 14:44and have some certain qualification
- 14:46of the trajectories or slow times that we infer.
- 14:54And the third point is that
- 14:56as you have shown at the beginning,
- 14:59there is also a growing need,
- 15:01to efficiently align trajectories
- 15:05or do a joint analysis
- 15:07from multiple single-cell RNA sequencing datasets.
- 15:10As the, as the studies...
- 15:14As the single-cell RNA sequencing datasets are expanding,
- 15:18there has already been a lot of studies for datasets,
- 15:21they are for the same tissue or for the same cell type.
- 15:25And it will be...
- 15:26(clears throat)
- 15:27And we can learn a better picture of, on this,
- 15:31the biological process in the tissue or for the cell time,
- 15:35if we can use all available datasets.
- 15:39And so there's a strong need,
- 15:41an increasing need to do this joint trajectory analysis
- 15:46for multiple datasets.
- 15:50So because of these reasons,
- 15:52we develop a new statistical framework and a new method,
- 15:57and we call it VITAE,
- 15:59which is short for variational inference
- 16:02for trajectory by autoencoders.
- 16:05And it is a model-based trajectory inference approach.
- 16:11So our model starts with a definition
- 16:15of the trajectory backbone.
- 16:18So we use a graph to define the trajectory backbone.
- 16:22So we start with a complete graph G,
- 16:25well the vertices are the distinct cell states and cell type
- 16:31and an edge denotes a possible transition
- 16:34between two cell states and or cell types.
- 16:39And then we can define a cell position on the graph
- 16:43which is a vector, which is a landscape vector.
- 16:47And it's...
- 16:48A K is the number of vertices on the graph,
- 16:51in the graph.
- 16:52So if a cell is exactly,
- 16:55belongs to one cell state or cell type,
- 16:58then it is on cell vertex.
- 17:01And if the cell is experiencing a transition
- 17:05between two cell states or cell types,
- 17:08then we denote it as on the edge between two vertices.
- 17:16And then we can define the trajectory backbone
- 17:18as a subgraph of G.
- 17:20So we only include an edge or vertex
- 17:23if we really observe cells that are on the edge.
- 17:27So though there are many possible transitions
- 17:30between the cell types or cell states,
- 17:33We believe, we include a transition only
- 17:38when we do observe cells
- 17:39that are experiencing such transitions.
- 17:42So though there can be many edges in our complete graph,
- 17:48on the sub graph, it's a sparse success of the other edges
- 17:52that are possible on the graph.
- 17:57And a benefit of the above definition,
- 18:01is that we can allow any types of the trajectory structure.
- 18:05So it can be either a linear structure,
- 18:07a bifurcate chain or a tree-like structure or a cycle,
- 18:12it completely depends on how the data shows.
- 18:16And so we allow,
- 18:18we want the data to automatically determine the other,
- 18:24the trajectory structure or topology
- 18:26of the underlying dynamic process.
- 18:33And we can also define the pseudotime for each cell.
- 18:38I have not written down the exact definition here
- 18:43but the idea is that we first need a root vertex.
- 18:46So a root vertex is the start of this dynamic process.
- 18:51And it can be given by the user,
- 18:54depending on looking at the marker genes
- 18:57or other side biological information.
- 19:00And later we will also...
- 19:02I will also show you that for some datasets,
- 19:04we can automatically determine the root vertex.
- 19:08And with a given root vertex,
- 19:09then the graph becomes a directed graph.
- 19:13And we had defined the pseudotime of the cell
- 19:17as the shortest path from the root to a specific cell
- 19:22along the trajectory, along the trajectory backbone.
- 19:28So this graph defines the trajectory structure.
- 19:33And the next step,
- 19:34is that we want to link the trajectory structure
- 19:38with the data generation model.
- 19:41So the single-cell RNA sequencing data matrix,
- 19:46is typically a high dimensional matrix
- 19:50because for each cell,
- 19:51we typically observe tens of thousands of genes
- 19:54and there are also complicated dependency relationships
- 19:59among the genes.
- 20:01And what we assume,
- 20:02is that we assume that these dependencies across genes,
- 20:06can be explained by a latent variables, Z(i)
- 20:10in a low dimensional space.
- 20:13And we assume that these latent variables,
- 20:18are following our normal distributions
- 20:20and they also have the graph structure.
- 20:25So U here, are the positions of the vertices on the graph
- 20:29in this low dimensional space,
- 20:32and the meaning of Z(i),
- 20:34is a linear combination of these vertices,
- 20:38depending on, of the positions of the vertices,
- 20:41depending on the cell's graphic position on the graph.
- 20:46And,
- 20:49what I want to emphasize here is one point,
- 20:52is that we assume a non-linear marking
- 20:55from the latent space to the high dimensional observed data,
- 20:59because we think that though in the low dimensional space,
- 21:03we can represent the trajectory as these linear lines,
- 21:07it is very likely a manifold on the observed data.
- 21:10So this non-linear mapping,
- 21:12can map this linear lines to hertz
- 21:16in the high dimensional space.
- 21:18And now to consider,
- 21:20to account for the confounding covariates,
- 21:23such as the data source or cell cycle,
- 21:27we also allow this non-linear mapping,
- 21:30to depend on this covariates.
- 21:33And here we are...
- 21:35Because the observed data count,
- 21:37we assume it follows an active binomial distribution,
- 21:40and L(i) are...
- 21:41Oh, sorry. L(i) here should be known library size.
- 21:44Sorry for the typo. It should be known library sizes.
- 21:46And CRJ, and the CRG, the dispersion parameter switch gene,
- 21:51are unknown parameters.
- 21:53And so in this, in the current model,
- 21:55the unknown parameters we have,
- 21:57are these cell, the vertex positions,
- 22:00U, the cell positions on the graph,
- 22:03the W(i) the non-linear mapping
- 22:06and this unknown dispersion parameters.
- 22:08So we have a lot of parameters.
- 22:11So to further simplify our estimation,
- 22:15we assume that there are a mixture prior
- 22:18on the cell positions.
- 22:20So it's a very tactical idea.
- 22:23So we assume that first the cell,
- 22:27there are some latent variables, Ci for each cell.
- 22:30And so Ci determines which edge or vertex a cell chooses.
- 22:35So the cell has some probability to choose a specific edge
- 22:39or a specific vertex,
- 22:42and if it chooses an edge,
- 22:46then, it then eventually choose the location of,
- 22:51the relative location on the edge..
- 22:55So that's what becomes a mixture prior on this diagram two.
- 23:00And,
- 23:02for these non-linear mapping functions,
- 23:06we're including non-linear mappings.
- 23:10We model these F(G(i)) functions by a neural network.
- 23:17So, Oh, do you?
- 23:18So our parameters now, our known parameters now are,
- 23:22this U the positions of the vertices
- 23:24on the low-dimensional space,
- 23:27this PI, the probability of each vertex and edge,
- 23:34and the non-linear mappings,
- 23:37which are the waste in the neural network
- 23:39and this, this dispersion parameters.
- 23:42And we...
- 23:43I space these parameters by combining our mixture model
- 23:47with a variational autoencoder.
- 23:51So the variational autoencoder.
- 23:53So the autoencoder has been a very popular model
- 23:58for in deep learning.
- 23:59So what it can do is, is that it can,
- 24:02can have some non-linear mapping
- 24:07of the observed data to a low-dimensional space
- 24:11and we want the low-dimensional space to best,
- 24:14recover our observed time rational data.
- 24:18And here, we also have such a task.
- 24:21We have the low-dimensional space
- 24:23and we want to best to recover our observed data.
- 24:27And what's different is that we also have a prior
- 24:31on the latent space, because we have the prior,
- 24:34so we use the variational autoencoder model
- 24:38in deep learning.
- 24:39So the classical variational autoencoder
- 24:43in deep learning, we'll assume that the latent space,
- 24:46has the standard normal distribution as the prior.
- 24:50And here we just modify it so that we have the...
- 24:53So that the latent space have the mixture prior
- 24:56that are assumed in our previous mixture models.
- 25:02And we use the same approach
- 25:04as the variational autoencoder, the variational path
- 25:08which is, though the,
- 25:10to approximate the posteriors of the latent space.
- 25:13So though, because of the complicated priors
- 25:16and non-linear mappings,
- 25:19this prior, the posterior of the latent space conditional
- 25:23on the observed data and the confounding covariates,
- 25:27it can be complicated,
- 25:28we approximate it by our normal distributions
- 25:32and the mean and the variances of the normal distributions
- 25:36which are functions of the observed data Y(i)
- 25:40and the covariance Xi,
- 25:41are also non-linear functions
- 25:44and we model them by the neural network
- 25:46and that's the encoder.
- 25:47So the decoder is the nominal mapping function F(G(i)),
- 25:53mapping the latent space to the observed data.
- 25:56And encoder are neural networks
- 25:58that are approximating the posteriors of the latent space.
- 26:03- Hi, Jingshu, can I ask a very quick question?
- 26:07If I understand correctly,
- 26:08up to now you have not used the time information,
- 26:11is this true?
- 26:13Or you have considered to include the time information
- 26:16in the covariate X?
- 26:17- Oh, oh, yes.
- 26:19So we will not use time information
- 26:24in our trajectory inference.
- 26:26If I have time, I may have a last slide which is a,
- 26:31which is a review of the literature into data inference.
- 26:34So into data inference,
- 26:37the most commonly used and best performing methods,
- 26:41they will tend to not use the time information
- 26:44because there is...
- 26:45Though the time information is typically correlated
- 26:49with developmental like, timing of the cells,
- 26:54but because at each collectional time,
- 26:57is a mixture of cells at different environmental time.
- 27:00So it's a big, complicated relation
- 27:03and some methods use that information
- 27:05but many methods do not use that.
- 27:09And so our approach is the methods that do not use the,
- 27:14the collection time information.
- 27:16And, we use it only when we decide
- 27:22which vertex is the root.
- 27:24And I will show that later.
- 27:27- Thanks.
- 27:30- So our...
- 27:33So the last function is, is composed of three parts.
- 27:37The first part is this likelihood based reconstruction loss.
- 27:42So this evaluates how goods are our latent spaces to,
- 27:48to reconstruct the high-dimensional observed data.
- 27:54And the second part is the KL divergence
- 27:57between the posterior distribution and the prior.
- 28:02And you can think of it as a regularization term.
- 28:06And so to regularize
- 28:07if the posterior is very far away from the prior,
- 28:12also when for variational autoencoders so beta equal to one,
- 28:17it can be also viewed as a lower bound
- 28:21of the marginalized data.
- 28:23So here we make the,
- 28:27we add the training parameter beta,
- 28:30and in practice we said, beta are larger than one,
- 28:34so that we can encourage the posterior, the regularization,
- 28:39so that the posteriors of the I will,
- 28:41are more likely to tend to align
- 28:44along the edges and vertices.
- 28:45And that's the idea that has been used in deep learning
- 28:48which is called the beta.
- 28:53And the third term,
- 28:55it's a term for adjusting for the covariance.
- 28:58So the covariance...
- 29:01So this covariance * covers.
- 29:03So we want our latent space C to be kind of,
- 29:08be correlated with the covariance,
- 29:10While...
- 29:11So we, certain...
- 29:14So we want to maximize the reconstruction of the data
- 29:18by only by the covariates.
- 29:20And setting the tuning parameter alpha larger than zero,
- 29:24we can help decorrelate Z(i) from Xi.
- 29:31So another art in our training is that,
- 29:35we need a good internalization of the graph.
- 29:38So specifically we need to determine,
- 29:40how many vertices there are,
- 29:42and also the positions of the vertices
- 29:45in the low-dimensional space.
- 29:47That's not an easy job.
- 29:49And if we just randomly,
- 29:52because our final graph depend on,
- 29:56the total number of vertices that we set at the beginning.
- 30:00So how we pretrain the model to return it's initial value?
- 30:05To get the initial values of the unknown parameters,
- 30:08is that we first trained with beta equal to zero,
- 30:12so that we don't make use of any,
- 30:14these prior distributions of the I.
- 30:17So it's only...
- 30:18We're only looking at the reconstruction loss
- 30:20from the likelihood of the data.
- 30:22So it's like the normal, the classical autoencoder.
- 30:27And from that we can get some initial estimate of Z(i),
- 30:33The latent space variables.
- 30:34And then we perform clustering on Z(i),
- 30:38and we let the clustering algorithm,
- 30:41to automatically determine the number of clusters
- 30:43and we use that as the number of vertices.
- 30:46And we also use the cluster centers
- 30:49as the initialization, as the initial values of U.
- 30:53So that's the main part,
- 30:55the key ideas in our pre-training start,
- 30:57so that we can have a good initial addition to,
- 31:01for the start of the training.
- 31:05And another trick that we have taken,
- 31:08is that in practice, sorry,
- 31:10the best performing existing trajectory inference methods.
- 31:16They will attempt...
- 31:17So they are typically very fast.
- 31:20And in order to have comparable computational costs
- 31:24of these methods,
- 31:25we also have accelerated version of our algorithm
- 31:28which is a simply to reduce the input,
- 31:34the dimension of the input space,
- 31:36so we can replace Y(i),
- 31:39the high-dimensional vector of the gene expressions
- 31:44with it's principal components.
- 31:46Now, principal component, principal scores,
- 31:50which is a low-dimensional vector L.
- 31:53We, by default we will take the first 64 dimensions
- 31:59for the principal scores.
- 32:01And so we replace the elected binomial distribution
- 32:05by a normal gaussian distribution assumption
- 32:09of these principal scores.
- 32:11And as you will see later in our,
- 32:15in our evaluations with real and synthetic data,
- 32:19we see that we actually have comparable performance
- 32:24with our previous likelihood,
- 32:26with our standard likelihood based methods
- 32:30for this accelerated version.
- 32:35So after the final step is that
- 32:38after the training the autoencoder,
- 32:40we have approximated distributions,
- 32:44posterior distributions of the latent space
- 32:47and also the cell positions that
- 32:52and which vertex,
- 32:54the vertex or the posterior distribution of Ci,
- 32:58well Ci* is which vertex or edge the cell is from.
- 33:04And we need to use those information,
- 33:07to determine the trajectory backbone
- 33:10and to project each cell
- 33:13on our inferred trajectory backbone.
- 33:17So how we do that is, first we calculate an edge score.
- 33:22So this edge score is...
- 33:27So we have different scores for an edge,
- 33:30and that is determined
- 33:32on looking at the posteriors of cells.
- 33:35How many cells from the posterior distribution?
- 33:38How many cells choose to lie on that specific edge?
- 33:43If there are a lot of cells then that means
- 33:45that it's very likely that edge exist.
- 33:47If there are very few cells
- 33:49then very likely that edge should not be,
- 33:52the edge that is included in the trajectory backbone.
- 33:58And the denominator is that we want to,
- 34:03give a relatively high fair score
- 34:06for the edges of that connecting to small clusters,
- 34:11to small cell types,
- 34:13such as we want to also capture the transition
- 34:17between two rare cell types.
- 34:20So that's why we have this regularization,
- 34:25waiting by with, in the denominator.
- 34:29And we include an edge in the trajectory backbone
- 34:32if it's edge score is larger than some*.
- 34:38And when we have an inferred trajectory backbone,
- 34:43then the next step is,
- 34:45we want to project the cells on the inferred trajectory,
- 34:50and we do it by looking at the...
- 34:54Based on the posterior, approximated posterior distributions
- 34:58of the cell positions that *.
- 35:01We want to find the closest point on the inferred trajectory
- 35:08for this cell based on the posterior distributions.
- 35:12And the distance,
- 35:14this expectation we can also use as a evaluation
- 35:18of the, some uncertainty quantification
- 35:21of this projection
- 35:24or the cell positions.
- 35:27And the third thing we need to,
- 35:29because our final results is some kind of directed graph.
- 35:33So we need to determine the root vertex.
- 35:36So the root vertex can be either given by the user, or if
- 35:43as I feel I asked,
- 35:44for some datasets like the data at the beginning of my talk,
- 35:48the cells are collected in the time series,
- 35:52and we can make use of that time series,
- 35:54to determine the root vertex.
- 35:58The rough idea is that for each vertex,
- 36:01we can calculate a fraction time score,
- 36:03which is an average of the cells
- 36:06that belong to the vertex
- 36:07or projected on the edge that connects to the vertex,
- 36:11depending on the distance from the cell to the vertex.
- 36:15And so we can have some vertex collection time,
- 36:20collection time score for each vertex.
- 36:21And we choose the root vertex
- 36:25as the vertex that has the smallest collection time score.
- 36:32And with the roots and with our inferred trajectory,
- 36:35it's straightforward to calculate the pseudotimes
- 36:38for each score.
- 36:40So that's the whole process
- 36:42of our model-based methods for trajectory inference.
- 36:48And now let's look at some benchmarking results.
- 36:52So first we...
- 36:54So our benchmarking includes both some real datasets
- 36:58and some synthetic datasets.
- 37:00And we follow the...
- 37:02Most of the benchmarking follows the same framework
- 37:05as this well known benchmarking paper
- 37:09in the "Nature of Biotech" in 2019.
- 37:12And our datasets are selected
- 37:14as a subset of the datasets that they have tried
- 37:18and our criteria is that these datasets,
- 37:19maybe have enough number of cells not too few cells.
- 37:25And we wanted to cover different types of topologies.
- 37:33And this is the benchmarking results.
- 37:37So the columns, sorry.
- 37:39So the rows are the different datasets
- 37:43that I have mentioned.
- 37:45And we compare five different methods.
- 37:50So we have, would come first compare two versions
- 37:52of our approach.
- 37:53Vitae one as the original elected binomial likelihood base.
- 38:00Vitae and accelerated version,
- 38:03replacing the gene expression vectors
- 38:07by principal scores.
- 38:11Then we compare it with three different,
- 38:13state of the arch trajectory inference methods.
- 38:19The monocle PAGA.
- 38:20So the monocle series is from the lab that,
- 38:24who developed the first trajectory inference methods,
- 38:27and there, and then they further,
- 38:30the first take monocle for one
- 38:32and now they have monocle three.
- 38:33So the monocle series are always commonly used
- 38:36in these single-cell RNA sequencing papers.
- 38:40And two, I expect the performing,
- 38:44trajectory inference methods in the benchmarking paper,
- 38:48the PAGA and Slingshot.
- 38:51And all these methods,
- 38:55do not use this time information explicitly.
- 38:59So it's a fair comparison between these methods.
- 39:04And so the...
- 39:07And for all the methods they are given to by those.
- 39:11We give them the two number
- 39:13of clusters or the vertices to start from,
- 39:19and the two root vertex.
- 39:22And we, and each column is,
- 39:25we compare it's measurement, it's metric
- 39:30for the evaluation of the performance of each method.
- 39:36So the first two columns,
- 39:37are the matrix for recovery of the trajectory topology
- 39:42or the trajectory structure.
- 39:46And next two columns are the evaluation
- 39:49of the cell position, estimation accuracy.
- 39:53And the last metric is
- 39:55for evaluating the pseudotime accuracy.
- 39:58And a larger score means a better performance,
- 40:01and a lower score means like, a worse performance.
- 40:06So you can see that, our approach first is,
- 40:11our approach has much better performance
- 40:13in recovery of the trajectory topology.
- 40:18We also have some benefits of the cell position estimates,
- 40:23and because of both,
- 40:24we have a better performance in the pseudotime accuracy.
- 40:29And the other thing you can see is that our,
- 40:33our accelerated version have comparable,
- 40:36slightly worse but comparable performance,
- 40:40compared to the, our likelihood based vitae.
- 40:46And though it has a much quicker,
- 40:48much less computational cost.
- 40:54So finally,
- 40:56let's come back to the case study on mouse neocortex.
- 41:02So this is the,
- 41:05the visualization of merging the raw data.
- 41:10And this is the performance of our methods.
- 41:14And for comparison, we compare is, another very popular use,
- 41:21data integration method called Seurat.
- 41:24So Seurat is the software, the most often used software,
- 41:29for single-cell RNA sequencing analysis.
- 41:31Their lab have different,
- 41:32have developed a series of computational tools
- 41:35for analyZ(i)ng the single-cell RNA sequencing data.
- 41:38And this is from their integration methods.
- 41:40So you can see that both methods can,
- 41:44is able to integrate the both two datasets,
- 41:50but for some details, I think,
- 41:52because we are assuming this trajectory structure,
- 41:55we have a slightly better performance.
- 41:57For example, this group of cells are the layer one neurons,
- 42:01where the group of here, are here in Seurat.
- 42:07So you can see that because they come from,
- 42:09because the outer layer parts and the layer parts,
- 42:12come from two datasets.
- 42:16Because as I mentioned earlier in dataset B,
- 42:20they have collected cells from, at day 11.
- 42:24So this are, we can take a look
- 42:27of the collection days of each cell.
- 42:32So you can see that these cells, they are,
- 42:34the layer one parts come from day 11,
- 42:36And the rest parts is a mixture
- 42:39of cells from both two datasets.
- 42:42And by the way they all belong to the layer one.
- 42:43So we know that they belong to layer one
- 42:45by looking at the marker genes expression
- 42:48which I did not show here.
- 42:50So it's because we encourage the cells to align together
- 42:55if they are along the address, if they are similar cells.
- 43:03And you can see here, that's,
- 43:06so the two datasets, they have this interpolation
- 43:10of the pseudo, of the collection time.
- 43:14And you can see for example,
- 43:16for these projected cells,
- 43:18we can see this continuous positions,
- 43:24like alignments of the cells of different days
- 43:27from so the most the dark is the cells from day ten.
- 43:34And the red ones are the cells from day 18
- 43:36and even days are, are from dataset A,
- 43:40and odd days are from dataset B.
- 43:42So you can see that though they're coming
- 43:43from two different sources,
- 43:46we can, we are able to align them in the right order.
- 43:54And,
- 43:56and as another comparison.
- 43:58So we compare our estimation of shared trajectory,
- 44:03with another partisan approach
- 44:06which is we're first to do data integration with Seurat
- 44:12and then we can use Slingshots,
- 44:14to perform trajectory inference on the integrated data.
- 44:19And you can see that this,
- 44:21we have a much cleaner trajectory structure.
- 44:24And we also have a comparable computational cost.
- 44:29So Seurat and Slingshots,
- 44:31they cannot be, they do not need regularization.
- 44:34And with one CPU, they, it takes about 12 minutes.
- 44:38And for our accelerated VITAE,
- 44:42generating this figure, we have,
- 44:45we can, we take about three minutes
- 44:47on one GPU port at about 10 minutes on eight CPU cores
- 44:52which is, the eight CPU cores are like currently,
- 44:55like most of our laptops,
- 44:57but we'll have such computational resources.
- 45:00So we have comparable computation cost
- 45:04with this state of our methods.
- 45:07And in addition, because we are...
- 45:11Based on this approximated posterior distributions
- 45:14we also have some kind of uncertainty quantifications
- 45:19on the cell positions.
- 45:20For example, here, it shall say some parts of the cells,
- 45:24these cell positions
- 45:26along the trajectory are not very reliable.
- 45:32And that will help us to evaluate our, the estimate,
- 45:37how we think our estimate in pseudotime.
- 45:42And finally,
- 45:44this is showing some gene expression change
- 45:49along the pseudotime order.
- 45:50And, and we look at some top markers
- 45:55that are changing along the pseudotime order
- 45:57for some trajectories in the whole trajectory structure.
- 46:03And you can see,
- 46:05here we separately fish the curve for two datasets,
- 46:11but you can see that they overlap
- 46:13with each other quite well.
- 46:15And so that's also an evidence showing
- 46:18that we can have a good,
- 46:23a good performance in aligning the two datasets.
- 46:28So the take home message is,
- 46:31first we perform this model-based trajectory inference,
- 46:36to understand cell dynamics.
- 46:38And our, the second is our methods.
- 46:41So our method is a model-based approach.
- 46:43We can combine the mixture prior model, Oh, sorry.
- 46:50We can combine the collected mixture structure
- 46:54for defining the trajectory structure
- 46:58with the variational autoencoders
- 47:00so that we can efficiently,
- 47:05efficiently solve the mixture model
- 47:08and have enough flexibility to fit the data well.
- 47:12And so our,
- 47:14trajectory inference approach features,
- 47:17the analysis of integrating
- 47:19multiple single-cell RNA sequencing datasets.
- 47:22And if you are anxious to know more details,
- 47:26we have our paper, a manuscript,
- 47:29a manuscript already available on bio archives
- 47:33and we also have our package codes on VITAE.
- 47:38And that's all. Thank you.
- 47:40And if you have any questions, I'm happy to answer them.
- 47:44- Thanks Jingshu for this excellent, excellent talk.
- 47:49I wonder whether the audience,
- 47:50have any questions for Jingshu.
- 47:58Okay. So Jingshu I have some maybe minor questions.
- 48:02I recall that in the model,
- 48:04you have this actual term encouraging X,
- 48:07the data explained by X the covariates,
- 48:12the confounding covariates,
- 48:13to be orthogonal to the leading factor, right?
- 48:19And then there is a penalty term alpha.
- 48:21So I wonder,
- 48:25what's the motivation for you to including this term
- 48:28instead of first removing the effect from the i directly.
- 48:33And in practice, how should we have set alpha?
- 48:39- So, so, so, so the thing is a bit tricky here
- 48:44from a statistical point of view.
- 48:45So we want to remove these confounding effects, right?
- 48:51But the other fact is that,
- 48:53these confounding effects, the X,
- 48:56are not exactly orthogonal with Z,
- 48:59because for instance for the two datasets that I have,
- 49:03we cannot say that the signal is completely orthogonal
- 49:06to each dataset they have come from
- 49:09because there are two biological differences
- 49:12between the two datasets.
- 49:13So, so here,
- 49:17so here, the problem is not completely identifiable
- 49:19but people do it in practice a lot.
- 49:22So we want to kind of decorrelate Z and X to some extent
- 49:29so that we can remove,
- 49:32remove the batch effects that we do not want
- 49:36but keep the two biological differences.
- 49:39So I think the underlying assumption
- 49:41the big assumption is that we are assuming
- 49:44that the two biological differences are large enough
- 49:48so that compared to the...
- 49:50So we remove the smaller differences, the batch effects
- 49:53but we can still pick the two biological differences,
- 49:56to some extent.
- 49:57So there's no guarantee that it will work,
- 50:01but in practice, it work on a lot of datasets.
- 50:06I think that's...
- 50:08So we will inherit this idea from this paper
- 50:11by Nancy Huang and her students.
- 50:16So I think in removing the batch effects.
- 50:19So I think the idea is that,
- 50:21we hope that it can work for a lot of datasets.
- 50:25And the reason that we want to have this penalty,
- 50:28is that if we don't add any penalty,
- 50:32then, because this autoencoder is trained by this,
- 50:36by this stochastic gradient descent.
- 50:39So sometime it may not find the optimal global solution.
- 50:44So if we don't encourage it, the X and Z to be decorrelated,
- 50:49it sometimes may not be able,
- 50:51it may give you a solution that is not,
- 50:54that's the Z still are highly correlated with X
- 50:58and the batch effects are still there.
- 50:59So then this alpha, I think in practice we set it to be 0.0,
- 51:06so it's a very small penalty so that we can put some,
- 51:11some kind of penalty to regularize that.
- 51:16- Small amount, I guess you mentioned.
- 51:18- Yes, yes.
- 51:21And it may be that does not work,
- 51:22and then in practice you can choose another alpha and try it
- 51:26and see if it gives you the best results that you want.
- 51:30- Right. So, so I want...
- 51:32I guess, right.
- 51:34So I guess my question is like,
- 51:36since it's not entirely a supervised problem,
- 51:40like how, right.
- 51:42So I'm not sure how to check,
- 51:45what is a good alpha in the sense,
- 51:48but if you tell me like a small alpha,
- 51:51well you're going to be fine
- 51:51then I just take it to be small alpha.
- 51:54- Yeah. I think the way you check it is that,
- 51:57for example the way we check it here is that,
- 52:02so sometimes we have some referenced cell types,
- 52:04so that, you know roughly what you are doing.
- 52:09So here, for example,
- 52:10here this reference cell types, these are,
- 52:14these are not used in the modeling approach,
- 52:18so these are for evaluation the performance.
- 52:21And for some datasets, you can't,
- 52:23we can mark our genes and do some class points,
- 52:26so, you know, roughly like which though,
- 52:29(indistinct)
- 52:32and for two datasets, you can see,
- 52:34like whether you can correctly merge the cell types
- 52:37that are shared among the datasets
- 52:41but keep the cell types
- 52:42that are unique to different cells.
- 52:46Then for our trajectory inference is slightly complicated
- 52:49because these cell types are not well separated.
- 52:52So another way that we can check our performance,
- 52:55is that you can see here that we can correctly,
- 53:00for these projected cells we can correctly like,
- 53:03order the wrong days in the right order.
- 53:06So that we know that we keep
- 53:08some biological meaningful signals there.
- 53:13I think there still can be some bias. Yeah.
- 53:18- Okay, great. Thanks.
- 53:21So, thanks Jingshu again for this excellent talk.
- 53:25And if you have any question you want to ask Jingshu
- 53:28that you cannot think about for now,
- 53:31you can always email her offline
- 53:33and if you want to use her software,
- 53:36I think she'll be more than happy to answer your question.
- 53:39- Yes. Yes.
- 53:41(chuckles)
- 53:42- So I guess that's all for today.
- 53:46Thank you everyone joining.
- 53:47Thank you Jingshu for being here,
- 53:49and have a nice remaining day.