YSPH Biostatistics Seminar: "Exploring Space and Time for Identifying Gene Interactions Using Single-cell Transcriptomics"

October 05, 2021

Information

Atul Deshpande, PhD, Postdoctoral Researcher, Division of Biostatistics and Bioinformatics, Johns Hopkins University

October 5, 2021

ID6959

To CiteDCA Citation Guide

00:00<v ->Today it is my honor to introduce,</v>
00:02Dr. Atul Deshpande.
00:04Dr. Deshpande is a postdoctoral researcher
00:07in the lab of Dr. Elana Fertig
00:09in the department of oncology,
00:11at Johns Hopkins University.
00:13He has a PhD in electrical engineering
00:15from the University of Wisconsin-Madison,
00:17and his interests include
00:18the use of time series analysis
00:20and spatial statistics
00:21for modeling biological processes.
00:24He's currently developing analysis techniques
00:26to use single cell and spacial multigenomics
00:28for the characterization of
00:30the tumor microenvironment
00:32and intracellular signaling networks.
00:34Welcome. (students applause)
00:40<v ->Well, thank you so much.</v>
00:41And once I figure out my...
00:48Where my PowerPoint window is,
00:49we can start in earnest.
00:52Okay, yeah, thank you for the kind introduction.
00:55So, I'm Atul Deshpande,
00:57and today the title of my talk is exploring time
01:01and space for identifying gene interactions
01:04using single cell transcriptomics.
01:07So, what do time and space mean
01:10in the context of this talk?
01:13So, they refer to recent technological advances
01:15and the algorithms, which are the foundation
01:17for the projects I will be talking about.
01:20And the first advance is the ability
01:24to measure gene expression in individual cells.
01:27This in turn inspired development
01:29of algorithms that ordered these cells along
01:32the biological trajectory.
01:34Using these algorithms, we can observe changes
01:37in gene expression in
01:39a pseudo temporal reference for pseudo time,
01:43which is a measure of the progress
01:45of the biological process.
01:48The second is a more recent ability
01:50to measure gene expression
01:52within the spatial context of the tissue.
01:54But this we can analyze changes
01:56in gene expression
01:58as cellular neighborhoods change,
02:00or as the tissue type changes.
02:07So, before single cell transcriptomics,
02:10we would usually get one measurement
02:12of gene expression from a collected sample.
02:15And this is now called
02:18bulk RNA-seq in retroactively.
02:22However, as this measurement would just be
02:26an average of the population of cells
02:27in the sample, and it would obscure information
02:31about the different cell types, or different
02:34cell states in the population.
02:36With single-cell RNA-seq,
02:37we can now measure gene expression
02:40in individual cells.
02:41Depending on technology, this can range
02:43from a few hundred cells up to hundreds
02:46of thousands of cells.
02:49And this allows us to observe
02:51the full heterogeneity of the cell population
02:56represented by gene expression.
02:59And using this high dimensional data
03:02that we now have,
03:03we can characterize different cell types
03:05and cell states as gene expression vectors.
03:12So, one drawback of this technique
03:14is the issue of technical dropouts.
03:17Now, this is characterized by observing,
03:21as in us observing a lot
03:23of false zeroes, or zero inflated measurements,
03:26because we are unable to reliably measure
03:29the low iron accounts in individual cells.
03:35Now, the first project
03:39that I will discuss uses
03:43a single cell RNA-seq technology,
03:46or as it's downstream of that.
03:49And it uses also downstream of algorithms,
03:53which order single cell data into trajectories,
03:58which represent the biology
04:01that they might be studying.
04:02For example, let's say if you are...
04:04You have a dataset, which corresponds
04:09to stem cell differentiation,
04:11there are probably now 70 different
04:15trajectory inference methods depending on what
04:17kind of datasets you are studying,
04:21what biology you want to study,
04:23how big the dataset is,
04:25or what the expected trajectory is
04:28of the biology that you're studying maybe.
04:30And they attempt to order these cells based
04:34on the expression of potentially
04:37a few key marker genes, or how, which genes
04:40are differentially expressed along
04:43the biological process.
04:46So, anytime you collect,
04:48let's say a single cell RNA-seq data,
04:51you would find a mix of cells,
04:54and that was the entire motivation
04:56for doing this.
04:57But that mix of cells would have
05:01a range of cell states,
05:03which could correspond to
05:07from the beginning of the biological process,
05:09to the very end of the biological process.
05:12And what these algorithms are trying to do
05:14is they're trying to fit these cells
05:18in their right place, in the biological process.
05:23And once we do that, we can actually observe
05:25the gene expression along this ordering.
05:30And a lot of these methods also assign
05:34a pseudo time to each cell,
05:35which tells you how far along in the biology
05:39they think, or they hypothesize that the cell is.
05:43And so, the question that we wanted
05:45to ask is given this pseudo temporal ordering
05:50of the cells, which gives us
05:54a gene expression dynamics
05:55in the pseudo temporal reference.
05:58Can we use these dynamics
06:03to infer gene regulatory networks?
06:07Or any directed networks from say,
06:10sets of genes to their targets.
06:14And the second question
06:14was whether the assigned pseudo time values help
06:19us in the network inference task.
06:25So, to make the, I guess,
06:31explanation more approachable,
06:34I will just use an example dataset.
06:37And as I explained, the concepts I've...
06:41We will just see what that means
06:44in terms of this dataset.
06:45So, this is a dataset from Semrau et al,
06:50and this is a single cell data
06:53from retinoic acid, driven differentiation.
06:57And in this mouse, embryonic stem cells
07:01differentiate into neuroectoderm
07:02and extraembryonic endoderm cells.
07:06Now the data as collected had nine samples,
07:10one before the differentiation starts
07:12and one after every six hours.
07:15So, you have data collected over 96 hours
07:19from nine samples, and each sample has 384 cells.
07:24So overall, I believe we have something
07:27like you can do the math.
07:29I guess, 2,600 cells or something like that.
07:33So, we chose to apply
07:37two trajectory inference methods to this.
07:39So, the first one is monocle 2,
07:42which is also called Monocle DDR tree, I believe.
07:45And the second one is PAGA Tree.
07:47So, both of these methods identify
07:50a bifurcating trajectory from these cells.
07:53And so, the first one is to the left
07:56where the embryonic stem cells are actually
08:01on the right of...
08:03I'm not sure if people can see my mouse pointer,
08:07but yeah, they're on the right of the trajectory.
08:09And then, towards the bottom left,
08:14you go into a neuroectoderm state
08:16and towards the...
08:19Right, top left, you go into an endoderm state.
08:24And on the right side, the way PAGA Tree
08:27infers trajectory is you have
08:30the embryonic stem cells on the top left.
08:33And then, it identifies
08:35a few more branches than Monocle does.
08:39But both of these
08:40identify branching trajectories.
08:43And in each case we selected
08:48the two branches,
08:49which corresponded to markers, which were,
08:54which ended up being high for neuroectoderm.
08:56So, the trajectories, the sub trajectories
09:00from each method that we've wanted to study
09:03was the embryonic stem cells to neuroectoderm,
09:08using these two methods.
09:11So, this as in, so we had...
09:14We have these two trajectory inference methods,
09:16which assigned their own pseudo times,
09:18and this is the pseudo temporal expression
09:23dynamics for the same gene.
09:25I did not mark which gene it was, but yeah,
09:29so this was for the same gene.
09:30And you can see that the dynamics
09:33that each of these trajectories gives
09:35us is different.
09:37First of all, the main branch,
09:40or sub part of the trajectory that
09:42we are considering has
09:44a different number of cells.
09:46And these cells may not necessarily be common
09:48to both end.
09:49There will be some which are common
09:50to both of these trajectories,
09:52but some others which are completely different.
09:54But also, that the cell ordering itself
09:57that each method based on whatever mathematics
10:01they use, or whatever algorithms they use,
10:05would differ between these two methods.
10:08So, as you see, Monocle has a higher expression
10:12much earlier in the pseudo time,
10:14as opposed to PAGA Tree, which has much later.
10:18And the pseudo times here,
10:20were not exactly 100, they're just nominalized
10:22to 100 just represent progress from 0%
10:25of the biology to 100% of the biology,
10:31or as inferred by that method.
10:34So, now what are the challenges associated
10:37with order single-cell data?
10:39So, the first one is that unlike say,
10:43stock data, or say weather data,
10:47or something like that, you don't necessarily
10:49have a uniform distribution of cells.
10:54And if you're going to do a time series analysis,
10:56that would mean that you do not
10:57have regularly spaced time series,
11:00but you actually
11:01have irregularly space time series.
11:03On top of that, the pseudo time values
11:05that are assigned to the cells
11:07and ordering stem cells is uncertain.
11:13Now, finally, we recall that we had the issue
11:17of zero inflated measurements,
11:19or false zeroes in the meter
11:21because of technical dropouts.
11:26So, the question is how to overcome all
11:29of these drawbacks
11:32to try and find
11:36networks from this time series data.
11:40So, the project that we had,
11:43it resulted in basically
11:45an algorithm called SINGE,
11:46which is single cell inference
11:48of networks from Granger ensembles.
11:50So, this was done at the Morgridge Institute
11:53for Research in Madison, Wisconsin.
11:55And these are my collaborators on this project.
12:01And let's see, okay.
12:03So, the main concept that we build on
12:06is basically the Granger causality test.
12:08It was introduced by Clive Granger in 1960s.
12:14And to give a very simple example
12:16of what it's trying to say is, let's say
12:17if you have two times series X and Y,
12:22now Granger causality tests, whether
12:26the prediction of current values of Y
12:28improves by using past values of X,
12:31in addition to past values of Y.
12:34And if that happens, then we say
12:36that X Granger causes Y.
12:38So, this is basically a lag regression
12:41between X and Y.
12:42So, this has had applications
12:44in econometrics and finance,
12:46and is also being used
12:47in computational neuroscience and biology,
12:51as noted in these examples here.
12:55Now, the multivariate Granger causality test
12:58can be thought of as setting up and solving
13:00a vector, or regression model,
13:02where you have say, P genes, T time points
13:05and L lags.
13:06Where L lags is telling you how many,
13:11say your relationships with the past expressions
13:14you're trying to model.
13:16And once you have that,
13:17you could think of solving this way,
13:21our model by just minimizing
13:24this objective function here.
13:27And that would give you, I guess,
13:28a few edges between the past values
13:31of all of the genes and your target gene.
13:34Okay, maybe I should have explained
13:36this figure first.
13:37So, you have all the regular,
13:39all the possible regulators of a gene,
13:42and then you have a target gene,
13:43and you're trying to identify
13:46what explains what past values
13:49of any of these genes explains
13:51the current values of the target gene.
13:55And if you wanted to have
13:59a sparse representation of this network,
14:02or have an...
14:03Count only a few of the edges,
14:05you would introduce this by CT parameter,
14:08which would ensure that the edges from say,
14:12all of these genes to your target
14:15are not numerous.
14:16And you can explain the biology in a few edges.
14:22Now, to counter the irregularity
14:27of the time series, we use
14:30an idea called Generalized Lasso Granger.
14:33So, what this does is,
14:36I'm not sure, maybe I have...
14:39Yeah, okay, so just to recall, right?
14:44So, you have a pseudo temporal data,
14:46which has irregular time series,
14:48and you have missing values,
14:51which show up as zeros here, right?
14:54So, we want to adapt the Lasso Granger test
15:00for irregular time series.
15:02So, what was previously,
15:05basically coefficients from older samples
15:07in regular time series,
15:09now becomes coefficients from just timestamps
15:15in the past.
15:16Because you might not necessarily have
15:18a sample at that point.
15:21Furthermore, we can rethink basically,
15:28the object to function as originally,
15:33if it was a dot predict between
15:35the coefficients and the values
15:38of the gene expression,
15:41we rethink that as a weighted dot predict,
15:45where basically we...
15:48And this is the description
15:49of the weighted dot predict, where you use
15:51a Gaussian kernel to weight the inputs
15:56pseudo product based on their proximity
15:59to the timestamps that you...
16:02That correspond to these coefficients.
16:04So, these ellipses here show kernels,
16:08I guess, they represent kernels.
16:10They don't necessarily stop at these bandwidths,
16:12but they just keep going
16:13because they're ghosting kernels.
16:16But these just represent the kernels,
16:18where basically, if you have
16:20a timestamp corresponding to coefficient
16:22and you have no sample at that timestamp,
16:25that doesn't necessarily mean
16:26that the input to the gene predict it is zero.
16:31So, basically what you would do is
16:33you would just look at a bin around
16:36that timestamp, and weight input from regulators,
16:42depending on their proximity to this timestamp.
16:46So, if the sample is exactly at
16:51the timestamp that you expect,
16:52you would rate it highly based
16:54on discussion kernel, and the farther
16:56you move away from the timestamp,
16:58the weaker the rate of
17:02that particular sample would be.
17:05So, what this helps us do
17:07is if there are say more than one cells
17:10in close proximity, it would take input
17:14from all of them.
17:15If there are no cells in the close proximity
17:18to at least take input from some cells,
17:20which are farther away, and so on.
17:25So, yeah, as in this works
17:27with irregular time series,
17:28because you don't necessarily have
17:30to expect samples in the past at the timestamps
17:34that you wanted them to.
17:36And yeah, I think we already discussed this.
17:40So, now, as in going back to the case for...
17:45So, we had these false zeroes, right?
17:48So now, because of this kernel method,
17:50we have an inherent imputation over missing data.
17:54So, now we get what we could think of as,
17:58instead of taking all of the zeros
18:00as they are at face value,
18:03we can treat them, or some of them
18:04as dropouts, as just missing data.
18:09And we just remove those samples now,
18:11because we can now work
18:13with irregular time series.
18:15And because of this kernel method,
18:17we can actually work with time signature,
18:19all uniquely irregular.
18:22We can work with...
18:24We can remove the zero valued samples
18:26and get a different, differently irregular
18:30time series for each of these genes.
18:33And so, such an action can probably
18:37be informed by imputation techniques like magic,
18:40which help you complete,
18:42or impute zeros in the dataset.
18:44So, instead of imputing the dataset,
18:46as you could just use its output
18:48to decide whether or not to remove the data from,
18:51or remove that zero from this input dataset.
18:58So, this is just an illustration
19:00of a single generalized Lasso Granger test.
19:04So, you have the POU5F1 gene, and it's basically,
19:08you see it's the cells corresponding
19:11to that, or other details expression
19:16along pseudo time.
19:18And what you also see is two trendlines
19:23predicted using a Lambda of 0.1,
19:27which is basically a sparsity constraint of 0.1.
19:29So, it would have fewer edges
19:32between the regulators and POU5F1.
19:36And then a Lambda of 0.02,
19:41which has far more regulators.
19:43And you can see that both of these predict
19:46the trends of POU5F1 when using
19:49the past values quite well.
19:54So, now that was just one GLG test.
19:59Now, what SINGE does, is it performs multiple
20:01such GLG tests where you sub-sample
20:04the time series different ways
20:07to get different irregulars time series again.
20:12And you also use diverse hyper-parameters
20:14to effectively using these two combinations,
20:17slice the cake multiple ways and trying
20:20to look at the data.
20:22So, the type of barometers
20:23that we use are Lambda, which determines
20:25the sparsity of the network that we get,
20:29or get into metrics that we get.
20:31And we have Delta T, which gives us
20:36a time resolution of the lags between say,
20:40the past regulators
20:41and the current target timestamps,
20:45and the number of likes that you have.
20:47So together, they will tell you how far behind
20:51in pseudo time should you be looking to try
20:54to predict the expression of the target.
20:57And finally, the kernel width,
20:59which tells how far, how wide the width should be
21:03around the timestamp that you are considering.
21:08Now, once we get
21:11adjacency matrices from all of these,
21:13we get, we considered them as partial networks,
21:17and we get ranked lists from each of them.
21:20And we aggregate these rank lists
21:22using a modified border count.
21:24So, border count is something
21:25which has been used in election.
21:29It's basically an election, I guess,
21:31result aggregating strategy,
21:34where if you have five candidates,
21:36you rank them from one to five,
21:39and then the person who has, I guess,
21:42the lowest number here over all
21:44of the people that voted,
21:47they would win the vote.
21:49So, the modified border width
21:52is basically the same concept,
21:53but the only change that we did
21:55was we wanted to place more weight
22:03to a ranking, which distinguishes
22:05between say a one, the first interaction
22:10we find with the 10th interaction we find.
22:12As opposed to say, the 10,000th interaction
22:15we find with the 10,010th interaction
22:18that we find.
22:19So, that's why the weighting before adding
22:23these border weights is one over N squared,
22:26as opposed to say, N here.
22:33So, yeah, once we aggregate this,
22:36we get a final rank list.
22:39And so, we had to do in for trajectories,
22:43we got gene dynamics from them,
22:46and now that results in two different networks.
22:49And there's just showing the top 100 edges
22:53from Monocle 2 and PAGA Tree.
22:55Now, you can obviously see
22:56that they look very different.
22:59Some of the edges I think, are common,
23:02but they can be very, very different.
23:05So, now the question is,
23:08which of these is right, or better?
23:12So, for that we would have
23:14to first think of, okay,
23:15how do we evaluate this?
23:17So, one way to evaluate that would be
23:20to do a precision recall evaluation.
23:24So, let's say we have this rank list
23:25of candidate gene interactions that we just got
23:28from SINGE and a gold standard,
23:31which knows the truth.
23:33As we go down this rank list,
23:34the precision metric tells us
23:37what fraction of the prediction
23:38so far have been correct.
23:40And the recall metric tells us
23:42how many of the total interactions
23:44in the gold standard, which were correct
23:46have so far been covered.
23:48So, the figure on the right shows
23:51a precision recall curve for two rank lists.
23:54The ideal precision recall curve
23:56would place all the edges in the gold standard
23:58at the top of the list.
23:59So, that's the dotted line that you see here,
24:04and the area under that precision
24:06we call curve (mumbles) blue one.
24:09A random list in expectation would be flat.
24:13So, and it would have a precision
24:15recall curve, and the area under
24:18that curve would be 0.5.
24:20and here, I guess, to make belief orderings.
24:27And in this example, we can see
24:29that the precision we call curve of A,
24:35which I guess, the predictor A is better
24:40because it starts off with having more ones,
24:45or as in a high precision, and then falls
24:48as opposed to B, which rises
24:50from a low precision.
24:51What it means that A gets more hits
24:55in the top of its list as opposed to B,
24:57and so on.
24:58And so, one way to also evaluate
25:01these position we call curves is to just look
25:03at the area under the curve, which is so A here
25:06is 0.7 and B's 0.52.
25:07And that tells us that on an average
25:10A ranks edges better as opposed to B.
25:16Now, we would like to use near this,
25:19and the question is what could we use as
25:22a gold standard?
25:24Now, this is real biological data
25:26that we are using, and for that,
25:29we would also need to look into
25:32the literature to find validation.
25:35So, one good source of information
25:37is the escape database curated
25:39by the Ma'ayan lab.
25:41And this database includes the results
25:44of loss of function and gain of experiments
25:47done on genes, and also
25:49and also ChIP-seq experiments,
25:50which identify binding sites
25:52of transcription factors.
25:54Now, the problem being that even this database
25:58is incomplete because the gaps
26:01in biological knowledge remain and doesn't,
26:04I guess over the time, over time,
26:06it would be completed, filled more and more.
26:09But when we were doing this evaluation,
26:12we had to deal with what was effectively
26:14a partial gold standard,
26:16or an incomplete gold standard.
26:18So, the evaluation that we did was not
26:20for all of the genes in the dataset,
26:23but only a fraction of the genes.
26:28So, we had these two methods
26:33and two pseudo times, which we got from that.
26:36So, what we wanted, what we did
26:38is we compared the performance of SINGE
26:43using say, Monocle 2 and the pseudo time,
26:46as well as Monocle 2 with only the ordering.
26:49And some of the least PAGA Tree
26:50fed the pseudo time and PAGA Tree
26:52with only the ordering.
26:54And so, this is how the precision recall curves
26:58of these four methods look.
27:01So, we look at the average precision,
27:04which is the same thing as the area under
27:06the precision recall curve.
27:08And we also look at the average precision
27:10in the early part of the precision recall curve.
27:14And the point for that being that,
27:18in say, a usual workflow,
27:21you would have a combination method,
27:24which would point to some important edges,
27:29and then, you would potentially tell
27:31a collaborator to try
27:34and experimentally validate that.
27:36And in that sense, you would be giving
27:38them results from the top of your list,
27:40as opposed to trying to tell how well
27:43the 10,000th edge in the list
27:45is placed in the rankings.
27:48So, with that in mind, we also look
27:50at what's the average early precision
27:53of these curves.
27:55And for that, we basically say what happened,
27:59as to what extent is the precision maintained
28:04until 10% of the genes
28:06and the gold standard are...
28:08Or interactions with the gold standard
28:10are regarded in the list that we have.
28:15So, the figure to the right shows
28:18a scatterplot of these, the average precision
28:20and the average early precision
28:21for these four methods, for these four options.
28:25And what we see is that the...
28:27The best performing combination
28:29is using Monocle's ordering,
28:31but not its pseudo time, and Monocle applying
28:36the pseudo time that it order,
28:38that it assigns to the cells,
28:41actually degrades the performance quite a bit.
28:46And both of the PAGA Tree options
28:49with, or without pseudo time,
28:51are in between these.
28:52So, now why would this happen?
28:56For example, and let's take
28:57an extreme case, right?
28:58And okay, before that, there's not necessarily
29:04something that's wrong with Monocle,
29:06but it's basically that for this dataset,
29:09in this instance, the pseudo time values
29:12did not necessarily make a lot of sense.
29:15So, let's say you have perfectly ordered cells.
29:17And for the first half of the cells,
29:19you just assign a value very close
29:22to zero and the second half,
29:23you assign a value very close to one.
29:26So, even though the ordering of the cells
29:27was quite nice and reliable, just because
29:31we ended up assigning a value
29:34to the pseudo times, often times,
29:36which is completely unrealistic.
29:38We might end up losing
29:41a lot of information
29:42that we otherwise had in the dataset,
29:44or in the ordering.
29:49So, yeah, as an extended,
29:53the ideas from this particular figure, right?
29:56So, you have two methods,
29:57they're giving you two different...
30:00Okay, two methods with their orderings
30:01and pseudo times, so basically four cases,
30:05and they all give you different rankings,
30:08which have different performances
30:12in terms of network evaluation.
30:14And in a sense, you could say
30:19that each of these PAGA Tree inference methods
30:22itself with all their inefficiencies
30:25and efficiencies are only partially looking
30:28at the biological data.
30:30So, from that perspective, each
30:34of these orderings and pseudo time values
30:37can be considered as sources
30:39of noisy information,
30:40or noisy sources of information.
30:42So, instead of trying to just infer
30:49one pseudo time trajectory from
30:52the dataset and finding the network,
30:55or say another, and finding
30:56the network from that, we could think
30:58of the trajectory inference method itself
31:01as an additional hyper parameter
31:03on top of the sparsity, and kernel bits,
31:06and so on.
31:08So, instead of aggregating at this point
31:10after just one trajectory inference method,
31:12we could just say that maybe
31:14we have four trajectory inference methods
31:19in the beginning.
31:20And after that, we do all
31:23of these sub sampling and application
31:25of hyper-parameters, and multiple tests.
31:28And then, we aggregate over all
31:29of these results across
31:31trajectory inference methods.
31:33So, hopefully what that would do
31:34is that would account for all the inefficiencies,
31:39or counter then inefficiencies
31:40of individual trajectory inference methods,
31:43and give us a more robust network at the end.
31:49And I have not, I guess, shown
31:52our comparisons for the other methods,
31:56which obviously isn't in our paper.
31:58We are doing better than them.
32:00So, but you can have a look at
32:03that in the paper if you're interested,
32:05because I just wanted to conceptually focus
32:08on these ideas a little bit more.
32:11So, I guess, one problem with trying
32:14to run four different, or five different
32:17trajectory inference methods is depending on
32:19what kind of data set you have
32:20and what kind of biology you are studying,
32:23you might not necessarily have
32:27to try only four methods.
32:29You will probably have
32:30to try multiple methods before,
32:32which let's say, if you know
32:34it's a branching trajectory,
32:35you end up seeing a branching trajectory.
32:38And each of these methods would have
32:41their own input data format,
32:44up data formats, visualizations,
32:49and all of these other intricacies.
32:52And that's where the dynverse project comes
32:55to our rescue.
32:56So, if anyone is looking to do
33:00a lot of trajectory inference methods,
33:01I would strongly encourage you to look at that.
33:04So, these in this project,
33:06they have streamlined the use of, I think,
33:1055 trajectory inference methods.
33:12So, you don't necessarily need to install
33:14each one of them.
33:15You just install this project
33:16and they run each
33:18of these methods using a docker.
33:21And so, what it also helps you do
33:23is it helps you visualize
33:26all of these trajectories and evaluate them using
33:31the same, I guess, support scripts
33:35and support functions, which they also provide.
33:38And in all this, this would make
33:42your lives quite easy.
33:44And they also have basically a user,
33:47a graphical user interface,
33:48which helps you prioritize
33:52what trajectory inference method to use,
33:55depending on what biology you want to study.
34:00How many cells you have, what compute power
34:02you might have access to, and so on.
34:12So, okay just some final comments on the use
34:17of, I guess, the utility of trajectory inference
34:19and pseudo times for further analysis.
34:22And so, first of all, as in trajectories
34:25look really nice, they visually,
34:28they give us a lot of information.
34:31And so, based on what we saw,
34:33we did see that there's some,
34:39the ordering information
34:41and the pseudo time values can help
34:43in network inference.
34:45The good pseudo times can help a little bit,
34:49but if you have exceptionally bad pseudo times,
34:51it can hurt a lot as opposed to ordering.
34:55And not every dataset is really suitable
34:59for trajectory inference.
35:00What do I mean by that?
35:01So, the dataset that I chose,
35:04and I guess a lot of what is...
35:08What particular inference methods
35:10are built around, as say,
35:11stem cell differentiation in general,
35:14where it's as in the biology is quite neat
35:19to begin with.
35:20As in you start off from a single cell type,
35:23and a lot of the biology is already known.
35:27So, you don't have to worry, you know
35:30that it's going to be a branching,
35:32or bifurcating, or multi furcating trajectory.
35:36So, you know that the quality of the biology,
35:38you know what cell states to exist, to expect,
35:43and so on, and so forth.
35:44You know the markers of each of those.
35:46And so, studying something like that
35:49is much more easier using trajectory inference,
35:53or pseudo time.
35:54On the other hand, let's say,
35:56if you had a sample from a cancer tumor
35:59in that you would find cancer cells,
36:02normal cells, a bunch of immune cells,
36:06probably 10 to 20 kinds of immune cells,
36:10and so on.
36:12So, the trajectory inference method
36:15usually tracks, or predicts places,
36:18cell states and context.
36:20Not cell types themselves.
36:23So, you wouldn't necessarily be able
36:25to reliably run a trajectory inference method
36:29across as in using a mix of different cell types,
36:33as opposed to cell states.
36:35Now, with the stem cell differentiation,
36:38the good thing is that the cell states
36:41themselves after a point, transition
36:43into different cell types,
36:45because it's the same cell,
36:47or same cell type which transitions
36:50through multiple cell types,
36:53through these cell states.
36:56But that's not the case with cancer biology,
36:58where you already start off
37:01with a mix of cell types and trajectory inference
37:06would not make sense for that mix.
37:08What people have tried is isolate,
37:11just say a T-cell type, and then try
37:16to order, or find the trajectory only
37:19for those T-cells.
37:21And there has been some success in that.
37:23So, you could run trajectory inference
37:27for a subset of the dataset, but not necessarily
37:30the entire dataset.
37:32And so, depending on what biological processes
37:38you want to study,
37:41there are trajectory inference methods,
37:43which may or may not be suitable for it.
37:45For example, a number of methods
37:47like Monocle and PAGA Tree,
37:51they try to find tree-like structures
37:56in the trajectories,
37:58so they would not be suitable
37:59for a cyclic biological process
38:03like just maintenance processes in cells.
38:07And then, there are other methods
38:08which actually try to find cell cycles,
38:11and they would not be appropriate
38:12for branching processes.
38:16And I guess, as a no single
38:19trajectory inference method,
38:23accurately represents the biology.
38:25So, it's all basically
38:27some mathematical abstraction
38:29of what might be happening in the cells.
38:35And yeah, as an if...
38:36If at the outset, you know
38:37what kind of trajectory to expect, then it helps
38:41in trying to
38:45at least first really,
38:46say whether the trajectory that you're getting
38:50and the pseudo times that you get
38:52is of any worth.
38:55So, just to give you an example.
38:58So, we started off with Monocle 2
39:00as one of our examples in our paper,
39:03and then we wanted to have another method
39:05to compare the effects of different
39:07trajectory inference methods.
39:10And PAGA Tree was not necessarily the first one.
39:13We tried a number of other ones,
39:14which did not.
39:16And we knew what to expect here.
39:18We knew that there was stem cell
39:21to ectoderm trajectory and endoderm trajectory,
39:26or a branch of that.
39:28And using basically, just the first,
39:35I think we tried four methods
39:38and PAGA Tree was basically the fourth method,
39:39which gave us that kind of branching trajectory,
39:42or branching topology for the biology.
39:45And so, none of the methods you try
39:49might necessarily mean anything,
39:53unless you have some way of validating that.
39:57So, at this point, I'm gonna switch
39:59to spatial expression,
40:04or a spatial data and special analysis.
40:06So, if you have any questions
40:08about the pseudo time analysis,
40:12should we take it now, or?
40:19<v Lecturer>Does anybody have any questions</v>
40:20on the first half of the presentation here?
40:26<v Dr. Deshpande>Oh, we can continue on,</v>
40:27then we can come back later.
40:34Shall we go on?
40:41<v Lecturer>Sounds good.</v>
40:42<v Dr. Deshpande>Okay.</v>
40:48Okay, so that was all about,
40:52say how pseudo time is used in our analysis.
40:57And so, the other end of,
41:03I guess, not necessarily end,
41:04the other perspective
41:05is how is space important and how,
41:10what kind of data do we have,
41:13which give us information about space?
41:16So, the spatial context of cells
41:18is very important in many biological processes.
41:22For example, when immune cells respond
41:24to an infection, or a wound, they need
41:27to be in physical proximity of their targets.
41:31Similarly with, I guess, cancer tumor growth,
41:34and the immune response to cancer
41:38happen through intracellular signaling.
41:40Either through cytokine secretion,
41:42or through surface receptors on adjacent cells.
41:48Just knowing the relative location
41:50of different cell types can also
41:52be very informative.
41:53For example, in the figure here,
41:57the information about the presence
41:58of various immune cell types nearest tumor,
42:02and the extent of immune deficient
42:04in the tumor are essential prognostic markers.
42:08And so, single cell RNA-seq,
42:13as good as it is, it associates a cell
42:15from its tissue, due to which
42:18we lose the spatial context of the cell states.
42:21But in recent years, we have been able
42:24to develop both
42:28as in spatial proteomics,
42:30which help you to image protein
42:35and densities of say, up to 30 markers
42:40at single cell resolution in the tissue.
42:43As well as spatial transcriptomics,
42:46which can measure 20,000 genes at spots
42:51in the tissue.
42:53And this was named method of the year last year
42:57in 2020, yeah, that was last year.
43:01So, here's just a workflow
43:05of the next Visium technology,
43:06which is one of these
43:07spatial transcriptomics technologies.
43:10So, this includes 5,000 barcoded spots on slide.
43:16And these are added to the cells in the...
43:21Which are located in those spots.
43:24And this helps preserve the spatial context
43:26of the cells to the actual sequencing.
43:30Now, this technology is not exactly single cell.
43:33It still provides a lot of useful spacial detail.
43:41So yeah, for explaining this project,
43:47I will use the 10x Visium sample,
43:51provided by 10x genomics
43:53of a breast cancer tissue.
43:55So, the figure on the left
43:57is an H and E slide, it's hematoxylin
44:01and eosin stain slide,
44:03which helps pathologists annotate
44:08the sample for tumor, and lesions, and so on.
44:14And the second image is that slide annotated
44:19by a pathologist, and you can see
44:22that there are different biology's
44:25in this one slide.
44:27And for example, the lesion on top
44:29is an invasive cancer lesion, which means
44:31that it can spread beyond the breast tissue,
44:33but the other lesions correspond
44:35to DCAs lesions,
44:36which are not yet classified as invasive,
44:39they could in the future be invasive.
44:42Other important annotations are those
44:43of immune cells and the stromal cells
44:47in between these lesions.
44:50For a good clinical outcome, you would hope
44:52that immune cells can infiltrate these lesions.
44:56And so the figure on the right shows
44:59the same H and E slide
45:02with overlaid Visium spots.
45:05So, each of these spots correspond
45:07to one measurement.
45:10So, this slide shows a couple of examples
45:14of spacial gene expression.
45:17So, the figure to the left
45:19is the same annotated H and E slide
45:21that will help us keep track
45:23of the biology in the slide.
45:27And so, the first figure, the middle figure,
45:30basically it shows the expression of CD8A,
45:33which is a marker of cytotoxic T-cells.
45:36Now, we see this gene expressed
45:37in the blood near the invasive and DCAs lesions,
45:42which means that the immune cells
45:44are responding to a tumor.
45:45However, we see that
45:47there's not much infiltration of these cells
45:49within the lesions.
45:51The second marker is CD14, which is found
45:54in macrophages and dendritic cells,
45:57and its expression is much higher
45:58inside the lesions, which could point
46:00to successful infiltration of these cell types.
46:04Now, just a reminder, these the measurements
46:08that we get from 10x Visium
46:10are not exactly single cell, but they're near,
46:13near single cell.
46:15In a sense that each of these spots
46:16is 55 micro meters wide.
46:19And depending on what cell type
46:22you might have in that spot,
46:23it could have anywhere from one to 10 cells.
46:27And immune cells are much smaller,
46:28so there could be up to 10 immune cells in it,
46:30but maybe only one cancer,
46:32or epithelial cell in that spot.
46:35So, as a result of gene expression
46:36of that spot is the average
46:38of the cells inside it.
46:42Now, our lab has a method called CoGAPS,
46:47oesophageal CoGAPS, which is a Bayesian
46:50Markov chain Monte Carlo method
46:51for nonnegative matrix factorization.
46:54And so, as a result of say,
46:58the 10x Visium measurement,
47:01we now have a high dimensional matrix
47:04with 20,000 genes and around 5,000 spots.
47:08And what CoGAPS does is it helps
47:12to factorize this matrix
47:15into two low rank matrices,
47:18both of which are non-negative,
47:21which correspond to latent patterns in the data.
47:25And in the past, we have seen
47:27that these two correspond to biology's
47:31based on the pattern markers.
47:34So, the two matrices that CoGAPS factorizes
47:38the dataset into are the amplitude matrix,
47:41which has say, 20,000 rows for 20,000 genes
47:44and N columns for the end patterns.
47:48And this helps us identify groups
47:50of co-expressed genes,
47:52which correspond to the patterns.
47:54And the pattern matrix has N rows
47:57and 5,000 columns, and they associate the spots
48:01on the sample with patterns.
48:04So, because of the nature of the CoGAPS,
48:08factorization, and these, the columns
48:11of the matrices here, or the rows of the matrices
48:14here are not really orthogonal.
48:15They are independent, but not orthogonal.
48:17So, they could co-exist in spots,
48:20or a gene could be present in multiple processes,
48:25and multiple patterns,
48:25which correspond to processes.
48:29So, when we apply CoGAPS to the Visium data,
48:37so the first try was basically
48:39just five patterns, and when we apply it
48:43to try and find five patterns
48:47after a factorization, we see that
48:50a number of them correspond
48:51to the pathology annotations
48:54that we see on the figure on the left.
48:57So, we find a pattern which corresponds
48:59to the immune cells.
49:01We find a pattern which corresponds
49:04to invasive carcinoma on the top left here.
49:07And we also find a pattern which corresponds
49:09to the DCAs lesions.
49:12And as we increase the dimensionality
49:15of CoGAPS factorization, we start seeing more
49:18and more tissue heterogeneity.
49:20For example, we now see three patterns
49:23which are associated with the mesial carcinoma,
49:26and we can see that they correspond
49:27to different regions in that lesion.
49:31And this for example is completely internal,
49:33which has no interaction with immune cells.
49:37We have a pattern which corresponds
49:39to immune cells, we have a pattern
49:41which corresponds to the stromal cells.
49:43And we also have different patterns
49:47which highlight individual DCAs lesions.
49:51So, one could say that potentially it's trying,
49:53it is finding biology's,
49:57which are unique to these DCAs lesions.
50:04So, we can analyze the A matrix
50:07to identify groups of genes associated
50:09with each pattern, and we call these
50:11the pattern markers.
50:12And these help us identify pathways
50:15that are likely expressed in these patterns,
50:18or because now, especially in this sample,
50:20we see a one to one association
50:23between the pattern and the biology,
50:25also in the biology, basically.
50:29So, let's see, how long do we have.
50:35I think we're close to...
50:38I'll quickly rush through these.
50:40So, the other analysis that we can do is given,
50:45let's say two of these patterns,
50:49we can try to see how these patterns interact.
50:52So, you can see that these patterns
50:54have a lot of spatial structure to it,
50:58which CoGAPS was not told about.
50:59CoGAPS, the parameters that Co-GAPS uses
51:01have no special information,
51:03and it's still found these spatial structures.
51:06So, and we also see that these patterns
51:08are adjacent to each other and we want
51:10to see how they interact.
51:11So, what we do is we find,
51:15basically we estimate the kernel density
51:18of each of these patterns, which is a function
51:22of both the pattern intensity at a spot,
51:25as well as the spatial clustering
51:28of hyper intensities.
51:30And we compare that against
51:32another distribution obtained by
51:35the density estimation after randomizing
51:37the locations of these pattern densities.
51:40So, the intensities which are beyond
51:43distal distribution are the ones that we...
51:48Are the spots which correspond
51:49to these outliers are the ones
51:51that we count as hotspots of pattern activity.
51:55Similarly, we can find the hotspots
51:57of immune response.
51:59And when we combine both of them,
52:02we find regions where cancer is active,
52:09regions where immune cells are active,
52:11and regions where both of them are active.
52:14And this is the interaction region.
52:16And in this region, we are trying
52:17to find genes which correspond
52:19to this interaction between cancer and immune,
52:25and which are not necessarily markers of...
52:27And regular markers of cancer and immune.
52:29So, genes which are specifically related
52:32to the non-linear interactions
52:34between these patterns.
52:38And to that end, basically we hypothesize
52:40that since CoGAPS is already
52:45an approximation of the dataset
52:47with a linear combination of the patterns,
52:50the residuals of CoGAPS,
52:51of the CoGAPS estimate from the dataset
52:55could point us to the non-linear interactions
52:58between the patterns.
53:01And we are only looking at the region
53:06where both of the patterns are active
53:09and comparing the residuals of CoGAPS
53:13in that region to the residuals
53:15in only the cancer region,
53:16and only the immune region.
53:18And now, this can be done for each of these,
53:24I guess, pattern combinations,
53:25and we can find what corresponds
53:29to pattern interaction
53:30between these pairs of patterns.
53:32So, for future work, as part
53:35of the data collection in clinical trials,
53:37we're already collecting both spacial
53:42and single cell transcriptomics
53:44and proteomics from patients.
53:47So, we are trying to integrate all
53:50of this into one big dataset,
53:54which would represent the tumor microenvironment,
53:59which would help us characterize
54:03the patient sample as a whole.
54:05And we would also like
54:07to infer intracellular signaling networks
54:10the same way as we were trying to do
54:12it using time, but now using space
54:14where intracellular signaling is a function
54:18of the distance between the cells
54:20and the types of neighboring cells
54:21for a target cell.
54:26And the learnings from these projects
54:27would go into a spatial temporal model
54:31of tumor growth and response to therapy,
54:33which can be used into building
54:36a digital patient or digital clone,
54:39where we can try to test what therapies
54:43might work on what patients.
54:48So, these are the people who have been,
54:50and of course, 10x Genomics,
54:52who were kind enough to give us the sample
54:54for studying, as well as my collaborators
54:59on this project.
55:01Thank you so much.
55:02And I can take questions now,
55:04sorry for the overshooting time.
55:10<v Lecturer>Thank you so much.</v>
55:11Do we have any questions to look at?
55:22People on Zoom? Yeah, question (mumbles).
55:25<v Female Student>Going back</v>
55:25to the time series slides.
55:28<v ->Mm-hmm.</v>
55:29<v Female Student>Can you talk</v>
55:29about how you know if you have good,
55:31or bad pseudo times?
55:32And is there a way to fix bad pseudo times?
55:35<v ->So, yeah, as in what I've not shared on here</v>
55:39is so, in our experiments,
55:43we also, we knew for example,
55:46that we were studying...
55:47We wanted to study a trajectory which goes
55:49from stem cells to neuroectoderm,
55:56and we had markers.
55:57And I think, some (mumbles) themselves.
56:01They have identified markers
56:03of stem cells neuroectoderms and endoderm cells.
56:09So, if we're looking at the trajectories
56:10of the markers along the pseudo time
56:13to see if those make sense.
56:15For example, a marker which is supposed
56:18to be high in stem cells would,
56:21should be tapering down to zero
56:23along pseudo time, and a marker,
56:26which is supposed to be high in neuroectoderm
56:30should be increasing with pseudo time.
56:34So, we had, I think six oral markers
56:38to each of stem cells, neuroectoderm
56:41and endoderm cells.
56:46And we were trying to confirm the combination
56:49that neuroectoderm markers increase
56:52with pseudo time, but the other two decrease,
56:54or the endoderm shouldn't decrease necessarily,
56:58but it shouldn't have
57:00a monotonic increase like the neuroectoderm one.
57:08And it should not be present in the initial.
57:12Does that...
57:14So, that was one way to do it, basically.
57:21<v Lecturer>Thank you.</v>
57:22Any other questions?
57:40So, with the combination of many cells,
57:41and the spatial stuff, is there any hope
57:44of getting a temporal signal out of any of that,
57:46or is that (indistinct)?
57:50<v ->In spatial did you mean?</v>
57:52<v Lecturer>Yeah.</v>
57:53<v Dr. Deshpande>So, I think,</v>
57:59the issue would be, I guess,
58:00not in clinical, I suppose.
58:04In a sense that, okay, are you thinking
58:06about pseudo temporal, or just clinical?
58:08<v Lecturer>Yeah.</v>
58:10<v ->Pseudo temporal, I think there might</v>
58:11be some possibility,
58:12and I've been thinking of
58:18as in, we would still have to isolate,
58:20I guess, cell types, for example.
58:22So, one of the problems with that
58:24is that as I mentioned,
58:27the spots are not exactly single cell, right?
58:31So, especially, let's say if you're trying
58:33to do a pseudo temporal ordering
58:36of CD8 T-cells,
58:39they are more,
58:41more likely than not, co-localized
58:44with other cell types, which would also,
58:49I guess, corrupt the expression
58:51that you are seeing.
58:53So, that would make it slightly different.
58:57We could think of ordering the spot
59:01as a whole, basically.
59:03And my...
59:05I belong to a school of thought that basically,
59:07if you have a...
59:08And then, so what people try to do
59:10with say, this kind of data,
59:13this spacial Visium data,
59:15where you have say, up to 10 cells,
59:17they try to resolve this into cell types.
59:23So, they would compare that to, there is I think,
59:25one paper called RTCD, or RCTD.
59:30RCTD robust cell type decomposition.
59:33So, what they do is basically,
59:36they take the spatial data,
59:38they have a reference single cell data,
59:41and they try to assign each spot,
59:47or a resolve each spot into a mixture
59:51of the cell types that might exist
59:54in the single cell data.
59:57And that could help you to say,
01:00:01identify what the mixture in general is.
01:00:04But my as in my thought is that we could
01:00:09just think of each spot as some representation
01:00:15of the biology in that neighborhood.
01:00:17So, each spot could just represent
01:00:20a neighborhood, as opposed to trying to find
01:00:22what the individual cells are.
01:00:25And that would basically abstract out
01:00:30the representation and the biology to that
01:00:33of the spots.
01:00:35And we'll have to think about how to do that,
01:00:37but I think there could be some ordering to that,
01:00:40but we'll need to see what makes sense.
01:00:45And then, for a lot of cells, cell states,
01:00:49they are quite well-characterized.
01:00:51For example, if you say that a T-cell
01:00:53is activated, or a T-cell as naive,
01:00:55or exhausted, you know what markers to expect.
01:00:59But what would you be able to say
01:01:02for spots instead?
01:01:05The other thing to think of is,
01:01:10especially with say, the proteomics as well,
01:01:12where you can get actual single cell
01:01:18and distributions, and neighborhood characterization.
01:01:22You could think of it as can you,
01:01:27so the same thing that...
01:01:28The same ideas that were used
01:01:31for pseudo temporal ordering of cells,
01:01:34can they be used for pseudo temporal
01:01:36ordering of neighborhoods?
01:01:39For example, if you have a cell neighborhood,
01:01:41which as they're presented as whatever,
01:01:45the central cell, and it's five neighbors.
01:01:49Now, depending on, are they all tumor?
01:01:52Then maybe they have...
01:01:53They're basically deep in the cancer,
01:01:54which has never been visited by an immune cell,
01:01:58is that a mix of tumor
01:01:59and activated immune cells?
01:02:02So, that is basically an active tumor
01:02:04immune interaction that's happening.
01:02:06Is that exhausted T-cells and tumor,
01:02:10where basically the tumor
01:02:11has fought back and tried to suppress the...
01:02:16Or it's basically sent signals
01:02:17to suppress the immune response, and so on.
01:02:21So, perhaps there could be
01:02:22a trajectory of neighborhoods,
01:02:25where you could say that depending on all
01:02:29the possible combinations that you expect
01:02:31in cellular neighborhoods,
01:02:35this current neighborhood is this far along
01:02:40that process, or that branch of a process.
01:02:44That was a long and winding answer.
01:02:47(chuckles) I don't know if
01:02:49that necessarily answered it. <v Lecturer>Thank you.</v>
01:02:52Thank you, any last questions?
01:02:54I wanna be mindful of time.
01:02:56Any questions that come to you, or?
01:03:06All right, well if not, thank you again.
01:03:09(students applaud) We really appreciate that.
01:03:11<v Dr. Deshpande>Thank you a lot.</v>
01:03:15<v Lecturer>You have a wonderful (indistinct).</v>
01:03:16<v ->Mm-hmm.</v>
01:03:20(lecturer mumbles indistinctly)
01:03:27(students chatter indistinctly)