YSPH Biostatistics Seminar: "SPRUCE and MAPLE: Bayesian Spatial Multivariate Mixture Model for High Throughput Spatial Transcriptomics Data"

December 01, 2021

Dongjun Chung, PhD, Associate Professor, Department of Biomedical Informatics, The Ohio State University

November 30, 2021

Information

ID: 7227
To Cite: DCA Citation Guide

Download Transcript

00:01<v ->College of Medicine and he is a member</v>
00:04of the Pelotonia Institute for Immuno-Oncology
00:08of Ohio State University as a candidate and a member.
00:12His research focuses on
00:15(mumbles)
00:18for integrative analysis on synaptic and genomic data
00:22with biomedical real data.
00:26So welcome back Dongjun Chung.
00:29(audience member claps)
00:32<v ->Okay.</v>
00:34Thank you Wei, for the kind introduction
00:37and it's so great to come back.
00:39Although it's all virtual.
00:43I hope someday we can see in person.
00:46So today I will discuss our recent project
00:52about the SPRUCE and MAPLE: Bayesian Multivariate
00:57Mixture Models for Spatial Transcriptomics Data.
01:01Oh, by the way, can you hear me well?
01:03<v ->Ah yes, we can hear you.</v>
01:05<v ->Okay, great.</v>
01:07So, let me start us from some quick introduction
01:12about the single cell genomics.
01:16So in some sense,
01:17we can say that the last decade was the era of single cell
01:21genomic experiments.
01:24So it changed science in many ways.
01:26And also a great amount of the data has been generated
01:32using the single cell genomic technology.
01:36Single cell genomic experiments
01:38provide high-dimensional data at the cell level.
01:42By doing so,
01:44it allows to investigate cellular heterogeneity
01:48within each subject or the patient
01:52which was not possible previously
01:54with the bulk of genomic data.
01:56Which means that genomic data collected at the tissue level.
02:04So some kind of standard visualization
02:09of the single cell genomic data is called a UMAP.
02:13And here,
02:14this UMAP shows the distribution of the different clusters
02:18in the tumor,
02:20including the different immune cell type.
02:25And in this way,
02:27we can interrogate different types
02:29of the immune cell composition.
02:32And also there,
02:33we can look at what kind of general feature
02:37imaged for each cell cluster.
02:40One of the recent (mumbles)
02:43is the emergence of the high-throughput
02:46spatial transcriptomics or the HST technology.
02:51So, with the emergence of the HST technology,
02:56we do not only look at the gene expression
02:59in the cell level or the close-to-cell level.
03:03We can now also notice that there are cross pointing
03:06spatial information.
03:08The figure at the bottom shows one example.
03:12And here it shows the mouse brain tissue,
03:16and each cell cone.
03:19Here cross pointer to one spot
03:21which is a group of the smaller...
03:24small number of like two to ten at most.
03:29And color here indicate expression level of different gene.
03:34So left one cross point to the Hpca gene.
03:39Right one cross point to the Ttr gene, for example.
03:48And with the HST data,
03:50we can do a lot of interesting science
03:54to improve the parity in current medication.
03:57So for example,
03:59we can now look at the spatial information
04:01of the tissue architecture at the transcriptomics level.
04:07And then we can also investigate
04:09the cell-cell communication with the spatial information
04:13in our hand.
04:15So at the figure at the bottom left shows the UMAP.
04:19And here,
04:20the different color indicates a different cell cluster.
04:24And if you look at the figure on the right,
04:27then you can see that there are a cluster
04:30in a meaningful way on the tissue.
04:33So in this way, we do not look at the different cell types
04:36within a tissue.
04:38But also look at their spatial information at the same time.
04:47And there's many exciting applications
04:49of the HST experiment, including the neuroscience.
04:57Including the brain cancer study such as the immuno-oncology
05:02and the developmental biology
05:04which looks at the changes of the cellular composition
05:08across the different stage of the development.
05:12And here I specifically discuss the application
05:16in the cancer, especially the tumor microenvironment.
05:20And with the spatial information,
05:22we can now study their location of the immune cell
05:27and the tumor cell in the tumor tissue.
05:31We can also interrogate implication of distance
05:35on the tissue and their corresponding density.
05:39And we can also study the distribution
05:42of the immune regulator.
05:45And finally, the special spacial patterns
05:49such as the tertiary lymphoid structure.
05:56Then from the statistical point of view,
05:59how the HST data look like.
06:05The first observation is in the HST data spatial structure,
06:10in the tissue architecture in a meaningful way.
06:13So as you discussed earlier,
06:16we can see a similar type of the cell cluster
06:19often located in the close proximity in the tissue.
06:27And even after we exclude such kind of cell competition
06:32in the spatial location,
06:35we can start to see some spatial pattern in the patient
06:38on the tissue.
06:40So the figure on the top shows the expression pattern
06:43of the three genes,
06:46PCP4, MBP and MTC01.
06:50After regressing out, with respect to the cell clusters.
06:56And as you can see, even after considering
06:58the cell cluster patterns,
07:00you can start to see some interesting spatial patterns.
07:05That the figure at the bottom shows the distribution
07:09of each gene for each cell cluster.
07:14And you can see that sometimes it's asymmetric
07:17but also often we can see non-symmetry
07:22in vascular distribution for each gene.
07:27So these are some of the key features
07:30of the HST data we want to consider
07:33in the modeling of the HST data.
07:37So if I profile pick somebody,
07:39Gene expression outcomes feature complex correlation
07:43such as the spatial correlation,
07:46and also gene-gene correlation,
07:49which mainly effects the biological pathway.
07:52Spatial structure can be
07:55(mumbles)
07:56cellular clustering entity expression patterns.
08:00And gene expression densities,
08:02often feature skewness and or heavy tears
08:06due to outlier cell spots.
08:09So ideally we seek to provide a model
08:13for identifying the tissue architecture
08:16while accommodating these challenging features.
08:24So, especially during the last two years,
08:28several statistical methods have been proposed
08:32to model HST data.
08:34And still many of them are network-based approaches.
08:39Partially because the stragglers; the very famous packages
08:43for the single cell genomic data analysis.
08:46And network-based approach has been proven
08:49to be powerful in this context.
08:51So based on that multiple network-based approach
08:56have been proposed including the Giotto, Seurat and stLearn.
09:03Because in the statistical model,
09:07recently BayesSpace was proposed by the group of the
09:12(mumbles)
09:13at the Fred Hutchinson.
09:15And essentially,
09:16it uses a multivariate-t mixture model
09:21to cluster cell spots.
09:24It implement spatial smoothing of clusters
09:27via a Pott's model prior on cluster labels.
09:32And interestingly,
09:34they try to predict sub-spots to increase the resolution.
09:41In spite of such interesting features,
09:44it has also some number of drawbacks.
09:47For example,
09:49it assumes the symmetry of the gene expression densities,
09:52and it also relies on the approximate inference.
09:58And here our goal is to develop a statistical model
10:03that overcome these limitations
10:06and also provide the optimal tissue architecture prediction
10:11using the HST data which we call SPRUCE
10:18or the spatial random effects-based clustering
10:20of the single cell data.
10:30So this is our SPRUCE model.
10:35So here we use the i as the index for the cell spot
10:40in the tissue sample.
10:42And then we denote y i
10:45as the length of gene expression vector for spot i.
10:50And based on the y i, we also may find a mixture model
10:56of the form.
10:59So here we assume the k number of the mixture component.
11:03or the cell spot clusters.
11:06Theta k indicates the set of the parameters
11:10specific to mixture component k.
11:13Pi k is the probability of the spot i
11:17belonging to the component k.
11:22We further introduce z1 to zn,
11:27which are the latent mixture component indicators
11:31for each spot.
11:33And zi can have the value between one to k.
11:37And as I mentioned earlier,
11:40can you see the gene-gene correlation
11:42are key features of the HST data?
11:47So to account for skewness and gene-gene correlation,
11:51we assume a multivariate skew-normal distribution.
11:56Where is the parameters?
11:59So first one indicates the main vector for spot i,
12:03and alpha k indicates gene-specific skewness parameters
12:07for mixture component k.
12:10And omega k is the gg scale matrix that captures correlation
12:15among the gene expression feature in the component k.
12:24And then we further represent MSN distribution
12:28using a convenient conditional representation.
12:31We use mu k for the mean of component k,
12:36phi i for the spatial effect,
12:38and t i and ksi k for the component-specific skewness
12:44of each gene.
12:47Epsilon i for the multivariate normal error.
12:53And then in order to further accommodate spatial dependence,
12:58we used the multivariate intrinsic
13:00conditionally autoregressive,
13:02or the CAR prior for phi i.
13:05So essentially,
13:08given all the spots except for spot i,
13:12we might suggest pi i as the normal distribution
13:17with the mean of its neighbors.
13:21And with the covariance matrix denoted as the lambda.
13:33And as you can see earlier,
13:35we see the two different levels of the spatial patterns.
13:40One for the spatial pattern of defect clustering.
13:44And another one is the spatial pattern
13:46of the gene expression.
13:48So for the spatial pattern of the cell clusters,
13:53we want to allow the probability of pi
13:58of belonging to each mixture component.
14:01Also to vary spatially as well.
14:03So in order to do so,
14:05we extend model I showed previously
14:09using the pi i k,
14:12which is the i specific.
14:14And then here we modeled this one as the sigmoid
14:18of the two parameters.
14:21And then part one in the interceptor
14:23for the baseline propensity of the membership
14:28into component k shared by all cell spots.
14:32And second term indicates the spatial random effects
14:35allowing the variation about the intersect.
14:42And again,
14:43to introduce the spatial association
14:46into the component membership model,
14:49we further assume the univariate intrinsic CAR prior.
14:53As you can see here.
14:55And here the one computational challenges,
15:01if you're interested, is format.
15:04Then it do not allow us to...
15:06It do not provide the closed form posterior distribution,
15:10which prevent Gibbs sampler.
15:12And in order to address this computation challenge,
15:17we extended our model
15:20using the results from the Polson et al in 2013, Jasa
15:25on Polya-Gamma data augmentation to allow for Gibbs sampling
15:30of the mixing weight model parameters.
15:34And essentially,
15:34we could assume that this can be represented
15:38as the Polya-Gama Data Augmentation.
15:42And by doing so,
15:43everything can be implemented as the Gibbs sampler.
15:49In the case of the further outliers or heavy-tails,
15:53we can even further extend the model
15:56to the multivariate skew-t distribution
15:59that you can see here.
16:00Which can be very easily implemented
16:03given the existing model.
16:07To complete our model specification,
16:10we use the weekly specified prior,
16:14and then the quantity of prior.
16:16And by using this conjugate prior,
16:19we can do everything using the fully Gibbs sampler
16:23of the closed form
16:24which provide the best computation.
16:29And some additional consideration.
16:33So here,
16:34the one question is the optimal number of the k
16:38worked in number of disparate clusters.
16:41So for the proposal,
16:42we use the product of the model selection approaches,
16:46and specifically we use the WAIC,
16:49or the widely applicable information criterion.
16:55In the patient mixture it's very common
16:57to observe the label switching program.
17:00So to protect against the label switching issue
17:03in the MCMC sampler, we use the canonical projection of z
17:08using the Peng and Cavalho, in 2016.
17:13And finally for the actual implementation,
17:17we use the Rccp
17:19to further improve the computation efficiency.
17:27We implement the proposed model as on our package SPRUCE,
17:33and it's currently available from our data page.
17:38Here.
17:40And then the figure shows our digital page.
17:45When we developed our software,
17:49one of the popular software to pre-processing
17:53and analyzing the HST data
17:57is the Seurat workflow.
18:00So when you develop our software,
18:02we provide integration with the Seurat workflow
18:05so that our software can be embedded
18:10as part of the (mumbles) flow.
18:12So for example,
18:14the data can be loaded into our using the Seurat,
18:19and then people can apply the pre-processing
18:23using the Seurat workflow.
18:26And then that objective
18:26can be fed into the SPRUCE analysis workflow.
18:31And then the output from the SPRUCE can, again,
18:34fit into the Seurat workflow for the visualization
18:39and downstream analysis
18:46So first for the simulation,
18:50the first for the simulation is about the...
18:54Has the two purposes.
18:56So first one is to assess the validity
18:59of the parameter estimation algorithm.
19:02And second is to quantify the effect
19:05of ignoring skewness and spatial information.
19:09So in order to make our simulation more realistic,
19:13we use the sagittal mouse brain data as the tissue shape
19:18and the spot location.
19:20And we simulated the full clusters
19:23from the multivariate skew-normal distribution
19:27with the 16 genes.
19:31We considered the 26...
19:352696 spots.
19:38And then we considered three models,
19:41including the multivariate normal,
19:44multivariate skew-normal,
19:45and with no skew-normal with no spatial.
19:49So first one shows the implication
19:51of inadequate study of skewness and spatial.
19:55Second shows the implication
19:58of ignoring the spatial structure.
20:00And the final was our proposed model.
20:05And here the top left figure,
20:08shows the true cluster labels.
20:11And top of right shows
20:12the UMAP reduction of the gene expression pattern.
20:18And as you can see, we can make the orange and the green,
20:22which is far away from each other,
20:24similar in the gene expression,
20:26so that it can be more challenging in the prediction.
20:30And we really test the performance of each model
20:35using the ARI where the very close one
20:39indicates the better performance.
20:42And as you can see here, when we ignore
20:47the skewness and the spatial pattern,
20:51there is the big loss of the ARI.
20:55And by considering the skewness,
20:57we gain some but still that there is being lost.
21:00And by further considering the spatial pattern,
21:04we can improve the high level of the ARI.
21:11And for the real data application,
21:14we consider the two applications.
21:18So,
21:20to compare the performance of the SPRUCE to existing tools,
21:26we used the 10X Visium human brain data
21:30from the Maynard et al, 2021, Nature Neuroscience.
21:36Here at the rehab we have about the 3000 spots.
21:40And one of the good aspect of this data is
21:45It's very well annotated.
21:48So, the author,
21:51using his expert knowledge,
21:54they annotated the 3000 spots into the 5 brain layers.
22:00Including the white matter and the frontal cortex layers.
22:05And as I mentioned earlier,
22:06we use the standard Seurat pre-processing pipeline,
22:11including the normalization of using the sc transform
22:16and also selection of the most variable genes
22:20using the existing pipeline.
22:22We consider the top 16 most variable genes.
22:29And we also consider the three other existing algorithms
22:34including BayesSpace, stLearn, Seurat and Giotto
22:40as the computing algorithms.
22:42And we use the default parameters for each of them.
22:49Here it shows the regions
22:52and top left figure shows the manual annotation
22:54provided by the author in the paper.
22:58And you can see the nice, five spatial clusters
23:03from inside out.
23:05And also there you can see
23:08that there is one, narrow cell cluster
23:11corresponding to the number four.
23:16Here we showed the real data for the SPRUCE,
23:18BayesSpace, stLearn, Seurat and the Giotto.
23:24And in this case, the network-based approaches,
23:28including the stLearn, Seurat and the Giotto,
23:32all showed a lower performance compared to those algorithms.
23:38The BayesSpace showed relatively higher performance
23:42about the ARI of 0.55.
23:46SPRUCE further improved the performance
23:49compared to the BayesSpace.
23:52And one thing I noted here is the...
23:58The narrowed cell cluster,
24:00could it be identified by the SPRUCE?
24:04Which is interesting.
24:06And as the second example.
24:10So first one is the more labeled data.
24:13We can compare our prediction to the existing annotation.
24:17And to further demonstrate the application of the SPRUCE
24:22to unlabeled data, we analyze the publicly available
24:26human invasive ductal carcinoma breast tissue.
24:31Again using the 10 X Visium platform.
24:36And we essentially followed the similar workflow
24:38and we identify the top 16 most spatially variable genes.
24:45And those included several tumor associated antigens,
24:50TAA, in creating the GFRA1 and CXCL14.
24:56And also that there is the tumor suppressive gene,
25:00like MALAT1.
25:04And we use the SPRUCE to identify the 5 sub regions
25:10using these 16 features.
25:12This shows the 16 most variable genes.
25:16And you can see that there are very clear spatial patterns.
25:22For example the CXCL14 and GFRA1,
25:28expel on the right bottom side.
25:30While the MALAT1 express higher in the top left side.
25:38And this is the cluster prediction
25:42made by the SPRUCE algorithm.
25:46And you can see that it identified
25:48the cluster too, which it highly coincide with the CLCX14
25:55and GFRAI1 with a study on.
25:59(mumbles)
26:01What the cell cluster 1,
26:05Is the MALAT1
26:09which is more tumor suppressor.
26:13So here we can see that the SPRUCE can identify
26:17the different group of the tissue architecture,
26:20such as the tumor suppressor and then tumor related
26:25(mumbles)
26:33And we can also easily look at there,
26:37within cluster expression pattern
26:40and gene-gene correlation.
26:43As you could see earlier,
26:44on cell cluster 2 which equals 0.2 to the right
26:49higher than the GFRA1 and CXCL14.
26:52One, which is the cross point here is the high-end MALAT1
26:57and so on.
26:58And also, in the case of cell cluster 2,
27:02there's a very strong gene-gene correlation pattern.
27:06So we just support the proposed model that considered
27:11spatial pattern and also gene-gene correlation
27:14simultaneously.
27:20So,
27:21so far I discussed the method
27:26for our SPRUCE and its application.
27:33And that we essentially expanded our work a little bit more
27:38to the MAPLE,
27:39which is the multi-sample spatial transcriptomics model
27:44Why we care about the multi-sample analysis of HST data?
27:49So currently most algorithms are designed in a way
27:53that it can more focus on a single sample.
27:57But even intuitively,
27:59joint analysis of the multiple HST data
28:03can potentially boost the signal
28:06by sharing the information amongst samples.
28:09And also the joint analysis of the different samples
28:13can allow the differentiation analysis of the HST data.
28:18So very often, each tissue is not our main interest.
28:24But we also want to compare tissue architecture
28:27between the different samples.
28:30For example, between the disease group versus the controls,
28:35responders versus the non responders to 13 treatments,
28:39such as the cancer immuno-therapy.
28:41So to offset this limitation, we proposed MAPLE.
28:47And actually our existing SPRUCE framework
28:50already allows this one naturally.
28:55So, simply what it can do is
28:57instead of now analyzing each sample individually,
29:01we can jointly analyze all the samples together.
29:05And then by doing so,
29:06we can share information
29:08about the modeling of each cell spot cluster,
29:13and also their spatial pattern.
29:17But by introducing the sample-level covariate exp xi
29:22in the cell type composition,
29:27we can see the impact
29:29of the different sample-level covariate.
29:33Which I show more in detail in the coming slides.
29:41So the first application is the same mouse brain data,
29:45the human brain data...
29:47Sorry this should be the mouse brain,
29:49and here we see the two anterior parts,
29:54which look very similar.
29:56And then as you can see here,
29:57when we jointly analyze the two sample
30:01cross pointing to the same part of the brain.
30:04It nicely identifies the cross pointing part
30:08between the two sample.
30:10Like one in the end, three on the top,
30:14five at the bottom and so on.
30:17And because this is the Bayesag framework,
30:21it can also provide uncertainty measures
30:25about our clustering prediction.
30:28And as you can see usually there is more uncertain
30:31about the clustering prediction
30:35around the boundary between different cell clusters.
30:38Which kind of makes sense,
30:40because we expect that maybe cell type
30:43might be more mixed together in the same cell spot.
30:48Also, there are some cell clusters
30:50with the higher level of the uncertainty
30:53of which we are still trying to understand more
30:56at this point.
30:58And this kind of the figure is the...
31:01what utility of this kind of joint analysis.
31:06So, for the identifier with T,
31:09we set the first cell cluster as the reference.
31:14And then here we see the two (mumbles)
31:16The top one shows the intercept,
31:20and then we can interpret this one as the relative size
31:25of each cell cluster.
31:26So then compared to the one,
31:29we can say three and the six are larger.
31:32So the three and the six are larger, compared to the one.
31:36Why the four is the smaller,
31:39well just smaller compared to the one.
31:40So this is what it can see by eye
31:45from the tissue prediction region.
31:48But good thing is that this model allows us to quantify,
31:52what you see by eye.
31:55And what is more interesting is the second one.
31:57So this one,
31:58is about the difference between the two sample.
32:02So again,
32:04so basically if it's higher,
32:07then it means that certain tissue spot cluster
32:12getting bigger in the second sample.
32:14And if it's lower immune state is a kind of smaller
32:18in the second sample and so on.
32:20So in this way,
32:21we can quantify the change of the tissue architecture
32:26between different cell clusters.
32:30And another interesting example is this one.
32:35So here, the image of 2D to anterior samples,
32:39we now also look at the posterior sample as well.
32:44So because this is two parts of the brain
32:48anterior and the posterior,
32:50the issue is kind of continuous between two.
32:53And as you can see here,
32:54cell cluster three is connected to the posterior side here.
33:00Cell cluster one is connected to here and so on.
33:05And then this kind of pattern is not clear
33:08if you analyze each data independently.
33:12And our MAPLE framework nicely captures
33:16such kind of sharing pattern.
33:18And also the difference pattern
33:20between the different samples, interestingly.
33:23So at this point,
33:25we are working on more simulation study
33:27and the real data analysis
33:29to further show the performance
33:33and understand the properties of the MAPLE at this point.
33:39So then I can't summarize my presentation today.
33:45So the high throughput spatial transcriptomics, or HST,
33:50provides unprecedented opportunities
33:54to investigate novel biological hypotheses,
33:57such as the tumor microenvironment and certain structure
34:05about the human brain and Alzheimer,
34:08and so on.
34:10And here we propose SPRUCE,
34:13a Bayesian multivariate mixture model
34:16for HST data analysis.
34:19SPRUCE has multiple strengths
34:23including the novel combination
34:26of the skewed normal density,
34:29Polya-Gamma data augmentation,
34:31and spatial random effect.
34:35Altogether, it allows to
34:37precisely infer spatially correlated mixture component
34:41membership probabilities.
34:44In our simulation study and real data analysis,
34:49we could see that SPRUCE outperforms the existing method,
34:53in the tissue architecture identification.
34:56And finally our recent extension of the MAPLE
35:01allows the joint clustering and differential analysis
35:05of multiple HST data.
35:09So at this point SPRUCE is on the review in,
35:13(mumbles)
35:15in the biometrics.
35:17Cross pointing manuscript is available in the bio archive.
35:22And there are multiple ongoing work
35:25regarding the HST data modeling
35:28in our lab.
35:30So we are actually currently working on further improving
35:35the SPRUCE and the MAPLE
35:37by incorporating other characteristics
35:39of the HST data, such as the relationships among cells.
35:44For example,
35:46we know that there are some likened and receptor,
35:50for example.
35:51Which we expect that they interact with each other
35:55in their cell structure.
35:57And then by incorporating different prior information,
36:01we can further improve the SPRUCE and MAPLE.
36:06We are also working on the other statistical models
36:10for somewhat relevant, but different tasks.
36:14For example,
36:15currently we are also working on the streamlining framework,
36:19especially the graph neural network,
36:22which is called RESEPT.
36:24And then using the gene framework,
36:28we tried to come up with good embedding
36:30of the HST gene expression pattern.
36:34Our current results show that such a combination
36:38of the stem learning and the statistical model approach
36:41can provide nice prediction performance.
36:47For this proposal, we developed a framework called RESEPT
36:52and cross pointing bio archive
36:55is also available publicly.
36:57And then cross pointing paper
36:59is now under revision in the nature communications.
37:06Regarding cell-cell communications,
37:09using network-based approaches has some benefit
37:12because the cell-cell communication can be nicely
37:16and naturally modeled using AGR network.
37:21So we have the parallel work called the the Banyan
37:25to identify the cell-cell communication
37:27and tissue architecture using the network-based approaches.
37:31And finally, there are the multiple effort experimentally
37:37to generate the spatial multimodal data.
37:42For example,
37:43the effect to seek such as the single cell genomics,
37:48proteomics and the T-cell receptor at the same time.
37:53And very soon,
37:54everything are expected to be combined
37:57as the spatial transcriptomic structure.
38:01We are working on the direction
38:03to develop the statistical model
38:06for integration of the HST data with other matched data.
38:13So I would like to acknowledge my research team at OSU.
38:18Carter Allen is the main driver this project,
38:23and also my pitch assistant
38:27Qin Ma and Yuzhou Chang is my close collaborator
38:32for the HST data modeling project.
38:36And Zihai Li,
38:38who is the director of the Immuno-Oncology Institute
38:42and also the expert in cancer.
38:46Won Chang at the University of Cincinnati
38:49who are the spatial statistics expert,
38:53and MUSC collaborator Brian Neelon
38:56and my grant support.
39:00So, and this is the end of my presentation,
39:03and you can find my manuscript
39:05and the software from the link here.
39:10If you have any questions and comment,
39:13please let me know by email at chung.911@osu.edu.
39:17So thank you for your attention.
39:29<v ->So thank you.</v>
39:32Do we have any questions from the audience in the classroom,
39:36or from the audience on zoom?
39:43<v ->Can I ask a question?</v>
39:45Can you hear me?
39:46<v ->Yes, mm-hm.</v>
39:47<v ->Right, Dongjun welcome back.</v>
39:50Great work, it's a nice presentation.
39:52I'm just wondering, like,
39:53when you do this from your own experience
39:56on the cell clustering,
39:57how much the spatial information contributes
40:00to the clustering.
40:03<v ->Sure.</v>
40:09So,
40:10(mumbles)
40:15If you're here,
40:16so if you look at the Seurat workflow,
40:18you can see there's a still lot of the, kind of,
40:22local boundary between different cell spot clusters.
40:28And when you analyze the same data using the SPRUCE,
40:32you can see much cleaner boundary.
40:34And often it will coincide with the
40:36expert analogy annotation.
40:40So given that there is the significant contribution,
40:46of course even the gene expression,
40:49we still get some big picture, as you can see here.
40:52But spatial information provide much cleaner prediction
40:57about the tissue architecture in general.
41:01<v ->I see.</v>
41:02And also the skewness.
41:05Do you estimate that or that's like your heart
41:08was persuaded by the skewness?
41:12<v ->You mean which one?</v>
41:14<v ->On k model.</v>
41:16Your model to specify, the k model you have there.
41:19I missed that part.
41:20Like, do you need to specify the skewness?
41:24<v ->Or learn from data.</v>
41:26<v ->Oh, I see.</v>
41:27But from the data, how skew?
41:29I mean, just in terms of how stable
41:31that alpha k can be estimated.
41:36<v ->So maybe I can answer it in two different ways.</v>
41:40So if there is this skewness in the data, I think yes.
41:44So we'll say it depends on how processed the data as well.
41:49So usually there's three different approaches
41:51to model the HST data in closed spatial embedding gene.
41:57And so you can see here,
41:58who are the people using the principle components?
42:02Who are the people use the team learning
42:05as the embedding step?
42:08If you use the team learning or the PCA
42:12it's more likely symmetry in the real data.
42:16If you consider the spatial embedding gene,
42:20we often hope to have the skewness, as you can see here.
42:24And then regarding your question, overall it works well.
42:30I don't have the exact quantification, but it works well.
42:34Especially stably in most cases.
42:37<v ->Yeah, I read the spatial Bayes paper.</v>
42:39They seem to be working on the principle components, right?
42:41They do not work on individual genes, right?.
42:43<v ->No, yeah.</v>
42:44They base this on the PCA.
42:46<v ->Yeah, that's why it's completely puzzling me</v>
42:47while you're doing that.
42:48But anyway, yeah.
42:49Thank you.
42:50<v ->Yeah so, so...</v>
42:51(mumbles)
42:54so they mainly target the PCA.
42:55So they only can start the multivariate distribution.
42:59And also because of the same reason,
43:01their equivalence metrics means less density.
43:05<v ->I see.</v>
43:08Thank you.
43:09<v ->Thank you.</v>
43:22<v ->Do we have any questions from students in the classroom?</v>
43:32<v ->Wait, can I ask another question?</v>
43:36So, towards the end,
43:37you mentioned you tried
43:38to look at the cell-cell communication.
43:45That part.
43:46I'm very interested in that
43:49From our experience on the single cell spatial data are...
43:55Are you talking about you're learning from the single cell,
43:58or the spatial single cell?
44:02<v ->So, regarding the cell-cell communication</v>
44:05it's still very ongoing research at this point.
44:10I mean, not just our side but in general.
44:13Because most of the cell-cell communication prediction
44:17based on the database.
44:20So based on data, like on the receptor,
44:22pairing the database and checking
44:25their cross point on the expression in cross point spot
44:28of the cell.
44:29And then by checking that the cross pointing pair
44:33of the expression pattern
44:34between the like and the receptor.
44:36They want to model cell-cell communication.
44:40It's not perfect, as you know,
44:42because it's like a computer.
44:45If you look at the chip, it's almost like
44:48(mumbles)
44:48but more like motive analysis.
44:51So there's some limitation,
44:53but it's a more likely general limitation at this point.
44:57<v ->Yeah,</v>
44:58I'm asking because we've been looking
44:59at some of the spatial single cell data
45:02that were too noisy for the like
45:04and receptor gene expression levels.
45:07Just couldn't make it too far.
45:09(mumbles)
45:11But for a single cell, may be different?
45:13I mean, probably there'll be more that, like...
45:17<v ->Yeah, three already.</v>
45:18I mean, so if you go to high-resolution,
45:22it's a very noisy,
45:24so very often we need to do some simplification.
45:28Like looking at multi-modal or the cell cluster,
45:31rather than the cell.
45:34It's still very multiple experimental limitation,
45:38at this point.
45:39(mumbles)
45:40Thank you.
45:51(class teacher addresses classroom)
46:00<v ->On the data from multiple samples</v>
46:03So, if we have samples from...
46:06(mumbles)
46:18<v ->Oh yeah, that's a very good question.</v>
46:22So,
46:23actually we can answer in the two different ways.
46:29In some sense,
46:30good pre-processing is still important
46:35because it still depends on the expression patterns.
46:43But still regarding the differences
46:46between the different tissues.
46:48If there is a big difference,
46:49it can still detect the difference
46:51between the different sample.
46:54So, it can detect spots.
46:55But still like a main goal is more
46:59for the similar type of tissue.
47:01If it's too different,
47:02maybe it's different research project.
47:05So, for example,
47:07here our targets is more about, for example,
47:10like same breast tissue,
47:13but with a different responders and non-responders group,
47:18for example.
47:19Or like a cell-cell long tissue, but the tumor but not tumor
47:24and so on.
47:25If you like a human and mouse,
47:29then it might be somewhat different story,
47:33which might need much more work.
47:38<v ->Do we have any more questions here?</v>
47:58Okay, can we have all the questions
47:59from the audience on zoom?
48:21Okay, so it looks like we don't have any more questions.
48:26So Dr. Chung, thank you again for your nice presentation.
48:31Look forward to meeting in person sometime soon.
48:36<v ->And then thank you again Wei and Hongyou</v>
48:38for the invitation
48:40and it's a great come back, although virtually.
48:43And I hope to see you again.
48:46<v ->We'll come by in person.</v>
48:49<v ->Hopefully someday soon.</v>
48:53Okay, thank you.