YSPH Biostatistics Seminar: "SPRUCE and MAPLE: Bayesian Spatial Multivariate Mixture Model for High Throughput Spatial Transcriptomics Data"
December 01, 2021Dongjun Chung, PhD, Associate Professor, Department of Biomedical Informatics, The Ohio State University
November 30, 2021
Information
- ID
- 7227
- To Cite
- DCA Citation Guide
Transcript
- 00:01<v ->College of Medicine and he is a member</v>
- 00:04of the Pelotonia Institute for Immuno-Oncology
- 00:08of Ohio State University as a candidate and a member.
- 00:12His research focuses on
- 00:15(mumbles)
- 00:18for integrative analysis on synaptic and genomic data
- 00:22with biomedical real data.
- 00:26So welcome back Dongjun Chung.
- 00:29(audience member claps)
- 00:32<v ->Okay.</v>
- 00:34Thank you Wei, for the kind introduction
- 00:37and it's so great to come back.
- 00:39Although it's all virtual.
- 00:43I hope someday we can see in person.
- 00:46So today I will discuss our recent project
- 00:52about the SPRUCE and MAPLE: Bayesian Multivariate
- 00:57Mixture Models for Spatial Transcriptomics Data.
- 01:01Oh, by the way, can you hear me well?
- 01:03<v ->Ah yes, we can hear you.</v>
- 01:05<v ->Okay, great.</v>
- 01:07So, let me start us from some quick introduction
- 01:12about the single cell genomics.
- 01:16So in some sense,
- 01:17we can say that the last decade was the era of single cell
- 01:21genomic experiments.
- 01:24So it changed science in many ways.
- 01:26And also a great amount of the data has been generated
- 01:32using the single cell genomic technology.
- 01:36Single cell genomic experiments
- 01:38provide high-dimensional data at the cell level.
- 01:42By doing so,
- 01:44it allows to investigate cellular heterogeneity
- 01:48within each subject or the patient
- 01:52which was not possible previously
- 01:54with the bulk of genomic data.
- 01:56Which means that genomic data collected at the tissue level.
- 02:04So some kind of standard visualization
- 02:09of the single cell genomic data is called a UMAP.
- 02:13And here,
- 02:14this UMAP shows the distribution of the different clusters
- 02:18in the tumor,
- 02:20including the different immune cell type.
- 02:25And in this way,
- 02:27we can interrogate different types
- 02:29of the immune cell composition.
- 02:32And also there,
- 02:33we can look at what kind of general feature
- 02:37imaged for each cell cluster.
- 02:40One of the recent (mumbles)
- 02:43is the emergence of the high-throughput
- 02:46spatial transcriptomics or the HST technology.
- 02:51So, with the emergence of the HST technology,
- 02:56we do not only look at the gene expression
- 02:59in the cell level or the close-to-cell level.
- 03:03We can now also notice that there are cross pointing
- 03:06spatial information.
- 03:08The figure at the bottom shows one example.
- 03:12And here it shows the mouse brain tissue,
- 03:16and each cell cone.
- 03:19Here cross pointer to one spot
- 03:21which is a group of the smaller...
- 03:24small number of like two to ten at most.
- 03:29And color here indicate expression level of different gene.
- 03:34So left one cross point to the Hpca gene.
- 03:39Right one cross point to the Ttr gene, for example.
- 03:48And with the HST data,
- 03:50we can do a lot of interesting science
- 03:54to improve the parity in current medication.
- 03:57So for example,
- 03:59we can now look at the spatial information
- 04:01of the tissue architecture at the transcriptomics level.
- 04:07And then we can also investigate
- 04:09the cell-cell communication with the spatial information
- 04:13in our hand.
- 04:15So at the figure at the bottom left shows the UMAP.
- 04:19And here,
- 04:20the different color indicates a different cell cluster.
- 04:24And if you look at the figure on the right,
- 04:27then you can see that there are a cluster
- 04:30in a meaningful way on the tissue.
- 04:33So in this way, we do not look at the different cell types
- 04:36within a tissue.
- 04:38But also look at their spatial information at the same time.
- 04:47And there's many exciting applications
- 04:49of the HST experiment, including the neuroscience.
- 04:57Including the brain cancer study such as the immuno-oncology
- 05:02and the developmental biology
- 05:04which looks at the changes of the cellular composition
- 05:08across the different stage of the development.
- 05:12And here I specifically discuss the application
- 05:16in the cancer, especially the tumor microenvironment.
- 05:20And with the spatial information,
- 05:22we can now study their location of the immune cell
- 05:27and the tumor cell in the tumor tissue.
- 05:31We can also interrogate implication of distance
- 05:35on the tissue and their corresponding density.
- 05:39And we can also study the distribution
- 05:42of the immune regulator.
- 05:45And finally, the special spacial patterns
- 05:49such as the tertiary lymphoid structure.
- 05:56Then from the statistical point of view,
- 05:59how the HST data look like.
- 06:05The first observation is in the HST data spatial structure,
- 06:10in the tissue architecture in a meaningful way.
- 06:13So as you discussed earlier,
- 06:16we can see a similar type of the cell cluster
- 06:19often located in the close proximity in the tissue.
- 06:27And even after we exclude such kind of cell competition
- 06:32in the spatial location,
- 06:35we can start to see some spatial pattern in the patient
- 06:38on the tissue.
- 06:40So the figure on the top shows the expression pattern
- 06:43of the three genes,
- 06:46PCP4, MBP and MTC01.
- 06:50After regressing out, with respect to the cell clusters.
- 06:56And as you can see, even after considering
- 06:58the cell cluster patterns,
- 07:00you can start to see some interesting spatial patterns.
- 07:05That the figure at the bottom shows the distribution
- 07:09of each gene for each cell cluster.
- 07:14And you can see that sometimes it's asymmetric
- 07:17but also often we can see non-symmetry
- 07:22in vascular distribution for each gene.
- 07:27So these are some of the key features
- 07:30of the HST data we want to consider
- 07:33in the modeling of the HST data.
- 07:37So if I profile pick somebody,
- 07:39Gene expression outcomes feature complex correlation
- 07:43such as the spatial correlation,
- 07:46and also gene-gene correlation,
- 07:49which mainly effects the biological pathway.
- 07:52Spatial structure can be
- 07:55(mumbles)
- 07:56cellular clustering entity expression patterns.
- 08:00And gene expression densities,
- 08:02often feature skewness and or heavy tears
- 08:06due to outlier cell spots.
- 08:09So ideally we seek to provide a model
- 08:13for identifying the tissue architecture
- 08:16while accommodating these challenging features.
- 08:24So, especially during the last two years,
- 08:28several statistical methods have been proposed
- 08:32to model HST data.
- 08:34And still many of them are network-based approaches.
- 08:39Partially because the stragglers; the very famous packages
- 08:43for the single cell genomic data analysis.
- 08:46And network-based approach has been proven
- 08:49to be powerful in this context.
- 08:51So based on that multiple network-based approach
- 08:56have been proposed including the Giotto, Seurat and stLearn.
- 09:03Because in the statistical model,
- 09:07recently BayesSpace was proposed by the group of the
- 09:12(mumbles)
- 09:13at the Fred Hutchinson.
- 09:15And essentially,
- 09:16it uses a multivariate-t mixture model
- 09:21to cluster cell spots.
- 09:24It implement spatial smoothing of clusters
- 09:27via a Pott's model prior on cluster labels.
- 09:32And interestingly,
- 09:34they try to predict sub-spots to increase the resolution.
- 09:41In spite of such interesting features,
- 09:44it has also some number of drawbacks.
- 09:47For example,
- 09:49it assumes the symmetry of the gene expression densities,
- 09:52and it also relies on the approximate inference.
- 09:58And here our goal is to develop a statistical model
- 10:03that overcome these limitations
- 10:06and also provide the optimal tissue architecture prediction
- 10:11using the HST data which we call SPRUCE
- 10:18or the spatial random effects-based clustering
- 10:20of the single cell data.
- 10:30So this is our SPRUCE model.
- 10:35So here we use the i as the index for the cell spot
- 10:40in the tissue sample.
- 10:42And then we denote y i
- 10:45as the length of gene expression vector for spot i.
- 10:50And based on the y i, we also may find a mixture model
- 10:56of the form.
- 10:59So here we assume the k number of the mixture component.
- 11:03or the cell spot clusters.
- 11:06Theta k indicates the set of the parameters
- 11:10specific to mixture component k.
- 11:13Pi k is the probability of the spot i
- 11:17belonging to the component k.
- 11:22We further introduce z1 to zn,
- 11:27which are the latent mixture component indicators
- 11:31for each spot.
- 11:33And zi can have the value between one to k.
- 11:37And as I mentioned earlier,
- 11:40can you see the gene-gene correlation
- 11:42are key features of the HST data?
- 11:47So to account for skewness and gene-gene correlation,
- 11:51we assume a multivariate skew-normal distribution.
- 11:56Where is the parameters?
- 11:59So first one indicates the main vector for spot i,
- 12:03and alpha k indicates gene-specific skewness parameters
- 12:07for mixture component k.
- 12:10And omega k is the gg scale matrix that captures correlation
- 12:15among the gene expression feature in the component k.
- 12:24And then we further represent MSN distribution
- 12:28using a convenient conditional representation.
- 12:31We use mu k for the mean of component k,
- 12:36phi i for the spatial effect,
- 12:38and t i and ksi k for the component-specific skewness
- 12:44of each gene.
- 12:47Epsilon i for the multivariate normal error.
- 12:53And then in order to further accommodate spatial dependence,
- 12:58we used the multivariate intrinsic
- 13:00conditionally autoregressive,
- 13:02or the CAR prior for phi i.
- 13:05So essentially,
- 13:08given all the spots except for spot i,
- 13:12we might suggest pi i as the normal distribution
- 13:17with the mean of its neighbors.
- 13:21And with the covariance matrix denoted as the lambda.
- 13:33And as you can see earlier,
- 13:35we see the two different levels of the spatial patterns.
- 13:40One for the spatial pattern of defect clustering.
- 13:44And another one is the spatial pattern
- 13:46of the gene expression.
- 13:48So for the spatial pattern of the cell clusters,
- 13:53we want to allow the probability of pi
- 13:58of belonging to each mixture component.
- 14:01Also to vary spatially as well.
- 14:03So in order to do so,
- 14:05we extend model I showed previously
- 14:09using the pi i k,
- 14:12which is the i specific.
- 14:14And then here we modeled this one as the sigmoid
- 14:18of the two parameters.
- 14:21And then part one in the interceptor
- 14:23for the baseline propensity of the membership
- 14:28into component k shared by all cell spots.
- 14:32And second term indicates the spatial random effects
- 14:35allowing the variation about the intersect.
- 14:42And again,
- 14:43to introduce the spatial association
- 14:46into the component membership model,
- 14:49we further assume the univariate intrinsic CAR prior.
- 14:53As you can see here.
- 14:55And here the one computational challenges,
- 15:01if you're interested, is format.
- 15:04Then it do not allow us to...
- 15:06It do not provide the closed form posterior distribution,
- 15:10which prevent Gibbs sampler.
- 15:12And in order to address this computation challenge,
- 15:17we extended our model
- 15:20using the results from the Polson et al in 2013, Jasa
- 15:25on Polya-Gamma data augmentation to allow for Gibbs sampling
- 15:30of the mixing weight model parameters.
- 15:34And essentially,
- 15:34we could assume that this can be represented
- 15:38as the Polya-Gama Data Augmentation.
- 15:42And by doing so,
- 15:43everything can be implemented as the Gibbs sampler.
- 15:49In the case of the further outliers or heavy-tails,
- 15:53we can even further extend the model
- 15:56to the multivariate skew-t distribution
- 15:59that you can see here.
- 16:00Which can be very easily implemented
- 16:03given the existing model.
- 16:07To complete our model specification,
- 16:10we use the weekly specified prior,
- 16:14and then the quantity of prior.
- 16:16And by using this conjugate prior,
- 16:19we can do everything using the fully Gibbs sampler
- 16:23of the closed form
- 16:24which provide the best computation.
- 16:29And some additional consideration.
- 16:33So here,
- 16:34the one question is the optimal number of the k
- 16:38worked in number of disparate clusters.
- 16:41So for the proposal,
- 16:42we use the product of the model selection approaches,
- 16:46and specifically we use the WAIC,
- 16:49or the widely applicable information criterion.
- 16:55In the patient mixture it's very common
- 16:57to observe the label switching program.
- 17:00So to protect against the label switching issue
- 17:03in the MCMC sampler, we use the canonical projection of z
- 17:08using the Peng and Cavalho, in 2016.
- 17:13And finally for the actual implementation,
- 17:17we use the Rccp
- 17:19to further improve the computation efficiency.
- 17:27We implement the proposed model as on our package SPRUCE,
- 17:33and it's currently available from our data page.
- 17:38Here.
- 17:40And then the figure shows our digital page.
- 17:45When we developed our software,
- 17:49one of the popular software to pre-processing
- 17:53and analyzing the HST data
- 17:57is the Seurat workflow.
- 18:00So when you develop our software,
- 18:02we provide integration with the Seurat workflow
- 18:05so that our software can be embedded
- 18:10as part of the (mumbles) flow.
- 18:12So for example,
- 18:14the data can be loaded into our using the Seurat,
- 18:19and then people can apply the pre-processing
- 18:23using the Seurat workflow.
- 18:26And then that objective
- 18:26can be fed into the SPRUCE analysis workflow.
- 18:31And then the output from the SPRUCE can, again,
- 18:34fit into the Seurat workflow for the visualization
- 18:39and downstream analysis
- 18:46So first for the simulation,
- 18:50the first for the simulation is about the...
- 18:54Has the two purposes.
- 18:56So first one is to assess the validity
- 18:59of the parameter estimation algorithm.
- 19:02And second is to quantify the effect
- 19:05of ignoring skewness and spatial information.
- 19:09So in order to make our simulation more realistic,
- 19:13we use the sagittal mouse brain data as the tissue shape
- 19:18and the spot location.
- 19:20And we simulated the full clusters
- 19:23from the multivariate skew-normal distribution
- 19:27with the 16 genes.
- 19:31We considered the 26...
- 19:352696 spots.
- 19:38And then we considered three models,
- 19:41including the multivariate normal,
- 19:44multivariate skew-normal,
- 19:45and with no skew-normal with no spatial.
- 19:49So first one shows the implication
- 19:51of inadequate study of skewness and spatial.
- 19:55Second shows the implication
- 19:58of ignoring the spatial structure.
- 20:00And the final was our proposed model.
- 20:05And here the top left figure,
- 20:08shows the true cluster labels.
- 20:11And top of right shows
- 20:12the UMAP reduction of the gene expression pattern.
- 20:18And as you can see, we can make the orange and the green,
- 20:22which is far away from each other,
- 20:24similar in the gene expression,
- 20:26so that it can be more challenging in the prediction.
- 20:30And we really test the performance of each model
- 20:35using the ARI where the very close one
- 20:39indicates the better performance.
- 20:42And as you can see here, when we ignore
- 20:47the skewness and the spatial pattern,
- 20:51there is the big loss of the ARI.
- 20:55And by considering the skewness,
- 20:57we gain some but still that there is being lost.
- 21:00And by further considering the spatial pattern,
- 21:04we can improve the high level of the ARI.
- 21:11And for the real data application,
- 21:14we consider the two applications.
- 21:18So,
- 21:20to compare the performance of the SPRUCE to existing tools,
- 21:26we used the 10X Visium human brain data
- 21:30from the Maynard et al, 2021, Nature Neuroscience.
- 21:36Here at the rehab we have about the 3000 spots.
- 21:40And one of the good aspect of this data is
- 21:45It's very well annotated.
- 21:48So, the author,
- 21:51using his expert knowledge,
- 21:54they annotated the 3000 spots into the 5 brain layers.
- 22:00Including the white matter and the frontal cortex layers.
- 22:05And as I mentioned earlier,
- 22:06we use the standard Seurat pre-processing pipeline,
- 22:11including the normalization of using the sc transform
- 22:16and also selection of the most variable genes
- 22:20using the existing pipeline.
- 22:22We consider the top 16 most variable genes.
- 22:29And we also consider the three other existing algorithms
- 22:34including BayesSpace, stLearn, Seurat and Giotto
- 22:40as the computing algorithms.
- 22:42And we use the default parameters for each of them.
- 22:49Here it shows the regions
- 22:52and top left figure shows the manual annotation
- 22:54provided by the author in the paper.
- 22:58And you can see the nice, five spatial clusters
- 23:03from inside out.
- 23:05And also there you can see
- 23:08that there is one, narrow cell cluster
- 23:11corresponding to the number four.
- 23:16Here we showed the real data for the SPRUCE,
- 23:18BayesSpace, stLearn, Seurat and the Giotto.
- 23:24And in this case, the network-based approaches,
- 23:28including the stLearn, Seurat and the Giotto,
- 23:32all showed a lower performance compared to those algorithms.
- 23:38The BayesSpace showed relatively higher performance
- 23:42about the ARI of 0.55.
- 23:46SPRUCE further improved the performance
- 23:49compared to the BayesSpace.
- 23:52And one thing I noted here is the...
- 23:58The narrowed cell cluster,
- 24:00could it be identified by the SPRUCE?
- 24:04Which is interesting.
- 24:06And as the second example.
- 24:10So first one is the more labeled data.
- 24:13We can compare our prediction to the existing annotation.
- 24:17And to further demonstrate the application of the SPRUCE
- 24:22to unlabeled data, we analyze the publicly available
- 24:26human invasive ductal carcinoma breast tissue.
- 24:31Again using the 10 X Visium platform.
- 24:36And we essentially followed the similar workflow
- 24:38and we identify the top 16 most spatially variable genes.
- 24:45And those included several tumor associated antigens,
- 24:50TAA, in creating the GFRA1 and CXCL14.
- 24:56And also that there is the tumor suppressive gene,
- 25:00like MALAT1.
- 25:04And we use the SPRUCE to identify the 5 sub regions
- 25:10using these 16 features.
- 25:12This shows the 16 most variable genes.
- 25:16And you can see that there are very clear spatial patterns.
- 25:22For example the CXCL14 and GFRA1,
- 25:28expel on the right bottom side.
- 25:30While the MALAT1 express higher in the top left side.
- 25:38And this is the cluster prediction
- 25:42made by the SPRUCE algorithm.
- 25:46And you can see that it identified
- 25:48the cluster too, which it highly coincide with the CLCX14
- 25:55and GFRAI1 with a study on.
- 25:59(mumbles)
- 26:01What the cell cluster 1,
- 26:05Is the MALAT1
- 26:09which is more tumor suppressor.
- 26:13So here we can see that the SPRUCE can identify
- 26:17the different group of the tissue architecture,
- 26:20such as the tumor suppressor and then tumor related
- 26:25(mumbles)
- 26:33And we can also easily look at there,
- 26:37within cluster expression pattern
- 26:40and gene-gene correlation.
- 26:43As you could see earlier,
- 26:44on cell cluster 2 which equals 0.2 to the right
- 26:49higher than the GFRA1 and CXCL14.
- 26:52One, which is the cross point here is the high-end MALAT1
- 26:57and so on.
- 26:58And also, in the case of cell cluster 2,
- 27:02there's a very strong gene-gene correlation pattern.
- 27:06So we just support the proposed model that considered
- 27:11spatial pattern and also gene-gene correlation
- 27:14simultaneously.
- 27:20So,
- 27:21so far I discussed the method
- 27:26for our SPRUCE and its application.
- 27:33And that we essentially expanded our work a little bit more
- 27:38to the MAPLE,
- 27:39which is the multi-sample spatial transcriptomics model
- 27:44Why we care about the multi-sample analysis of HST data?
- 27:49So currently most algorithms are designed in a way
- 27:53that it can more focus on a single sample.
- 27:57But even intuitively,
- 27:59joint analysis of the multiple HST data
- 28:03can potentially boost the signal
- 28:06by sharing the information amongst samples.
- 28:09And also the joint analysis of the different samples
- 28:13can allow the differentiation analysis of the HST data.
- 28:18So very often, each tissue is not our main interest.
- 28:24But we also want to compare tissue architecture
- 28:27between the different samples.
- 28:30For example, between the disease group versus the controls,
- 28:35responders versus the non responders to 13 treatments,
- 28:39such as the cancer immuno-therapy.
- 28:41So to offset this limitation, we proposed MAPLE.
- 28:47And actually our existing SPRUCE framework
- 28:50already allows this one naturally.
- 28:55So, simply what it can do is
- 28:57instead of now analyzing each sample individually,
- 29:01we can jointly analyze all the samples together.
- 29:05And then by doing so,
- 29:06we can share information
- 29:08about the modeling of each cell spot cluster,
- 29:13and also their spatial pattern.
- 29:17But by introducing the sample-level covariate exp xi
- 29:22in the cell type composition,
- 29:27we can see the impact
- 29:29of the different sample-level covariate.
- 29:33Which I show more in detail in the coming slides.
- 29:41So the first application is the same mouse brain data,
- 29:45the human brain data...
- 29:47Sorry this should be the mouse brain,
- 29:49and here we see the two anterior parts,
- 29:54which look very similar.
- 29:56And then as you can see here,
- 29:57when we jointly analyze the two sample
- 30:01cross pointing to the same part of the brain.
- 30:04It nicely identifies the cross pointing part
- 30:08between the two sample.
- 30:10Like one in the end, three on the top,
- 30:14five at the bottom and so on.
- 30:17And because this is the Bayesag framework,
- 30:21it can also provide uncertainty measures
- 30:25about our clustering prediction.
- 30:28And as you can see usually there is more uncertain
- 30:31about the clustering prediction
- 30:35around the boundary between different cell clusters.
- 30:38Which kind of makes sense,
- 30:40because we expect that maybe cell type
- 30:43might be more mixed together in the same cell spot.
- 30:48Also, there are some cell clusters
- 30:50with the higher level of the uncertainty
- 30:53of which we are still trying to understand more
- 30:56at this point.
- 30:58And this kind of the figure is the...
- 31:01what utility of this kind of joint analysis.
- 31:06So, for the identifier with T,
- 31:09we set the first cell cluster as the reference.
- 31:14And then here we see the two (mumbles)
- 31:16The top one shows the intercept,
- 31:20and then we can interpret this one as the relative size
- 31:25of each cell cluster.
- 31:26So then compared to the one,
- 31:29we can say three and the six are larger.
- 31:32So the three and the six are larger, compared to the one.
- 31:36Why the four is the smaller,
- 31:39well just smaller compared to the one.
- 31:40So this is what it can see by eye
- 31:45from the tissue prediction region.
- 31:48But good thing is that this model allows us to quantify,
- 31:52what you see by eye.
- 31:55And what is more interesting is the second one.
- 31:57So this one,
- 31:58is about the difference between the two sample.
- 32:02So again,
- 32:04so basically if it's higher,
- 32:07then it means that certain tissue spot cluster
- 32:12getting bigger in the second sample.
- 32:14And if it's lower immune state is a kind of smaller
- 32:18in the second sample and so on.
- 32:20So in this way,
- 32:21we can quantify the change of the tissue architecture
- 32:26between different cell clusters.
- 32:30And another interesting example is this one.
- 32:35So here, the image of 2D to anterior samples,
- 32:39we now also look at the posterior sample as well.
- 32:44So because this is two parts of the brain
- 32:48anterior and the posterior,
- 32:50the issue is kind of continuous between two.
- 32:53And as you can see here,
- 32:54cell cluster three is connected to the posterior side here.
- 33:00Cell cluster one is connected to here and so on.
- 33:05And then this kind of pattern is not clear
- 33:08if you analyze each data independently.
- 33:12And our MAPLE framework nicely captures
- 33:16such kind of sharing pattern.
- 33:18And also the difference pattern
- 33:20between the different samples, interestingly.
- 33:23So at this point,
- 33:25we are working on more simulation study
- 33:27and the real data analysis
- 33:29to further show the performance
- 33:33and understand the properties of the MAPLE at this point.
- 33:39So then I can't summarize my presentation today.
- 33:45So the high throughput spatial transcriptomics, or HST,
- 33:50provides unprecedented opportunities
- 33:54to investigate novel biological hypotheses,
- 33:57such as the tumor microenvironment and certain structure
- 34:05about the human brain and Alzheimer,
- 34:08and so on.
- 34:10And here we propose SPRUCE,
- 34:13a Bayesian multivariate mixture model
- 34:16for HST data analysis.
- 34:19SPRUCE has multiple strengths
- 34:23including the novel combination
- 34:26of the skewed normal density,
- 34:29Polya-Gamma data augmentation,
- 34:31and spatial random effect.
- 34:35Altogether, it allows to
- 34:37precisely infer spatially correlated mixture component
- 34:41membership probabilities.
- 34:44In our simulation study and real data analysis,
- 34:49we could see that SPRUCE outperforms the existing method,
- 34:53in the tissue architecture identification.
- 34:56And finally our recent extension of the MAPLE
- 35:01allows the joint clustering and differential analysis
- 35:05of multiple HST data.
- 35:09So at this point SPRUCE is on the review in,
- 35:13(mumbles)
- 35:15in the biometrics.
- 35:17Cross pointing manuscript is available in the bio archive.
- 35:22And there are multiple ongoing work
- 35:25regarding the HST data modeling
- 35:28in our lab.
- 35:30So we are actually currently working on further improving
- 35:35the SPRUCE and the MAPLE
- 35:37by incorporating other characteristics
- 35:39of the HST data, such as the relationships among cells.
- 35:44For example,
- 35:46we know that there are some likened and receptor,
- 35:50for example.
- 35:51Which we expect that they interact with each other
- 35:55in their cell structure.
- 35:57And then by incorporating different prior information,
- 36:01we can further improve the SPRUCE and MAPLE.
- 36:06We are also working on the other statistical models
- 36:10for somewhat relevant, but different tasks.
- 36:14For example,
- 36:15currently we are also working on the streamlining framework,
- 36:19especially the graph neural network,
- 36:22which is called RESEPT.
- 36:24And then using the gene framework,
- 36:28we tried to come up with good embedding
- 36:30of the HST gene expression pattern.
- 36:34Our current results show that such a combination
- 36:38of the stem learning and the statistical model approach
- 36:41can provide nice prediction performance.
- 36:47For this proposal, we developed a framework called RESEPT
- 36:52and cross pointing bio archive
- 36:55is also available publicly.
- 36:57And then cross pointing paper
- 36:59is now under revision in the nature communications.
- 37:06Regarding cell-cell communications,
- 37:09using network-based approaches has some benefit
- 37:12because the cell-cell communication can be nicely
- 37:16and naturally modeled using AGR network.
- 37:21So we have the parallel work called the the Banyan
- 37:25to identify the cell-cell communication
- 37:27and tissue architecture using the network-based approaches.
- 37:31And finally, there are the multiple effort experimentally
- 37:37to generate the spatial multimodal data.
- 37:42For example,
- 37:43the effect to seek such as the single cell genomics,
- 37:48proteomics and the T-cell receptor at the same time.
- 37:53And very soon,
- 37:54everything are expected to be combined
- 37:57as the spatial transcriptomic structure.
- 38:01We are working on the direction
- 38:03to develop the statistical model
- 38:06for integration of the HST data with other matched data.
- 38:13So I would like to acknowledge my research team at OSU.
- 38:18Carter Allen is the main driver this project,
- 38:23and also my pitch assistant
- 38:27Qin Ma and Yuzhou Chang is my close collaborator
- 38:32for the HST data modeling project.
- 38:36And Zihai Li,
- 38:38who is the director of the Immuno-Oncology Institute
- 38:42and also the expert in cancer.
- 38:46Won Chang at the University of Cincinnati
- 38:49who are the spatial statistics expert,
- 38:53and MUSC collaborator Brian Neelon
- 38:56and my grant support.
- 39:00So, and this is the end of my presentation,
- 39:03and you can find my manuscript
- 39:05and the software from the link here.
- 39:10If you have any questions and comment,
- 39:13please let me know by email at chung.911@osu.edu.
- 39:17So thank you for your attention.
- 39:29<v ->So thank you.</v>
- 39:32Do we have any questions from the audience in the classroom,
- 39:36or from the audience on zoom?
- 39:43<v ->Can I ask a question?</v>
- 39:45Can you hear me?
- 39:46<v ->Yes, mm-hm.</v>
- 39:47<v ->Right, Dongjun welcome back.</v>
- 39:50Great work, it's a nice presentation.
- 39:52I'm just wondering, like,
- 39:53when you do this from your own experience
- 39:56on the cell clustering,
- 39:57how much the spatial information contributes
- 40:00to the clustering.
- 40:03<v ->Sure.</v>
- 40:09So,
- 40:10(mumbles)
- 40:15If you're here,
- 40:16so if you look at the Seurat workflow,
- 40:18you can see there's a still lot of the, kind of,
- 40:22local boundary between different cell spot clusters.
- 40:28And when you analyze the same data using the SPRUCE,
- 40:32you can see much cleaner boundary.
- 40:34And often it will coincide with the
- 40:36expert analogy annotation.
- 40:40So given that there is the significant contribution,
- 40:46of course even the gene expression,
- 40:49we still get some big picture, as you can see here.
- 40:52But spatial information provide much cleaner prediction
- 40:57about the tissue architecture in general.
- 41:01<v ->I see.</v>
- 41:02And also the skewness.
- 41:05Do you estimate that or that's like your heart
- 41:08was persuaded by the skewness?
- 41:12<v ->You mean which one?</v>
- 41:14<v ->On k model.</v>
- 41:16Your model to specify, the k model you have there.
- 41:19I missed that part.
- 41:20Like, do you need to specify the skewness?
- 41:24<v ->Or learn from data.</v>
- 41:26<v ->Oh, I see.</v>
- 41:27But from the data, how skew?
- 41:29I mean, just in terms of how stable
- 41:31that alpha k can be estimated.
- 41:36<v ->So maybe I can answer it in two different ways.</v>
- 41:40So if there is this skewness in the data, I think yes.
- 41:44So we'll say it depends on how processed the data as well.
- 41:49So usually there's three different approaches
- 41:51to model the HST data in closed spatial embedding gene.
- 41:57And so you can see here,
- 41:58who are the people using the principle components?
- 42:02Who are the people use the team learning
- 42:05as the embedding step?
- 42:08If you use the team learning or the PCA
- 42:12it's more likely symmetry in the real data.
- 42:16If you consider the spatial embedding gene,
- 42:20we often hope to have the skewness, as you can see here.
- 42:24And then regarding your question, overall it works well.
- 42:30I don't have the exact quantification, but it works well.
- 42:34Especially stably in most cases.
- 42:37<v ->Yeah, I read the spatial Bayes paper.</v>
- 42:39They seem to be working on the principle components, right?
- 42:41They do not work on individual genes, right?.
- 42:43<v ->No, yeah.</v>
- 42:44They base this on the PCA.
- 42:46<v ->Yeah, that's why it's completely puzzling me</v>
- 42:47while you're doing that.
- 42:48But anyway, yeah.
- 42:49Thank you.
- 42:50<v ->Yeah so, so...</v>
- 42:51(mumbles)
- 42:54so they mainly target the PCA.
- 42:55So they only can start the multivariate distribution.
- 42:59And also because of the same reason,
- 43:01their equivalence metrics means less density.
- 43:05<v ->I see.</v>
- 43:08Thank you.
- 43:09<v ->Thank you.</v>
- 43:22<v ->Do we have any questions from students in the classroom?</v>
- 43:32<v ->Wait, can I ask another question?</v>
- 43:36So, towards the end,
- 43:37you mentioned you tried
- 43:38to look at the cell-cell communication.
- 43:45That part.
- 43:46I'm very interested in that
- 43:49From our experience on the single cell spatial data are...
- 43:55Are you talking about you're learning from the single cell,
- 43:58or the spatial single cell?
- 44:02<v ->So, regarding the cell-cell communication</v>
- 44:05it's still very ongoing research at this point.
- 44:10I mean, not just our side but in general.
- 44:13Because most of the cell-cell communication prediction
- 44:17based on the database.
- 44:20So based on data, like on the receptor,
- 44:22pairing the database and checking
- 44:25their cross point on the expression in cross point spot
- 44:28of the cell.
- 44:29And then by checking that the cross pointing pair
- 44:33of the expression pattern
- 44:34between the like and the receptor.
- 44:36They want to model cell-cell communication.
- 44:40It's not perfect, as you know,
- 44:42because it's like a computer.
- 44:45If you look at the chip, it's almost like
- 44:48(mumbles)
- 44:48but more like motive analysis.
- 44:51So there's some limitation,
- 44:53but it's a more likely general limitation at this point.
- 44:57<v ->Yeah,</v>
- 44:58I'm asking because we've been looking
- 44:59at some of the spatial single cell data
- 45:02that were too noisy for the like
- 45:04and receptor gene expression levels.
- 45:07Just couldn't make it too far.
- 45:09(mumbles)
- 45:11But for a single cell, may be different?
- 45:13I mean, probably there'll be more that, like...
- 45:17<v ->Yeah, three already.</v>
- 45:18I mean, so if you go to high-resolution,
- 45:22it's a very noisy,
- 45:24so very often we need to do some simplification.
- 45:28Like looking at multi-modal or the cell cluster,
- 45:31rather than the cell.
- 45:34It's still very multiple experimental limitation,
- 45:38at this point.
- 45:39(mumbles)
- 45:40Thank you.
- 45:51(class teacher addresses classroom)
- 46:00<v ->On the data from multiple samples</v>
- 46:03So, if we have samples from...
- 46:06(mumbles)
- 46:18<v ->Oh yeah, that's a very good question.</v>
- 46:22So,
- 46:23actually we can answer in the two different ways.
- 46:29In some sense,
- 46:30good pre-processing is still important
- 46:35because it still depends on the expression patterns.
- 46:43But still regarding the differences
- 46:46between the different tissues.
- 46:48If there is a big difference,
- 46:49it can still detect the difference
- 46:51between the different sample.
- 46:54So, it can detect spots.
- 46:55But still like a main goal is more
- 46:59for the similar type of tissue.
- 47:01If it's too different,
- 47:02maybe it's different research project.
- 47:05So, for example,
- 47:07here our targets is more about, for example,
- 47:10like same breast tissue,
- 47:13but with a different responders and non-responders group,
- 47:18for example.
- 47:19Or like a cell-cell long tissue, but the tumor but not tumor
- 47:24and so on.
- 47:25If you like a human and mouse,
- 47:29then it might be somewhat different story,
- 47:33which might need much more work.
- 47:38<v ->Do we have any more questions here?</v>
- 47:58Okay, can we have all the questions
- 47:59from the audience on zoom?
- 48:21Okay, so it looks like we don't have any more questions.
- 48:26So Dr. Chung, thank you again for your nice presentation.
- 48:31Look forward to meeting in person sometime soon.
- 48:36<v ->And then thank you again Wei and Hongyou</v>
- 48:38for the invitation
- 48:40and it's a great come back, although virtually.
- 48:43And I hope to see you again.
- 48:46<v ->We'll come by in person.</v>
- 48:49<v ->Hopefully someday soon.</v>
- 48:53Okay, thank you.