Skip to Main Content

YSPH Biostatistics Seminar: "SPRUCE and MAPLE: Bayesian Spatial Multivariate Mixture Model for High Throughput Spatial Transcriptomics Data"

December 01, 2021
  • 00:01<v ->College of Medicine and he is a member</v>
  • 00:04of the Pelotonia Institute for Immuno-Oncology
  • 00:08of Ohio State University as a candidate and a member.
  • 00:12His research focuses on
  • 00:15(mumbles)
  • 00:18for integrative analysis on synaptic and genomic data
  • 00:22with biomedical real data.
  • 00:26So welcome back Dongjun Chung.
  • 00:29(audience member claps)
  • 00:32<v ->Okay.</v>
  • 00:34Thank you Wei, for the kind introduction
  • 00:37and it's so great to come back.
  • 00:39Although it's all virtual.
  • 00:43I hope someday we can see in person.
  • 00:46So today I will discuss our recent project
  • 00:52about the SPRUCE and MAPLE: Bayesian Multivariate
  • 00:57Mixture Models for Spatial Transcriptomics Data.
  • 01:01Oh, by the way, can you hear me well?
  • 01:03<v ->Ah yes, we can hear you.</v>
  • 01:05<v ->Okay, great.</v>
  • 01:07So, let me start us from some quick introduction
  • 01:12about the single cell genomics.
  • 01:16So in some sense,
  • 01:17we can say that the last decade was the era of single cell
  • 01:21genomic experiments.
  • 01:24So it changed science in many ways.
  • 01:26And also a great amount of the data has been generated
  • 01:32using the single cell genomic technology.
  • 01:36Single cell genomic experiments
  • 01:38provide high-dimensional data at the cell level.
  • 01:42By doing so,
  • 01:44it allows to investigate cellular heterogeneity
  • 01:48within each subject or the patient
  • 01:52which was not possible previously
  • 01:54with the bulk of genomic data.
  • 01:56Which means that genomic data collected at the tissue level.
  • 02:04So some kind of standard visualization
  • 02:09of the single cell genomic data is called a UMAP.
  • 02:13And here,
  • 02:14this UMAP shows the distribution of the different clusters
  • 02:18in the tumor,
  • 02:20including the different immune cell type.
  • 02:25And in this way,
  • 02:27we can interrogate different types
  • 02:29of the immune cell composition.
  • 02:32And also there,
  • 02:33we can look at what kind of general feature
  • 02:37imaged for each cell cluster.
  • 02:40One of the recent (mumbles)
  • 02:43is the emergence of the high-throughput
  • 02:46spatial transcriptomics or the HST technology.
  • 02:51So, with the emergence of the HST technology,
  • 02:56we do not only look at the gene expression
  • 02:59in the cell level or the close-to-cell level.
  • 03:03We can now also notice that there are cross pointing
  • 03:06spatial information.
  • 03:08The figure at the bottom shows one example.
  • 03:12And here it shows the mouse brain tissue,
  • 03:16and each cell cone.
  • 03:19Here cross pointer to one spot
  • 03:21which is a group of the smaller...
  • 03:24small number of like two to ten at most.
  • 03:29And color here indicate expression level of different gene.
  • 03:34So left one cross point to the Hpca gene.
  • 03:39Right one cross point to the Ttr gene, for example.
  • 03:48And with the HST data,
  • 03:50we can do a lot of interesting science
  • 03:54to improve the parity in current medication.
  • 03:57So for example,
  • 03:59we can now look at the spatial information
  • 04:01of the tissue architecture at the transcriptomics level.
  • 04:07And then we can also investigate
  • 04:09the cell-cell communication with the spatial information
  • 04:13in our hand.
  • 04:15So at the figure at the bottom left shows the UMAP.
  • 04:19And here,
  • 04:20the different color indicates a different cell cluster.
  • 04:24And if you look at the figure on the right,
  • 04:27then you can see that there are a cluster
  • 04:30in a meaningful way on the tissue.
  • 04:33So in this way, we do not look at the different cell types
  • 04:36within a tissue.
  • 04:38But also look at their spatial information at the same time.
  • 04:47And there's many exciting applications
  • 04:49of the HST experiment, including the neuroscience.
  • 04:57Including the brain cancer study such as the immuno-oncology
  • 05:02and the developmental biology
  • 05:04which looks at the changes of the cellular composition
  • 05:08across the different stage of the development.
  • 05:12And here I specifically discuss the application
  • 05:16in the cancer, especially the tumor microenvironment.
  • 05:20And with the spatial information,
  • 05:22we can now study their location of the immune cell
  • 05:27and the tumor cell in the tumor tissue.
  • 05:31We can also interrogate implication of distance
  • 05:35on the tissue and their corresponding density.
  • 05:39And we can also study the distribution
  • 05:42of the immune regulator.
  • 05:45And finally, the special spacial patterns
  • 05:49such as the tertiary lymphoid structure.
  • 05:56Then from the statistical point of view,
  • 05:59how the HST data look like.
  • 06:05The first observation is in the HST data spatial structure,
  • 06:10in the tissue architecture in a meaningful way.
  • 06:13So as you discussed earlier,
  • 06:16we can see a similar type of the cell cluster
  • 06:19often located in the close proximity in the tissue.
  • 06:27And even after we exclude such kind of cell competition
  • 06:32in the spatial location,
  • 06:35we can start to see some spatial pattern in the patient
  • 06:38on the tissue.
  • 06:40So the figure on the top shows the expression pattern
  • 06:43of the three genes,
  • 06:46PCP4, MBP and MTC01.
  • 06:50After regressing out, with respect to the cell clusters.
  • 06:56And as you can see, even after considering
  • 06:58the cell cluster patterns,
  • 07:00you can start to see some interesting spatial patterns.
  • 07:05That the figure at the bottom shows the distribution
  • 07:09of each gene for each cell cluster.
  • 07:14And you can see that sometimes it's asymmetric
  • 07:17but also often we can see non-symmetry
  • 07:22in vascular distribution for each gene.
  • 07:27So these are some of the key features
  • 07:30of the HST data we want to consider
  • 07:33in the modeling of the HST data.
  • 07:37So if I profile pick somebody,
  • 07:39Gene expression outcomes feature complex correlation
  • 07:43such as the spatial correlation,
  • 07:46and also gene-gene correlation,
  • 07:49which mainly effects the biological pathway.
  • 07:52Spatial structure can be
  • 07:55(mumbles)
  • 07:56cellular clustering entity expression patterns.
  • 08:00And gene expression densities,
  • 08:02often feature skewness and or heavy tears
  • 08:06due to outlier cell spots.
  • 08:09So ideally we seek to provide a model
  • 08:13for identifying the tissue architecture
  • 08:16while accommodating these challenging features.
  • 08:24So, especially during the last two years,
  • 08:28several statistical methods have been proposed
  • 08:32to model HST data.
  • 08:34And still many of them are network-based approaches.
  • 08:39Partially because the stragglers; the very famous packages
  • 08:43for the single cell genomic data analysis.
  • 08:46And network-based approach has been proven
  • 08:49to be powerful in this context.
  • 08:51So based on that multiple network-based approach
  • 08:56have been proposed including the Giotto, Seurat and stLearn.
  • 09:03Because in the statistical model,
  • 09:07recently BayesSpace was proposed by the group of the
  • 09:12(mumbles)
  • 09:13at the Fred Hutchinson.
  • 09:15And essentially,
  • 09:16it uses a multivariate-t mixture model
  • 09:21to cluster cell spots.
  • 09:24It implement spatial smoothing of clusters
  • 09:27via a Pott's model prior on cluster labels.
  • 09:32And interestingly,
  • 09:34they try to predict sub-spots to increase the resolution.
  • 09:41In spite of such interesting features,
  • 09:44it has also some number of drawbacks.
  • 09:47For example,
  • 09:49it assumes the symmetry of the gene expression densities,
  • 09:52and it also relies on the approximate inference.
  • 09:58And here our goal is to develop a statistical model
  • 10:03that overcome these limitations
  • 10:06and also provide the optimal tissue architecture prediction
  • 10:11using the HST data which we call SPRUCE
  • 10:18or the spatial random effects-based clustering
  • 10:20of the single cell data.
  • 10:30So this is our SPRUCE model.
  • 10:35So here we use the i as the index for the cell spot
  • 10:40in the tissue sample.
  • 10:42And then we denote y i
  • 10:45as the length of gene expression vector for spot i.
  • 10:50And based on the y i, we also may find a mixture model
  • 10:56of the form.
  • 10:59So here we assume the k number of the mixture component.
  • 11:03or the cell spot clusters.
  • 11:06Theta k indicates the set of the parameters
  • 11:10specific to mixture component k.
  • 11:13Pi k is the probability of the spot i
  • 11:17belonging to the component k.
  • 11:22We further introduce z1 to zn,
  • 11:27which are the latent mixture component indicators
  • 11:31for each spot.
  • 11:33And zi can have the value between one to k.
  • 11:37And as I mentioned earlier,
  • 11:40can you see the gene-gene correlation
  • 11:42are key features of the HST data?
  • 11:47So to account for skewness and gene-gene correlation,
  • 11:51we assume a multivariate skew-normal distribution.
  • 11:56Where is the parameters?
  • 11:59So first one indicates the main vector for spot i,
  • 12:03and alpha k indicates gene-specific skewness parameters
  • 12:07for mixture component k.
  • 12:10And omega k is the gg scale matrix that captures correlation
  • 12:15among the gene expression feature in the component k.
  • 12:24And then we further represent MSN distribution
  • 12:28using a convenient conditional representation.
  • 12:31We use mu k for the mean of component k,
  • 12:36phi i for the spatial effect,
  • 12:38and t i and ksi k for the component-specific skewness
  • 12:44of each gene.
  • 12:47Epsilon i for the multivariate normal error.
  • 12:53And then in order to further accommodate spatial dependence,
  • 12:58we used the multivariate intrinsic
  • 13:00conditionally autoregressive,
  • 13:02or the CAR prior for phi i.
  • 13:05So essentially,
  • 13:08given all the spots except for spot i,
  • 13:12we might suggest pi i as the normal distribution
  • 13:17with the mean of its neighbors.
  • 13:21And with the covariance matrix denoted as the lambda.
  • 13:33And as you can see earlier,
  • 13:35we see the two different levels of the spatial patterns.
  • 13:40One for the spatial pattern of defect clustering.
  • 13:44And another one is the spatial pattern
  • 13:46of the gene expression.
  • 13:48So for the spatial pattern of the cell clusters,
  • 13:53we want to allow the probability of pi
  • 13:58of belonging to each mixture component.
  • 14:01Also to vary spatially as well.
  • 14:03So in order to do so,
  • 14:05we extend model I showed previously
  • 14:09using the pi i k,
  • 14:12which is the i specific.
  • 14:14And then here we modeled this one as the sigmoid
  • 14:18of the two parameters.
  • 14:21And then part one in the interceptor
  • 14:23for the baseline propensity of the membership
  • 14:28into component k shared by all cell spots.
  • 14:32And second term indicates the spatial random effects
  • 14:35allowing the variation about the intersect.
  • 14:42And again,
  • 14:43to introduce the spatial association
  • 14:46into the component membership model,
  • 14:49we further assume the univariate intrinsic CAR prior.
  • 14:53As you can see here.
  • 14:55And here the one computational challenges,
  • 15:01if you're interested, is format.
  • 15:04Then it do not allow us to...
  • 15:06It do not provide the closed form posterior distribution,
  • 15:10which prevent Gibbs sampler.
  • 15:12And in order to address this computation challenge,
  • 15:17we extended our model
  • 15:20using the results from the Polson et al in 2013, Jasa
  • 15:25on Polya-Gamma data augmentation to allow for Gibbs sampling
  • 15:30of the mixing weight model parameters.
  • 15:34And essentially,
  • 15:34we could assume that this can be represented
  • 15:38as the Polya-Gama Data Augmentation.
  • 15:42And by doing so,
  • 15:43everything can be implemented as the Gibbs sampler.
  • 15:49In the case of the further outliers or heavy-tails,
  • 15:53we can even further extend the model
  • 15:56to the multivariate skew-t distribution
  • 15:59that you can see here.
  • 16:00Which can be very easily implemented
  • 16:03given the existing model.
  • 16:07To complete our model specification,
  • 16:10we use the weekly specified prior,
  • 16:14and then the quantity of prior.
  • 16:16And by using this conjugate prior,
  • 16:19we can do everything using the fully Gibbs sampler
  • 16:23of the closed form
  • 16:24which provide the best computation.
  • 16:29And some additional consideration.
  • 16:33So here,
  • 16:34the one question is the optimal number of the k
  • 16:38worked in number of disparate clusters.
  • 16:41So for the proposal,
  • 16:42we use the product of the model selection approaches,
  • 16:46and specifically we use the WAIC,
  • 16:49or the widely applicable information criterion.
  • 16:55In the patient mixture it's very common
  • 16:57to observe the label switching program.
  • 17:00So to protect against the label switching issue
  • 17:03in the MCMC sampler, we use the canonical projection of z
  • 17:08using the Peng and Cavalho, in 2016.
  • 17:13And finally for the actual implementation,
  • 17:17we use the Rccp
  • 17:19to further improve the computation efficiency.
  • 17:27We implement the proposed model as on our package SPRUCE,
  • 17:33and it's currently available from our data page.
  • 17:38Here.
  • 17:40And then the figure shows our digital page.
  • 17:45When we developed our software,
  • 17:49one of the popular software to pre-processing
  • 17:53and analyzing the HST data
  • 17:57is the Seurat workflow.
  • 18:00So when you develop our software,
  • 18:02we provide integration with the Seurat workflow
  • 18:05so that our software can be embedded
  • 18:10as part of the (mumbles) flow.
  • 18:12So for example,
  • 18:14the data can be loaded into our using the Seurat,
  • 18:19and then people can apply the pre-processing
  • 18:23using the Seurat workflow.
  • 18:26And then that objective
  • 18:26can be fed into the SPRUCE analysis workflow.
  • 18:31And then the output from the SPRUCE can, again,
  • 18:34fit into the Seurat workflow for the visualization
  • 18:39and downstream analysis
  • 18:46So first for the simulation,
  • 18:50the first for the simulation is about the...
  • 18:54Has the two purposes.
  • 18:56So first one is to assess the validity
  • 18:59of the parameter estimation algorithm.
  • 19:02And second is to quantify the effect
  • 19:05of ignoring skewness and spatial information.
  • 19:09So in order to make our simulation more realistic,
  • 19:13we use the sagittal mouse brain data as the tissue shape
  • 19:18and the spot location.
  • 19:20And we simulated the full clusters
  • 19:23from the multivariate skew-normal distribution
  • 19:27with the 16 genes.
  • 19:31We considered the 26...
  • 19:352696 spots.
  • 19:38And then we considered three models,
  • 19:41including the multivariate normal,
  • 19:44multivariate skew-normal,
  • 19:45and with no skew-normal with no spatial.
  • 19:49So first one shows the implication
  • 19:51of inadequate study of skewness and spatial.
  • 19:55Second shows the implication
  • 19:58of ignoring the spatial structure.
  • 20:00And the final was our proposed model.
  • 20:05And here the top left figure,
  • 20:08shows the true cluster labels.
  • 20:11And top of right shows
  • 20:12the UMAP reduction of the gene expression pattern.
  • 20:18And as you can see, we can make the orange and the green,
  • 20:22which is far away from each other,
  • 20:24similar in the gene expression,
  • 20:26so that it can be more challenging in the prediction.
  • 20:30And we really test the performance of each model
  • 20:35using the ARI where the very close one
  • 20:39indicates the better performance.
  • 20:42And as you can see here, when we ignore
  • 20:47the skewness and the spatial pattern,
  • 20:51there is the big loss of the ARI.
  • 20:55And by considering the skewness,
  • 20:57we gain some but still that there is being lost.
  • 21:00And by further considering the spatial pattern,
  • 21:04we can improve the high level of the ARI.
  • 21:11And for the real data application,
  • 21:14we consider the two applications.
  • 21:18So,
  • 21:20to compare the performance of the SPRUCE to existing tools,
  • 21:26we used the 10X Visium human brain data
  • 21:30from the Maynard et al, 2021, Nature Neuroscience.
  • 21:36Here at the rehab we have about the 3000 spots.
  • 21:40And one of the good aspect of this data is
  • 21:45It's very well annotated.
  • 21:48So, the author,
  • 21:51using his expert knowledge,
  • 21:54they annotated the 3000 spots into the 5 brain layers.
  • 22:00Including the white matter and the frontal cortex layers.
  • 22:05And as I mentioned earlier,
  • 22:06we use the standard Seurat pre-processing pipeline,
  • 22:11including the normalization of using the sc transform
  • 22:16and also selection of the most variable genes
  • 22:20using the existing pipeline.
  • 22:22We consider the top 16 most variable genes.
  • 22:29And we also consider the three other existing algorithms
  • 22:34including BayesSpace, stLearn, Seurat and Giotto
  • 22:40as the computing algorithms.
  • 22:42And we use the default parameters for each of them.
  • 22:49Here it shows the regions
  • 22:52and top left figure shows the manual annotation
  • 22:54provided by the author in the paper.
  • 22:58And you can see the nice, five spatial clusters
  • 23:03from inside out.
  • 23:05And also there you can see
  • 23:08that there is one, narrow cell cluster
  • 23:11corresponding to the number four.
  • 23:16Here we showed the real data for the SPRUCE,
  • 23:18BayesSpace, stLearn, Seurat and the Giotto.
  • 23:24And in this case, the network-based approaches,
  • 23:28including the stLearn, Seurat and the Giotto,
  • 23:32all showed a lower performance compared to those algorithms.
  • 23:38The BayesSpace showed relatively higher performance
  • 23:42about the ARI of 0.55.
  • 23:46SPRUCE further improved the performance
  • 23:49compared to the BayesSpace.
  • 23:52And one thing I noted here is the...
  • 23:58The narrowed cell cluster,
  • 24:00could it be identified by the SPRUCE?
  • 24:04Which is interesting.
  • 24:06And as the second example.
  • 24:10So first one is the more labeled data.
  • 24:13We can compare our prediction to the existing annotation.
  • 24:17And to further demonstrate the application of the SPRUCE
  • 24:22to unlabeled data, we analyze the publicly available
  • 24:26human invasive ductal carcinoma breast tissue.
  • 24:31Again using the 10 X Visium platform.
  • 24:36And we essentially followed the similar workflow
  • 24:38and we identify the top 16 most spatially variable genes.
  • 24:45And those included several tumor associated antigens,
  • 24:50TAA, in creating the GFRA1 and CXCL14.
  • 24:56And also that there is the tumor suppressive gene,
  • 25:00like MALAT1.
  • 25:04And we use the SPRUCE to identify the 5 sub regions
  • 25:10using these 16 features.
  • 25:12This shows the 16 most variable genes.
  • 25:16And you can see that there are very clear spatial patterns.
  • 25:22For example the CXCL14 and GFRA1,
  • 25:28expel on the right bottom side.
  • 25:30While the MALAT1 express higher in the top left side.
  • 25:38And this is the cluster prediction
  • 25:42made by the SPRUCE algorithm.
  • 25:46And you can see that it identified
  • 25:48the cluster too, which it highly coincide with the CLCX14
  • 25:55and GFRAI1 with a study on.
  • 25:59(mumbles)
  • 26:01What the cell cluster 1,
  • 26:05Is the MALAT1
  • 26:09which is more tumor suppressor.
  • 26:13So here we can see that the SPRUCE can identify
  • 26:17the different group of the tissue architecture,
  • 26:20such as the tumor suppressor and then tumor related
  • 26:25(mumbles)
  • 26:33And we can also easily look at there,
  • 26:37within cluster expression pattern
  • 26:40and gene-gene correlation.
  • 26:43As you could see earlier,
  • 26:44on cell cluster 2 which equals 0.2 to the right
  • 26:49higher than the GFRA1 and CXCL14.
  • 26:52One, which is the cross point here is the high-end MALAT1
  • 26:57and so on.
  • 26:58And also, in the case of cell cluster 2,
  • 27:02there's a very strong gene-gene correlation pattern.
  • 27:06So we just support the proposed model that considered
  • 27:11spatial pattern and also gene-gene correlation
  • 27:14simultaneously.
  • 27:20So,
  • 27:21so far I discussed the method
  • 27:26for our SPRUCE and its application.
  • 27:33And that we essentially expanded our work a little bit more
  • 27:38to the MAPLE,
  • 27:39which is the multi-sample spatial transcriptomics model
  • 27:44Why we care about the multi-sample analysis of HST data?
  • 27:49So currently most algorithms are designed in a way
  • 27:53that it can more focus on a single sample.
  • 27:57But even intuitively,
  • 27:59joint analysis of the multiple HST data
  • 28:03can potentially boost the signal
  • 28:06by sharing the information amongst samples.
  • 28:09And also the joint analysis of the different samples
  • 28:13can allow the differentiation analysis of the HST data.
  • 28:18So very often, each tissue is not our main interest.
  • 28:24But we also want to compare tissue architecture
  • 28:27between the different samples.
  • 28:30For example, between the disease group versus the controls,
  • 28:35responders versus the non responders to 13 treatments,
  • 28:39such as the cancer immuno-therapy.
  • 28:41So to offset this limitation, we proposed MAPLE.
  • 28:47And actually our existing SPRUCE framework
  • 28:50already allows this one naturally.
  • 28:55So, simply what it can do is
  • 28:57instead of now analyzing each sample individually,
  • 29:01we can jointly analyze all the samples together.
  • 29:05And then by doing so,
  • 29:06we can share information
  • 29:08about the modeling of each cell spot cluster,
  • 29:13and also their spatial pattern.
  • 29:17But by introducing the sample-level covariate exp xi
  • 29:22in the cell type composition,
  • 29:27we can see the impact
  • 29:29of the different sample-level covariate.
  • 29:33Which I show more in detail in the coming slides.
  • 29:41So the first application is the same mouse brain data,
  • 29:45the human brain data...
  • 29:47Sorry this should be the mouse brain,
  • 29:49and here we see the two anterior parts,
  • 29:54which look very similar.
  • 29:56And then as you can see here,
  • 29:57when we jointly analyze the two sample
  • 30:01cross pointing to the same part of the brain.
  • 30:04It nicely identifies the cross pointing part
  • 30:08between the two sample.
  • 30:10Like one in the end, three on the top,
  • 30:14five at the bottom and so on.
  • 30:17And because this is the Bayesag framework,
  • 30:21it can also provide uncertainty measures
  • 30:25about our clustering prediction.
  • 30:28And as you can see usually there is more uncertain
  • 30:31about the clustering prediction
  • 30:35around the boundary between different cell clusters.
  • 30:38Which kind of makes sense,
  • 30:40because we expect that maybe cell type
  • 30:43might be more mixed together in the same cell spot.
  • 30:48Also, there are some cell clusters
  • 30:50with the higher level of the uncertainty
  • 30:53of which we are still trying to understand more
  • 30:56at this point.
  • 30:58And this kind of the figure is the...
  • 31:01what utility of this kind of joint analysis.
  • 31:06So, for the identifier with T,
  • 31:09we set the first cell cluster as the reference.
  • 31:14And then here we see the two (mumbles)
  • 31:16The top one shows the intercept,
  • 31:20and then we can interpret this one as the relative size
  • 31:25of each cell cluster.
  • 31:26So then compared to the one,
  • 31:29we can say three and the six are larger.
  • 31:32So the three and the six are larger, compared to the one.
  • 31:36Why the four is the smaller,
  • 31:39well just smaller compared to the one.
  • 31:40So this is what it can see by eye
  • 31:45from the tissue prediction region.
  • 31:48But good thing is that this model allows us to quantify,
  • 31:52what you see by eye.
  • 31:55And what is more interesting is the second one.
  • 31:57So this one,
  • 31:58is about the difference between the two sample.
  • 32:02So again,
  • 32:04so basically if it's higher,
  • 32:07then it means that certain tissue spot cluster
  • 32:12getting bigger in the second sample.
  • 32:14And if it's lower immune state is a kind of smaller
  • 32:18in the second sample and so on.
  • 32:20So in this way,
  • 32:21we can quantify the change of the tissue architecture
  • 32:26between different cell clusters.
  • 32:30And another interesting example is this one.
  • 32:35So here, the image of 2D to anterior samples,
  • 32:39we now also look at the posterior sample as well.
  • 32:44So because this is two parts of the brain
  • 32:48anterior and the posterior,
  • 32:50the issue is kind of continuous between two.
  • 32:53And as you can see here,
  • 32:54cell cluster three is connected to the posterior side here.
  • 33:00Cell cluster one is connected to here and so on.
  • 33:05And then this kind of pattern is not clear
  • 33:08if you analyze each data independently.
  • 33:12And our MAPLE framework nicely captures
  • 33:16such kind of sharing pattern.
  • 33:18And also the difference pattern
  • 33:20between the different samples, interestingly.
  • 33:23So at this point,
  • 33:25we are working on more simulation study
  • 33:27and the real data analysis
  • 33:29to further show the performance
  • 33:33and understand the properties of the MAPLE at this point.
  • 33:39So then I can't summarize my presentation today.
  • 33:45So the high throughput spatial transcriptomics, or HST,
  • 33:50provides unprecedented opportunities
  • 33:54to investigate novel biological hypotheses,
  • 33:57such as the tumor microenvironment and certain structure
  • 34:05about the human brain and Alzheimer,
  • 34:08and so on.
  • 34:10And here we propose SPRUCE,
  • 34:13a Bayesian multivariate mixture model
  • 34:16for HST data analysis.
  • 34:19SPRUCE has multiple strengths
  • 34:23including the novel combination
  • 34:26of the skewed normal density,
  • 34:29Polya-Gamma data augmentation,
  • 34:31and spatial random effect.
  • 34:35Altogether, it allows to
  • 34:37precisely infer spatially correlated mixture component
  • 34:41membership probabilities.
  • 34:44In our simulation study and real data analysis,
  • 34:49we could see that SPRUCE outperforms the existing method,
  • 34:53in the tissue architecture identification.
  • 34:56And finally our recent extension of the MAPLE
  • 35:01allows the joint clustering and differential analysis
  • 35:05of multiple HST data.
  • 35:09So at this point SPRUCE is on the review in,
  • 35:13(mumbles)
  • 35:15in the biometrics.
  • 35:17Cross pointing manuscript is available in the bio archive.
  • 35:22And there are multiple ongoing work
  • 35:25regarding the HST data modeling
  • 35:28in our lab.
  • 35:30So we are actually currently working on further improving
  • 35:35the SPRUCE and the MAPLE
  • 35:37by incorporating other characteristics
  • 35:39of the HST data, such as the relationships among cells.
  • 35:44For example,
  • 35:46we know that there are some likened and receptor,
  • 35:50for example.
  • 35:51Which we expect that they interact with each other
  • 35:55in their cell structure.
  • 35:57And then by incorporating different prior information,
  • 36:01we can further improve the SPRUCE and MAPLE.
  • 36:06We are also working on the other statistical models
  • 36:10for somewhat relevant, but different tasks.
  • 36:14For example,
  • 36:15currently we are also working on the streamlining framework,
  • 36:19especially the graph neural network,
  • 36:22which is called RESEPT.
  • 36:24And then using the gene framework,
  • 36:28we tried to come up with good embedding
  • 36:30of the HST gene expression pattern.
  • 36:34Our current results show that such a combination
  • 36:38of the stem learning and the statistical model approach
  • 36:41can provide nice prediction performance.
  • 36:47For this proposal, we developed a framework called RESEPT
  • 36:52and cross pointing bio archive
  • 36:55is also available publicly.
  • 36:57And then cross pointing paper
  • 36:59is now under revision in the nature communications.
  • 37:06Regarding cell-cell communications,
  • 37:09using network-based approaches has some benefit
  • 37:12because the cell-cell communication can be nicely
  • 37:16and naturally modeled using AGR network.
  • 37:21So we have the parallel work called the the Banyan
  • 37:25to identify the cell-cell communication
  • 37:27and tissue architecture using the network-based approaches.
  • 37:31And finally, there are the multiple effort experimentally
  • 37:37to generate the spatial multimodal data.
  • 37:42For example,
  • 37:43the effect to seek such as the single cell genomics,
  • 37:48proteomics and the T-cell receptor at the same time.
  • 37:53And very soon,
  • 37:54everything are expected to be combined
  • 37:57as the spatial transcriptomic structure.
  • 38:01We are working on the direction
  • 38:03to develop the statistical model
  • 38:06for integration of the HST data with other matched data.
  • 38:13So I would like to acknowledge my research team at OSU.
  • 38:18Carter Allen is the main driver this project,
  • 38:23and also my pitch assistant
  • 38:27Qin Ma and Yuzhou Chang is my close collaborator
  • 38:32for the HST data modeling project.
  • 38:36And Zihai Li,
  • 38:38who is the director of the Immuno-Oncology Institute
  • 38:42and also the expert in cancer.
  • 38:46Won Chang at the University of Cincinnati
  • 38:49who are the spatial statistics expert,
  • 38:53and MUSC collaborator Brian Neelon
  • 38:56and my grant support.
  • 39:00So, and this is the end of my presentation,
  • 39:03and you can find my manuscript
  • 39:05and the software from the link here.
  • 39:10If you have any questions and comment,
  • 39:13please let me know by email at chung.911@osu.edu.
  • 39:17So thank you for your attention.
  • 39:29<v ->So thank you.</v>
  • 39:32Do we have any questions from the audience in the classroom,
  • 39:36or from the audience on zoom?
  • 39:43<v ->Can I ask a question?</v>
  • 39:45Can you hear me?
  • 39:46<v ->Yes, mm-hm.</v>
  • 39:47<v ->Right, Dongjun welcome back.</v>
  • 39:50Great work, it's a nice presentation.
  • 39:52I'm just wondering, like,
  • 39:53when you do this from your own experience
  • 39:56on the cell clustering,
  • 39:57how much the spatial information contributes
  • 40:00to the clustering.
  • 40:03<v ->Sure.</v>
  • 40:09So,
  • 40:10(mumbles)
  • 40:15If you're here,
  • 40:16so if you look at the Seurat workflow,
  • 40:18you can see there's a still lot of the, kind of,
  • 40:22local boundary between different cell spot clusters.
  • 40:28And when you analyze the same data using the SPRUCE,
  • 40:32you can see much cleaner boundary.
  • 40:34And often it will coincide with the
  • 40:36expert analogy annotation.
  • 40:40So given that there is the significant contribution,
  • 40:46of course even the gene expression,
  • 40:49we still get some big picture, as you can see here.
  • 40:52But spatial information provide much cleaner prediction
  • 40:57about the tissue architecture in general.
  • 41:01<v ->I see.</v>
  • 41:02And also the skewness.
  • 41:05Do you estimate that or that's like your heart
  • 41:08was persuaded by the skewness?
  • 41:12<v ->You mean which one?</v>
  • 41:14<v ->On k model.</v>
  • 41:16Your model to specify, the k model you have there.
  • 41:19I missed that part.
  • 41:20Like, do you need to specify the skewness?
  • 41:24<v ->Or learn from data.</v>
  • 41:26<v ->Oh, I see.</v>
  • 41:27But from the data, how skew?
  • 41:29I mean, just in terms of how stable
  • 41:31that alpha k can be estimated.
  • 41:36<v ->So maybe I can answer it in two different ways.</v>
  • 41:40So if there is this skewness in the data, I think yes.
  • 41:44So we'll say it depends on how processed the data as well.
  • 41:49So usually there's three different approaches
  • 41:51to model the HST data in closed spatial embedding gene.
  • 41:57And so you can see here,
  • 41:58who are the people using the principle components?
  • 42:02Who are the people use the team learning
  • 42:05as the embedding step?
  • 42:08If you use the team learning or the PCA
  • 42:12it's more likely symmetry in the real data.
  • 42:16If you consider the spatial embedding gene,
  • 42:20we often hope to have the skewness, as you can see here.
  • 42:24And then regarding your question, overall it works well.
  • 42:30I don't have the exact quantification, but it works well.
  • 42:34Especially stably in most cases.
  • 42:37<v ->Yeah, I read the spatial Bayes paper.</v>
  • 42:39They seem to be working on the principle components, right?
  • 42:41They do not work on individual genes, right?.
  • 42:43<v ->No, yeah.</v>
  • 42:44They base this on the PCA.
  • 42:46<v ->Yeah, that's why it's completely puzzling me</v>
  • 42:47while you're doing that.
  • 42:48But anyway, yeah.
  • 42:49Thank you.
  • 42:50<v ->Yeah so, so...</v>
  • 42:51(mumbles)
  • 42:54so they mainly target the PCA.
  • 42:55So they only can start the multivariate distribution.
  • 42:59And also because of the same reason,
  • 43:01their equivalence metrics means less density.
  • 43:05<v ->I see.</v>
  • 43:08Thank you.
  • 43:09<v ->Thank you.</v>
  • 43:22<v ->Do we have any questions from students in the classroom?</v>
  • 43:32<v ->Wait, can I ask another question?</v>
  • 43:36So, towards the end,
  • 43:37you mentioned you tried
  • 43:38to look at the cell-cell communication.
  • 43:45That part.
  • 43:46I'm very interested in that
  • 43:49From our experience on the single cell spatial data are...
  • 43:55Are you talking about you're learning from the single cell,
  • 43:58or the spatial single cell?
  • 44:02<v ->So, regarding the cell-cell communication</v>
  • 44:05it's still very ongoing research at this point.
  • 44:10I mean, not just our side but in general.
  • 44:13Because most of the cell-cell communication prediction
  • 44:17based on the database.
  • 44:20So based on data, like on the receptor,
  • 44:22pairing the database and checking
  • 44:25their cross point on the expression in cross point spot
  • 44:28of the cell.
  • 44:29And then by checking that the cross pointing pair
  • 44:33of the expression pattern
  • 44:34between the like and the receptor.
  • 44:36They want to model cell-cell communication.
  • 44:40It's not perfect, as you know,
  • 44:42because it's like a computer.
  • 44:45If you look at the chip, it's almost like
  • 44:48(mumbles)
  • 44:48but more like motive analysis.
  • 44:51So there's some limitation,
  • 44:53but it's a more likely general limitation at this point.
  • 44:57<v ->Yeah,</v>
  • 44:58I'm asking because we've been looking
  • 44:59at some of the spatial single cell data
  • 45:02that were too noisy for the like
  • 45:04and receptor gene expression levels.
  • 45:07Just couldn't make it too far.
  • 45:09(mumbles)
  • 45:11But for a single cell, may be different?
  • 45:13I mean, probably there'll be more that, like...
  • 45:17<v ->Yeah, three already.</v>
  • 45:18I mean, so if you go to high-resolution,
  • 45:22it's a very noisy,
  • 45:24so very often we need to do some simplification.
  • 45:28Like looking at multi-modal or the cell cluster,
  • 45:31rather than the cell.
  • 45:34It's still very multiple experimental limitation,
  • 45:38at this point.
  • 45:39(mumbles)
  • 45:40Thank you.
  • 45:51(class teacher addresses classroom)
  • 46:00<v ->On the data from multiple samples</v>
  • 46:03So, if we have samples from...
  • 46:06(mumbles)
  • 46:18<v ->Oh yeah, that's a very good question.</v>
  • 46:22So,
  • 46:23actually we can answer in the two different ways.
  • 46:29In some sense,
  • 46:30good pre-processing is still important
  • 46:35because it still depends on the expression patterns.
  • 46:43But still regarding the differences
  • 46:46between the different tissues.
  • 46:48If there is a big difference,
  • 46:49it can still detect the difference
  • 46:51between the different sample.
  • 46:54So, it can detect spots.
  • 46:55But still like a main goal is more
  • 46:59for the similar type of tissue.
  • 47:01If it's too different,
  • 47:02maybe it's different research project.
  • 47:05So, for example,
  • 47:07here our targets is more about, for example,
  • 47:10like same breast tissue,
  • 47:13but with a different responders and non-responders group,
  • 47:18for example.
  • 47:19Or like a cell-cell long tissue, but the tumor but not tumor
  • 47:24and so on.
  • 47:25If you like a human and mouse,
  • 47:29then it might be somewhat different story,
  • 47:33which might need much more work.
  • 47:38<v ->Do we have any more questions here?</v>
  • 47:58Okay, can we have all the questions
  • 47:59from the audience on zoom?
  • 48:21Okay, so it looks like we don't have any more questions.
  • 48:26So Dr. Chung, thank you again for your nice presentation.
  • 48:31Look forward to meeting in person sometime soon.
  • 48:36<v ->And then thank you again Wei and Hongyou</v>
  • 48:38for the invitation
  • 48:40and it's a great come back, although virtually.
  • 48:43And I hope to see you again.
  • 48:46<v ->We'll come by in person.</v>
  • 48:49<v ->Hopefully someday soon.</v>
  • 48:53Okay, thank you.