Skip to Main Content

YSPH Biostatistics Seminar: "SPRUCE and MAPLE: Bayesian Spatial Multivariate Mixture Model for High Throughput Spatial Transcriptomics Data"

December 01, 2021

Dongjun Chung, PhD, Associate Professor, Department of Biomedical Informatics, The Ohio State University

November 30, 2021

ID
7227

Transcript

  • 00:01<v ->College of Medicine and he is a member</v>
  • 00:04of the Pelotonia Institute for Immuno-Oncology
  • 00:08of Ohio State University as a candidate and a member.
  • 00:12His research focuses on
  • 00:15(mumbles)
  • 00:18for integrative analysis on synaptic and genomic data
  • 00:22with biomedical real data.
  • 00:26So welcome back Dongjun Chung.
  • 00:29(audience member claps)
  • 00:32<v ->Okay.</v>
  • 00:34Thank you Wei, for the kind introduction
  • 00:37and it's so great to come back.
  • 00:39Although it's all virtual.
  • 00:43I hope someday we can see in person.
  • 00:46So today I will discuss our recent project
  • 00:52about the SPRUCE and MAPLE: Bayesian Multivariate
  • 00:57Mixture Models for Spatial Transcriptomics Data.
  • 01:01Oh, by the way, can you hear me well?
  • 01:03<v ->Ah yes, we can hear you.</v>
  • 01:05<v ->Okay, great.</v>
  • 01:07So, let me start us from some quick introduction
  • 01:12about the single cell genomics.
  • 01:16So in some sense,
  • 01:17we can say that the last decade was the era of single cell
  • 01:21genomic experiments.
  • 01:24So it changed science in many ways.
  • 01:26And also a great amount of the data has been generated
  • 01:32using the single cell genomic technology.
  • 01:36Single cell genomic experiments
  • 01:38provide high-dimensional data at the cell level.
  • 01:42By doing so,
  • 01:44it allows to investigate cellular heterogeneity
  • 01:48within each subject or the patient
  • 01:52which was not possible previously
  • 01:54with the bulk of genomic data.
  • 01:56Which means that genomic data collected at the tissue level.
  • 02:04So some kind of standard visualization
  • 02:09of the single cell genomic data is called a UMAP.
  • 02:13And here,
  • 02:14this UMAP shows the distribution of the different clusters
  • 02:18in the tumor,
  • 02:20including the different immune cell type.
  • 02:25And in this way,
  • 02:27we can interrogate different types
  • 02:29of the immune cell composition.
  • 02:32And also there,
  • 02:33we can look at what kind of general feature
  • 02:37imaged for each cell cluster.
  • 02:40One of the recent (mumbles)
  • 02:43is the emergence of the high-throughput
  • 02:46spatial transcriptomics or the HST technology.
  • 02:51So, with the emergence of the HST technology,
  • 02:56we do not only look at the gene expression
  • 02:59in the cell level or the close-to-cell level.
  • 03:03We can now also notice that there are cross pointing
  • 03:06spatial information.
  • 03:08The figure at the bottom shows one example.
  • 03:12And here it shows the mouse brain tissue,
  • 03:16and each cell cone.
  • 03:19Here cross pointer to one spot
  • 03:21which is a group of the smaller...
  • 03:24small number of like two to ten at most.
  • 03:29And color here indicate expression level of different gene.
  • 03:34So left one cross point to the Hpca gene.
  • 03:39Right one cross point to the Ttr gene, for example.
  • 03:48And with the HST data,
  • 03:50we can do a lot of interesting science
  • 03:54to improve the parity in current medication.
  • 03:57So for example,
  • 03:59we can now look at the spatial information
  • 04:01of the tissue architecture at the transcriptomics level.
  • 04:07And then we can also investigate
  • 04:09the cell-cell communication with the spatial information
  • 04:13in our hand.
  • 04:15So at the figure at the bottom left shows the UMAP.
  • 04:19And here,
  • 04:20the different color indicates a different cell cluster.
  • 04:24And if you look at the figure on the right,
  • 04:27then you can see that there are a cluster
  • 04:30in a meaningful way on the tissue.
  • 04:33So in this way, we do not look at the different cell types
  • 04:36within a tissue.
  • 04:38But also look at their spatial information at the same time.
  • 04:47And there's many exciting applications
  • 04:49of the HST experiment, including the neuroscience.
  • 04:57Including the brain cancer study such as the immuno-oncology
  • 05:02and the developmental biology
  • 05:04which looks at the changes of the cellular composition
  • 05:08across the different stage of the development.
  • 05:12And here I specifically discuss the application
  • 05:16in the cancer, especially the tumor microenvironment.
  • 05:20And with the spatial information,
  • 05:22we can now study their location of the immune cell
  • 05:27and the tumor cell in the tumor tissue.
  • 05:31We can also interrogate implication of distance
  • 05:35on the tissue and their corresponding density.
  • 05:39And we can also study the distribution
  • 05:42of the immune regulator.
  • 05:45And finally, the special spacial patterns
  • 05:49such as the tertiary lymphoid structure.
  • 05:56Then from the statistical point of view,
  • 05:59how the HST data look like.
  • 06:05The first observation is in the HST data spatial structure,
  • 06:10in the tissue architecture in a meaningful way.
  • 06:13So as you discussed earlier,
  • 06:16we can see a similar type of the cell cluster
  • 06:19often located in the close proximity in the tissue.
  • 06:27And even after we exclude such kind of cell competition
  • 06:32in the spatial location,
  • 06:35we can start to see some spatial pattern in the patient
  • 06:38on the tissue.
  • 06:40So the figure on the top shows the expression pattern
  • 06:43of the three genes,
  • 06:46PCP4, MBP and MTC01.
  • 06:50After regressing out, with respect to the cell clusters.
  • 06:56And as you can see, even after considering
  • 06:58the cell cluster patterns,
  • 07:00you can start to see some interesting spatial patterns.
  • 07:05That the figure at the bottom shows the distribution
  • 07:09of each gene for each cell cluster.
  • 07:14And you can see that sometimes it's asymmetric
  • 07:17but also often we can see non-symmetry
  • 07:22in vascular distribution for each gene.
  • 07:27So these are some of the key features
  • 07:30of the HST data we want to consider
  • 07:33in the modeling of the HST data.
  • 07:37So if I profile pick somebody,
  • 07:39Gene expression outcomes feature complex correlation
  • 07:43such as the spatial correlation,
  • 07:46and also gene-gene correlation,
  • 07:49which mainly effects the biological pathway.
  • 07:52Spatial structure can be
  • 07:55(mumbles)
  • 07:56cellular clustering entity expression patterns.
  • 08:00And gene expression densities,
  • 08:02often feature skewness and or heavy tears
  • 08:06due to outlier cell spots.
  • 08:09So ideally we seek to provide a model
  • 08:13for identifying the tissue architecture
  • 08:16while accommodating these challenging features.
  • 08:24So, especially during the last two years,
  • 08:28several statistical methods have been proposed
  • 08:32to model HST data.
  • 08:34And still many of them are network-based approaches.
  • 08:39Partially because the stragglers; the very famous packages
  • 08:43for the single cell genomic data analysis.
  • 08:46And network-based approach has been proven
  • 08:49to be powerful in this context.
  • 08:51So based on that multiple network-based approach
  • 08:56have been proposed including the Giotto, Seurat and stLearn.
  • 09:03Because in the statistical model,
  • 09:07recently BayesSpace was proposed by the group of the
  • 09:12(mumbles)
  • 09:13at the Fred Hutchinson.
  • 09:15And essentially,
  • 09:16it uses a multivariate-t mixture model
  • 09:21to cluster cell spots.
  • 09:24It implement spatial smoothing of clusters
  • 09:27via a Pott's model prior on cluster labels.
  • 09:32And interestingly,
  • 09:34they try to predict sub-spots to increase the resolution.
  • 09:41In spite of such interesting features,
  • 09:44it has also some number of drawbacks.
  • 09:47For example,
  • 09:49it assumes the symmetry of the gene expression densities,
  • 09:52and it also relies on the approximate inference.
  • 09:58And here our goal is to develop a statistical model
  • 10:03that overcome these limitations
  • 10:06and also provide the optimal tissue architecture prediction
  • 10:11using the HST data which we call SPRUCE
  • 10:18or the spatial random effects-based clustering
  • 10:20of the single cell data.
  • 10:30So this is our SPRUCE model.
  • 10:35So here we use the i as the index for the cell spot
  • 10:40in the tissue sample.
  • 10:42And then we denote y i
  • 10:45as the length of gene expression vector for spot i.
  • 10:50And based on the y i, we also may find a mixture model
  • 10:56of the form.
  • 10:59So here we assume the k number of the mixture component.
  • 11:03or the cell spot clusters.
  • 11:06Theta k indicates the set of the parameters
  • 11:10specific to mixture component k.
  • 11:13Pi k is the probability of the spot i
  • 11:17belonging to the component k.
  • 11:22We further introduce z1 to zn,
  • 11:27which are the latent mixture component indicators
  • 11:31for each spot.
  • 11:33And zi can have the value between one to k.
  • 11:37And as I mentioned earlier,
  • 11:40can you see the gene-gene correlation
  • 11:42are key features of the HST data?
  • 11:47So to account for skewness and gene-gene correlation,
  • 11:51we assume a multivariate skew-normal distribution.
  • 11:56Where is the parameters?
  • 11:59So first one indicates the main vector for spot i,
  • 12:03and alpha k indicates gene-specific skewness parameters
  • 12:07for mixture component k.
  • 12:10And omega k is the gg scale matrix that captures correlation
  • 12:15among the gene expression feature in the component k.
  • 12:24And then we further represent MSN distribution
  • 12:28using a convenient conditional representation.
  • 12:31We use mu k for the mean of component k,
  • 12:36phi i for the spatial effect,
  • 12:38and t i and ksi k for the component-specific skewness
  • 12:44of each gene.
  • 12:47Epsilon i for the multivariate normal error.
  • 12:53And then in order to further accommodate spatial dependence,
  • 12:58we used the multivariate intrinsic
  • 13:00conditionally autoregressive,
  • 13:02or the CAR prior for phi i.
  • 13:05So essentially,
  • 13:08given all the spots except for spot i,
  • 13:12we might suggest pi i as the normal distribution
  • 13:17with the mean of its neighbors.
  • 13:21And with the covariance matrix denoted as the lambda.
  • 13:33And as you can see earlier,
  • 13:35we see the two different levels of the spatial patterns.
  • 13:40One for the spatial pattern of defect clustering.
  • 13:44And another one is the spatial pattern
  • 13:46of the gene expression.
  • 13:48So for the spatial pattern of the cell clusters,
  • 13:53we want to allow the probability of pi
  • 13:58of belonging to each mixture component.
  • 14:01Also to vary spatially as well.
  • 14:03So in order to do so,
  • 14:05we extend model I showed previously
  • 14:09using the pi i k,
  • 14:12which is the i specific.
  • 14:14And then here we modeled this one as the sigmoid
  • 14:18of the two parameters.
  • 14:21And then part one in the interceptor
  • 14:23for the baseline propensity of the membership
  • 14:28into component k shared by all cell spots.
  • 14:32And second term indicates the spatial random effects
  • 14:35allowing the variation about the intersect.
  • 14:42And again,
  • 14:43to introduce the spatial association
  • 14:46into the component membership model,
  • 14:49we further assume the univariate intrinsic CAR prior.
  • 14:53As you can see here.
  • 14:55And here the one computational challenges,
  • 15:01if you're interested, is format.
  • 15:04Then it do not allow us to...
  • 15:06It do not provide the closed form posterior distribution,
  • 15:10which prevent Gibbs sampler.
  • 15:12And in order to address this computation challenge,
  • 15:17we extended our model
  • 15:20using the results from the Polson et al in 2013, Jasa
  • 15:25on Polya-Gamma data augmentation to allow for Gibbs sampling
  • 15:30of the mixing weight model parameters.
  • 15:34And essentially,
  • 15:34we could assume that this can be represented
  • 15:38as the Polya-Gama Data Augmentation.
  • 15:42And by doing so,
  • 15:43everything can be implemented as the Gibbs sampler.
  • 15:49In the case of the further outliers or heavy-tails,
  • 15:53we can even further extend the model
  • 15:56to the multivariate skew-t distribution
  • 15:59that you can see here.
  • 16:00Which can be very easily implemented
  • 16:03given the existing model.
  • 16:07To complete our model specification,
  • 16:10we use the weekly specified prior,
  • 16:14and then the quantity of prior.
  • 16:16And by using this conjugate prior,
  • 16:19we can do everything using the fully Gibbs sampler
  • 16:23of the closed form
  • 16:24which provide the best computation.
  • 16:29And some additional consideration.
  • 16:33So here,
  • 16:34the one question is the optimal number of the k
  • 16:38worked in number of disparate clusters.
  • 16:41So for the proposal,
  • 16:42we use the product of the model selection approaches,
  • 16:46and specifically we use the WAIC,
  • 16:49or the widely applicable information criterion.
  • 16:55In the patient mixture it's very common
  • 16:57to observe the label switching program.
  • 17:00So to protect against the label switching issue
  • 17:03in the MCMC sampler, we use the canonical projection of z
  • 17:08using the Peng and Cavalho, in 2016.
  • 17:13And finally for the actual implementation,
  • 17:17we use the Rccp
  • 17:19to further improve the computation efficiency.
  • 17:27We implement the proposed model as on our package SPRUCE,
  • 17:33and it's currently available from our data page.
  • 17:38Here.
  • 17:40And then the figure shows our digital page.
  • 17:45When we developed our software,
  • 17:49one of the popular software to pre-processing
  • 17:53and analyzing the HST data
  • 17:57is the Seurat workflow.
  • 18:00So when you develop our software,
  • 18:02we provide integration with the Seurat workflow
  • 18:05so that our software can be embedded
  • 18:10as part of the (mumbles) flow.
  • 18:12So for example,
  • 18:14the data can be loaded into our using the Seurat,
  • 18:19and then people can apply the pre-processing
  • 18:23using the Seurat workflow.
  • 18:26And then that objective
  • 18:26can be fed into the SPRUCE analysis workflow.
  • 18:31And then the output from the SPRUCE can, again,
  • 18:34fit into the Seurat workflow for the visualization
  • 18:39and downstream analysis
  • 18:46So first for the simulation,
  • 18:50the first for the simulation is about the...
  • 18:54Has the two purposes.
  • 18:56So first one is to assess the validity
  • 18:59of the parameter estimation algorithm.
  • 19:02And second is to quantify the effect
  • 19:05of ignoring skewness and spatial information.
  • 19:09So in order to make our simulation more realistic,
  • 19:13we use the sagittal mouse brain data as the tissue shape
  • 19:18and the spot location.
  • 19:20And we simulated the full clusters
  • 19:23from the multivariate skew-normal distribution
  • 19:27with the 16 genes.
  • 19:31We considered the 26...
  • 19:352696 spots.
  • 19:38And then we considered three models,
  • 19:41including the multivariate normal,
  • 19:44multivariate skew-normal,
  • 19:45and with no skew-normal with no spatial.
  • 19:49So first one shows the implication
  • 19:51of inadequate study of skewness and spatial.
  • 19:55Second shows the implication
  • 19:58of ignoring the spatial structure.
  • 20:00And the final was our proposed model.
  • 20:05And here the top left figure,
  • 20:08shows the true cluster labels.
  • 20:11And top of right shows
  • 20:12the UMAP reduction of the gene expression pattern.
  • 20:18And as you can see, we can make the orange and the green,
  • 20:22which is far away from each other,
  • 20:24similar in the gene expression,
  • 20:26so that it can be more challenging in the prediction.
  • 20:30And we really test the performance of each model
  • 20:35using the ARI where the very close one
  • 20:39indicates the better performance.
  • 20:42And as you can see here, when we ignore
  • 20:47the skewness and the spatial pattern,
  • 20:51there is the big loss of the ARI.
  • 20:55And by considering the skewness,
  • 20:57we gain some but still that there is being lost.
  • 21:00And by further considering the spatial pattern,
  • 21:04we can improve the high level of the ARI.
  • 21:11And for the real data application,
  • 21:14we consider the two applications.
  • 21:18So,
  • 21:20to compare the performance of the SPRUCE to existing tools,
  • 21:26we used the 10X Visium human brain data
  • 21:30from the Maynard et al, 2021, Nature Neuroscience.
  • 21:36Here at the rehab we have about the 3000 spots.
  • 21:40And one of the good aspect of this data is
  • 21:45It's very well annotated.
  • 21:48So, the author,
  • 21:51using his expert knowledge,
  • 21:54they annotated the 3000 spots into the 5 brain layers.
  • 22:00Including the white matter and the frontal cortex layers.
  • 22:05And as I mentioned earlier,
  • 22:06we use the standard Seurat pre-processing pipeline,
  • 22:11including the normalization of using the sc transform
  • 22:16and also selection of the most variable genes
  • 22:20using the existing pipeline.
  • 22:22We consider the top 16 most variable genes.
  • 22:29And we also consider the three other existing algorithms
  • 22:34including BayesSpace, stLearn, Seurat and Giotto
  • 22:40as the computing algorithms.
  • 22:42And we use the default parameters for each of them.
  • 22:49Here it shows the regions
  • 22:52and top left figure shows the manual annotation
  • 22:54provided by the author in the paper.
  • 22:58And you can see the nice, five spatial clusters
  • 23:03from inside out.
  • 23:05And also there you can see
  • 23:08that there is one, narrow cell cluster
  • 23:11corresponding to the number four.
  • 23:16Here we showed the real data for the SPRUCE,
  • 23:18BayesSpace, stLearn, Seurat and the Giotto.
  • 23:24And in this case, the network-based approaches,
  • 23:28including the stLearn, Seurat and the Giotto,
  • 23:32all showed a lower performance compared to those algorithms.
  • 23:38The BayesSpace showed relatively higher performance
  • 23:42about the ARI of 0.55.
  • 23:46SPRUCE further improved the performance
  • 23:49compared to the BayesSpace.
  • 23:52And one thing I noted here is the...
  • 23:58The narrowed cell cluster,
  • 24:00could it be identified by the SPRUCE?
  • 24:04Which is interesting.
  • 24:06And as the second example.
  • 24:10So first one is the more labeled data.
  • 24:13We can compare our prediction to the existing annotation.
  • 24:17And to further demonstrate the application of the SPRUCE
  • 24:22to unlabeled data, we analyze the publicly available
  • 24:26human invasive ductal carcinoma breast tissue.
  • 24:31Again using the 10 X Visium platform.
  • 24:36And we essentially followed the similar workflow
  • 24:38and we identify the top 16 most spatially variable genes.
  • 24:45And those included several tumor associated antigens,
  • 24:50TAA, in creating the GFRA1 and CXCL14.
  • 24:56And also that there is the tumor suppressive gene,
  • 25:00like MALAT1.
  • 25:04And we use the SPRUCE to identify the 5 sub regions
  • 25:10using these 16 features.
  • 25:12This shows the 16 most variable genes.
  • 25:16And you can see that there are very clear spatial patterns.
  • 25:22For example the CXCL14 and GFRA1,
  • 25:28expel on the right bottom side.
  • 25:30While the MALAT1 express higher in the top left side.
  • 25:38And this is the cluster prediction
  • 25:42made by the SPRUCE algorithm.
  • 25:46And you can see that it identified
  • 25:48the cluster too, which it highly coincide with the CLCX14
  • 25:55and GFRAI1 with a study on.
  • 25:59(mumbles)
  • 26:01What the cell cluster 1,
  • 26:05Is the MALAT1
  • 26:09which is more tumor suppressor.
  • 26:13So here we can see that the SPRUCE can identify
  • 26:17the different group of the tissue architecture,
  • 26:20such as the tumor suppressor and then tumor related
  • 26:25(mumbles)
  • 26:33And we can also easily look at there,
  • 26:37within cluster expression pattern
  • 26:40and gene-gene correlation.
  • 26:43As you could see earlier,
  • 26:44on cell cluster 2 which equals 0.2 to the right
  • 26:49higher than the GFRA1 and CXCL14.
  • 26:52One, which is the cross point here is the high-end MALAT1
  • 26:57and so on.
  • 26:58And also, in the case of cell cluster 2,
  • 27:02there's a very strong gene-gene correlation pattern.
  • 27:06So we just support the proposed model that considered
  • 27:11spatial pattern and also gene-gene correlation
  • 27:14simultaneously.
  • 27:20So,
  • 27:21so far I discussed the method
  • 27:26for our SPRUCE and its application.
  • 27:33And that we essentially expanded our work a little bit more
  • 27:38to the MAPLE,
  • 27:39which is the multi-sample spatial transcriptomics model
  • 27:44Why we care about the multi-sample analysis of HST data?
  • 27:49So currently most algorithms are designed in a way
  • 27:53that it can more focus on a single sample.
  • 27:57But even intuitively,
  • 27:59joint analysis of the multiple HST data
  • 28:03can potentially boost the signal
  • 28:06by sharing the information amongst samples.
  • 28:09And also the joint analysis of the different samples
  • 28:13can allow the differentiation analysis of the HST data.
  • 28:18So very often, each tissue is not our main interest.
  • 28:24But we also want to compare tissue architecture
  • 28:27between the different samples.
  • 28:30For example, between the disease group versus the controls,
  • 28:35responders versus the non responders to 13 treatments,
  • 28:39such as the cancer immuno-therapy.
  • 28:41So to offset this limitation, we proposed MAPLE.
  • 28:47And actually our existing SPRUCE framework
  • 28:50already allows this one naturally.
  • 28:55So, simply what it can do is
  • 28:57instead of now analyzing each sample individually,
  • 29:01we can jointly analyze all the samples together.
  • 29:05And then by doing so,
  • 29:06we can share information
  • 29:08about the modeling of each cell spot cluster,
  • 29:13and also their spatial pattern.
  • 29:17But by introducing the sample-level covariate exp xi
  • 29:22in the cell type composition,
  • 29:27we can see the impact
  • 29:29of the different sample-level covariate.
  • 29:33Which I show more in detail in the coming slides.
  • 29:41So the first application is the same mouse brain data,
  • 29:45the human brain data...
  • 29:47Sorry this should be the mouse brain,
  • 29:49and here we see the two anterior parts,
  • 29:54which look very similar.
  • 29:56And then as you can see here,
  • 29:57when we jointly analyze the two sample
  • 30:01cross pointing to the same part of the brain.
  • 30:04It nicely identifies the cross pointing part
  • 30:08between the two sample.
  • 30:10Like one in the end, three on the top,
  • 30:14five at the bottom and so on.
  • 30:17And because this is the Bayesag framework,
  • 30:21it can also provide uncertainty measures
  • 30:25about our clustering prediction.
  • 30:28And as you can see usually there is more uncertain
  • 30:31about the clustering prediction
  • 30:35around the boundary between different cell clusters.
  • 30:38Which kind of makes sense,
  • 30:40because we expect that maybe cell type
  • 30:43might be more mixed together in the same cell spot.
  • 30:48Also, there are some cell clusters
  • 30:50with the higher level of the uncertainty
  • 30:53of which we are still trying to understand more
  • 30:56at this point.
  • 30:58And this kind of the figure is the...
  • 31:01what utility of this kind of joint analysis.
  • 31:06So, for the identifier with T,
  • 31:09we set the first cell cluster as the reference.
  • 31:14And then here we see the two (mumbles)
  • 31:16The top one shows the intercept,
  • 31:20and then we can interpret this one as the relative size
  • 31:25of each cell cluster.
  • 31:26So then compared to the one,
  • 31:29we can say three and the six are larger.
  • 31:32So the three and the six are larger, compared to the one.
  • 31:36Why the four is the smaller,
  • 31:39well just smaller compared to the one.
  • 31:40So this is what it can see by eye
  • 31:45from the tissue prediction region.
  • 31:48But good thing is that this model allows us to quantify,
  • 31:52what you see by eye.
  • 31:55And what is more interesting is the second one.
  • 31:57So this one,
  • 31:58is about the difference between the two sample.
  • 32:02So again,
  • 32:04so basically if it's higher,
  • 32:07then it means that certain tissue spot cluster
  • 32:12getting bigger in the second sample.
  • 32:14And if it's lower immune state is a kind of smaller
  • 32:18in the second sample and so on.
  • 32:20So in this way,
  • 32:21we can quantify the change of the tissue architecture
  • 32:26between different cell clusters.
  • 32:30And another interesting example is this one.
  • 32:35So here, the image of 2D to anterior samples,
  • 32:39we now also look at the posterior sample as well.
  • 32:44So because this is two parts of the brain
  • 32:48anterior and the posterior,
  • 32:50the issue is kind of continuous between two.
  • 32:53And as you can see here,
  • 32:54cell cluster three is connected to the posterior side here.
  • 33:00Cell cluster one is connected to here and so on.
  • 33:05And then this kind of pattern is not clear
  • 33:08if you analyze each data independently.
  • 33:12And our MAPLE framework nicely captures
  • 33:16such kind of sharing pattern.
  • 33:18And also the difference pattern
  • 33:20between the different samples, interestingly.
  • 33:23So at this point,
  • 33:25we are working on more simulation study
  • 33:27and the real data analysis
  • 33:29to further show the performance
  • 33:33and understand the properties of the MAPLE at this point.
  • 33:39So then I can't summarize my presentation today.
  • 33:45So the high throughput spatial transcriptomics, or HST,
  • 33:50provides unprecedented opportunities
  • 33:54to investigate novel biological hypotheses,
  • 33:57such as the tumor microenvironment and certain structure
  • 34:05about the human brain and Alzheimer,
  • 34:08and so on.
  • 34:10And here we propose SPRUCE,
  • 34:13a Bayesian multivariate mixture model
  • 34:16for HST data analysis.
  • 34:19SPRUCE has multiple strengths
  • 34:23including the novel combination
  • 34:26of the skewed normal density,
  • 34:29Polya-Gamma data augmentation,
  • 34:31and spatial random effect.
  • 34:35Altogether, it allows to
  • 34:37precisely infer spatially correlated mixture component
  • 34:41membership probabilities.
  • 34:44In our simulation study and real data analysis,
  • 34:49we could see that SPRUCE outperforms the existing method,
  • 34:53in the tissue architecture identification.
  • 34:56And finally our recent extension of the MAPLE
  • 35:01allows the joint clustering and differential analysis
  • 35:05of multiple HST data.
  • 35:09So at this point SPRUCE is on the review in,
  • 35:13(mumbles)
  • 35:15in the biometrics.
  • 35:17Cross pointing manuscript is available in the bio archive.
  • 35:22And there are multiple ongoing work
  • 35:25regarding the HST data modeling
  • 35:28in our lab.
  • 35:30So we are actually currently working on further improving
  • 35:35the SPRUCE and the MAPLE
  • 35:37by incorporating other characteristics
  • 35:39of the HST data, such as the relationships among cells.
  • 35:44For example,
  • 35:46we know that there are some likened and receptor,
  • 35:50for example.
  • 35:51Which we expect that they interact with each other
  • 35:55in their cell structure.
  • 35:57And then by incorporating different prior information,
  • 36:01we can further improve the SPRUCE and MAPLE.
  • 36:06We are also working on the other statistical models
  • 36:10for somewhat relevant, but different tasks.
  • 36:14For example,
  • 36:15currently we are also working on the streamlining framework,
  • 36:19especially the graph neural network,
  • 36:22which is called RESEPT.
  • 36:24And then using the gene framework,
  • 36:28we tried to come up with good embedding
  • 36:30of the HST gene expression pattern.
  • 36:34Our current results show that such a combination
  • 36:38of the stem learning and the statistical model approach
  • 36:41can provide nice prediction performance.
  • 36:47For this proposal, we developed a framework called RESEPT
  • 36:52and cross pointing bio archive
  • 36:55is also available publicly.
  • 36:57And then cross pointing paper
  • 36:59is now under revision in the nature communications.
  • 37:06Regarding cell-cell communications,
  • 37:09using network-based approaches has some benefit
  • 37:12because the cell-cell communication can be nicely
  • 37:16and naturally modeled using AGR network.
  • 37:21So we have the parallel work called the the Banyan
  • 37:25to identify the cell-cell communication
  • 37:27and tissue architecture using the network-based approaches.
  • 37:31And finally, there are the multiple effort experimentally
  • 37:37to generate the spatial multimodal data.
  • 37:42For example,
  • 37:43the effect to seek such as the single cell genomics,
  • 37:48proteomics and the T-cell receptor at the same time.
  • 37:53And very soon,
  • 37:54everything are expected to be combined
  • 37:57as the spatial transcriptomic structure.
  • 38:01We are working on the direction
  • 38:03to develop the statistical model
  • 38:06for integration of the HST data with other matched data.
  • 38:13So I would like to acknowledge my research team at OSU.
  • 38:18Carter Allen is the main driver this project,
  • 38:23and also my pitch assistant
  • 38:27Qin Ma and Yuzhou Chang is my close collaborator
  • 38:32for the HST data modeling project.
  • 38:36And Zihai Li,
  • 38:38who is the director of the Immuno-Oncology Institute
  • 38:42and also the expert in cancer.
  • 38:46Won Chang at the University of Cincinnati
  • 38:49who are the spatial statistics expert,
  • 38:53and MUSC collaborator Brian Neelon
  • 38:56and my grant support.
  • 39:00So, and this is the end of my presentation,
  • 39:03and you can find my manuscript
  • 39:05and the software from the link here.
  • 39:10If you have any questions and comment,
  • 39:13please let me know by email at chung.911@osu.edu.
  • 39:17So thank you for your attention.
  • 39:29<v ->So thank you.</v>
  • 39:32Do we have any questions from the audience in the classroom,
  • 39:36or from the audience on zoom?
  • 39:43<v ->Can I ask a question?</v>
  • 39:45Can you hear me?
  • 39:46<v ->Yes, mm-hm.</v>
  • 39:47<v ->Right, Dongjun welcome back.</v>
  • 39:50Great work, it's a nice presentation.
  • 39:52I'm just wondering, like,
  • 39:53when you do this from your own experience
  • 39:56on the cell clustering,
  • 39:57how much the spatial information contributes
  • 40:00to the clustering.
  • 40:03<v ->Sure.</v>
  • 40:09So,
  • 40:10(mumbles)
  • 40:15If you're here,
  • 40:16so if you look at the Seurat workflow,
  • 40:18you can see there's a still lot of the, kind of,
  • 40:22local boundary between different cell spot clusters.
  • 40:28And when you analyze the same data using the SPRUCE,
  • 40:32you can see much cleaner boundary.
  • 40:34And often it will coincide with the
  • 40:36expert analogy annotation.
  • 40:40So given that there is the significant contribution,
  • 40:46of course even the gene expression,
  • 40:49we still get some big picture, as you can see here.
  • 40:52But spatial information provide much cleaner prediction
  • 40:57about the tissue architecture in general.
  • 41:01<v ->I see.</v>
  • 41:02And also the skewness.
  • 41:05Do you estimate that or that's like your heart
  • 41:08was persuaded by the skewness?
  • 41:12<v ->You mean which one?</v>
  • 41:14<v ->On k model.</v>
  • 41:16Your model to specify, the k model you have there.
  • 41:19I missed that part.
  • 41:20Like, do you need to specify the skewness?
  • 41:24<v ->Or learn from data.</v>
  • 41:26<v ->Oh, I see.</v>
  • 41:27But from the data, how skew?
  • 41:29I mean, just in terms of how stable
  • 41:31that alpha k can be estimated.
  • 41:36<v ->So maybe I can answer it in two different ways.</v>
  • 41:40So if there is this skewness in the data, I think yes.
  • 41:44So we'll say it depends on how processed the data as well.
  • 41:49So usually there's three different approaches
  • 41:51to model the HST data in closed spatial embedding gene.
  • 41:57And so you can see here,
  • 41:58who are the people using the principle components?
  • 42:02Who are the people use the team learning
  • 42:05as the embedding step?
  • 42:08If you use the team learning or the PCA
  • 42:12it's more likely symmetry in the real data.
  • 42:16If you consider the spatial embedding gene,
  • 42:20we often hope to have the skewness, as you can see here.
  • 42:24And then regarding your question, overall it works well.
  • 42:30I don't have the exact quantification, but it works well.
  • 42:34Especially stably in most cases.
  • 42:37<v ->Yeah, I read the spatial Bayes paper.</v>
  • 42:39They seem to be working on the principle components, right?
  • 42:41They do not work on individual genes, right?.
  • 42:43<v ->No, yeah.</v>
  • 42:44They base this on the PCA.
  • 42:46<v ->Yeah, that's why it's completely puzzling me</v>
  • 42:47while you're doing that.
  • 42:48But anyway, yeah.
  • 42:49Thank you.
  • 42:50<v ->Yeah so, so...</v>
  • 42:51(mumbles)
  • 42:54so they mainly target the PCA.
  • 42:55So they only can start the multivariate distribution.
  • 42:59And also because of the same reason,
  • 43:01their equivalence metrics means less density.
  • 43:05<v ->I see.</v>
  • 43:08Thank you.
  • 43:09<v ->Thank you.</v>
  • 43:22<v ->Do we have any questions from students in the classroom?</v>
  • 43:32<v ->Wait, can I ask another question?</v>
  • 43:36So, towards the end,
  • 43:37you mentioned you tried
  • 43:38to look at the cell-cell communication.
  • 43:45That part.
  • 43:46I'm very interested in that
  • 43:49From our experience on the single cell spatial data are...
  • 43:55Are you talking about you're learning from the single cell,
  • 43:58or the spatial single cell?
  • 44:02<v ->So, regarding the cell-cell communication</v>
  • 44:05it's still very ongoing research at this point.
  • 44:10I mean, not just our side but in general.
  • 44:13Because most of the cell-cell communication prediction
  • 44:17based on the database.
  • 44:20So based on data, like on the receptor,
  • 44:22pairing the database and checking
  • 44:25their cross point on the expression in cross point spot
  • 44:28of the cell.
  • 44:29And then by checking that the cross pointing pair
  • 44:33of the expression pattern
  • 44:34between the like and the receptor.
  • 44:36They want to model cell-cell communication.
  • 44:40It's not perfect, as you know,
  • 44:42because it's like a computer.
  • 44:45If you look at the chip, it's almost like
  • 44:48(mumbles)
  • 44:48but more like motive analysis.
  • 44:51So there's some limitation,
  • 44:53but it's a more likely general limitation at this point.
  • 44:57<v ->Yeah,</v>
  • 44:58I'm asking because we've been looking
  • 44:59at some of the spatial single cell data
  • 45:02that were too noisy for the like
  • 45:04and receptor gene expression levels.
  • 45:07Just couldn't make it too far.
  • 45:09(mumbles)
  • 45:11But for a single cell, may be different?
  • 45:13I mean, probably there'll be more that, like...
  • 45:17<v ->Yeah, three already.</v>
  • 45:18I mean, so if you go to high-resolution,
  • 45:22it's a very noisy,
  • 45:24so very often we need to do some simplification.
  • 45:28Like looking at multi-modal or the cell cluster,
  • 45:31rather than the cell.
  • 45:34It's still very multiple experimental limitation,
  • 45:38at this point.
  • 45:39(mumbles)
  • 45:40Thank you.
  • 45:51(class teacher addresses classroom)
  • 46:00<v ->On the data from multiple samples</v>
  • 46:03So, if we have samples from...
  • 46:06(mumbles)
  • 46:18<v ->Oh yeah, that's a very good question.</v>
  • 46:22So,
  • 46:23actually we can answer in the two different ways.
  • 46:29In some sense,
  • 46:30good pre-processing is still important
  • 46:35because it still depends on the expression patterns.
  • 46:43But still regarding the differences
  • 46:46between the different tissues.
  • 46:48If there is a big difference,
  • 46:49it can still detect the difference
  • 46:51between the different sample.
  • 46:54So, it can detect spots.
  • 46:55But still like a main goal is more
  • 46:59for the similar type of tissue.
  • 47:01If it's too different,
  • 47:02maybe it's different research project.
  • 47:05So, for example,
  • 47:07here our targets is more about, for example,
  • 47:10like same breast tissue,
  • 47:13but with a different responders and non-responders group,
  • 47:18for example.
  • 47:19Or like a cell-cell long tissue, but the tumor but not tumor
  • 47:24and so on.
  • 47:25If you like a human and mouse,
  • 47:29then it might be somewhat different story,
  • 47:33which might need much more work.
  • 47:38<v ->Do we have any more questions here?</v>
  • 47:58Okay, can we have all the questions
  • 47:59from the audience on zoom?
  • 48:21Okay, so it looks like we don't have any more questions.
  • 48:26So Dr. Chung, thank you again for your nice presentation.
  • 48:31Look forward to meeting in person sometime soon.
  • 48:36<v ->And then thank you again Wei and Hongyou</v>
  • 48:38for the invitation
  • 48:40and it's a great come back, although virtually.
  • 48:43And I hope to see you again.
  • 48:46<v ->We'll come by in person.</v>
  • 48:49<v ->Hopefully someday soon.</v>
  • 48:53Okay, thank you.