# YSPH Biostatistics Seminar: “Feature Aggregation in Causal Discovery for High-dimensional Data: Application to Targeting the “Gut-Brain-Axis” via the Microbiome Diversity"

September 21, 2023## Information

Jinyuan Liu, PhD, Assistant Professor, Department of Biostatistics, Vanderbilt University Medical Center

September 19, 2023

ID10729

To CiteDCA Citation Guide

- 00:01<v ->All right, I'm very excited</v>
- 00:03to introduce our speaker for today.
- 00:04We have Dr. Meghan Short.
- 00:06Dr. Short has completed fellowships
- 00:08at the Glenn Biggs Institute for Alzheimer's
- 00:10and Neurodegenerative Diseases,
- 00:12and at Harvard's Huttenhower Lab.
- 00:14Currently, Dr. Short is an assistant professor
- 00:17at Tufts University.
- 00:18Let's give a warm welcome to Dr. Short.
- 00:31<v ->Hi, everyone, Thank you for being here.</v>
- 00:34Can you all hear me, okay?
- 00:35<v ->Sign in if you're registered.</v>
- 00:38<v ->All right, so, today, I'm going to talk about a project</v>
- 00:41that I worked on as part of my postdoc
- 00:43down at UT Health San Antonio
- 00:46with the Glenn Biggs Institute for Alzheimer's
- 00:49and Neurodegenerative Diseases,
- 00:51and I wanted to talk about this as a...
- 00:56None of the sort of methods that I'm gonna talk about
- 00:58in this talk are particularly new.
- 01:01This wasn't sort of a methods development project.
- 01:04So the sort of main network method I'll talk about
- 01:08is about a decade old at this point, at least,
- 01:10but what's nice about it is that
- 01:13with increasing availability
- 01:15of high dimensional biomedical data,
- 01:17it's sort of seeing more use cases,
- 01:20and it's not something that, at least, I learned about
- 01:22in my graduate program in biostatistics,
- 01:24but it's something that I thought
- 01:26would be good to talk about today
- 01:28since it's such a useful method.
- 01:32So let's see if I advance.
- 01:35There we go.
- 01:36So I'll start just by giving a quick introduction.
- 01:39I know that when I was in grad school, I always wanted,
- 01:43I thought it was interesting
- 01:44to hear about people's career paths
- 01:46as I was considering my own.
- 01:48So I started in biology as a field.
- 01:53I studied salt marsh ecology as an undergrad,
- 01:56and then by the end of undergrad,
- 01:58I was interested in getting more into sort of a human,
- 02:00more directly human-focused environment,
- 02:02and so I considered public health.
- 02:05I learned about statistics
- 02:06as part of my research in undergrad
- 02:08and wanted to continue with that so I participated in SIBS,
- 02:11which is a program that you may be aware of,
- 02:14and that was my first intro to biostat.
- 02:16I was a graduate student at Boston University.
- 02:19I had fortune of working
- 02:21with the Framingham Heart Study,
- 02:22which is where the data comes from
- 02:24that I'll be talking to you about today,
- 02:26which is a really interesting study,
- 02:27and I'll get more details on in the few slides.
- 02:29That was sort of my introduction
- 02:31to working with epidemiological data.
- 02:34After grad school, I continued on,
- 02:36again, to UT Health San Antonio,
- 02:38and then following that to postdoc at Harvard
- 02:42looking at developing methods for microbiome analysis.
- 02:47So if you have any interest in that,
- 02:49feel free to approach me,
- 02:51although I'm not gonna talk about that today,
- 02:54and then as of March this year,
- 02:57I started as an assistant professor at Tufts Medicine
- 03:00where I'm working on a variety of projects
- 03:02but a lot related to sort of omics data
- 03:06and aging and longevity.
- 03:12So I'll start today's talk with a bit of motivation
- 03:15for why network-based analyses we're a good fit
- 03:18for looking at sort of the proteome in Alzheimer's disease.
- 03:24So first of all, Alzheimer's disease
- 03:27is a very prevalent condition.
- 03:30Many of you may be like me and know some family members
- 03:33or people who have been affected by it.
- 03:36It's very common and expect it to be more so
- 03:40as populations age, and it's a leading cause of mortality,
- 03:44disability, and poor health among seniors,
- 03:47and one interesting feature of this disease
- 03:49is that precursors of it can appear years to decades
- 03:52before symptoms manifest.
- 03:55So those precursors can include indicators
- 03:58that are visible on brain MRIs,
- 04:01performance on neurocognitive testing, changes in gait,
- 04:05even changes in sense of smell,
- 04:08and cerebral spinal fluid markers, such as tau and amyloid.
- 04:17Because of this, there's interest in being able to find
- 04:21plasma biomarkers for Alzheimer's disease
- 04:24and related dementias.
- 04:25ADRD is a acronym we'll be using sort of throughout.
- 04:30Because since there are indicators
- 04:33of sort of pre-disease development
- 04:35in years to decades before being able to detect those,
- 04:37either earlier or in a less invasive or expensive way,
- 04:40is very useful,
- 04:44and so when I say invasive, I mentioned CSF markers,
- 04:50such as how an amyloid can predict dementia,
- 04:53but that involves doing a lumbar puncture
- 04:56versus something like a blood draw, which is easier to do.
- 05:01Another good aspect of trying to find biomarkers
- 05:04is that you can get a sense of biological processes
- 05:07that are involved in disease development,
- 05:10and that can hopefully lead to either preventative
- 05:13or therapeutic interventions.
- 05:19What makes this difficult?
- 05:21So in my case, I was looking at proteins.
- 05:24There are thousands and thousands to select from,
- 05:27and you get sort of this inherent trade off
- 05:30between trying to control a false positive rate
- 05:33for all these multiple tests that you may be performing,
- 05:36but if you effectively control the false positive rate,
- 05:39you're going to likely end up with low statistical power.
- 05:42There's this trade off between...
- 05:45It's sort of a needle in a haystack.
- 05:47Another thing that has tended to be true
- 05:50is that there is not very good replicability across studies.
- 05:53So one study may find 20 biomarkers
- 05:57and maybe one or two of them
- 05:59may replicate in a different study.
- 06:01So there's a lot of noise that ends up coming through.
- 06:08The approach that I took in this project
- 06:10was to use network analysis
- 06:13to analyze the protein data,
- 06:17and the motivation there is to try and capture
- 06:20subtle but consistent variation in groups of proteins.
- 06:23I'll refer to them as modules during this talk.
- 06:28In then just a few things, so first of all,
- 06:30it reduces the dimensionality
- 06:32of the statistical testing problem that you have.
- 06:35So rather than testing each protein individually
- 06:37and having to adjust for all of those multiple tests,
- 06:40you can sort of reduce the space
- 06:43to a smaller number of tests
- 06:46where the proteins within each group being tested
- 06:49are inter-correlated with one another,
- 06:52and unlike other dimensionality reduction methods,
- 06:55something like a principle components analysis
- 06:57that you may have maybe familiar with,
- 06:59the network method has sort of a benefit of looking
- 07:03not just at, say, correlations
- 07:06or relationships between pairs of proteins,
- 07:09but, also, at sort of the correlational neighborhood
- 07:11of what common neighbors
- 07:13those proteins share in the network.
- 07:18Another benefit of or sort of way
- 07:22that we try to get around some of the pitfalls
- 07:23of proteomic analysis is by focusing on biological pathways
- 07:29instead of on individual proteins themselves.
- 07:32So within groups of proteins that we find to be of interest
- 07:36or possibly associated with dementia outcomes,
- 07:40we use a tool called over-representation analysis,
- 07:43which I'll talk about later,
- 07:45but it essentially tries to pinpoint biological pathways
- 07:48that may be overrepresented by the proteins
- 07:51that are found to be associated with the outcome,
- 07:54and the hope there is to find,
- 07:56to get sort of insights that are more robust across studies
- 08:01and, hopefully, address some of the issues
- 08:03with replicability.
- 08:08Okay, so that's sort of the motivation for this study,
- 08:11and, now, I'll talk a little bit about the data.
- 08:18The data for this study
- 08:19comes from the Framingham Heart Study,
- 08:22which has been going on for a very long time.
- 08:24It started in 1948 in a town of Framingham, Massachusetts,
- 08:29and at the time they enrolled,
- 08:31they reached out to two-thirds of the population of the town
- 08:34to try and enroll them in this epidemiological study.
- 08:36It was one of the first ones of its kind,
- 08:39and people would come in for exams every few years,
- 08:42and they would take all of this information about them,
- 08:45and then follow them for outcomes.
- 08:47Cardiovascular outcomes was really
- 08:49the sort of outcome of interest when it first started.
- 08:53Over the years, they've then enrolled offspring
- 08:57of the original cohort participants
- 08:59as well as grandchildren and third generation,
- 09:02and then as sort of the demographics
- 09:06of Framingham have changed over the years,
- 09:09if you're only enrolling descendants
- 09:10of people who live there in 1948,
- 09:12you're not gonna capture that.
- 09:13So they also have been enrolling omni cohorts
- 09:15to reflect sort of more diverse populations (indistinct).
- 09:21Again, they were sort of aiming
- 09:23towards identifying risk factors
- 09:25and etiologies of cardiovascular disease,
- 09:29but as those populations age,
- 09:31brain health and cognition is also an important outcome,
- 09:34and so they've measured sort of cognitive outcomes
- 09:39and incidents of dementia as well, and, of course,
- 09:41those things are also related to cardiovascular.
- 09:48For our study in particular,
- 09:51we were using the offspring cohort,
- 09:53and at their examination cycle five,
- 09:55which was in the early 90s, they collected blood samples,
- 10:00and froze the plasma from those samples,
- 10:03and years later, when they sort of had
- 10:06these broader proteomic analysis assays available,
- 10:11they measured the plasma proteome,
- 10:14I'll talk about the methods for that on the next slide,
- 10:18but they did this in about 1,900 participants
- 10:21who were approximately aged 55 when the blood was drawn.
- 10:24So this is sort of a middle-aged cohort,
- 10:26generally, cognitively healthy
- 10:29and a little more than half women.
- 10:33The main outcomes that we looked at in this study
- 10:35are MRI-based measures, so brain MRIs were taken
- 10:41about 10 years or so, five to 10 years
- 10:45after the initial blood draws, and those had...
- 10:51The sort of outcomes that I looked at there are
- 10:54total brain volume as well as the volume of the hippocampus
- 10:57and then a measure called white matter hyperintensities,
- 11:01which is sort of a measure of vascular injury in the brain,
- 11:06and a reason to look at those outcomes is that
- 11:10I mentioned there are sort of precursors of dementia
- 11:13or risk factors for dementia that can be identified on MRI,
- 11:16those are some of the big ones.
- 11:19Especially since we had a middle-aged cohort,
- 11:22you may not see a lot of incident dementia,
- 11:24and so being able to detect proteins
- 11:27that are associated with some of those precursors
- 11:29is a way of getting at this issue.
- 11:34We did also look at incident dementia.
- 11:36So we had about 20 years of follow-up,
- 11:37which is one of the strengths of this,
- 11:40looking in this particular sample,
- 11:42and we had 128 incidences of dementia
- 11:46of which 94 of them were classified
- 11:48as Alzheimer's type dementia.
- 11:53We also had a replication cohort.
- 11:55I mentioned the importance replication,
- 11:58and so we worked with collaborators
- 12:00at the University of Washington and their cohort study
- 12:04called the Cardiovascular Health Study,
- 12:06which has sites, I think, four different sites around the US
- 12:09and has measures of the same proteomic platform
- 12:13and same outcomes that we're looking at in the study.
- 12:19The assay that we used to measure proteins
- 12:23is called SOMAScan.
- 12:24It's by this company called SomaLogic.
- 12:27They use these single-stranded DNA aptamers
- 12:29that are designed to specifically bind
- 12:31to different proteins, and you can sort of tag them
- 12:35that way and measure their concentrations.
- 12:38In our sample, the assay had 1,300 proteins,
- 12:42which that's even sort of becoming dated now.
- 12:45I think the latest version
- 12:46has something like 7,000 proteins.
- 12:48So there's a lot that can be measured with this,
- 12:51but there is some sort of bias towards, I think,
- 12:57molecules that sort of have some evidence
- 12:59of being important in cardiovascular disease.
- 13:01So it's not an entirely sort of agnostic choice of proteins,
- 13:06but it does get a pretty wide range.
- 13:11Okay, so that's a description of the data,
- 13:15and, now, I want to dig in a bit
- 13:17to the network methods that we used.
- 13:20So this is sort of a graphical abstract
- 13:24from their original paper,
- 13:28describing this weighted gene
- 13:29correlation network analysis method.
- 13:32So that's what WGCNA stands for.
- 13:34I put gene in parentheses because they've started
- 13:37dropping that from the name when it gets used elsewhere
- 13:40because, originally, it was developed
- 13:42for gene expression data, but it's been found to have use
- 13:45in other high dimensional data sets as well,
- 13:48and so in our case, we're using it to analyze proteins,
- 13:52but the language here makes reference to gene expression.
- 13:57So just broadly, what this method does
- 14:01is you get a co-expression network,
- 14:04and I'll sort of give details on the next few slides,
- 14:08but the idea is that the network is based
- 14:10on co-occurrence or correlation in your sample.
- 14:14So there's not really information coming from outside.
- 14:17You're not even considering your outcome at all.
- 14:19It's just looking at the space of the proteins
- 14:21and which proteins are correlated with one another.
- 14:26Once you've identified this sort of network matrix,
- 14:29you use a hierarchical clustering algorithm
- 14:32to define modules.
- 14:34It's a little small here, but I'll show a a bigger example.
- 14:37Basically, you have a dendrogram,
- 14:39and you see that if sort of proteins
- 14:42are on this x-axis of this figure here.
- 14:44I'll do the mouse for people who are online.
- 14:48You get these sort of bands or groups of proteins
- 14:51that are highly correlated with one another
- 14:53and not correlated with other proteins.
- 14:58So that is where those sort of protein groups come from.
- 15:02Once you have those, you can use a numerical summary
- 15:06of each protein group as sort of a feature or a predictor
- 15:10in a regression or some sort of analysis
- 15:13to try and relate the modules or groups
- 15:15to external information.
- 15:16So that's how we relate our protein groups
- 15:20to dementia outcomes in this study.
- 15:24There's also the possibility
- 15:25of looking at relationships between modules.
- 15:28So I mentioned the modules in the network
- 15:32are highly inter-correlated
- 15:33within the proteins within themselves,
- 15:36but there may also be some correlation between modules,
- 15:38and that could be important to look at as well,
- 15:41and then within modules, you may have
- 15:44tens or hundreds of proteins, and so trying to figure out
- 15:47which proteins within those modules
- 15:50are driving any associations you see
- 15:52is sort of a final step that can be
- 15:55useful for getting sort of biological meaning
- 15:57out of these associations.
- 16:02So that's a broad overview.
- 16:03This is sort of a more graphical abstract from our study,
- 16:08and I'll sort of go through bit by bit
- 16:11the different pieces of the analysis.
- 16:15So, again, this WGCNA step is sort of the first step
- 16:18of getting from this protein expression matrix
- 16:20where you have sort of your proteins by participants,
- 16:24and using the sort of correlations in your sample
- 16:27to come up with these modules of co-expressed proteins.
- 16:33The first step in doing that
- 16:35is to make a pairwise correlation or similarity matrix.
- 16:39So if you have n proteins,
- 16:40then that becomes an n by n matrix
- 16:43where each cell is describing
- 16:45the similarity or correlation
- 16:47between protein i and protein j in your sample.
- 16:52You then use this to create
- 16:54what's called an adjacency matrix, which is,
- 16:56I'll talk about more in the next slide,
- 16:58but is sort of a more networky way
- 17:02of describing the association between proteins,
- 17:05and then a topological overlap matrix,
- 17:08which then takes into account
- 17:10not only the correlation between proteins
- 17:13but their shared neighborhood, and then, again,
- 17:16that is what is used to cluster the proteins.
- 17:23So to get into a bit more detail
- 17:25about sort of the network construction,
- 17:30again, you described the network as an n by n matrix
- 17:33with the number of nodes or genes, proteins, et cetera,
- 17:36and, in our case, we use to describe the similarity,
- 17:39just a simple correlation,
- 17:42absolute value of the correlation,
- 17:44between a given node i and j.
- 17:48The adjacency is then a measure of whether or how strongly
- 17:51the nodes are connected in the network.
- 17:53So the idea being that
- 17:56nodes that have very high correlations
- 17:58are particularly interesting.
- 18:00Nodes that have moderate to low correlations
- 18:02are probably not informative
- 18:03is sort of the the underlying idea,
- 18:07and so if you look at sort of this figure here,
- 18:12the correlation or similarity is on the x-axis,
- 18:15and then the adjacency is on the y, and so if you use
- 18:19what's called an unweighted network approach,
- 18:22you pick a threshold value, here, it's 0.8,
- 18:25and you say that anything with a similarity less than 0.8
- 18:28is considered to not be a connection in the network,
- 18:31and everything greater than 0.8
- 18:33is considered to be a connection.
- 18:34So it's sort of a binary yes or no.
- 18:38What WGCNA does that was novel
- 18:42was to introduce a weighting
- 18:45where sort of the downside of this unweighted metric is that
- 18:49if you have a correlation of 0.79,
- 18:52that could be useful to know, but it counts as a zero.
- 18:55So you're losing information,
- 18:57and so what the weighted network does
- 19:00is it uses a sort of power transformation
- 19:03to get from sort of the straight correlation
- 19:07shown in this red line,
- 19:08and sort of depending on this power value that you use,
- 19:12you weight more or less towards the higher correlations
- 19:16in your network, and when you fit this model
- 19:20or when you sort of build the network, your choice of data
- 19:24is sort of one of the parameters that you choose going in,
- 19:27and there's ways to sort of measure
- 19:30which gives the best fit to the data.
- 19:38So then once you have your sort of unweighted
- 19:41or weighted adjacency matrix,
- 19:45then is the part where you account for shared neighbors.
- 19:48So this is this topological overlap matrix that is created,
- 19:52so, basically, this measure omega of connectedness.
- 19:58The equation, I don't find super sort of intuitive,
- 20:01but the components are...
- 20:03This is the sum, so u are, basically,
- 20:05all of the nodes other than i and j
- 20:07that you're looking at the connectedness between,
- 20:10and so you're summing up
- 20:11the sort of common connection strength between i and u
- 20:15and j and u as a product.
- 20:18So if I and J both have a strong connection
- 20:22to this other node, then that's adding to this term l,
- 20:27and then these k terms here
- 20:29are just the individual connections between, no,
- 20:32each sort of the node i of interest
- 20:34and other nodes in the network,
- 20:37but I find sort of the easiest or most intuitive explanation
- 20:41from this original paper shows that for the unweighted case,
- 20:46omega is equal to one if the node with fewer connections
- 20:50has all of its neighbors,
- 20:51also, has connections of the other node.
- 20:53So the connections of node i
- 20:55are a subset of the connections of node j,
- 20:59and, also, i and j are directly connected.
- 21:01So that's sort of the most interconnected
- 21:03that those two nodes can be,
- 21:05and then the least interconnected they can be
- 21:08is if they are not connected to one another,
- 21:10and they don't share any neighbors.
- 21:11So that would be sort of the zero case.
- 21:16So this a value can either take on
- 21:18the unweighted or the weighted case,
- 21:20and in our sample with WGCNA,
- 21:23we're using those sort of weighted network connections
- 21:26that just adds more information
- 21:28into this topological overlap matrix.
- 21:36Okay.
- 21:39So, now, once you have the topological overlap matrix,
- 21:45again, this measure of sort of interconnectedness
- 21:48accounting for shared neighbors,
- 21:51then you can use hierarchical clustering
- 21:53to divide those proteins
- 21:57into groups based on their similarity,
- 22:00and this is the results from our analysis.
- 22:03So sort of on the x-axis,
- 22:06you have the different proteins, you have the dendrogram,
- 22:09which represents the hierarchical clustering
- 22:11of the topological overlap matrix,
- 22:14and then you have this dynamic tree cut algorithm
- 22:20which then defines these clusters
- 22:22which are shown in colors on the bottom based on the tree.
- 22:26So you see this huge branch down here.
- 22:28That's gonna be this black cluster.
- 22:30There's this other cluster over here in green,
- 22:33and so there's, again, a few more parameters
- 22:37that you can use to decide how those cuts are made,
- 22:40and, in some cases, you can sort of merge branches
- 22:43that have correlation with one another,
- 22:45and my general advice
- 22:48for when you're doing this on real data
- 22:49is to try different values
- 22:51and see how robust the network is
- 22:53to choosing different values because, in our case,
- 22:56it tended to be pretty consistent
- 22:59where we saw four modules pretty much regardless.
- 23:02I think if we merged,
- 23:04if we really cranked up one of the merging parameters,
- 23:06we would get to three,
- 23:07but other than that it sort of stayed put.
- 23:13Okay.
- 23:15So the next step is trying to get
- 23:18a numerical summary measure of the groups of proteins
- 23:22that we've identified from our network.
- 23:25So from these modules of co-expressed proteins,
- 23:28we then use, basically, a principle components analysis
- 23:33to get what we call an eigenprotein
- 23:35or it was called an eigen gene in the original paper.
- 23:39What it is is, essentially, a weighted sum
- 23:43of the values of each of the proteins in the module,
- 23:47and the weights correspond to sort of how well correlated
- 23:50that protein is with the overall module.
- 23:53So if a protein has a high weight in the module,
- 23:56it means that it's sort of the most interconnected
- 23:58in the module or sort of best represents the overall module.
- 24:04So each person is going to have
- 24:06an eigenprotein value for each module,
- 24:16and when we look at the sort of weights
- 24:18within each of the modules, so just to sort of orient us,
- 24:22on the x-axis are each of the module eigen genes
- 24:27or eigenproteins, and then each sort of bar
- 24:34on the y is a different protein.
- 24:36In this case, we're only including
- 24:39proteins that fall into one of the four modules.
- 24:42There were, also, if you notice on the last slide,
- 24:45plenty of proteins that didn't fall into any module
- 24:48and were sort of the extras, so to speak,
- 24:51and if you were to expand this down
- 24:54and include more rows with those,
- 24:56that would sort of show those, but for purposes of this,
- 25:00we're just including ones
- 25:01that fell into at least one of the four,
- 25:04and each of these bars represents a correlation
- 25:09between the individual protein
- 25:11and the overall eigenprotein.
- 25:13So for these blocks of red,
- 25:15it's sort of the higher weighted proteins
- 25:17that are within in this example module one,
- 25:21module two, three, and four, and then you can see,
- 25:24if you look sort of laterally from these proteins,
- 25:28it's the correlation of these proteins
- 25:30with the other modules.
- 25:31So the idea being we wanna see sort of blocks of red,
- 25:36and then not a lot of correlation
- 25:38between the blocks and other modules,
- 25:40which is what we see.
- 25:46All right, now that we've constructed our network,
- 25:50and we've come up with numerical summary measures
- 25:52for each of the protein groups that we've identified,
- 25:56that is sort of the input or the predictor
- 25:59for these associations with outcomes.
- 26:02So for the MRI measures, which, again,
- 26:04our total brain volume, hippocampal volume,
- 26:07and white matter hyperintensities,
- 26:09we use just a simple or, you know,
- 26:12linear regression with covariates,
- 26:14and then a Cox proportional hazards regression,
- 26:17we use to predict incident dementia
- 26:20and, specifically, Alzheimer's type dementia.
- 26:26These are the regression equations.
- 26:28Again, these eigenproteins are,
- 26:30they're sort of one for each module.
- 26:32So we'll run a separate regression analysis
- 26:35for modules one, two, three, and four.
- 26:37We adjust for age and age squared, sex education.
- 26:42APOE is a gene that confers a lot of risk
- 26:44for Alzheimer's disease.
- 26:45So it's associated with the outcomes,
- 26:47and we include it as a covariate,
- 26:49and then a measure of time lag
- 26:51between when the blood was sampled
- 26:53and when the MRI was taken to account for any differences
- 26:57between people or the time difference,
- 27:01and for dementia, it's slightly simpler regression equation.
- 27:06We only adjust for age, sex, and APOE status.
- 27:13All right, so next, I will show
- 27:17the results in the Framingham Heart Study.
- 27:21So from the four modules that we tested,
- 27:24there were two that we identified to have
- 27:27some association with outcomes.
- 27:29The first is module two.
- 27:31I gave it sort of a name clearance and synaptic maintenance,
- 27:35and I'll talk about how I arrived
- 27:37at that name for the module in a bit.
- 27:40It has 165 proteins in it.
- 27:44Some of the half weighted proteins sort of give an idea
- 27:47of which ones are sort of most highly weighted
- 27:51or sort of most correlated with the eigen protein.
- 27:56I'll talk about how we got to these
- 27:59in another slide as well,
- 28:01but, basically, this is from that
- 28:02over-representation analysis
- 28:04where you're trying to identify biological pathways
- 28:06that are important or overrepresented
- 28:09by proteins in those modules.
- 28:12So we have the Axon guidance pathway
- 28:14was most strongly associated with this module,
- 28:21and then in terms of relating to outcomes,
- 28:25total brain volume
- 28:26was the only significant association that we saw.
- 28:29So since this is a linear aggression,
- 28:33effect greater than zero means a positive association.
- 28:37So we see that for larger values
- 28:40of the eigenprotein for module two,
- 28:42we saw larger total brain volume.
- 28:44So it's sort of a protective effect
- 28:47since brain atrophy is what is the risk factor for dementia,
- 28:53and then for incident dementia,
- 28:55we did not see a significant effect
- 28:56after correcting our p-values
- 28:58using a Bonferroni correction.
- 29:00You'll notice that the confidence interval excludes one,
- 29:04which would be the null value,
- 29:05and that's just because that's based
- 29:06on the non-Bonferroni corrected value,
- 29:10but after testing for or adjusting for the four modules
- 29:14that we tested, we didn't see a significant association.
- 29:18It is nice at least that the direction of effect
- 29:22is what we would expect
- 29:23based on our total brain volume association,
- 29:26which is that higher values of M2
- 29:31correspond to sort of a lower incident dementia occurrence.
- 29:38The second module that we found to be associated
- 29:41with total brain volume was this M4,
- 29:44which I will call sort of an inflammation-related module.
- 29:47It had 42 proteins in it.
- 29:50The highlighted pathway there
- 29:52was cytokine-cytokine receptor interactions,
- 29:55so these sort of immune signaling molecules,
- 29:57and in this case, the association
- 30:00was in the opposite direction
- 30:01where higher values of this module for eigenprotein
- 30:05are associated with lower total brain volume.
- 30:07So it's sort of a risk conferring module
- 30:10and, again, similar to what we saw here, not a significant,
- 30:14sort of an annoyingly borderline association
- 30:17between this and dementia, but, again,
- 30:20the direction of effect is what we would expect
- 30:24based on our observed association with brain volume,
- 30:29and, also, I'll just mention that I standardize
- 30:31the eigenprotein so that the effect sizes
- 30:34correspond to a standard deviation increase in eigenprotein.
- 30:37So it's a little bit...
- 30:39One sort of drawback I would say
- 30:40of these methods is the interpretation
- 30:43since a standard deviation increase, in this case,
- 30:47depends entirely on the sample that you're using.
- 30:49So it's really just sort of a direction of effect
- 30:54more than anything.
- 30:56So to try and get at some of, get a better understanding
- 31:00of how these modules relate to our data
- 31:03or sort of what may be responsible
- 31:06for some of the associations we see,
- 31:08this is a map of the correlations
- 31:12between different demographic variables
- 31:15and each of the modules, and I mentioned that we have
- 31:18a replication cohort as well, the CHS.
- 31:20So these two bars, sort of the two columns,
- 31:23show the two different cohorts that were included.
- 31:28So I put blue arrows to show the covariates
- 31:31that were included in our regression model,
- 31:34and you can see that there are some correlations
- 31:35between, say, sex and the modules,
- 31:38not really anything with APOE carrier status,
- 31:42maybe some education associations,
- 31:44and some associations with age.
- 31:46So it's good that we adjusted for those in our models.
- 31:49However, you can also see there are a lot of other factors,
- 31:53cardiovascular risk factors,
- 31:54such as systolic blood pressure, BMI,
- 31:58fasting glucose that have associations with these modules.
- 32:02So we wanted to see if any of those could perhaps explain
- 32:05the associations that we saw.
- 32:10So I'm repeating sort of our standard model here
- 32:14was what I showed results from previously.
- 32:17The expanded model that we considered
- 32:19included a bunch of these risk factors,
- 32:23basically, something representing BMI,
- 32:27hypertension, sort of lipid dysregulation, and diabetes,
- 32:33and I also included smoking as well,
- 32:37and we also included a measure of kidney function,
- 32:40which can also be an indicator of cardiovascular disease.
- 32:45So for module two,
- 32:48I'm repeating the sort of effects we saw
- 32:50from the standard model here,
- 32:53and when you adjust for the expanded set of covariates,
- 32:56your effect is attenuated by half,
- 32:58and it's no longer significantly associated.
- 33:01So with that says, it's either you have
- 33:04a sort of confounding issue
- 33:08where the association you're seeing between these proteins
- 33:12and total brain volume is really just in effect
- 33:16of sort of poor cardiovascular health
- 33:20or better cardiovascular health
- 33:22or you may think that it might be
- 33:25some sort of mediation effect
- 33:26where perhaps the risk associated
- 33:31between the proteins and the sort of total brain volume
- 33:34could be mediated
- 33:35by some poor cardiovascular health outcomes,
- 33:41and then for module four,
- 33:43again, this sort of inflammation module,
- 33:45we don't see any real effect attenuation.
- 33:48Regardless of whether you adjust
- 33:49for cardiovascular factors or not,
- 33:52it's still associated with total brain volume,
- 33:54which suggests it's sort of different mechanism
- 33:57or lack of compounding between
- 33:59or based on cardiovascular health.
- 34:05Okay, so I mentioned
- 34:08in the sort of initial graphical abstract
- 34:12that once you find protein modules
- 34:14associated with your outcomes of interest,
- 34:16it can be good to look within the proteins of those modules
- 34:19to try and find sort of subsets
- 34:21or specific proteins that may be driving the associations.
- 34:26So for modules two and four,
- 34:27where we found associations with brain volume,
- 34:30we wanted to see if we removed proteins one at a time
- 34:35based on their sort of increasing weight,
- 34:37so remove the lowest weighted proteins in the modules first,
- 34:42what sort of happened to the strength of the associations.
- 34:46So these are both associations with total brain volume.
- 34:49It's sort of the p-value on the y-axis,
- 34:53and you can see that as you remove, say, from module two,
- 34:57the first 20 proteins or so,
- 34:59you're really not seeing a difference
- 35:01in the effect of the overall module with total brain volume,
- 35:05which suggests that those proteins
- 35:07aren't really impacting the association,
- 35:11whereas beyond that point, once you start removing proteins,
- 35:15the association becomes less strong,
- 35:17and so that's suggesting that those proteins
- 35:20may have more of an impact on sort of the overall module,
- 35:25and so for both of these modules, we identified the spot
- 35:29where sort of the based on the lowest p-value,
- 35:32which proteins were
- 35:35sort of the most important in the module.
- 35:37I wanna emphasize that we didn't use this to...
- 35:41So for things like dementia, if you were to run this,
- 35:44since we didn't see a strong association
- 35:47or a significant association beforehand,
- 35:50we didn't sort of use that to try and find a subset
- 35:52that we're significantly associated
- 35:54because I would call that cheating.
- 36:01Okay, so the last piece that I'll talk about
- 36:05in terms of teasing apart associations
- 36:09or sort of understanding protein within the modules
- 36:13is this functional enrichment
- 36:16or over-representation analysis within the modules.
- 36:20So based on the ones, sort of the significant modules
- 36:24or significantly associated modules with the outcomes,
- 36:28there is this software called STRING
- 36:31that does a few different things, but what I used it for
- 36:35is doing an over-representation analysis
- 36:38of biological pathways.
- 36:41So the idea is that there are annotation databases
- 36:45for proteins that sort of group them
- 36:48into biological functions
- 36:51or pathways that they're involved in,
- 36:53and the idea is that if you have a module
- 36:55that has more proteins than you would expect
- 36:58from a given pathway,
- 36:59then that's sort of the over-representation piece,
- 37:02and it indicates that that biological pathway
- 37:05might be important in whatever functions
- 37:08the module is carrying out.
- 37:12So this is just a screen grab of one example.
- 37:16So this is from module four.
- 37:18So you can see the annotation database is over on the left.
- 37:22So KEGG is one of them.
- 37:24Gene Ontology is another,
- 37:26and so you have these sort of observed proteins,
- 37:30and then the background is sort of the total number
- 37:33of proteins that are in the pathway,
- 37:36and the idea being that if you were to grab, I don't know,
- 37:39however many proteins out of the background,
- 37:41like how many would you expect to be in this module
- 37:45due to chance, and do we have sort of over-representation
- 37:49compared to what we would expect?
- 37:51And so for module four,
- 37:52the cytokine-cytokine receptor interaction
- 37:55was the strongest overrepresented pathway,
- 37:59and then you can sort of look at these others that
- 38:03have some sort of false discovery rate greater than 0.05,
- 38:08and so I found the KEGG pathways, personally,
- 38:11to be the most informative.
- 38:12Gene Ontology tends to be a lot more specific,
- 38:15which may be more useful for targeting
- 38:18certain sort of therapeutic processes
- 38:21or something like that,
- 38:22but so depending on the scale that is important to you,
- 38:25you can sort of use different annotations.
- 38:31Okay, so the last thing I wanted to talk about,
- 38:33with the Framingham data in particular,
- 38:38was sort of getting back to our motivation
- 38:40for doing a network analysis in the first place.
- 38:43So the sort of contrast or comparator would be to do
- 38:47individual protein analyses where you're running
- 38:49a regression model for each protein that you're analyzing,
- 38:53and so we did that as a point of comparison.
- 38:55So for total brain volume, there were like a dozen proteins
- 38:59that were associated with total brain volume.
- 39:02One was associated with hippocampal volume,
- 39:04and two were associated with Alzheimer's disease
- 39:07at an FDR value of less than 0.1.
- 39:11So what was interesting,
- 39:14especially with the brain volume results,
- 39:16and, again, that was where we had seen
- 39:17associations with these modules,
- 39:19some of the proteins that were significantly associated
- 39:23were from module two and module four and others weren't.
- 39:29So what I get from that is a few things.
- 39:32One is that some proteins
- 39:34that are associated with the outcome
- 39:36are sort of individually associated
- 39:39but not sort of detectable
- 39:41within sort of a larger network of proteins
- 39:44that are associated with that outcome,
- 39:46and then the other is that
- 39:48for those that are within the modules,
- 39:51we would only be getting information
- 39:53about sort of a few of the proteins in the modules,
- 39:56whereas, as we see here,
- 40:00the associations tend or continue to get stronger
- 40:03with sort of looking at the broader network
- 40:06around sort of the most highly weighted proteins.
- 40:09So you're getting a bit more information
- 40:10about proteins that may be associated
- 40:13with total brain volume
- 40:14and maybe at some of the biological processes
- 40:17compared to if you're looking at things individually,
- 40:20but, again, because you're seeing associations
- 40:22that you don't catch with the modules,
- 40:23it's sort of important to look at both,
- 40:25and you get sort of complimentary information
- 40:28from the two approaches.
- 40:33So a caveat,
- 40:36I mentioned issues with lack
- 40:37with sort of difficulties in replication.
- 40:40We replicated this analysis
- 40:42in the Cardiovascular Health Study,
- 40:44and we did so by taking the same module,
- 40:47so module two and module four,
- 40:50taking the same weights from those proteins
- 40:52and applying them to the protein concentrations
- 40:56in the Cardiovascular Health Study.
- 40:59So we didn't do a network reconstruction or anything
- 41:02in the different study.
- 41:03We were just seeing if these modules replicated
- 41:07in their associations with outcomes in a different cohort.
- 41:10So in this case, it's really not seeing much
- 41:14in terms of association with both total brain volume
- 41:18and we also looked at dementia out of interest
- 41:22since things were sort of close in our cohort,
- 41:26but, really, we're not seeing much in terms of associations.
- 41:31Part of the reason for that,
- 41:33so there are not that many cohorts
- 41:36that are available that have a large proteomic panel
- 41:39with the same proteins that we were looking at
- 41:41as well as MRI and incident dementia outcomes,
- 41:45and, in this case, the demographics of the cohort
- 41:48are fairly different from (indistinct) Framingham.
- 41:51So about 20 years older on average.
- 41:56I'm just including the sort of first few rows
- 41:58of our table one, but you can see differences in education,
- 42:01systolic blood pressure, and the same is true
- 42:03of a lot of the other cardiovascular risk factors.
- 42:06So it's a very different cohort,
- 42:08and digging a bit into the literature
- 42:10about sort of proteins over the life course,
- 42:13it's not too surprising that we don't see
- 42:16the same associations, but it it does sort of,
- 42:19it's a good cautionary message
- 42:20about drawing conclusions too far
- 42:23based on sort of one set of data
- 42:25or one set of demographics.
- 42:30Just to put these results in context,
- 42:32so our module four included
- 42:36a lot of immune-related signaling molecules
- 42:38like interleukins, TNF receptor proteins,
- 42:41which are both types of cytokines, and have been associated
- 42:44with Alzheimer's disease previously,
- 42:47in particular, interleukin-1 beta was in our module four,
- 42:52and it had been found to be elevated
- 42:53in 80 cases in a meta-analysis.
- 42:56However, other biomarkers that have been sort of validated
- 43:00in other cohorts were not identified in our module.
- 43:08In module two, we saw Axon guidance pathway proteins
- 43:11including ephrins, netrins, and semaphorins,
- 43:13which have been associated with AD in previous work,
- 43:17and complement cascades are also have been associated
- 43:20with AD probably for the reason
- 43:22of inducing these immune cells called microglia
- 43:27in the brain to, basically, eat up
- 43:31cells in response to amyloid deposition.
- 43:35So there's some biologically plausible mechanisms
- 43:37that could be associated with these modules
- 43:42in Alzheimer's disease,
- 43:46and the last thing I'll say is talking about some sort
- 43:49of other ways of approaching this problem,
- 43:51so as I mentioned, the CHS cohort
- 43:54has different underlying characteristics,
- 43:56and so it may well have a different network structure.
- 43:59So one thing that could be good to do
- 44:02is to look at sort of consensus modules across the cohorts
- 44:07where you construct networks in each cohort,
- 44:09and then look at where the overlaps are,
- 44:12and you can get sort of a more,
- 44:14hopefully, more robust network across cohorts,
- 44:18and then there are other network-based approaches
- 44:20that can incorporate external information.
- 44:22So, again, our network approach
- 44:24was just based on correlation in our dataset,
- 44:27whereas other methods use sort of those annotation databases
- 44:33and that sort of thing to construct the networks
- 44:35and sort of decide how strong the similarities between nodes
- 44:39or the strength of connections will be.
- 44:41So that's another approach,
- 44:43and then the last thing I'll say is that
- 44:45I'm sort of still using this kind of method
- 44:48now in work with longevity and aging
- 44:51and trying to apply it to metabolomics,
- 44:54so metabolites data in cohorts related to those outcomes.
- 45:02So thank you all for being here.
- 45:04Thank you, my collaborators.
- 45:06This is the folks down at UT.
- 45:09I'll say that (indistinct).
- 45:11Thank you.
- 45:16<v ->Thank you for wonderful presentation.</v>
- 45:18We're open for questions.
- 45:20So let's start with people in the room.
- 45:21Any questions?
- 45:23<v ->Got one over here.</v> <v ->Perfect, thank you.</v>
- 45:25<v Audience>Yeah, so my research interest</v>
- 45:26is about the cancer, and, also,
- 45:28we're interested in your study.
- 45:30So I've got some technical issues about this project.
- 45:35So the first issue that,
- 45:36how do you do the normalization in your process?
- 45:41<v ->Yeah, great question.</v>
- 45:42So yeah, I totally glossed over
- 45:44all the pre-processing stuff.
- 45:47So before doing the network construction,
- 45:51I log transformed the protein concentrations
- 45:54to reduce stiffness.
- 45:56There was a standardization within,
- 45:58there were sort of two phases of runs of protein modules,
- 46:02so I sort of standardized within those batches,
- 46:06and then after that, I did a rank normalized
- 46:11or inverse normal rank transformation to sort of-
- 46:16(audience speaks indistinctly) <v ->What's that?</v>
- 46:17<v ->(indistinct) normalization?</v> <v ->Basically.</v>
- 46:19Yeah, yeah, yeah.
- 46:20So that was sort of the data pre-processing.
- 46:23So I think I, you know,
- 46:26I've thought about sort of the pros and cons
- 46:28of those things as well and I think my biggest qualm
- 46:31with the way that I did it is sort of interpretability,
- 46:34because, yeah, sort of what does it mean
- 46:37to be at one quantile versus another
- 46:39where you have this huge dynamic range
- 46:40of protein concentrations?
- 46:42<v Audience>So another question is that</v>
- 46:44I know that in your project,
- 46:46the modules identification is very important.
- 46:48So I wonder,
- 46:53you have talked a little bit
- 46:54about how to answer the modules,
- 46:57but so can you explain a little bit more
- 47:00about how you gonna bring modules from the data?
- 47:08<v ->I'm not sure, can you say a little bit more?</v>
- 47:11<v Audience>Yeah, so in your previous pages,</v>
- 47:13I think you talked a little bit about the clustering
- 47:17of the modules so that we know
- 47:18that there are four main modules.
- 47:22<v ->Yes.</v> <v ->In the whole dataset.</v>
- 47:24So what is the name of that algorithm
- 47:28and how it basically work?
- 47:31<v ->Yeah, so the clustering itself was done</v>
- 47:36using algorithm called H+.
- 47:41To be honest, I'm not too sure
- 47:43about sort of the details of it.
- 47:45It can use any dissimilarity measure,
- 47:48which, in our case, comes from the TOM matrix, but-
- 47:52<v Audience>So this is the algorithm that we separate</v>
- 47:55the whole proteins into four different modules
- 47:58so that we can analyze it one by one.
- 48:00<v ->Yeah, yeah, yeah, yeah.</v> <v ->Yeah,</v>
- 48:01so I also noticed that
- 48:07in the weighted protein expression network analysis,
- 48:13you talk about the beta values.
- 48:16<v ->Yes.</v> <v ->That you use that</v>
- 48:20like the soft threshold. <v ->Yeah.</v>
- 48:23<v Audience>To make the genes to be more important</v>
- 48:28if that is the thing that you wanna analyze.
- 48:31So in this process, I want to know how you would make sure
- 48:35the value of the data in this process.
- 48:39<v ->So sorry, we have to end 'cause it's 12:15.</v>
- 48:42I know others have classes and everything.
- 48:44Maybe you guys can discuss a little bit.
- 48:46<v ->Yeah, (indistinct), yeah.</v> <v ->Maybe if you have time.</v>
- 48:48Please, if you're registered,
- 48:49make sure you signed in on a sign in sheet.
- 48:51There's three of 'em.
- 48:52You only have to sign on one of them,
- 48:54and then one-fourth page reflections will be due
- 48:57before the next speaker's time to speak.
- 48:59(indistinct talking)