YSPH Biostatistics Seminar: “Feature Aggregation in Causal Discovery for High-dimensional Data: Application to Targeting the “Gut-Brain-Axis” via the Microbiome Diversity"

Name: YSPH Biostatistics Seminar: “Feature Aggregation in Causal Discovery for High-dimensional Data: Application to Targeting the “Gut-Brain-Axis” via the Microbiome Diversity"
Uploaded: 2023-09-21T15:01:57.9Z
Duration: 49 min 29 s
Description: Jinyuan Liu, PhD, Assistant Professor, Department of Biostatistics, Vanderbilt University Medical Center September 19, 2023

September 21, 2023

Jinyuan Liu, PhD, Assistant Professor, Department of Biostatistics, Vanderbilt University Medical Center

September 19, 2023

Information

ID: 10729
To Cite: DCA Citation Guide

Download Transcript

00:01<v ->All right, I'm very excited</v>
00:03to introduce our speaker for today.
00:04We have Dr. Meghan Short.
00:06Dr. Short has completed fellowships
00:08at the Glenn Biggs Institute for Alzheimer's
00:10and Neurodegenerative Diseases,
00:12and at Harvard's Huttenhower Lab.
00:14Currently, Dr. Short is an assistant professor
00:17at Tufts University.
00:18Let's give a warm welcome to Dr. Short.
00:31<v ->Hi, everyone, Thank you for being here.</v>
00:34Can you all hear me, okay?
00:35<v ->Sign in if you're registered.</v>
00:38<v ->All right, so, today, I'm going to talk about a project</v>
00:41that I worked on as part of my postdoc
00:43down at UT Health San Antonio
00:46with the Glenn Biggs Institute for Alzheimer's
00:49and Neurodegenerative Diseases,
00:51and I wanted to talk about this as a...
00:56None of the sort of methods that I'm gonna talk about
00:58in this talk are particularly new.
01:01This wasn't sort of a methods development project.
01:04So the sort of main network method I'll talk about
01:08is about a decade old at this point, at least,
01:10but what's nice about it is that
01:13with increasing availability
01:15of high dimensional biomedical data,
01:17it's sort of seeing more use cases,
01:20and it's not something that, at least, I learned about
01:22in my graduate program in biostatistics,
01:24but it's something that I thought
01:26would be good to talk about today
01:28since it's such a useful method.
01:32So let's see if I advance.
01:35There we go.
01:36So I'll start just by giving a quick introduction.
01:39I know that when I was in grad school, I always wanted,
01:43I thought it was interesting
01:44to hear about people's career paths
01:46as I was considering my own.
01:48So I started in biology as a field.
01:53I studied salt marsh ecology as an undergrad,
01:56and then by the end of undergrad,
01:58I was interested in getting more into sort of a human,
02:00more directly human-focused environment,
02:02and so I considered public health.
02:05I learned about statistics
02:06as part of my research in undergrad
02:08and wanted to continue with that so I participated in SIBS,
02:11which is a program that you may be aware of,
02:14and that was my first intro to biostat.
02:16I was a graduate student at Boston University.
02:19I had fortune of working
02:21with the Framingham Heart Study,
02:22which is where the data comes from
02:24that I'll be talking to you about today,
02:26which is a really interesting study,
02:27and I'll get more details on in the few slides.
02:29That was sort of my introduction
02:31to working with epidemiological data.
02:34After grad school, I continued on,
02:36again, to UT Health San Antonio,
02:38and then following that to postdoc at Harvard
02:42looking at developing methods for microbiome analysis.
02:47So if you have any interest in that,
02:49feel free to approach me,
02:51although I'm not gonna talk about that today,
02:54and then as of March this year,
02:57I started as an assistant professor at Tufts Medicine
03:00where I'm working on a variety of projects
03:02but a lot related to sort of omics data
03:06and aging and longevity.
03:12So I'll start today's talk with a bit of motivation
03:15for why network-based analyses we're a good fit
03:18for looking at sort of the proteome in Alzheimer's disease.
03:24So first of all, Alzheimer's disease
03:27is a very prevalent condition.
03:30Many of you may be like me and know some family members
03:33or people who have been affected by it.
03:36It's very common and expect it to be more so
03:40as populations age, and it's a leading cause of mortality,
03:44disability, and poor health among seniors,
03:47and one interesting feature of this disease
03:49is that precursors of it can appear years to decades
03:52before symptoms manifest.
03:55So those precursors can include indicators
03:58that are visible on brain MRIs,
04:01performance on neurocognitive testing, changes in gait,
04:05even changes in sense of smell,
04:08and cerebral spinal fluid markers, such as tau and amyloid.
04:17Because of this, there's interest in being able to find
04:21plasma biomarkers for Alzheimer's disease
04:24and related dementias.
04:25ADRD is a acronym we'll be using sort of throughout.
04:30Because since there are indicators
04:33of sort of pre-disease development
04:35in years to decades before being able to detect those,
04:37either earlier or in a less invasive or expensive way,
04:40is very useful,
04:44and so when I say invasive, I mentioned CSF markers,
04:50such as how an amyloid can predict dementia,
04:53but that involves doing a lumbar puncture
04:56versus something like a blood draw, which is easier to do.
05:01Another good aspect of trying to find biomarkers
05:04is that you can get a sense of biological processes
05:07that are involved in disease development,
05:10and that can hopefully lead to either preventative
05:13or therapeutic interventions.
05:19What makes this difficult?
05:21So in my case, I was looking at proteins.
05:24There are thousands and thousands to select from,
05:27and you get sort of this inherent trade off
05:30between trying to control a false positive rate
05:33for all these multiple tests that you may be performing,
05:36but if you effectively control the false positive rate,
05:39you're going to likely end up with low statistical power.
05:42There's this trade off between...
05:45It's sort of a needle in a haystack.
05:47Another thing that has tended to be true
05:50is that there is not very good replicability across studies.
05:53So one study may find 20 biomarkers
05:57and maybe one or two of them
05:59may replicate in a different study.
06:01So there's a lot of noise that ends up coming through.
06:08The approach that I took in this project
06:10was to use network analysis
06:13to analyze the protein data,
06:17and the motivation there is to try and capture
06:20subtle but consistent variation in groups of proteins.
06:23I'll refer to them as modules during this talk.
06:28In then just a few things, so first of all,
06:30it reduces the dimensionality
06:32of the statistical testing problem that you have.
06:35So rather than testing each protein individually
06:37and having to adjust for all of those multiple tests,
06:40you can sort of reduce the space
06:43to a smaller number of tests
06:46where the proteins within each group being tested
06:49are inter-correlated with one another,
06:52and unlike other dimensionality reduction methods,
06:55something like a principle components analysis
06:57that you may have maybe familiar with,
06:59the network method has sort of a benefit of looking
07:03not just at, say, correlations
07:06or relationships between pairs of proteins,
07:09but, also, at sort of the correlational neighborhood
07:11of what common neighbors
07:13those proteins share in the network.
07:18Another benefit of or sort of way
07:22that we try to get around some of the pitfalls
07:23of proteomic analysis is by focusing on biological pathways
07:29instead of on individual proteins themselves.
07:32So within groups of proteins that we find to be of interest
07:36or possibly associated with dementia outcomes,
07:40we use a tool called over-representation analysis,
07:43which I'll talk about later,
07:45but it essentially tries to pinpoint biological pathways
07:48that may be overrepresented by the proteins
07:51that are found to be associated with the outcome,
07:54and the hope there is to find,
07:56to get sort of insights that are more robust across studies
08:01and, hopefully, address some of the issues
08:03with replicability.
08:08Okay, so that's sort of the motivation for this study,
08:11and, now, I'll talk a little bit about the data.
08:18The data for this study
08:19comes from the Framingham Heart Study,
08:22which has been going on for a very long time.
08:24It started in 1948 in a town of Framingham, Massachusetts,
08:29and at the time they enrolled,
08:31they reached out to two-thirds of the population of the town
08:34to try and enroll them in this epidemiological study.
08:36It was one of the first ones of its kind,
08:39and people would come in for exams every few years,
08:42and they would take all of this information about them,
08:45and then follow them for outcomes.
08:47Cardiovascular outcomes was really
08:49the sort of outcome of interest when it first started.
08:53Over the years, they've then enrolled offspring
08:57of the original cohort participants
08:59as well as grandchildren and third generation,
09:02and then as sort of the demographics
09:06of Framingham have changed over the years,
09:09if you're only enrolling descendants
09:10of people who live there in 1948,
09:12you're not gonna capture that.
09:13So they also have been enrolling omni cohorts
09:15to reflect sort of more diverse populations (indistinct).
09:21Again, they were sort of aiming
09:23towards identifying risk factors
09:25and etiologies of cardiovascular disease,
09:29but as those populations age,
09:31brain health and cognition is also an important outcome,
09:34and so they've measured sort of cognitive outcomes
09:39and incidents of dementia as well, and, of course,
09:41those things are also related to cardiovascular.
09:48For our study in particular,
09:51we were using the offspring cohort,
09:53and at their examination cycle five,
09:55which was in the early 90s, they collected blood samples,
10:00and froze the plasma from those samples,
10:03and years later, when they sort of had
10:06these broader proteomic analysis assays available,
10:11they measured the plasma proteome,
10:14I'll talk about the methods for that on the next slide,
10:18but they did this in about 1,900 participants
10:21who were approximately aged 55 when the blood was drawn.
10:24So this is sort of a middle-aged cohort,
10:26generally, cognitively healthy
10:29and a little more than half women.
10:33The main outcomes that we looked at in this study
10:35are MRI-based measures, so brain MRIs were taken
10:41about 10 years or so, five to 10 years
10:45after the initial blood draws, and those had...
10:51The sort of outcomes that I looked at there are
10:54total brain volume as well as the volume of the hippocampus
10:57and then a measure called white matter hyperintensities,
11:01which is sort of a measure of vascular injury in the brain,
11:06and a reason to look at those outcomes is that
11:10I mentioned there are sort of precursors of dementia
11:13or risk factors for dementia that can be identified on MRI,
11:16those are some of the big ones.
11:19Especially since we had a middle-aged cohort,
11:22you may not see a lot of incident dementia,
11:24and so being able to detect proteins
11:27that are associated with some of those precursors
11:29is a way of getting at this issue.
11:34We did also look at incident dementia.
11:36So we had about 20 years of follow-up,
11:37which is one of the strengths of this,
11:40looking in this particular sample,
11:42and we had 128 incidences of dementia
11:46of which 94 of them were classified
11:48as Alzheimer's type dementia.
11:53We also had a replication cohort.
11:55I mentioned the importance replication,
11:58and so we worked with collaborators
12:00at the University of Washington and their cohort study
12:04called the Cardiovascular Health Study,
12:06which has sites, I think, four different sites around the US
12:09and has measures of the same proteomic platform
12:13and same outcomes that we're looking at in the study.
12:19The assay that we used to measure proteins
12:23is called SOMAScan.
12:24It's by this company called SomaLogic.
12:27They use these single-stranded DNA aptamers
12:29that are designed to specifically bind
12:31to different proteins, and you can sort of tag them
12:35that way and measure their concentrations.
12:38In our sample, the assay had 1,300 proteins,
12:42which that's even sort of becoming dated now.
12:45I think the latest version
12:46has something like 7,000 proteins.
12:48So there's a lot that can be measured with this,
12:51but there is some sort of bias towards, I think,
12:57molecules that sort of have some evidence
12:59of being important in cardiovascular disease.
13:01So it's not an entirely sort of agnostic choice of proteins,
13:06but it does get a pretty wide range.
13:11Okay, so that's a description of the data,
13:15and, now, I want to dig in a bit
13:17to the network methods that we used.
13:20So this is sort of a graphical abstract
13:24from their original paper,
13:28describing this weighted gene
13:29correlation network analysis method.
13:32So that's what WGCNA stands for.
13:34I put gene in parentheses because they've started
13:37dropping that from the name when it gets used elsewhere
13:40because, originally, it was developed
13:42for gene expression data, but it's been found to have use
13:45in other high dimensional data sets as well,
13:48and so in our case, we're using it to analyze proteins,
13:52but the language here makes reference to gene expression.
13:57So just broadly, what this method does
14:01is you get a co-expression network,
14:04and I'll sort of give details on the next few slides,
14:08but the idea is that the network is based
14:10on co-occurrence or correlation in your sample.
14:14So there's not really information coming from outside.
14:17You're not even considering your outcome at all.
14:19It's just looking at the space of the proteins
14:21and which proteins are correlated with one another.
14:26Once you've identified this sort of network matrix,
14:29you use a hierarchical clustering algorithm
14:32to define modules.
14:34It's a little small here, but I'll show a a bigger example.
14:37Basically, you have a dendrogram,
14:39and you see that if sort of proteins
14:42are on this x-axis of this figure here.
14:44I'll do the mouse for people who are online.
14:48You get these sort of bands or groups of proteins
14:51that are highly correlated with one another
14:53and not correlated with other proteins.
14:58So that is where those sort of protein groups come from.
15:02Once you have those, you can use a numerical summary
15:06of each protein group as sort of a feature or a predictor
15:10in a regression or some sort of analysis
15:13to try and relate the modules or groups
15:15to external information.
15:16So that's how we relate our protein groups
15:20to dementia outcomes in this study.
15:24There's also the possibility
15:25of looking at relationships between modules.
15:28So I mentioned the modules in the network
15:32are highly inter-correlated
15:33within the proteins within themselves,
15:36but there may also be some correlation between modules,
15:38and that could be important to look at as well,
15:41and then within modules, you may have
15:44tens or hundreds of proteins, and so trying to figure out
15:47which proteins within those modules
15:50are driving any associations you see
15:52is sort of a final step that can be
15:55useful for getting sort of biological meaning
15:57out of these associations.
16:02So that's a broad overview.
16:03This is sort of a more graphical abstract from our study,
16:08and I'll sort of go through bit by bit
16:11the different pieces of the analysis.
16:15So, again, this WGCNA step is sort of the first step
16:18of getting from this protein expression matrix
16:20where you have sort of your proteins by participants,
16:24and using the sort of correlations in your sample
16:27to come up with these modules of co-expressed proteins.
16:33The first step in doing that
16:35is to make a pairwise correlation or similarity matrix.
16:39So if you have n proteins,
16:40then that becomes an n by n matrix
16:43where each cell is describing
16:45the similarity or correlation
16:47between protein i and protein j in your sample.
16:52You then use this to create
16:54what's called an adjacency matrix, which is,
16:56I'll talk about more in the next slide,
16:58but is sort of a more networky way
17:02of describing the association between proteins,
17:05and then a topological overlap matrix,
17:08which then takes into account
17:10not only the correlation between proteins
17:13but their shared neighborhood, and then, again,
17:16that is what is used to cluster the proteins.
17:23So to get into a bit more detail
17:25about sort of the network construction,
17:30again, you described the network as an n by n matrix
17:33with the number of nodes or genes, proteins, et cetera,
17:36and, in our case, we use to describe the similarity,
17:39just a simple correlation,
17:42absolute value of the correlation,
17:44between a given node i and j.
17:48The adjacency is then a measure of whether or how strongly
17:51the nodes are connected in the network.
17:53So the idea being that
17:56nodes that have very high correlations
17:58are particularly interesting.
18:00Nodes that have moderate to low correlations
18:02are probably not informative
18:03is sort of the the underlying idea,
18:07and so if you look at sort of this figure here,
18:12the correlation or similarity is on the x-axis,
18:15and then the adjacency is on the y, and so if you use
18:19what's called an unweighted network approach,
18:22you pick a threshold value, here, it's 0.8,
18:25and you say that anything with a similarity less than 0.8
18:28is considered to not be a connection in the network,
18:31and everything greater than 0.8
18:33is considered to be a connection.
18:34So it's sort of a binary yes or no.
18:38What WGCNA does that was novel
18:42was to introduce a weighting
18:45where sort of the downside of this unweighted metric is that
18:49if you have a correlation of 0.79,
18:52that could be useful to know, but it counts as a zero.
18:55So you're losing information,
18:57and so what the weighted network does
19:00is it uses a sort of power transformation
19:03to get from sort of the straight correlation
19:07shown in this red line,
19:08and sort of depending on this power value that you use,
19:12you weight more or less towards the higher correlations
19:16in your network, and when you fit this model
19:20or when you sort of build the network, your choice of data
19:24is sort of one of the parameters that you choose going in,
19:27and there's ways to sort of measure
19:30which gives the best fit to the data.
19:38So then once you have your sort of unweighted
19:41or weighted adjacency matrix,
19:45then is the part where you account for shared neighbors.
19:48So this is this topological overlap matrix that is created,
19:52so, basically, this measure omega of connectedness.
19:58The equation, I don't find super sort of intuitive,
20:01but the components are...
20:03This is the sum, so u are, basically,
20:05all of the nodes other than i and j
20:07that you're looking at the connectedness between,
20:10and so you're summing up
20:11the sort of common connection strength between i and u
20:15and j and u as a product.
20:18So if I and J both have a strong connection
20:22to this other node, then that's adding to this term l,
20:27and then these k terms here
20:29are just the individual connections between, no,
20:32each sort of the node i of interest
20:34and other nodes in the network,
20:37but I find sort of the easiest or most intuitive explanation
20:41from this original paper shows that for the unweighted case,
20:46omega is equal to one if the node with fewer connections
20:50has all of its neighbors,
20:51also, has connections of the other node.
20:53So the connections of node i
20:55are a subset of the connections of node j,
20:59and, also, i and j are directly connected.
21:01So that's sort of the most interconnected
21:03that those two nodes can be,
21:05and then the least interconnected they can be
21:08is if they are not connected to one another,
21:10and they don't share any neighbors.
21:11So that would be sort of the zero case.
21:16So this a value can either take on
21:18the unweighted or the weighted case,
21:20and in our sample with WGCNA,
21:23we're using those sort of weighted network connections
21:26that just adds more information
21:28into this topological overlap matrix.
21:36Okay.
21:39So, now, once you have the topological overlap matrix,
21:45again, this measure of sort of interconnectedness
21:48accounting for shared neighbors,
21:51then you can use hierarchical clustering
21:53to divide those proteins
21:57into groups based on their similarity,
22:00and this is the results from our analysis.
22:03So sort of on the x-axis,
22:06you have the different proteins, you have the dendrogram,
22:09which represents the hierarchical clustering
22:11of the topological overlap matrix,
22:14and then you have this dynamic tree cut algorithm
22:20which then defines these clusters
22:22which are shown in colors on the bottom based on the tree.
22:26So you see this huge branch down here.
22:28That's gonna be this black cluster.
22:30There's this other cluster over here in green,
22:33and so there's, again, a few more parameters
22:37that you can use to decide how those cuts are made,
22:40and, in some cases, you can sort of merge branches
22:43that have correlation with one another,
22:45and my general advice
22:48for when you're doing this on real data
22:49is to try different values
22:51and see how robust the network is
22:53to choosing different values because, in our case,
22:56it tended to be pretty consistent
22:59where we saw four modules pretty much regardless.
23:02I think if we merged,
23:04if we really cranked up one of the merging parameters,
23:06we would get to three,
23:07but other than that it sort of stayed put.
23:13Okay.
23:15So the next step is trying to get
23:18a numerical summary measure of the groups of proteins
23:22that we've identified from our network.
23:25So from these modules of co-expressed proteins,
23:28we then use, basically, a principle components analysis
23:33to get what we call an eigenprotein
23:35or it was called an eigen gene in the original paper.
23:39What it is is, essentially, a weighted sum
23:43of the values of each of the proteins in the module,
23:47and the weights correspond to sort of how well correlated
23:50that protein is with the overall module.
23:53So if a protein has a high weight in the module,
23:56it means that it's sort of the most interconnected
23:58in the module or sort of best represents the overall module.
24:04So each person is going to have
24:06an eigenprotein value for each module,
24:16and when we look at the sort of weights
24:18within each of the modules, so just to sort of orient us,
24:22on the x-axis are each of the module eigen genes
24:27or eigenproteins, and then each sort of bar
24:34on the y is a different protein.
24:36In this case, we're only including
24:39proteins that fall into one of the four modules.
24:42There were, also, if you notice on the last slide,
24:45plenty of proteins that didn't fall into any module
24:48and were sort of the extras, so to speak,
24:51and if you were to expand this down
24:54and include more rows with those,
24:56that would sort of show those, but for purposes of this,
25:00we're just including ones
25:01that fell into at least one of the four,
25:04and each of these bars represents a correlation
25:09between the individual protein
25:11and the overall eigenprotein.
25:13So for these blocks of red,
25:15it's sort of the higher weighted proteins
25:17that are within in this example module one,
25:21module two, three, and four, and then you can see,
25:24if you look sort of laterally from these proteins,
25:28it's the correlation of these proteins
25:30with the other modules.
25:31So the idea being we wanna see sort of blocks of red,
25:36and then not a lot of correlation
25:38between the blocks and other modules,
25:40which is what we see.
25:46All right, now that we've constructed our network,
25:50and we've come up with numerical summary measures
25:52for each of the protein groups that we've identified,
25:56that is sort of the input or the predictor
25:59for these associations with outcomes.
26:02So for the MRI measures, which, again,
26:04our total brain volume, hippocampal volume,
26:07and white matter hyperintensities,
26:09we use just a simple or, you know,
26:12linear regression with covariates,
26:14and then a Cox proportional hazards regression,
26:17we use to predict incident dementia
26:20and, specifically, Alzheimer's type dementia.
26:26These are the regression equations.
26:28Again, these eigenproteins are,
26:30they're sort of one for each module.
26:32So we'll run a separate regression analysis
26:35for modules one, two, three, and four.
26:37We adjust for age and age squared, sex education.
26:42APOE is a gene that confers a lot of risk
26:44for Alzheimer's disease.
26:45So it's associated with the outcomes,
26:47and we include it as a covariate,
26:49and then a measure of time lag
26:51between when the blood was sampled
26:53and when the MRI was taken to account for any differences
26:57between people or the time difference,
27:01and for dementia, it's slightly simpler regression equation.
27:06We only adjust for age, sex, and APOE status.
27:13All right, so next, I will show
27:17the results in the Framingham Heart Study.
27:21So from the four modules that we tested,
27:24there were two that we identified to have
27:27some association with outcomes.
27:29The first is module two.
27:31I gave it sort of a name clearance and synaptic maintenance,
27:35and I'll talk about how I arrived
27:37at that name for the module in a bit.
27:40It has 165 proteins in it.
27:44Some of the half weighted proteins sort of give an idea
27:47of which ones are sort of most highly weighted
27:51or sort of most correlated with the eigen protein.
27:56I'll talk about how we got to these
27:59in another slide as well,
28:01but, basically, this is from that
28:02over-representation analysis
28:04where you're trying to identify biological pathways
28:06that are important or overrepresented
28:09by proteins in those modules.
28:12So we have the Axon guidance pathway
28:14was most strongly associated with this module,
28:21and then in terms of relating to outcomes,
28:25total brain volume
28:26was the only significant association that we saw.
28:29So since this is a linear aggression,
28:33effect greater than zero means a positive association.
28:37So we see that for larger values
28:40of the eigenprotein for module two,
28:42we saw larger total brain volume.
28:44So it's sort of a protective effect
28:47since brain atrophy is what is the risk factor for dementia,
28:53and then for incident dementia,
28:55we did not see a significant effect
28:56after correcting our p-values
28:58using a Bonferroni correction.
29:00You'll notice that the confidence interval excludes one,
29:04which would be the null value,
29:05and that's just because that's based
29:06on the non-Bonferroni corrected value,
29:10but after testing for or adjusting for the four modules
29:14that we tested, we didn't see a significant association.
29:18It is nice at least that the direction of effect
29:22is what we would expect
29:23based on our total brain volume association,
29:26which is that higher values of M2
29:31correspond to sort of a lower incident dementia occurrence.
29:38The second module that we found to be associated
29:41with total brain volume was this M4,
29:44which I will call sort of an inflammation-related module.
29:47It had 42 proteins in it.
29:50The highlighted pathway there
29:52was cytokine-cytokine receptor interactions,
29:55so these sort of immune signaling molecules,
29:57and in this case, the association
30:00was in the opposite direction
30:01where higher values of this module for eigenprotein
30:05are associated with lower total brain volume.
30:07So it's sort of a risk conferring module
30:10and, again, similar to what we saw here, not a significant,
30:14sort of an annoyingly borderline association
30:17between this and dementia, but, again,
30:20the direction of effect is what we would expect
30:24based on our observed association with brain volume,
30:29and, also, I'll just mention that I standardize
30:31the eigenprotein so that the effect sizes
30:34correspond to a standard deviation increase in eigenprotein.
30:37So it's a little bit...
30:39One sort of drawback I would say
30:40of these methods is the interpretation
30:43since a standard deviation increase, in this case,
30:47depends entirely on the sample that you're using.
30:49So it's really just sort of a direction of effect
30:54more than anything.
30:56So to try and get at some of, get a better understanding
31:00of how these modules relate to our data
31:03or sort of what may be responsible
31:06for some of the associations we see,
31:08this is a map of the correlations
31:12between different demographic variables
31:15and each of the modules, and I mentioned that we have
31:18a replication cohort as well, the CHS.
31:20So these two bars, sort of the two columns,
31:23show the two different cohorts that were included.
31:28So I put blue arrows to show the covariates
31:31that were included in our regression model,
31:34and you can see that there are some correlations
31:35between, say, sex and the modules,
31:38not really anything with APOE carrier status,
31:42maybe some education associations,
31:44and some associations with age.
31:46So it's good that we adjusted for those in our models.
31:49However, you can also see there are a lot of other factors,
31:53cardiovascular risk factors,
31:54such as systolic blood pressure, BMI,
31:58fasting glucose that have associations with these modules.
32:02So we wanted to see if any of those could perhaps explain
32:05the associations that we saw.
32:10So I'm repeating sort of our standard model here
32:14was what I showed results from previously.
32:17The expanded model that we considered
32:19included a bunch of these risk factors,
32:23basically, something representing BMI,
32:27hypertension, sort of lipid dysregulation, and diabetes,
32:33and I also included smoking as well,
32:37and we also included a measure of kidney function,
32:40which can also be an indicator of cardiovascular disease.
32:45So for module two,
32:48I'm repeating the sort of effects we saw
32:50from the standard model here,
32:53and when you adjust for the expanded set of covariates,
32:56your effect is attenuated by half,
32:58and it's no longer significantly associated.
33:01So with that says, it's either you have
33:04a sort of confounding issue
33:08where the association you're seeing between these proteins
33:12and total brain volume is really just in effect
33:16of sort of poor cardiovascular health
33:20or better cardiovascular health
33:22or you may think that it might be
33:25some sort of mediation effect
33:26where perhaps the risk associated
33:31between the proteins and the sort of total brain volume
33:34could be mediated
33:35by some poor cardiovascular health outcomes,
33:41and then for module four,
33:43again, this sort of inflammation module,
33:45we don't see any real effect attenuation.
33:48Regardless of whether you adjust
33:49for cardiovascular factors or not,
33:52it's still associated with total brain volume,
33:54which suggests it's sort of different mechanism
33:57or lack of compounding between
33:59or based on cardiovascular health.
34:05Okay, so I mentioned
34:08in the sort of initial graphical abstract
34:12that once you find protein modules
34:14associated with your outcomes of interest,
34:16it can be good to look within the proteins of those modules
34:19to try and find sort of subsets
34:21or specific proteins that may be driving the associations.
34:26So for modules two and four,
34:27where we found associations with brain volume,
34:30we wanted to see if we removed proteins one at a time
34:35based on their sort of increasing weight,
34:37so remove the lowest weighted proteins in the modules first,
34:42what sort of happened to the strength of the associations.
34:46So these are both associations with total brain volume.
34:49It's sort of the p-value on the y-axis,
34:53and you can see that as you remove, say, from module two,
34:57the first 20 proteins or so,
34:59you're really not seeing a difference
35:01in the effect of the overall module with total brain volume,
35:05which suggests that those proteins
35:07aren't really impacting the association,
35:11whereas beyond that point, once you start removing proteins,
35:15the association becomes less strong,
35:17and so that's suggesting that those proteins
35:20may have more of an impact on sort of the overall module,
35:25and so for both of these modules, we identified the spot
35:29where sort of the based on the lowest p-value,
35:32which proteins were
35:35sort of the most important in the module.
35:37I wanna emphasize that we didn't use this to...
35:41So for things like dementia, if you were to run this,
35:44since we didn't see a strong association
35:47or a significant association beforehand,
35:50we didn't sort of use that to try and find a subset
35:52that we're significantly associated
35:54because I would call that cheating.
36:01Okay, so the last piece that I'll talk about
36:05in terms of teasing apart associations
36:09or sort of understanding protein within the modules
36:13is this functional enrichment
36:16or over-representation analysis within the modules.
36:20So based on the ones, sort of the significant modules
36:24or significantly associated modules with the outcomes,
36:28there is this software called STRING
36:31that does a few different things, but what I used it for
36:35is doing an over-representation analysis
36:38of biological pathways.
36:41So the idea is that there are annotation databases
36:45for proteins that sort of group them
36:48into biological functions
36:51or pathways that they're involved in,
36:53and the idea is that if you have a module
36:55that has more proteins than you would expect
36:58from a given pathway,
36:59then that's sort of the over-representation piece,
37:02and it indicates that that biological pathway
37:05might be important in whatever functions
37:08the module is carrying out.
37:12So this is just a screen grab of one example.
37:16So this is from module four.
37:18So you can see the annotation database is over on the left.
37:22So KEGG is one of them.
37:24Gene Ontology is another,
37:26and so you have these sort of observed proteins,
37:30and then the background is sort of the total number
37:33of proteins that are in the pathway,
37:36and the idea being that if you were to grab, I don't know,
37:39however many proteins out of the background,
37:41like how many would you expect to be in this module
37:45due to chance, and do we have sort of over-representation
37:49compared to what we would expect?
37:51And so for module four,
37:52the cytokine-cytokine receptor interaction
37:55was the strongest overrepresented pathway,
37:59and then you can sort of look at these others that
38:03have some sort of false discovery rate greater than 0.05,
38:08and so I found the KEGG pathways, personally,
38:11to be the most informative.
38:12Gene Ontology tends to be a lot more specific,
38:15which may be more useful for targeting
38:18certain sort of therapeutic processes
38:21or something like that,
38:22but so depending on the scale that is important to you,
38:25you can sort of use different annotations.
38:31Okay, so the last thing I wanted to talk about,
38:33with the Framingham data in particular,
38:38was sort of getting back to our motivation
38:40for doing a network analysis in the first place.
38:43So the sort of contrast or comparator would be to do
38:47individual protein analyses where you're running
38:49a regression model for each protein that you're analyzing,
38:53and so we did that as a point of comparison.
38:55So for total brain volume, there were like a dozen proteins
38:59that were associated with total brain volume.
39:02One was associated with hippocampal volume,
39:04and two were associated with Alzheimer's disease
39:07at an FDR value of less than 0.1.
39:11So what was interesting,
39:14especially with the brain volume results,
39:16and, again, that was where we had seen
39:17associations with these modules,
39:19some of the proteins that were significantly associated
39:23were from module two and module four and others weren't.
39:29So what I get from that is a few things.
39:32One is that some proteins
39:34that are associated with the outcome
39:36are sort of individually associated
39:39but not sort of detectable
39:41within sort of a larger network of proteins
39:44that are associated with that outcome,
39:46and then the other is that
39:48for those that are within the modules,
39:51we would only be getting information
39:53about sort of a few of the proteins in the modules,
39:56whereas, as we see here,
40:00the associations tend or continue to get stronger
40:03with sort of looking at the broader network
40:06around sort of the most highly weighted proteins.
40:09So you're getting a bit more information
40:10about proteins that may be associated
40:13with total brain volume
40:14and maybe at some of the biological processes
40:17compared to if you're looking at things individually,
40:20but, again, because you're seeing associations
40:22that you don't catch with the modules,
40:23it's sort of important to look at both,
40:25and you get sort of complimentary information
40:28from the two approaches.
40:33So a caveat,
40:36I mentioned issues with lack
40:37with sort of difficulties in replication.
40:40We replicated this analysis
40:42in the Cardiovascular Health Study,
40:44and we did so by taking the same module,
40:47so module two and module four,
40:50taking the same weights from those proteins
40:52and applying them to the protein concentrations
40:56in the Cardiovascular Health Study.
40:59So we didn't do a network reconstruction or anything
41:02in the different study.
41:03We were just seeing if these modules replicated
41:07in their associations with outcomes in a different cohort.
41:10So in this case, it's really not seeing much
41:14in terms of association with both total brain volume
41:18and we also looked at dementia out of interest
41:22since things were sort of close in our cohort,
41:26but, really, we're not seeing much in terms of associations.
41:31Part of the reason for that,
41:33so there are not that many cohorts
41:36that are available that have a large proteomic panel
41:39with the same proteins that we were looking at
41:41as well as MRI and incident dementia outcomes,
41:45and, in this case, the demographics of the cohort
41:48are fairly different from (indistinct) Framingham.
41:51So about 20 years older on average.
41:56I'm just including the sort of first few rows
41:58of our table one, but you can see differences in education,
42:01systolic blood pressure, and the same is true
42:03of a lot of the other cardiovascular risk factors.
42:06So it's a very different cohort,
42:08and digging a bit into the literature
42:10about sort of proteins over the life course,
42:13it's not too surprising that we don't see
42:16the same associations, but it it does sort of,
42:19it's a good cautionary message
42:20about drawing conclusions too far
42:23based on sort of one set of data
42:25or one set of demographics.
42:30Just to put these results in context,
42:32so our module four included
42:36a lot of immune-related signaling molecules
42:38like interleukins, TNF receptor proteins,
42:41which are both types of cytokines, and have been associated
42:44with Alzheimer's disease previously,
42:47in particular, interleukin-1 beta was in our module four,
42:52and it had been found to be elevated
42:53in 80 cases in a meta-analysis.
42:56However, other biomarkers that have been sort of validated
43:00in other cohorts were not identified in our module.
43:08In module two, we saw Axon guidance pathway proteins
43:11including ephrins, netrins, and semaphorins,
43:13which have been associated with AD in previous work,
43:17and complement cascades are also have been associated
43:20with AD probably for the reason
43:22of inducing these immune cells called microglia
43:27in the brain to, basically, eat up
43:31cells in response to amyloid deposition.
43:35So there's some biologically plausible mechanisms
43:37that could be associated with these modules
43:42in Alzheimer's disease,
43:46and the last thing I'll say is talking about some sort
43:49of other ways of approaching this problem,
43:51so as I mentioned, the CHS cohort
43:54has different underlying characteristics,
43:56and so it may well have a different network structure.
43:59So one thing that could be good to do
44:02is to look at sort of consensus modules across the cohorts
44:07where you construct networks in each cohort,
44:09and then look at where the overlaps are,
44:12and you can get sort of a more,
44:14hopefully, more robust network across cohorts,
44:18and then there are other network-based approaches
44:20that can incorporate external information.
44:22So, again, our network approach
44:24was just based on correlation in our dataset,
44:27whereas other methods use sort of those annotation databases
44:33and that sort of thing to construct the networks
44:35and sort of decide how strong the similarities between nodes
44:39or the strength of connections will be.
44:41So that's another approach,
44:43and then the last thing I'll say is that
44:45I'm sort of still using this kind of method
44:48now in work with longevity and aging
44:51and trying to apply it to metabolomics,
44:54so metabolites data in cohorts related to those outcomes.
45:02So thank you all for being here.
45:04Thank you, my collaborators.
45:06This is the folks down at UT.
45:09I'll say that (indistinct).
45:11Thank you.
45:16<v ->Thank you for wonderful presentation.</v>
45:18We're open for questions.
45:20So let's start with people in the room.
45:21Any questions?
45:23<v ->Got one over here.</v> <v ->Perfect, thank you.</v>
45:25<v Audience>Yeah, so my research interest</v>
45:26is about the cancer, and, also,
45:28we're interested in your study.
45:30So I've got some technical issues about this project.
45:35So the first issue that,
45:36how do you do the normalization in your process?
45:41<v ->Yeah, great question.</v>
45:42So yeah, I totally glossed over
45:44all the pre-processing stuff.
45:47So before doing the network construction,
45:51I log transformed the protein concentrations
45:54to reduce stiffness.
45:56There was a standardization within,
45:58there were sort of two phases of runs of protein modules,
46:02so I sort of standardized within those batches,
46:06and then after that, I did a rank normalized
46:11or inverse normal rank transformation to sort of-
46:16(audience speaks indistinctly) <v ->What's that?</v>
46:17<v ->(indistinct) normalization?</v> <v ->Basically.</v>
46:19Yeah, yeah, yeah.
46:20So that was sort of the data pre-processing.
46:23So I think I, you know,
46:26I've thought about sort of the pros and cons
46:28of those things as well and I think my biggest qualm
46:31with the way that I did it is sort of interpretability,
46:34because, yeah, sort of what does it mean
46:37to be at one quantile versus another
46:39where you have this huge dynamic range
46:40of protein concentrations?
46:42<v Audience>So another question is that</v>
46:44I know that in your project,
46:46the modules identification is very important.
46:48So I wonder,
46:53you have talked a little bit
46:54about how to answer the modules,
46:57but so can you explain a little bit more
47:00about how you gonna bring modules from the data?
47:08<v ->I'm not sure, can you say a little bit more?</v>
47:11<v Audience>Yeah, so in your previous pages,</v>
47:13I think you talked a little bit about the clustering
47:17of the modules so that we know
47:18that there are four main modules.
47:22<v ->Yes.</v> <v ->In the whole dataset.</v>
47:24So what is the name of that algorithm
47:28and how it basically work?
47:31<v ->Yeah, so the clustering itself was done</v>
47:36using algorithm called H+.
47:41To be honest, I'm not too sure
47:43about sort of the details of it.
47:45It can use any dissimilarity measure,
47:48which, in our case, comes from the TOM matrix, but-
47:52<v Audience>So this is the algorithm that we separate</v>
47:55the whole proteins into four different modules
47:58so that we can analyze it one by one.
48:00<v ->Yeah, yeah, yeah, yeah.</v> <v ->Yeah,</v>
48:01so I also noticed that
48:07in the weighted protein expression network analysis,
48:13you talk about the beta values.
48:16<v ->Yes.</v> <v ->That you use that</v>
48:20like the soft threshold. <v ->Yeah.</v>
48:23<v Audience>To make the genes to be more important</v>
48:28if that is the thing that you wanna analyze.
48:31So in this process, I want to know how you would make sure
48:35the value of the data in this process.
48:39<v ->So sorry, we have to end 'cause it's 12:15.</v>
48:42I know others have classes and everything.
48:44Maybe you guys can discuss a little bit.
48:46<v ->Yeah, (indistinct), yeah.</v> <v ->Maybe if you have time.</v>
48:48Please, if you're registered,
48:49make sure you signed in on a sign in sheet.
48:51There's three of 'em.
48:52You only have to sign on one of them,
48:54and then one-fourth page reflections will be due
48:57before the next speaker's time to speak.
48:59(indistinct talking)