YSPH Biostatistics Seminar: “Addressing the Replicability and Generalizability of Clinical Prediction Models”

September 08, 2021

Naim Rashid, PhD
Associate Professor, Department of Biostatistics
University of North Carolina at Chapel Hill

September 7, 2021

Information

ID: 6891
To Cite: DCA Citation Guide

Download Transcript

00:00<v Robert>Hi, I'm a Professor McDougal,</v>
00:06and Professor Wayne is also in the back.
00:08If you haven't signed in, please make sure that you pass
00:11this, get a chance to sign the sign in sheet.
00:15So today we are very, very privileged to be joined
00:19by Professor Naim Rashid
00:22from the University of North Carolina Chapel Hill,
00:25Professor Rashid got his bachelor's in biology from Duke,
00:30and his PhD in biostatistics from UNC Chapel Hill.
00:35He's the author of 34 publications, and he holds a patent
00:40on methods in composition for prognostic
00:44and/or diagnostic supply chain of pancreatic cancer.
00:48He's currently an associate professor at UNC Chapel Hill's
00:51department of biostatistics, and he's also affiliated
00:54with their comprehensive cancer center there.
00:59With that, Professor Rashid, would you like to take it away?
01:04<v ->Sure.</v>
01:06It looks like it says host disabled screen sharing.
01:10(chuckling)
01:12<v Robert>All right, give me one second.</v>
01:14Thank you.
01:17I'm trying to do.
01:27(indistinct)
01:34Okay, you should be, you should be able to come on now.
01:36<v ->All right.</v>
01:39Can you guys see my screen?
01:44All right.
01:48Can you guys see this?
01:50<v Robert>There we go.</v>
01:52Perfect. Thank you.
01:53<v ->Okay, great.</v>
01:54So yes, thanks to the department for inviting me to speak
01:57today, and also thanks to Robert and Wayne for organizing.
02:01And today I'll be talking about issues regarding
02:04replicability in terms of clinical prediction models,
02:08specifically in the context of genomic prediction models,
02:12derived from clinical trials.
02:16So as an overview, we'll be talking first a little bit
02:18about the problems of replicability in general,
02:21in scientific research, and also about specific issues
02:24in genomics itself, and then I'll be moving on to talking
02:28about a method that we've proposed to assist
02:31with issues regarding data integration, and learning
02:34in this environment when you have a heterogeneous data sets.
02:38I'll talk a little bit about a case study
02:40where we apply these practices to subtyping
02:43pancreatic cancer, touch on some current work
02:45that we're doing, and then end
02:47with some concluding thoughts.
02:48And feel free to interrupt, you know,
02:50as the talk is long, if you have any questions.
02:54So I'm now an associate professor in the department
02:56of biostatistics at UNC.
02:58My work generally involves problems
03:00surrounding cancer and genomics, and more recently
03:05we've been doing work regarding epigenomics.
03:07We just recently published a supply-connected package called
03:09Epigram for a consistence of differential key calling,
03:13and we've also done some work in model-based clustering.
03:15We published a package called, FSCSeq,
03:18which helps you derive and discover clusters
03:22from RNA seq data, while also determining
03:25clusters in specific genes.
03:26And today we'll be talking more about the topic
03:28of multi-study replicability, which is the topic
03:30of a paper that we published a year or two ago,
03:34and in our package that we've developed more recently,
03:37implementing some of the methods.
03:40So before I get deeper into the talk, one of the things
03:43I wanted to establish is this definition
03:45of what we mean by replicability.
03:47You might've heard the term reproducibility as well,
03:50and to make the distinction between the two terms,
03:52I'd like to define reproducibility in a way
03:54that Jeff Leak has defined in the past,
03:57where reproducibility is the ability to take
03:59coding data from a publication, and to rerun the code,
04:03and get the same results as the original publication.
04:06Where replicability, we're defining as the ability to be run
04:09an experiment generating new data, and get results
04:11that are quote, unquote "consistent"
04:14with that of the original study.
04:16So in this sort of context, when it comes to replicability,
04:19you might've heard about publications that have come out
04:22in the past that talk about how there are issues
04:24regarding replicating the research that's been published
04:28in the scientific literature.
04:30This one paper in PLOS Medicine was published
04:32by, and that is in 2005, and there's been a number
04:36of publications that have come out since,
04:38talking about problems regarding replicability,
04:41and ways that we could potentially address it.
04:43And the problem has become large enough where it has
04:46its own Wikipedia entry talking about the crisis,
04:49and has a long list of examples that talks
04:51about issues regarding replicating results
04:54from the scientific studies.
04:55So this is something that has been a known issue
04:58for a while, and these problems also extend
05:00to situations where you want to, for example,
05:03develop clinical prediction models in genomics.
05:06So to give an example of this, let's say that we wanted to,
05:10in the population of metastatic breast cancer patients,
05:13we wanted to develop a model that predicts
05:16some clinical outcome Y, given a set
05:18of gene expression values X.
05:21And so the purpose of this sort of exercise is
05:23to hopefully translate this sort of model
05:26that we've developed, and apply it to the clinic,
05:28where we can use it for clinical decision-making.
05:31Now, if we have data from one particular trial
05:35that pertains to this patient population,
05:37and the same clinical outcome being measured,
05:39in addition to having gene expression data,
05:41let's say that we derived a model, let's say
05:43that we're modeling some sort of binary outcome,
05:44let's say tumor response.
05:46And in this model, we used a cost report,
05:48or penalized logistic regression model
05:51that we fit to the data to try and predict the outcome,
05:54given the gene expression values.
05:56And here we obtained, let's say, 12 genes
05:59after the fitting process, and the internal model 1 UNC
06:04on the sort of training subjects is 0.9.
06:07But then let's say there's another group at Duke
06:09that's using data from their clinical trial,
06:11and they have a larger sample size.
06:13They also found more genes, 65 genes,
06:16but have a slightly lower training at UNC.
06:18However, we really need to use external validation
06:22to sort of get an independent assessment of how well
06:25each one of these alternative models are doing.
06:27So let's say we have data from a similar study from Harvard,
06:30and we applied both these train models
06:33to the genomic data from that study at Harvard.
06:35We have the outcome information for those patients as well,
06:38so we can calculate how well the model predicts
06:42on those validation subjects.
06:44And we find here in this data set,
06:46model 2 seems to be doing better than model 1,
06:49but if you try this again with another data set
06:51from Michigan, you might find that model 1 is doing
06:53better, better than model 2.
06:55So the problem here is where we have researchers
06:58that are pointing fingers at each other,
06:59and it's really hard to know, "Well, who's who's right?"
07:01And why is this even happening in the first place,
07:04in terms of why do we get different genes, numbers of genes,
07:06and each of the models derived from study 1 and study 2?
07:09And why are we seeing very low performance
07:12in some of these validation datasets?
07:15So here's an example from 2014,
07:17in the context of ovarian cancer.
07:20The authors basically collected 10 studies,
07:22all were microarray studies.
07:25The goal here was to predict overall survival
07:27in this population of ovarian cancer patients,
07:30given gene expression measurements
07:32from this microarray platform.
07:34So through a series
07:35of really complicated cross-fertilization approaches,
07:39the data was normalized, and harmonized
07:40across the studies, using a combination of ComBat
07:43and frozen RNA, and then they took
07:4614 published prediction models in the literature,
07:48and they applied each of those models to each
07:51of the subjects from these 10 studies, and they compared
07:53the model predictions across each subject.
07:58So each column here in this matrix is a patient,
08:00and each row is a different prediction model,
08:03and each cell represents the prediction
08:06from that model on that patient.
08:08So an ideal scenario, where we have the models generalizing
08:12and replicating across each of these individuals,
08:14we would expect to see the column,
08:16each column here to have the same color value,
08:19meaning that the predictions are consistent.
08:20But clearly we see here that the predictions are
08:22actually very inconsistent,
08:24and very different from each other.
08:27In addition, if you look
08:28at the individual risk prediction models
08:30that the authors used, there was also
08:32substantial differences in the genes
08:34that were selected in each of these models.
08:36So there's a max 2% overlap in terms of common genes
08:40between each of these approaches.
08:41And one thing to mention here is that each one
08:43of these risk-prediction models were derived
08:45from separate individual studies.
08:48So the question here is, you know, how exactly,
08:51if you were a clinician, you're eager to sort of take
08:54the results that you're seeing here,
08:57and extend to the clinic,
08:58which model do you use, which is right?
09:01Why are you seeing this level of variability?
09:03This is, of course, concerning, if you, if your goal is
09:06to move things towards the clinic, and this also has
09:08implications in terms of, you know, getting in the way
09:11of trying to approve the use of some
09:13of these, and for clinical use.
09:17So why is this happening?
09:19So there's been a lot of studies have been done
09:22that have tied issues to, obviously, sample size
09:24in the training studies, smaller sample sizes,
09:27and models trained on them may lead to more unstable models,
09:31or less accurate models.
09:32Between different studies, you might have
09:35different prevalences of the clinical outcome.
09:36In some studies, you might have higher levels of response,
09:39and other studies, you might have lower levels of response,
09:40for example, if you have this binary clinical outcome,
09:43and also there's issues regarding differences
09:46in lab conditions, where the genomic data was extracted.
09:49We've seen at Lineberger that, depending on the type
09:52of extraction, RNA extraction kit that you use,
09:55you might see differences in the expression of a gene,
09:58even from the same original tumor.
10:00And also the issue of batch placement,
10:02which has been widely talked about in the literature,
10:04where depending on the day you run the experiment,
10:06or the technician who's handling the data,
10:11you might see slight differences,
10:12technical differences in expression.
10:15There's also differences due to protocols.
10:17Some trials might have different inclusion
10:18and exclusion criteria, so they might be recruiting
10:21a slightly different patient population,
10:22even though they might be all
10:24in the context of metastatic breast cancer.
10:25All of these things can help impart heterogeneity
10:29between what the genomic data and the outcome data
10:34across different studies.
10:36In the context of genomic data in particular,
10:39there's also this aspect of data preprocessing.
10:41For the normalization taking that you use is very important,
10:45and we'll talk about that in a little bit.
10:47And it's a very critical part when it comes
10:48to training models, and trying to validate your model
10:52on other datasets, and depending on the type
10:54of normalization you use, this could also impact
10:58how well your model works.
11:00In addition, there's also differences in the potential way
11:03in which you measure gene expression.
11:04Some trials might use an older technology called microarray.
11:07I know other trials might use something
11:09relatively more recent called RNAC,
11:11or a particular trial might use
11:13a more targeted platform like NanoString.
11:15So the differences in platform also can lead to differences
11:19in your ability to help validate some of these studies.
11:21If you train something in marker rate, it's very difficult
11:24to take that model, and apply it to RNAC,
11:26because the expression values are just are just different.
11:30And so, as I mentioned before, this also impacts
11:32through to normalization on model performance as well.
11:37So the main thing to remember here is that
11:40the traditional way in which prediction models,
11:43based on genomic data for using the clinical training is
11:46typically on the results from a single study.
11:52To talk a little bit more about question
11:54of between-study normalization, and the purpose of this is
11:57to put the expression data on basically an even scale,
12:00which helps facilitate training.
12:02If there's global shifts, and some of the expression values
12:06in one sample versus another, it's very difficult to train
12:09an accurate model in that particular scenario.
12:11So normalization helps to align
12:13the expression you get from different samples,
12:16and hopefully across the between difference as well.
12:19And so the goal here is to eventually predict this outcome
12:23in a new patient, you plug in the genomic data
12:25from a new patient in order to get the predicted outcome
12:28for that patient based on that training model.
12:30So the, in order to do that, you also have to normalize
12:34the new data to the training data, right?
12:36Because you also want to put the new data on the same scale
12:38as a training data, and in the ideal scenario,
12:41you would want to make sure that the training samples
12:44that you use to train your original model are untouched,
12:47because what some people try to do is they try
12:49to sort of sidestep this normalization issue,
12:52they would combine the new data with the old training data,
12:55and renormalize everything at once.
12:57And the problem with this is that this changes
12:59your training sample values, and in a sense,
13:01would necessitate the fact that you need to retrain
13:04your old model again.
13:04And this leads to instability, and lack of stability
13:07over time in terms of the model itself.
13:10So in the prior example from ovarian cancer,
13:12this is not as big of an issue, because you have
13:15all the data you want to work with in hand.
13:18This is a retrospective study, you have 10 data sets,
13:20so you just normalize everything at the same time,
13:22that's in ComBat and frozen RNA.
13:24And so you can split up those studies into separate training
13:27and test studies, and they're all rated on the same scale.
13:31But the problem is that in practice, you're trying to do
13:34a prospective type of analysis, where when you train
13:37your model, you're normalizing all of the available studies
13:40you have, let's say, and then you use that to predict
13:44the outcome in a future patient, or a future study.
13:47And so the problem with that is that you have to find
13:51a good way to align, as I mentioned before,
13:55the data from that future study for your training samples,
13:57and that may not be an easy task to do,
14:00especially for some of the newer platforms like RNAC.
14:04So taking this problem a step further,
14:06what if there's no good cross study normalization approach
14:10that's available to begin with?
14:12This really is going to make things difficult in terms
14:15of the training in the model in the first place.
14:18Another more complicated problem is that you might have
14:21different types of platforms at that training time.
14:24For example, you might have the only type of data
14:26that's available from one study is NanoString in one case,
14:29and another study it's only RNAC, so what do you do?
14:33And looking forward, as platforms change,
14:35as technology evolves, you have different ways
14:36of measuring gene expression, for example.
14:42So what do you do with the models that are trained
14:44on old data, because you can't apply them to the new data?
14:48So oftentimes you find this situation
14:50where you have to retrain new models on these new platforms,
14:53and the old models are not able to be applied
14:57directly to this new data types.
14:58So that leads to waste here.
15:01So if you take all of these problems together,
15:03regarding cross-study normalization,
15:07and changes in platform,
15:09and a lot of the other issues, you know,
15:11regarding replicability that I mentioned,
15:13it's no wonder that there's only a small handful
15:17of expression-based clinically applicable assets have been
15:21approved by the FDA, like Oncotype DX, MammaPrint,
15:24and Prosigna, because this is a very, very tough problem.
15:30So I want to move on with that, to an approach
15:33that we proposed to help tackle this sort of issue
15:36by using this idea of multi-study learning,
15:39where instead of just using, and deriving, and generating
15:43models from individual studies, we combine data
15:45from multiple studies together, and create a consensus model
15:48that we use for prediction, which will hopefully be
15:50more stable, and more accurate down the road.
15:54So this approach of combining data is called
15:56horizontal data integration, where we're merging data
15:59from let's say K different studies.
16:01And the pro of this approach is that we get increased power,
16:04and the ability to reach some sort of consensus
16:06across these different studies.
16:09The negative is that the effect of a gene
16:12and its relationship to outcome may actually vary
16:14across studies, and also by, you know, depending on,
16:16and also the way that you normalize the genes may also vary
16:19across studies too if we're using published data
16:21from some prior publication.
16:24There's also this issue of sample size and balance.
16:25You might have a study that has 500 subjects,
16:28and another one that might have 200 subjects.
16:30So there are some methods that were designed to account for
16:34between-study heterogeneity after you do
16:36horizontal data integration.
16:38One is called the meta-lasso, another is called
16:41the AW statistic, but these two methods don't really have
16:44any prediction aspect about them.
16:46They're more about feature selection.
16:48Ensembling is one approach that can directly account
16:50for between-study heterogeneity
16:52after horizontal data integration, but there's
16:54no explicit future selection step here.
16:57But all of these approaches assume
16:59that the data has been pre-normalized.
17:02As we talked about before,
17:03for prospective decision-making, based off a train model,
17:07that might be prohibitive in some cases,
17:10and we need a strategy also to easily predict
17:13and apply these models in new patients.
17:20Okay, so moving on, we're going to talk first
17:24about this issue of how do we integrate data,
17:27and sort of sidestep this normalization problem
17:30at training time, and also at test time where we,
17:33when we try to predict in new subjects?
17:35So the approach that we put forth is to use
17:39what's called top scoring pairs, which you can think of
17:41as a rank-based transformation of the original set
17:45of gene expression values from a patient.
17:47So the idea here originally,
17:50when top scoring pairs were introduced,
17:51was you're trying to find a pair of genes
17:53where it's such that if the expression of gene A
17:56in the pair is greater than gene B, that would imply
17:59that the, let's say, the subtype for that individual is,
18:03say, subtype one, and if it's less,
18:05then that implies subtype zero with high probability.
18:09Now, in this case, this sort of approach was developed
18:12with when one has a binary outcome variable
18:14that you care about.
18:15In this case, we're talking about subtype,
18:17but it could also be tumor response or something else.
18:20So essentially what you're doing is that you're taking
18:22these continuous measurements in terms of gene expression,
18:25or integer, and you are converting that, transforming
18:31that into basically a binary predictor,
18:32which takes on the value of the zero or one.
18:34And the hope is that that particular transformed value is
18:38going to be associated with this binary outcome.
18:41So the simple assumption in this scenario is
18:44that the relative rank of these genes
18:46in a given sample is predictive of subtype, and that's it.
18:51And so the example here I have on the right is an example
18:54of two genes, GSTP1 and ESR1.
18:58And so you can see here that if you're
19:00in the upper left quadrant, this is where this gene is
19:02greater than this gene expression, it's implying
19:05the triangle subtype with high probability,
19:08and otherwise it implies the circle subtype.
19:11So that's the general idea of what we're going for here.
19:14It's a sort of a rank-based transformation
19:16of the original continuous predictor space.
19:21So the nice thing about this approach,
19:22because we're only based on the simple assumption, right?
19:25That we're only caring about the relative rank
19:27within a subject, this makes
19:29this particular new transformed predictor
19:32relatively invariant to batch effects, pre-normalization,
19:36and it also most importantly, simplifies merging data
19:39from different studies.
19:41Everything is now on the same scale, zero to one,
19:43so it's very easy to paste together the data
19:45from different studies, and we can sidestep this problem
19:50of trying to pick a cross-normalization approach,
19:53and then work in this sort of transformed space.
19:57The other nice thing is that this is easily computable
19:59for new patients as well.
20:01If you have a new patient that comes into clinic,
20:03you just check to see whether the gene A is
20:04greater than gene B in terms of expression,
20:06and then you have your value for this top scoring pair,
20:11and we don't have to worry as much about normalizing
20:14this patient's raw gene spectrum data
20:18to the training sample expression values.
20:21So essentially what we're doing here is that we're,
20:23let's enumerate all possible gene pairs for us,
20:26instead of a candidate genes, and each column here
20:28in this matrix shown on the right pertains
20:31to the zero one values for a particular gene pair J.
20:34And so this value takes the value of one, it is greater
20:38than B, in sample I, in pair j, and zero otherwise.
20:41And then we merge over the common top scoring pairs.
20:46So in this example have data from four different studies,
20:49each indicator by a different color here
20:50in the first track, and this data pertains to data
20:54from two different platforms,
20:55and three different cancer types.
20:56And so the clinical outcome here is binary subtype,
20:59which is given by the orange and the blue color here.
21:02So you can see here that we enumerated the TSPs,
21:05we merged the data together, and now we have
21:07this transformed predictor agents.
21:09And the interesting thing is
21:10that you can definitely see some patterning here.
21:13With any study where you have a particular set of TSPs
21:15that had taken a value of one, when the subtype is blue,
21:19and it flips when it's orange.
21:21And we see the same general pattern seem to replicate
21:24across different studies,
21:25but not every top scoring pair changes the same way
21:29across different studies.
21:32So if we cluster the rows here, we can also see
21:35some patterns sort of persist where we see
21:38some clustering by subtype,
21:40but also some clustering by study as well.
21:42And so what this implies is that there's a relationship
21:45between TSPs and subtypes, and that can vary across studies,
21:47which is not too different from what we've talked
21:50about regarding the issues we've seen
21:51in replicability in the past.
21:53So ideally we would like to see a particular gene pair,
21:57or TSP vector here take on a value of one,
22:01only when there's the orange subtype,
22:03and zero in the blue subtype, or vice versa.
22:05And we wanted to see this pattern replicated
22:07across patients in studies, but we see obviously
22:10that that's not the case.
22:12So the question now that we've sort of introduced,
22:15or proposed is this sort of approach to simplify
22:17data merging in normalization.
22:19The question now that we're sort of dealing
22:20with is well, how do we actually now find
22:22features that are consistent across different studies
22:26in their relationship with outcome, and also estimate
22:29their study-level effect, and then use them for prediction?
22:33So that leads us to the second part of our paper,
22:35where we developed a model to help select
22:39these particular study-consistent features
22:42while accounting for study-level heterogeneity.
22:47So to sort of illustrate the idea behind this,
22:49let's just start with a simple simulation
22:52where we're not doing any normalization,
22:54we're not worrying about resuming, everything's fine
22:56in terms of the expression values,
22:59and we're not doing any selection,
23:00no TSP transmission either.
23:03So we're going to assimilate data pertaining
23:05to two, let's say, known biomarkers
23:06that are associated with binary subtype.
23:09We're going to generate K datasets,
23:11and we're going to try three different strategies
23:12for learning a prediction model two to these data sets.
23:15And at the end, we're going to validate each of those models
23:18on an externally-generated data set
23:19to compare their prediction performance.
23:22So to do this, we're going to fit and assume for each study
23:25that we can fit it with a logistic regression model
23:28to model by our outcome with these two predictors,
23:31and in generating these K data sets,
23:32we're going to vary the number of with respect to K.
23:35So we might generate two trained data sets five or 10,
23:38and also change the total sample size of each one,
23:40and make sure that the sample sizes are in balanced
23:42across the different studies, and then assume
23:45values for the coefficients for each of these predictors
23:50to be these values here, and lastly, to induce some sort
23:53of heterogeneity across the different training datasets,
23:56we're gonna add in sort of like a random value drop
23:59from the normal distribution, where we're assuming
24:03this level of variance for this value.
24:05So basically we're just injecting heterogeneity
24:07into this data generation process.
24:09So after we generate the training studies,
24:11then we're going to apply three different ways
24:13or strategies to the training data.
24:15The first is the individual study approach,
24:17which we've talked about before, where you train
24:20a generalized model separately for each study.
24:22The second approach is where you merge the data.
24:25Again, we're ignoring the normalization problem here
24:26in simulation, obviously, and then train a single GLMM
24:30for the combined data, and then lastly,
24:32we're going to merge the data, and train
24:34a generalized linear mixed model,
24:35where we explicitly account for a random intercept,
24:38and a random slope for each predictor,
24:41assuming, you know, a study-level random effect.
24:45So after we do that, we'll generate a validation dataset
24:48from the same approach above, and then predict outcome
24:52in this validation dataset with respect
24:55to the models derived from each of these three strategies.
24:59So if we look at the individual strategy performance,
25:01where we fit a GLM logistical regression model
25:04separately for each study, and then apply it
25:06to this validation data set, we can check
25:08the prediction accuracy, we can find that,
25:11due to the induced level of heterogeneity
25:14between studies in predictor effects,
25:16in one study, we do really poorly,
25:18and another study we do really well,
25:20and this variation is entirely due to variations
25:24in the gene subtype relationship.
25:27And these predictions obviously vary as a result
25:29across the different studies.
25:30And this will reflect a little bit of what we see
25:32in some of the examples that we showed earlier,
25:35studies that were trained on different data sets.
25:40And then the second approach is where we combine
25:43the data sets, and train a single logistical question model
25:46to predict outcome.
25:46And so we see what the median prediction error is better
25:49than most of the models here, but if we fit the GLMM,
25:52the median prediction (indistinct) gets better
25:54than some of the other approaches here.
25:56So this is basically just one example.
25:58So we did this over and over a hundred times
26:00for every single possible simulation condition,
26:03varying K, and the heterogeneity across different studies.
26:07And some of the things that we found was that
26:10the individual study approach had, as you can see,
26:12the worst prediction error overall,
26:14combining the data improved this a little bit,
26:17but the estimates for the coefficients
26:21from the combined GLMM were still biased.
26:23There's supposed to be two in this extreme scenario.
26:27And a kind of heterogeneity with the GLMM mixed model had
26:31the best performance out of the rest,
26:32and also had the lowest bias in terms
26:35of the regression coefficients as well.
26:39So this is great, but we also have a lot
26:42of potential types of pairs.
26:44We can't really estimate them all
26:47with a GLMM mixed model, so we need to find a way
26:50where we can, at least in reasonable dimension,
26:52figure out a way which fixed effects are non-zero,
26:55while accounting for, you know,
26:56this sort of study-level heterogeneity for each effect.
27:00So this led us to develop a pGLMM, which is basically
27:05a high-dimensional generalized intermixed model,
27:08where we are able to select fixed and random effects
27:11simultaneously using a penalization framework.
27:13So essentially here, we're assuming that all the predictors
27:17in the model, we assume a random effect,
27:20a random slope for each one, and so we were aiming to select
27:23the features that have non-zero fixed effects
27:28in this particular approach, and indeed we're assuming
27:30these are going to be study-consistent.
27:32And to do this, we're going to reorganize
27:35the linear predictor from the standard GLMM,
27:38so basically we're starting with the same general likelihood
27:41for, you know, the generalized mixed model.
27:44Here, Y is our outcome, X is our predictor,
27:49alpha is the, alpha K is the random effect
27:53for the case study, fi here is typically assumed to be
27:58multi, very normal, means zero, and a covariant
28:02on some sort of unstructured covariance matrix typically.
28:05And so to sort of simplify this, we factor out
28:09the random effects covariance matrix,
28:10and incorporate into the linear predictor.
28:12And with some more reorganizing, now we're able to select
28:16the fixed effects and determine which random effects have
28:21true non-covariance, using this sort
28:24of joint penalization framework.
28:26If you want more detail, you can check out the publication
28:28that I linked above, and I also forgot to send out
28:31the link to this talk here.
28:33I'll do that right now, in case you want to check out
28:35some of the publications that I'm linking in this talk.
28:41Okay, so how do we do this estimation?
28:42And we use that penalized NCM algorithm,
28:44where in each step we're drawing from the posterior
28:47with respect to the random effects, given
28:48the current aspects of the parameters,
28:50and the observed data, using Metropolis point of Gibbs.
28:55In the R packets, I'm going to talk about in a little bit,
28:58we update this to using a Hamiltonian Monte Carlo,
29:03but in the original version,
29:04we use Metropolis point of Gibbs, where we skipped
29:07components that had zero variance from the M-STEP.
29:09And then we use, in the M-step,
29:12two conditional maximization steps
29:14where we first update data, given the draws
29:17from the E-step, and the prior estimates for gamma here,
29:20and then up to gamma using a group penalty.
29:24So we use a couple of other tricks
29:25to speed up performance here.
29:27I won't go too much into the details there,
29:29but you can check out the paper for more detail on that.
29:33But with this approach, one of the things
29:35that we were able to show was that we have
29:37similar conclusions regarding bias and prediction error,
29:39as in the simple setup we had before,
29:41where in this particular situation, we're simulating
29:43a bunch of predictors that do not have any association
29:47with outcome, either 10 to 50 extra predictors,
29:51or there's only two that are actually truly relevant.
29:54And so the prediction error in this model
29:56after this penalized selection process is
29:59generally the same, if not a little bit worse.
30:01And one thing that we find here is that
30:03the parameters are selected
30:06by the individual study approach we're applying now
30:08at penalized distribution regression model has
30:10a low sensitivity to detect the true predictors,
30:13and a higher false positive rate in terms of selecting
30:16predictors that aren't associated
30:17with outcome and simulation.
30:19And what we find here also is that the approach
30:23that we developed had a much better sensitivity
30:26compared to other approaches for selecting
30:28the true predictors when accounting
30:30for study-level homogeneity,
30:32and the lower false positive rate as well.
30:36The example data sets that I talked about before,
30:39the four ones that I showed a figure up earlier,
30:43we did a whole data study analysis where we trained
30:45on three studies and held out one of the studies.
30:48We found that, you know, the approach that we put forward
30:51that put combining the data using our TSP approach,
30:54and then training a model using the pGLM had
30:58the lowest overall holdout study error
31:00compared to the approach using just
31:02a regular generalized linear model,
31:06and then also the individual study approach as well.
31:09And we also compared it to another post called
31:12the Meta-Lasso, which we were able to adapt
31:14to do prediction, and we didn't see that much improvement
31:16of performance as well.
31:17But in general, the result that we saw here was
31:21that the individual study approach had
31:23bad prediction error also across the different studies.
31:27So again, this sort of takes what we've already seen
31:29in the literature in terms of inconsistency,
31:31in terms of the number of genes that are being selected
31:33in each of these models, and also the variations
31:35in the prediction accuracy, this sort of reflects
31:38what we've been seeing in some of this prior work.
31:44So in order to you implement this approach
31:46in a more systematic way, my student and I,
31:49Hillary worked, put together an R package called
31:51The GLMMPen R Package.
31:54So this was just recently submitted
31:56to Journal of Statistical Software, but if you want to track
31:59the code, it's available on Github right here,
32:02and we're in the process of submitting this to CRAN as well.
32:05This was sort of like a nice starter project that I gave
32:08to Hillary to, you know, get her feet wet with coding,
32:12and she's done a really great job, you know,
32:15in terms of putting this together.
32:16And some of the distinct differences between this
32:19and what we put forth in the paper is the use
32:21of Hamiltonian Monte Carlo and the east app,
32:24instead of the Metropolis Gibbs.
32:26It's much faster, much more efficient.
32:27We also have added helper functions
32:29for the (indistinct) tuning parameters, and also making
32:33some diagnostic plots as well, after convergence.
32:37And we've also implemented some speed
32:39and memory improvements as well, to help with usability.
32:44Okay, so we talked about some issues
32:47regarding data integration, and then issues
32:50with normalization, how that impedes, or can impede
32:52validation in future patients, and then we introduced
32:56a way to sidestep the normalization problem,
32:59using this sort of rank-based transformation,
33:01and an approach to select consistent predictors
33:03in the presence of between-study heterogeneity.
33:07So next, I'm going to talk about a case study
33:09in pancreatic cancer, where we took a lot of these tools,
33:13and applied them to a problem that some collaboratives
33:16of mine were having, you know, at the cancer center at UNC.
33:20And to give a brief overview of pancreatic cancer,
33:23it has a really poor prognosis.
33:26Five-year survival is very low, you know, typically 5%.
33:30The median survival tends to be less than 11 months,
33:32and the main reason why this is the case is that
33:35early detection is very difficult,
33:37and so when patients show up to the clinic,
33:40they're oftentimes in later stages, or gone metastatic.
33:44So for those reasons, it's really important to place
33:48patients on optimal therapies upfront, and choosing
33:51the best therapies, specifically for a patient, you know,
33:54when after they're diagnosed.
33:56So breast and colorectal cancers have
33:59long-established subtyping systems that are oftentimes used.
34:02Again, an example of a few of them in breast
34:04that have actually been approved by the FDA
34:06for clinical use, but there's nothing available for,
34:09in terms of precision medicine for pancreatic cancer,
34:11except for a couple of targeted therapies
34:14for specific mutations.
34:17So in 2015, the Yeh Lab at UNC,
34:20using a combination of non-negative matrix factorization
34:24and consensus clustering, where it was able to discover
34:27two potentially clinically applicable subtypes
34:30in pancreatic cancer, which they call basal-like,
34:33the orange line here, which has a much worse survival
34:37compared to this classical subtype in blue,
34:41where patients seem to do a little bit better.
34:44And so with this approach, they used
34:45this unsupervised learning, set of learning techniques
34:48to derive these novel subtypes.
34:51And so when they took these subtypes and overlaid them
34:54from data from a clinical trial where they had
34:56treatment response information, they found that
34:58largely patients who with basal-like subtype tended to have
35:02tumors that did not respond
35:04to common first-line therapy, Folfirinox.
35:06Their tumors tended to grow from baseline.
35:08Whereas patients that were the classical subtype tended
35:12to respond better on average compared to the basal samples.
35:16So the implications here are that if you are,
35:20subtype is basal, you should avoid Folfirinox
35:23at baseline entry with an alternative type drug,
35:25typically Gemcitabine and nab-paclitaxel Abraxane.
35:27And then for classical patients,
35:29they should receive Folfirinox.
35:32But the problem here is that subtyping clearly is
35:34an unsupervised learning approach, right?
35:36It's not a prediction tool.
35:37So it's, this approach is quite limited if it,
35:42when you have to do, assign a subtype
35:45in a small number of patients, it just doesn't work.
35:48So what some people have done in the past,
35:50so they simply take new patients, and recluster them
35:52with existing, their existing training samples.
35:55The problem with that is that the subtype assignments
35:58for those original training samples might change
36:00when they recluster it.
36:01So there's not a stable, it's not really
36:03a stable approach to really do this.
36:05So the goal here was to leverage the existing training data
36:08that's available to the lab, which come
36:12from different platforms to come up with an approach,
36:15a classifier to predict subtype, given
36:18new subtypes information, genomic,
36:20a new patient's genomic data, to get subtype,
36:23a predicted subtype for that individual.
36:25So of course, in that scenario, we also want to make sure
36:28that that process is simplified, and that we make
36:31this prediction process as easy as possible,
36:33in the face of all these issues we talked about regarding
36:36normalization and the training data to each other,
36:40and also normalization of the new patient data
36:42to the existing training data.
36:45So using some of the techniques that we just talked about,
36:49we came up with a classifier that we call PurIST,
36:51which was published in the CCR last year,
36:53where essentially we were able to do that.
36:56We take in the genomic data for a previous patient,
36:59and able to predict subtype based off of that,
37:04the train model that we developed.
37:06And in this particular paper, we had nine data sets
37:09that we curated from the literature, three of which
37:11that we used for training,
37:13the rest we used for validation.
37:14And we did consensus clustering on all of them,
37:16using the gene list that was derived
37:18from the original publication,
37:21where the subtypes were discovered to get labels,
37:23subject labels for each one of the subjects
37:25in each one of these studies.
37:27So once we had those labels from consensus clustering,
37:30we then merged the data from our three largest studies,
37:33which are our training studies.
37:35We did some sample for filtering based on quality,
37:37and we filtered some genes based off of, you know,
37:40expression levels and things like that.
37:42And then we applied our previous training approach
37:45to get a small subset of top scoring pairs from the data.
37:50And in this case, we have eight that we selected,
37:51each with their own study-level coefficient.
37:55And then for prediction, the process is very simple,
37:58we just check in that patient, whether gene A is greater
38:00than gene D for each of these pairs,
38:02and that gives us their binary vector of ones and zeros.
38:05We multiply that by the coefficients from the train model.
38:09This is basically just calculating a linear predictor
38:11from this logistic regression model.
38:14And then we can convert that
38:15to a predicted probability of being basal.
38:18So using this approach, we were able to select
38:2316 genes pertaining to eight subtypes,
38:25but we can find here that the predictions
38:27from this model tends to coincide very strongly
38:31with the labels that were collected
38:33using consensus clusters.
38:34So that gives us some confidence that reproducing
38:36in some way, you know, this, the result that we got
38:41using this clustering approach.
38:43You can also clearly see here that as the subtype changes,
38:46that you see flips in the expression in each one
38:49of the pairs of genes that we collected
38:52in this particular study.
38:54And then when we applied this model
38:55to six external validation dataset, we found that it had
38:59a very good performance in terms of recapitulating subtype,
39:01where we had a relatively good sensitivity
39:04and specificity in each case, which we owe part
39:07to the fact that we don't have to worry as much
39:08about this sort of cross-study normalization training time
39:13or test time, and also the fact that we leveraged
39:17multiple data sets when selecting
39:21the predictors for this model.
39:22And so when we looked at the predictive values
39:24in these holdout studies, the predictive subtypes,
39:27we recapitulated the differences in survival
39:30that we observed in other studies as well,
39:32where basal-like patients do a lot worse
39:34compared to classical patients.
39:37If you want to look a little bit more at the details
39:39in this paper, you can check out this link here,
39:41and if you want to access the code that we used
39:44to make these predictions, that's available
39:45on this Github page at this link right here.
39:50Another thing that we were able to show is that for patients
39:53that had samples that are collected through different modes
39:56of collection, whether it was bulk, FNA, FFPE,
40:00we found that the predictions in these patients tend to be
40:03highly consistent, and this is basically deriving
40:06itself, again, from the simple assumption behind TSPs,
40:09where the relative rank within the subject of the expression
40:13of these genes is predicted.
40:15So as long as that is being preserved,
40:17then you should be able to have the model predict well
40:21in different scenarios.
40:23So when we also went through CLIA validation for this tool,
40:28we also confirmed 95% agreement between replicated runs
40:31in other platforms, and we also confirmed concordance
40:38between NanoString and RNAC, also through different modes
40:43of sample collection.
40:44So right now this is the first clinically applicable test
40:47for a prospect of first line treatment selection in PDAC.
40:51And right now we do have a study that just recently opened
40:54at the Medical College of Wisconsin that's using PurIST
40:56for prospect of treatment selection,
40:58and we have another one opening at University of Rochester,
41:02and also at UNC soon as well.
41:06So this is just an example about how you can take
41:10a problem, you know, in, from the literature,
41:14from your collaborators, come up with a method,
41:18and some theory behind it, and really be able to come up
41:22with a good solution that is robust,
41:24and that can really help your collaborative
41:27at your institution and elsewhere.
41:32Okay, so that was the case study.
41:34To talk about some current work
41:35that we're doing just briefly.
41:36So we wanted to think about how we can also scale up the,
41:39this particular framework that we developed for the pGLMM,
41:42and one idea that we're pursuing right now
41:44with my student Hillary, is that we're thinking
41:48about using, borrowing ideas from factor analysis
41:50to decompose, do a deep, deterministic decomposition
41:53of the random effects to a lower dimensional space,
41:56where essentially, we can essentially map
42:00between the lower dimensional space (indistinct) factors,
42:03which is r-dimensional, to this higher dimensional space,
42:05using some by matrix B, which is q by r,
42:12and essentially in doing so, this reduces the dimension
42:16of the integral in the Monte Carlo EM algorithm.
42:20So rather than having to do approximate integral
42:22and q dimensions, which can be difficult,
42:24you can work in a much lower space in terms of integral,
42:27and then have this additional problem
42:29of trying to estimate this matrix,
42:31and not back to the original dimension cube.
42:33So that's something that we're just starting to work on
42:35right now, and another thing that we're starting to work on
42:39is the idea of trying to extend some of the work
42:41in variational autoencoders
42:43that my student David is working on now.
42:45His current work is trying to account for missing data
42:48when trying to train these sort of deep learning models,
42:51the VAEs unsupervised learning model's oftentimes used
42:55for dimensional reduction.
42:56You might've heard of it
42:57in single cells sequencing applications.
43:01But the question that we wanted to address is, well,
43:03what if you have missing data, you know,
43:05in your input features X, which might be (indistinct)?
43:10So essentially we were able to develop input.
43:14So we have a pre-print up right now, it's the code,
43:17and we're looking to extend this, where essentially,
43:20rather than worrying about this latent space Z,
43:23which we're assuming that that encodes a lot
43:25of the information in the original data,
43:27we replaced that with learning the posterior
43:29of the random effect, given the observed data.
43:32And then in the second portion here, we replaced
43:34this generative model with the general model of y given X
43:39in the random effects.
43:41So that's another avenue that can allow us
43:43to hopefully account for non-linearity,
43:45and arbitrator action between features as well.
43:47And also it might be an easier way to scale up
43:49some of the analysis we've done too,
43:53which I've already mentioned.
43:55Okay, so in terms of some concluding thoughts,
43:58I talked a lot about how the original subtypes were derived
44:03for this pancreatic cancer case study using NMF
44:06and consensus clustering to get two subtypes.
44:09But there were also other groups that are published,
44:12subtyping systems, that in one, they found
44:16three subtypes, and in another one they found four subtypes.
44:19So the question is, well, you know, well,
44:22which one do we use?
44:23Again, this is also confusing for practitioners
44:26about which approach might be more meaningful
44:29in the clinical setting.
44:30And each of these approaches were also derived
44:32using NMF and consensus clustering, and they were done
44:35separately on different patient cohorts
44:38at different institutions.
44:39So you can see that this is another reflection
44:41of heterogeneity in single-study learning,
44:45and how we can get these different or discrepant results
44:49from applying the same technique to 200 genus datasets
44:52that were generated at different places.
44:54So of course this creates another problem, you know,
44:57who's right, which approach do we use?
45:00And it's kind of like a circular argument here.
45:03So in the paper that I mentioned before with PurIST,
45:07another thing that we did is we overlaid
45:09the others subtype system calls
45:12with the observed clinical outcomes
45:15for the studies that we collected.
45:17And one of the things that we found was that,
45:19and these other subtyping systems,
45:22each of them also had something,
45:24something that was very similar to the basal-like subtype,
45:27and for the remaining subtypes, they had survival
45:30that was similar to the classical subtype.
45:33So one of the arguments that we made was that,
45:35well, if the clinical outcomes are the same
45:37for the other subtypes, you know,
45:40are they exactly right necessary
45:42for clinical decision-making?
45:43That was one argument that we put forth.
45:46And when we looked at the response data, again,
45:48we saw that one of the subtypes in the other approaches
45:51also overlapped the basal-like subtype in terms of response.
45:56And then for the remaining subtypes,
45:57they were just kind of randomly dispersed at the other end,
46:01you know, of the spectrum here in terms of tumor present,
46:05tumor change after treatment.
46:07So the takeaway here is that heterogeneity
46:09between studies also impacts tasks in unsupervised learning,
46:14like the NMF+ consensus clustering approach
46:16to discover subtypes.
46:18And what this also does is, as you can imagine,
46:21this injects a lot of confusion into the literature,
46:24and can also slow down the process of translating
46:27some of these approaches to the clinic.
46:30So this also underlies the need
46:32for replicable cross-study sub discovery approaches,
46:35for replicable approaches for unsupervised learning.
46:41That's something that, you know, something that we might,
46:43we hope to be working on in the future,
46:46and we hope to see more work on as well.
46:49So to summarize the, one of the major points
46:53of this talk was to introduce and discuss, you know,
46:55replicability issues in genomic prediction models,
46:58supervised learning, that stems from technical,
47:01and also non-technical sources.
47:03We also introduced a new approach to facilitate
47:07data integration and multistory learning
47:09in a way that captures between-study heterogeneity,
47:12and showed how this can be used for the prediction
47:15of subtype for pancreatic cancer, and also introduced
47:20some scalable methods and future direction
47:23in replicable subtype discovery.
47:26So that's it for me.
47:28I just want to thank some of my faculty crowd,
47:30collaboratives, Quefeng Li, Junier Oliva
47:33from UNC computer science, Jen Jen Yeah
47:37from surgical oncology at Lineberger,
47:40Joe Ibrahim as well, UNC biostatistics,
47:43and also my students, Hilary, who's done a lot of work
47:45in this area, and also David Lim, who's doing
47:48some of the deep learning work in our group.
47:50And that's it, thank you.
47:58<v Robert>So does anybody here have</v>
47:59any questions for the professor?
48:09Or anybody on the, on Zoom, any questions you want to ask?
48:26<v ->It looks like I'm off the hook.</v>
48:29<v Robert>All right, well, thank you so much.</v>
48:30Really appreciated your talk.
48:33Have a good afternoon.
48:36<v ->All right, thank you for having me.</v>