YSPH Biostatistics Seminar: “Addressing the Replicability and Generalizability of Clinical Prediction Models”September 08, 2021
Naim Rashid, PhD
Associate Professor, Department of Biostatistics
University of North Carolina at Chapel Hill
September 7, 2021
To CiteDCA Citation Guide
- 00:00<v Robert>Hi, I'm a Professor McDougal,</v>
- 00:06and Professor Wayne is also in the back.
- 00:08If you haven't signed in, please make sure that you pass
- 00:11this, get a chance to sign the sign in sheet.
- 00:15So today we are very, very privileged to be joined
- 00:19by Professor Naim Rashid
- 00:22from the University of North Carolina Chapel Hill,
- 00:25Professor Rashid got his bachelor's in biology from Duke,
- 00:30and his PhD in biostatistics from UNC Chapel Hill.
- 00:35He's the author of 34 publications, and he holds a patent
- 00:40on methods in composition for prognostic
- 00:44and/or diagnostic supply chain of pancreatic cancer.
- 00:48He's currently an associate professor at UNC Chapel Hill's
- 00:51department of biostatistics, and he's also affiliated
- 00:54with their comprehensive cancer center there.
- 00:59With that, Professor Rashid, would you like to take it away?
- 01:04<v ->Sure.</v>
- 01:06It looks like it says host disabled screen sharing.
- 01:12<v Robert>All right, give me one second.</v>
- 01:14Thank you.
- 01:17I'm trying to do.
- 01:34Okay, you should be, you should be able to come on now.
- 01:36<v ->All right.</v>
- 01:39Can you guys see my screen?
- 01:44All right.
- 01:48Can you guys see this?
- 01:50<v Robert>There we go.</v>
- 01:52Perfect. Thank you.
- 01:53<v ->Okay, great.</v>
- 01:54So yes, thanks to the department for inviting me to speak
- 01:57today, and also thanks to Robert and Wayne for organizing.
- 02:01And today I'll be talking about issues regarding
- 02:04replicability in terms of clinical prediction models,
- 02:08specifically in the context of genomic prediction models,
- 02:12derived from clinical trials.
- 02:16So as an overview, we'll be talking first a little bit
- 02:18about the problems of replicability in general,
- 02:21in scientific research, and also about specific issues
- 02:24in genomics itself, and then I'll be moving on to talking
- 02:28about a method that we've proposed to assist
- 02:31with issues regarding data integration, and learning
- 02:34in this environment when you have a heterogeneous data sets.
- 02:38I'll talk a little bit about a case study
- 02:40where we apply these practices to subtyping
- 02:43pancreatic cancer, touch on some current work
- 02:45that we're doing, and then end
- 02:47with some concluding thoughts.
- 02:48And feel free to interrupt, you know,
- 02:50as the talk is long, if you have any questions.
- 02:54So I'm now an associate professor in the department
- 02:56of biostatistics at UNC.
- 02:58My work generally involves problems
- 03:00surrounding cancer and genomics, and more recently
- 03:05we've been doing work regarding epigenomics.
- 03:07We just recently published a supply-connected package called
- 03:09Epigram for a consistence of differential key calling,
- 03:13and we've also done some work in model-based clustering.
- 03:15We published a package called, FSCSeq,
- 03:18which helps you derive and discover clusters
- 03:22from RNA seq data, while also determining
- 03:25clusters in specific genes.
- 03:26And today we'll be talking more about the topic
- 03:28of multi-study replicability, which is the topic
- 03:30of a paper that we published a year or two ago,
- 03:34and in our package that we've developed more recently,
- 03:37implementing some of the methods.
- 03:40So before I get deeper into the talk, one of the things
- 03:43I wanted to establish is this definition
- 03:45of what we mean by replicability.
- 03:47You might've heard the term reproducibility as well,
- 03:50and to make the distinction between the two terms,
- 03:52I'd like to define reproducibility in a way
- 03:54that Jeff Leak has defined in the past,
- 03:57where reproducibility is the ability to take
- 03:59coding data from a publication, and to rerun the code,
- 04:03and get the same results as the original publication.
- 04:06Where replicability, we're defining as the ability to be run
- 04:09an experiment generating new data, and get results
- 04:11that are quote, unquote "consistent"
- 04:14with that of the original study.
- 04:16So in this sort of context, when it comes to replicability,
- 04:19you might've heard about publications that have come out
- 04:22in the past that talk about how there are issues
- 04:24regarding replicating the research that's been published
- 04:28in the scientific literature.
- 04:30This one paper in PLOS Medicine was published
- 04:32by, and that is in 2005, and there's been a number
- 04:36of publications that have come out since,
- 04:38talking about problems regarding replicability,
- 04:41and ways that we could potentially address it.
- 04:43And the problem has become large enough where it has
- 04:46its own Wikipedia entry talking about the crisis,
- 04:49and has a long list of examples that talks
- 04:51about issues regarding replicating results
- 04:54from the scientific studies.
- 04:55So this is something that has been a known issue
- 04:58for a while, and these problems also extend
- 05:00to situations where you want to, for example,
- 05:03develop clinical prediction models in genomics.
- 05:06So to give an example of this, let's say that we wanted to,
- 05:10in the population of metastatic breast cancer patients,
- 05:13we wanted to develop a model that predicts
- 05:16some clinical outcome Y, given a set
- 05:18of gene expression values X.
- 05:21And so the purpose of this sort of exercise is
- 05:23to hopefully translate this sort of model
- 05:26that we've developed, and apply it to the clinic,
- 05:28where we can use it for clinical decision-making.
- 05:31Now, if we have data from one particular trial
- 05:35that pertains to this patient population,
- 05:37and the same clinical outcome being measured,
- 05:39in addition to having gene expression data,
- 05:41let's say that we derived a model, let's say
- 05:43that we're modeling some sort of binary outcome,
- 05:44let's say tumor response.
- 05:46And in this model, we used a cost report,
- 05:48or penalized logistic regression model
- 05:51that we fit to the data to try and predict the outcome,
- 05:54given the gene expression values.
- 05:56And here we obtained, let's say, 12 genes
- 05:59after the fitting process, and the internal model 1 UNC
- 06:04on the sort of training subjects is 0.9.
- 06:07But then let's say there's another group at Duke
- 06:09that's using data from their clinical trial,
- 06:11and they have a larger sample size.
- 06:13They also found more genes, 65 genes,
- 06:16but have a slightly lower training at UNC.
- 06:18However, we really need to use external validation
- 06:22to sort of get an independent assessment of how well
- 06:25each one of these alternative models are doing.
- 06:27So let's say we have data from a similar study from Harvard,
- 06:30and we applied both these train models
- 06:33to the genomic data from that study at Harvard.
- 06:35We have the outcome information for those patients as well,
- 06:38so we can calculate how well the model predicts
- 06:42on those validation subjects.
- 06:44And we find here in this data set,
- 06:46model 2 seems to be doing better than model 1,
- 06:49but if you try this again with another data set
- 06:51from Michigan, you might find that model 1 is doing
- 06:53better, better than model 2.
- 06:55So the problem here is where we have researchers
- 06:58that are pointing fingers at each other,
- 06:59and it's really hard to know, "Well, who's who's right?"
- 07:01And why is this even happening in the first place,
- 07:04in terms of why do we get different genes, numbers of genes,
- 07:06and each of the models derived from study 1 and study 2?
- 07:09And why are we seeing very low performance
- 07:12in some of these validation datasets?
- 07:15So here's an example from 2014,
- 07:17in the context of ovarian cancer.
- 07:20The authors basically collected 10 studies,
- 07:22all were microarray studies.
- 07:25The goal here was to predict overall survival
- 07:27in this population of ovarian cancer patients,
- 07:30given gene expression measurements
- 07:32from this microarray platform.
- 07:34So through a series
- 07:35of really complicated cross-fertilization approaches,
- 07:39the data was normalized, and harmonized
- 07:40across the studies, using a combination of ComBat
- 07:43and frozen RNA, and then they took
- 07:4614 published prediction models in the literature,
- 07:48and they applied each of those models to each
- 07:51of the subjects from these 10 studies, and they compared
- 07:53the model predictions across each subject.
- 07:58So each column here in this matrix is a patient,
- 08:00and each row is a different prediction model,
- 08:03and each cell represents the prediction
- 08:06from that model on that patient.
- 08:08So an ideal scenario, where we have the models generalizing
- 08:12and replicating across each of these individuals,
- 08:14we would expect to see the column,
- 08:16each column here to have the same color value,
- 08:19meaning that the predictions are consistent.
- 08:20But clearly we see here that the predictions are
- 08:22actually very inconsistent,
- 08:24and very different from each other.
- 08:27In addition, if you look
- 08:28at the individual risk prediction models
- 08:30that the authors used, there was also
- 08:32substantial differences in the genes
- 08:34that were selected in each of these models.
- 08:36So there's a max 2% overlap in terms of common genes
- 08:40between each of these approaches.
- 08:41And one thing to mention here is that each one
- 08:43of these risk-prediction models were derived
- 08:45from separate individual studies.
- 08:48So the question here is, you know, how exactly,
- 08:51if you were a clinician, you're eager to sort of take
- 08:54the results that you're seeing here,
- 08:57and extend to the clinic,
- 08:58which model do you use, which is right?
- 09:01Why are you seeing this level of variability?
- 09:03This is, of course, concerning, if you, if your goal is
- 09:06to move things towards the clinic, and this also has
- 09:08implications in terms of, you know, getting in the way
- 09:11of trying to approve the use of some
- 09:13of these, and for clinical use.
- 09:17So why is this happening?
- 09:19So there's been a lot of studies have been done
- 09:22that have tied issues to, obviously, sample size
- 09:24in the training studies, smaller sample sizes,
- 09:27and models trained on them may lead to more unstable models,
- 09:31or less accurate models.
- 09:32Between different studies, you might have
- 09:35different prevalences of the clinical outcome.
- 09:36In some studies, you might have higher levels of response,
- 09:39and other studies, you might have lower levels of response,
- 09:40for example, if you have this binary clinical outcome,
- 09:43and also there's issues regarding differences
- 09:46in lab conditions, where the genomic data was extracted.
- 09:49We've seen at Lineberger that, depending on the type
- 09:52of extraction, RNA extraction kit that you use,
- 09:55you might see differences in the expression of a gene,
- 09:58even from the same original tumor.
- 10:00And also the issue of batch placement,
- 10:02which has been widely talked about in the literature,
- 10:04where depending on the day you run the experiment,
- 10:06or the technician who's handling the data,
- 10:11you might see slight differences,
- 10:12technical differences in expression.
- 10:15There's also differences due to protocols.
- 10:17Some trials might have different inclusion
- 10:18and exclusion criteria, so they might be recruiting
- 10:21a slightly different patient population,
- 10:22even though they might be all
- 10:24in the context of metastatic breast cancer.
- 10:25All of these things can help impart heterogeneity
- 10:29between what the genomic data and the outcome data
- 10:34across different studies.
- 10:36In the context of genomic data in particular,
- 10:39there's also this aspect of data preprocessing.
- 10:41For the normalization taking that you use is very important,
- 10:45and we'll talk about that in a little bit.
- 10:47And it's a very critical part when it comes
- 10:48to training models, and trying to validate your model
- 10:52on other datasets, and depending on the type
- 10:54of normalization you use, this could also impact
- 10:58how well your model works.
- 11:00In addition, there's also differences in the potential way
- 11:03in which you measure gene expression.
- 11:04Some trials might use an older technology called microarray.
- 11:07I know other trials might use something
- 11:09relatively more recent called RNAC,
- 11:11or a particular trial might use
- 11:13a more targeted platform like NanoString.
- 11:15So the differences in platform also can lead to differences
- 11:19in your ability to help validate some of these studies.
- 11:21If you train something in marker rate, it's very difficult
- 11:24to take that model, and apply it to RNAC,
- 11:26because the expression values are just are just different.
- 11:30And so, as I mentioned before, this also impacts
- 11:32through to normalization on model performance as well.
- 11:37So the main thing to remember here is that
- 11:40the traditional way in which prediction models,
- 11:43based on genomic data for using the clinical training is
- 11:46typically on the results from a single study.
- 11:52To talk a little bit more about question
- 11:54of between-study normalization, and the purpose of this is
- 11:57to put the expression data on basically an even scale,
- 12:00which helps facilitate training.
- 12:02If there's global shifts, and some of the expression values
- 12:06in one sample versus another, it's very difficult to train
- 12:09an accurate model in that particular scenario.
- 12:11So normalization helps to align
- 12:13the expression you get from different samples,
- 12:16and hopefully across the between difference as well.
- 12:19And so the goal here is to eventually predict this outcome
- 12:23in a new patient, you plug in the genomic data
- 12:25from a new patient in order to get the predicted outcome
- 12:28for that patient based on that training model.
- 12:30So the, in order to do that, you also have to normalize
- 12:34the new data to the training data, right?
- 12:36Because you also want to put the new data on the same scale
- 12:38as a training data, and in the ideal scenario,
- 12:41you would want to make sure that the training samples
- 12:44that you use to train your original model are untouched,
- 12:47because what some people try to do is they try
- 12:49to sort of sidestep this normalization issue,
- 12:52they would combine the new data with the old training data,
- 12:55and renormalize everything at once.
- 12:57And the problem with this is that this changes
- 12:59your training sample values, and in a sense,
- 13:01would necessitate the fact that you need to retrain
- 13:04your old model again.
- 13:04And this leads to instability, and lack of stability
- 13:07over time in terms of the model itself.
- 13:10So in the prior example from ovarian cancer,
- 13:12this is not as big of an issue, because you have
- 13:15all the data you want to work with in hand.
- 13:18This is a retrospective study, you have 10 data sets,
- 13:20so you just normalize everything at the same time,
- 13:22that's in ComBat and frozen RNA.
- 13:24And so you can split up those studies into separate training
- 13:27and test studies, and they're all rated on the same scale.
- 13:31But the problem is that in practice, you're trying to do
- 13:34a prospective type of analysis, where when you train
- 13:37your model, you're normalizing all of the available studies
- 13:40you have, let's say, and then you use that to predict
- 13:44the outcome in a future patient, or a future study.
- 13:47And so the problem with that is that you have to find
- 13:51a good way to align, as I mentioned before,
- 13:55the data from that future study for your training samples,
- 13:57and that may not be an easy task to do,
- 14:00especially for some of the newer platforms like RNAC.
- 14:04So taking this problem a step further,
- 14:06what if there's no good cross study normalization approach
- 14:10that's available to begin with?
- 14:12This really is going to make things difficult in terms
- 14:15of the training in the model in the first place.
- 14:18Another more complicated problem is that you might have
- 14:21different types of platforms at that training time.
- 14:24For example, you might have the only type of data
- 14:26that's available from one study is NanoString in one case,
- 14:29and another study it's only RNAC, so what do you do?
- 14:33And looking forward, as platforms change,
- 14:35as technology evolves, you have different ways
- 14:36of measuring gene expression, for example.
- 14:42So what do you do with the models that are trained
- 14:44on old data, because you can't apply them to the new data?
- 14:48So oftentimes you find this situation
- 14:50where you have to retrain new models on these new platforms,
- 14:53and the old models are not able to be applied
- 14:57directly to this new data types.
- 14:58So that leads to waste here.
- 15:01So if you take all of these problems together,
- 15:03regarding cross-study normalization,
- 15:07and changes in platform,
- 15:09and a lot of the other issues, you know,
- 15:11regarding replicability that I mentioned,
- 15:13it's no wonder that there's only a small handful
- 15:17of expression-based clinically applicable assets have been
- 15:21approved by the FDA, like Oncotype DX, MammaPrint,
- 15:24and Prosigna, because this is a very, very tough problem.
- 15:30So I want to move on with that, to an approach
- 15:33that we proposed to help tackle this sort of issue
- 15:36by using this idea of multi-study learning,
- 15:39where instead of just using, and deriving, and generating
- 15:43models from individual studies, we combine data
- 15:45from multiple studies together, and create a consensus model
- 15:48that we use for prediction, which will hopefully be
- 15:50more stable, and more accurate down the road.
- 15:54So this approach of combining data is called
- 15:56horizontal data integration, where we're merging data
- 15:59from let's say K different studies.
- 16:01And the pro of this approach is that we get increased power,
- 16:04and the ability to reach some sort of consensus
- 16:06across these different studies.
- 16:09The negative is that the effect of a gene
- 16:12and its relationship to outcome may actually vary
- 16:14across studies, and also by, you know, depending on,
- 16:16and also the way that you normalize the genes may also vary
- 16:19across studies too if we're using published data
- 16:21from some prior publication.
- 16:24There's also this issue of sample size and balance.
- 16:25You might have a study that has 500 subjects,
- 16:28and another one that might have 200 subjects.
- 16:30So there are some methods that were designed to account for
- 16:34between-study heterogeneity after you do
- 16:36horizontal data integration.
- 16:38One is called the meta-lasso, another is called
- 16:41the AW statistic, but these two methods don't really have
- 16:44any prediction aspect about them.
- 16:46They're more about feature selection.
- 16:48Ensembling is one approach that can directly account
- 16:50for between-study heterogeneity
- 16:52after horizontal data integration, but there's
- 16:54no explicit future selection step here.
- 16:57But all of these approaches assume
- 16:59that the data has been pre-normalized.
- 17:02As we talked about before,
- 17:03for prospective decision-making, based off a train model,
- 17:07that might be prohibitive in some cases,
- 17:10and we need a strategy also to easily predict
- 17:13and apply these models in new patients.
- 17:20Okay, so moving on, we're going to talk first
- 17:24about this issue of how do we integrate data,
- 17:27and sort of sidestep this normalization problem
- 17:30at training time, and also at test time where we,
- 17:33when we try to predict in new subjects?
- 17:35So the approach that we put forth is to use
- 17:39what's called top scoring pairs, which you can think of
- 17:41as a rank-based transformation of the original set
- 17:45of gene expression values from a patient.
- 17:47So the idea here originally,
- 17:50when top scoring pairs were introduced,
- 17:51was you're trying to find a pair of genes
- 17:53where it's such that if the expression of gene A
- 17:56in the pair is greater than gene B, that would imply
- 17:59that the, let's say, the subtype for that individual is,
- 18:03say, subtype one, and if it's less,
- 18:05then that implies subtype zero with high probability.
- 18:09Now, in this case, this sort of approach was developed
- 18:12with when one has a binary outcome variable
- 18:14that you care about.
- 18:15In this case, we're talking about subtype,
- 18:17but it could also be tumor response or something else.
- 18:20So essentially what you're doing is that you're taking
- 18:22these continuous measurements in terms of gene expression,
- 18:25or integer, and you are converting that, transforming
- 18:31that into basically a binary predictor,
- 18:32which takes on the value of the zero or one.
- 18:34And the hope is that that particular transformed value is
- 18:38going to be associated with this binary outcome.
- 18:41So the simple assumption in this scenario is
- 18:44that the relative rank of these genes
- 18:46in a given sample is predictive of subtype, and that's it.
- 18:51And so the example here I have on the right is an example
- 18:54of two genes, GSTP1 and ESR1.
- 18:58And so you can see here that if you're
- 19:00in the upper left quadrant, this is where this gene is
- 19:02greater than this gene expression, it's implying
- 19:05the triangle subtype with high probability,
- 19:08and otherwise it implies the circle subtype.
- 19:11So that's the general idea of what we're going for here.
- 19:14It's a sort of a rank-based transformation
- 19:16of the original continuous predictor space.
- 19:21So the nice thing about this approach,
- 19:22because we're only based on the simple assumption, right?
- 19:25That we're only caring about the relative rank
- 19:27within a subject, this makes
- 19:29this particular new transformed predictor
- 19:32relatively invariant to batch effects, pre-normalization,
- 19:36and it also most importantly, simplifies merging data
- 19:39from different studies.
- 19:41Everything is now on the same scale, zero to one,
- 19:43so it's very easy to paste together the data
- 19:45from different studies, and we can sidestep this problem
- 19:50of trying to pick a cross-normalization approach,
- 19:53and then work in this sort of transformed space.
- 19:57The other nice thing is that this is easily computable
- 19:59for new patients as well.
- 20:01If you have a new patient that comes into clinic,
- 20:03you just check to see whether the gene A is
- 20:04greater than gene B in terms of expression,
- 20:06and then you have your value for this top scoring pair,
- 20:11and we don't have to worry as much about normalizing
- 20:14this patient's raw gene spectrum data
- 20:18to the training sample expression values.
- 20:21So essentially what we're doing here is that we're,
- 20:23let's enumerate all possible gene pairs for us,
- 20:26instead of a candidate genes, and each column here
- 20:28in this matrix shown on the right pertains
- 20:31to the zero one values for a particular gene pair J.
- 20:34And so this value takes the value of one, it is greater
- 20:38than B, in sample I, in pair j, and zero otherwise.
- 20:41And then we merge over the common top scoring pairs.
- 20:46So in this example have data from four different studies,
- 20:49each indicator by a different color here
- 20:50in the first track, and this data pertains to data
- 20:54from two different platforms,
- 20:55and three different cancer types.
- 20:56And so the clinical outcome here is binary subtype,
- 20:59which is given by the orange and the blue color here.
- 21:02So you can see here that we enumerated the TSPs,
- 21:05we merged the data together, and now we have
- 21:07this transformed predictor agents.
- 21:09And the interesting thing is
- 21:10that you can definitely see some patterning here.
- 21:13With any study where you have a particular set of TSPs
- 21:15that had taken a value of one, when the subtype is blue,
- 21:19and it flips when it's orange.
- 21:21And we see the same general pattern seem to replicate
- 21:24across different studies,
- 21:25but not every top scoring pair changes the same way
- 21:29across different studies.
- 21:32So if we cluster the rows here, we can also see
- 21:35some patterns sort of persist where we see
- 21:38some clustering by subtype,
- 21:40but also some clustering by study as well.
- 21:42And so what this implies is that there's a relationship
- 21:45between TSPs and subtypes, and that can vary across studies,
- 21:47which is not too different from what we've talked
- 21:50about regarding the issues we've seen
- 21:51in replicability in the past.
- 21:53So ideally we would like to see a particular gene pair,
- 21:57or TSP vector here take on a value of one,
- 22:01only when there's the orange subtype,
- 22:03and zero in the blue subtype, or vice versa.
- 22:05And we wanted to see this pattern replicated
- 22:07across patients in studies, but we see obviously
- 22:10that that's not the case.
- 22:12So the question now that we've sort of introduced,
- 22:15or proposed is this sort of approach to simplify
- 22:17data merging in normalization.
- 22:19The question now that we're sort of dealing
- 22:20with is well, how do we actually now find
- 22:22features that are consistent across different studies
- 22:26in their relationship with outcome, and also estimate
- 22:29their study-level effect, and then use them for prediction?
- 22:33So that leads us to the second part of our paper,
- 22:35where we developed a model to help select
- 22:39these particular study-consistent features
- 22:42while accounting for study-level heterogeneity.
- 22:47So to sort of illustrate the idea behind this,
- 22:49let's just start with a simple simulation
- 22:52where we're not doing any normalization,
- 22:54we're not worrying about resuming, everything's fine
- 22:56in terms of the expression values,
- 22:59and we're not doing any selection,
- 23:00no TSP transmission either.
- 23:03So we're going to assimilate data pertaining
- 23:05to two, let's say, known biomarkers
- 23:06that are associated with binary subtype.
- 23:09We're going to generate K datasets,
- 23:11and we're going to try three different strategies
- 23:12for learning a prediction model two to these data sets.
- 23:15And at the end, we're going to validate each of those models
- 23:18on an externally-generated data set
- 23:19to compare their prediction performance.
- 23:22So to do this, we're going to fit and assume for each study
- 23:25that we can fit it with a logistic regression model
- 23:28to model by our outcome with these two predictors,
- 23:31and in generating these K data sets,
- 23:32we're going to vary the number of with respect to K.
- 23:35So we might generate two trained data sets five or 10,
- 23:38and also change the total sample size of each one,
- 23:40and make sure that the sample sizes are in balanced
- 23:42across the different studies, and then assume
- 23:45values for the coefficients for each of these predictors
- 23:50to be these values here, and lastly, to induce some sort
- 23:53of heterogeneity across the different training datasets,
- 23:56we're gonna add in sort of like a random value drop
- 23:59from the normal distribution, where we're assuming
- 24:03this level of variance for this value.
- 24:05So basically we're just injecting heterogeneity
- 24:07into this data generation process.
- 24:09So after we generate the training studies,
- 24:11then we're going to apply three different ways
- 24:13or strategies to the training data.
- 24:15The first is the individual study approach,
- 24:17which we've talked about before, where you train
- 24:20a generalized model separately for each study.
- 24:22The second approach is where you merge the data.
- 24:25Again, we're ignoring the normalization problem here
- 24:26in simulation, obviously, and then train a single GLMM
- 24:30for the combined data, and then lastly,
- 24:32we're going to merge the data, and train
- 24:34a generalized linear mixed model,
- 24:35where we explicitly account for a random intercept,
- 24:38and a random slope for each predictor,
- 24:41assuming, you know, a study-level random effect.
- 24:45So after we do that, we'll generate a validation dataset
- 24:48from the same approach above, and then predict outcome
- 24:52in this validation dataset with respect
- 24:55to the models derived from each of these three strategies.
- 24:59So if we look at the individual strategy performance,
- 25:01where we fit a GLM logistical regression model
- 25:04separately for each study, and then apply it
- 25:06to this validation data set, we can check
- 25:08the prediction accuracy, we can find that,
- 25:11due to the induced level of heterogeneity
- 25:14between studies in predictor effects,
- 25:16in one study, we do really poorly,
- 25:18and another study we do really well,
- 25:20and this variation is entirely due to variations
- 25:24in the gene subtype relationship.
- 25:27And these predictions obviously vary as a result
- 25:29across the different studies.
- 25:30And this will reflect a little bit of what we see
- 25:32in some of the examples that we showed earlier,
- 25:35studies that were trained on different data sets.
- 25:40And then the second approach is where we combine
- 25:43the data sets, and train a single logistical question model
- 25:46to predict outcome.
- 25:46And so we see what the median prediction error is better
- 25:49than most of the models here, but if we fit the GLMM,
- 25:52the median prediction (indistinct) gets better
- 25:54than some of the other approaches here.
- 25:56So this is basically just one example.
- 25:58So we did this over and over a hundred times
- 26:00for every single possible simulation condition,
- 26:03varying K, and the heterogeneity across different studies.
- 26:07And some of the things that we found was that
- 26:10the individual study approach had, as you can see,
- 26:12the worst prediction error overall,
- 26:14combining the data improved this a little bit,
- 26:17but the estimates for the coefficients
- 26:21from the combined GLMM were still biased.
- 26:23There's supposed to be two in this extreme scenario.
- 26:27And a kind of heterogeneity with the GLMM mixed model had
- 26:31the best performance out of the rest,
- 26:32and also had the lowest bias in terms
- 26:35of the regression coefficients as well.
- 26:39So this is great, but we also have a lot
- 26:42of potential types of pairs.
- 26:44We can't really estimate them all
- 26:47with a GLMM mixed model, so we need to find a way
- 26:50where we can, at least in reasonable dimension,
- 26:52figure out a way which fixed effects are non-zero,
- 26:55while accounting for, you know,
- 26:56this sort of study-level heterogeneity for each effect.
- 27:00So this led us to develop a pGLMM, which is basically
- 27:05a high-dimensional generalized intermixed model,
- 27:08where we are able to select fixed and random effects
- 27:11simultaneously using a penalization framework.
- 27:13So essentially here, we're assuming that all the predictors
- 27:17in the model, we assume a random effect,
- 27:20a random slope for each one, and so we were aiming to select
- 27:23the features that have non-zero fixed effects
- 27:28in this particular approach, and indeed we're assuming
- 27:30these are going to be study-consistent.
- 27:32And to do this, we're going to reorganize
- 27:35the linear predictor from the standard GLMM,
- 27:38so basically we're starting with the same general likelihood
- 27:41for, you know, the generalized mixed model.
- 27:44Here, Y is our outcome, X is our predictor,
- 27:49alpha is the, alpha K is the random effect
- 27:53for the case study, fi here is typically assumed to be
- 27:58multi, very normal, means zero, and a covariant
- 28:02on some sort of unstructured covariance matrix typically.
- 28:05And so to sort of simplify this, we factor out
- 28:09the random effects covariance matrix,
- 28:10and incorporate into the linear predictor.
- 28:12And with some more reorganizing, now we're able to select
- 28:16the fixed effects and determine which random effects have
- 28:21true non-covariance, using this sort
- 28:24of joint penalization framework.
- 28:26If you want more detail, you can check out the publication
- 28:28that I linked above, and I also forgot to send out
- 28:31the link to this talk here.
- 28:33I'll do that right now, in case you want to check out
- 28:35some of the publications that I'm linking in this talk.
- 28:41Okay, so how do we do this estimation?
- 28:42And we use that penalized NCM algorithm,
- 28:44where in each step we're drawing from the posterior
- 28:47with respect to the random effects, given
- 28:48the current aspects of the parameters,
- 28:50and the observed data, using Metropolis point of Gibbs.
- 28:55In the R packets, I'm going to talk about in a little bit,
- 28:58we update this to using a Hamiltonian Monte Carlo,
- 29:03but in the original version,
- 29:04we use Metropolis point of Gibbs, where we skipped
- 29:07components that had zero variance from the M-STEP.
- 29:09And then we use, in the M-step,
- 29:12two conditional maximization steps
- 29:14where we first update data, given the draws
- 29:17from the E-step, and the prior estimates for gamma here,
- 29:20and then up to gamma using a group penalty.
- 29:24So we use a couple of other tricks
- 29:25to speed up performance here.
- 29:27I won't go too much into the details there,
- 29:29but you can check out the paper for more detail on that.
- 29:33But with this approach, one of the things
- 29:35that we were able to show was that we have
- 29:37similar conclusions regarding bias and prediction error,
- 29:39as in the simple setup we had before,
- 29:41where in this particular situation, we're simulating
- 29:43a bunch of predictors that do not have any association
- 29:47with outcome, either 10 to 50 extra predictors,
- 29:51or there's only two that are actually truly relevant.
- 29:54And so the prediction error in this model
- 29:56after this penalized selection process is
- 29:59generally the same, if not a little bit worse.
- 30:01And one thing that we find here is that
- 30:03the parameters are selected
- 30:06by the individual study approach we're applying now
- 30:08at penalized distribution regression model has
- 30:10a low sensitivity to detect the true predictors,
- 30:13and a higher false positive rate in terms of selecting
- 30:16predictors that aren't associated
- 30:17with outcome and simulation.
- 30:19And what we find here also is that the approach
- 30:23that we developed had a much better sensitivity
- 30:26compared to other approaches for selecting
- 30:28the true predictors when accounting
- 30:30for study-level homogeneity,
- 30:32and the lower false positive rate as well.
- 30:36The example data sets that I talked about before,
- 30:39the four ones that I showed a figure up earlier,
- 30:43we did a whole data study analysis where we trained
- 30:45on three studies and held out one of the studies.
- 30:48We found that, you know, the approach that we put forward
- 30:51that put combining the data using our TSP approach,
- 30:54and then training a model using the pGLM had
- 30:58the lowest overall holdout study error
- 31:00compared to the approach using just
- 31:02a regular generalized linear model,
- 31:06and then also the individual study approach as well.
- 31:09And we also compared it to another post called
- 31:12the Meta-Lasso, which we were able to adapt
- 31:14to do prediction, and we didn't see that much improvement
- 31:16of performance as well.
- 31:17But in general, the result that we saw here was
- 31:21that the individual study approach had
- 31:23bad prediction error also across the different studies.
- 31:27So again, this sort of takes what we've already seen
- 31:29in the literature in terms of inconsistency,
- 31:31in terms of the number of genes that are being selected
- 31:33in each of these models, and also the variations
- 31:35in the prediction accuracy, this sort of reflects
- 31:38what we've been seeing in some of this prior work.
- 31:44So in order to you implement this approach
- 31:46in a more systematic way, my student and I,
- 31:49Hillary worked, put together an R package called
- 31:51The GLMMPen R Package.
- 31:54So this was just recently submitted
- 31:56to Journal of Statistical Software, but if you want to track
- 31:59the code, it's available on Github right here,
- 32:02and we're in the process of submitting this to CRAN as well.
- 32:05This was sort of like a nice starter project that I gave
- 32:08to Hillary to, you know, get her feet wet with coding,
- 32:12and she's done a really great job, you know,
- 32:15in terms of putting this together.
- 32:16And some of the distinct differences between this
- 32:19and what we put forth in the paper is the use
- 32:21of Hamiltonian Monte Carlo and the east app,
- 32:24instead of the Metropolis Gibbs.
- 32:26It's much faster, much more efficient.
- 32:27We also have added helper functions
- 32:29for the (indistinct) tuning parameters, and also making
- 32:33some diagnostic plots as well, after convergence.
- 32:37And we've also implemented some speed
- 32:39and memory improvements as well, to help with usability.
- 32:44Okay, so we talked about some issues
- 32:47regarding data integration, and then issues
- 32:50with normalization, how that impedes, or can impede
- 32:52validation in future patients, and then we introduced
- 32:56a way to sidestep the normalization problem,
- 32:59using this sort of rank-based transformation,
- 33:01and an approach to select consistent predictors
- 33:03in the presence of between-study heterogeneity.
- 33:07So next, I'm going to talk about a case study
- 33:09in pancreatic cancer, where we took a lot of these tools,
- 33:13and applied them to a problem that some collaboratives
- 33:16of mine were having, you know, at the cancer center at UNC.
- 33:20And to give a brief overview of pancreatic cancer,
- 33:23it has a really poor prognosis.
- 33:26Five-year survival is very low, you know, typically 5%.
- 33:30The median survival tends to be less than 11 months,
- 33:32and the main reason why this is the case is that
- 33:35early detection is very difficult,
- 33:37and so when patients show up to the clinic,
- 33:40they're oftentimes in later stages, or gone metastatic.
- 33:44So for those reasons, it's really important to place
- 33:48patients on optimal therapies upfront, and choosing
- 33:51the best therapies, specifically for a patient, you know,
- 33:54when after they're diagnosed.
- 33:56So breast and colorectal cancers have
- 33:59long-established subtyping systems that are oftentimes used.
- 34:02Again, an example of a few of them in breast
- 34:04that have actually been approved by the FDA
- 34:06for clinical use, but there's nothing available for,
- 34:09in terms of precision medicine for pancreatic cancer,
- 34:11except for a couple of targeted therapies
- 34:14for specific mutations.
- 34:17So in 2015, the Yeh Lab at UNC,
- 34:20using a combination of non-negative matrix factorization
- 34:24and consensus clustering, where it was able to discover
- 34:27two potentially clinically applicable subtypes
- 34:30in pancreatic cancer, which they call basal-like,
- 34:33the orange line here, which has a much worse survival
- 34:37compared to this classical subtype in blue,
- 34:41where patients seem to do a little bit better.
- 34:44And so with this approach, they used
- 34:45this unsupervised learning, set of learning techniques
- 34:48to derive these novel subtypes.
- 34:51And so when they took these subtypes and overlaid them
- 34:54from data from a clinical trial where they had
- 34:56treatment response information, they found that
- 34:58largely patients who with basal-like subtype tended to have
- 35:02tumors that did not respond
- 35:04to common first-line therapy, Folfirinox.
- 35:06Their tumors tended to grow from baseline.
- 35:08Whereas patients that were the classical subtype tended
- 35:12to respond better on average compared to the basal samples.
- 35:16So the implications here are that if you are,
- 35:20subtype is basal, you should avoid Folfirinox
- 35:23at baseline entry with an alternative type drug,
- 35:25typically Gemcitabine and nab-paclitaxel Abraxane.
- 35:27And then for classical patients,
- 35:29they should receive Folfirinox.
- 35:32But the problem here is that subtyping clearly is
- 35:34an unsupervised learning approach, right?
- 35:36It's not a prediction tool.
- 35:37So it's, this approach is quite limited if it,
- 35:42when you have to do, assign a subtype
- 35:45in a small number of patients, it just doesn't work.
- 35:48So what some people have done in the past,
- 35:50so they simply take new patients, and recluster them
- 35:52with existing, their existing training samples.
- 35:55The problem with that is that the subtype assignments
- 35:58for those original training samples might change
- 36:00when they recluster it.
- 36:01So there's not a stable, it's not really
- 36:03a stable approach to really do this.
- 36:05So the goal here was to leverage the existing training data
- 36:08that's available to the lab, which come
- 36:12from different platforms to come up with an approach,
- 36:15a classifier to predict subtype, given
- 36:18new subtypes information, genomic,
- 36:20a new patient's genomic data, to get subtype,
- 36:23a predicted subtype for that individual.
- 36:25So of course, in that scenario, we also want to make sure
- 36:28that that process is simplified, and that we make
- 36:31this prediction process as easy as possible,
- 36:33in the face of all these issues we talked about regarding
- 36:36normalization and the training data to each other,
- 36:40and also normalization of the new patient data
- 36:42to the existing training data.
- 36:45So using some of the techniques that we just talked about,
- 36:49we came up with a classifier that we call PurIST,
- 36:51which was published in the CCR last year,
- 36:53where essentially we were able to do that.
- 36:56We take in the genomic data for a previous patient,
- 36:59and able to predict subtype based off of that,
- 37:04the train model that we developed.
- 37:06And in this particular paper, we had nine data sets
- 37:09that we curated from the literature, three of which
- 37:11that we used for training,
- 37:13the rest we used for validation.
- 37:14And we did consensus clustering on all of them,
- 37:16using the gene list that was derived
- 37:18from the original publication,
- 37:21where the subtypes were discovered to get labels,
- 37:23subject labels for each one of the subjects
- 37:25in each one of these studies.
- 37:27So once we had those labels from consensus clustering,
- 37:30we then merged the data from our three largest studies,
- 37:33which are our training studies.
- 37:35We did some sample for filtering based on quality,
- 37:37and we filtered some genes based off of, you know,
- 37:40expression levels and things like that.
- 37:42And then we applied our previous training approach
- 37:45to get a small subset of top scoring pairs from the data.
- 37:50And in this case, we have eight that we selected,
- 37:51each with their own study-level coefficient.
- 37:55And then for prediction, the process is very simple,
- 37:58we just check in that patient, whether gene A is greater
- 38:00than gene D for each of these pairs,
- 38:02and that gives us their binary vector of ones and zeros.
- 38:05We multiply that by the coefficients from the train model.
- 38:09This is basically just calculating a linear predictor
- 38:11from this logistic regression model.
- 38:14And then we can convert that
- 38:15to a predicted probability of being basal.
- 38:18So using this approach, we were able to select
- 38:2316 genes pertaining to eight subtypes,
- 38:25but we can find here that the predictions
- 38:27from this model tends to coincide very strongly
- 38:31with the labels that were collected
- 38:33using consensus clusters.
- 38:34So that gives us some confidence that reproducing
- 38:36in some way, you know, this, the result that we got
- 38:41using this clustering approach.
- 38:43You can also clearly see here that as the subtype changes,
- 38:46that you see flips in the expression in each one
- 38:49of the pairs of genes that we collected
- 38:52in this particular study.
- 38:54And then when we applied this model
- 38:55to six external validation dataset, we found that it had
- 38:59a very good performance in terms of recapitulating subtype,
- 39:01where we had a relatively good sensitivity
- 39:04and specificity in each case, which we owe part
- 39:07to the fact that we don't have to worry as much
- 39:08about this sort of cross-study normalization training time
- 39:13or test time, and also the fact that we leveraged
- 39:17multiple data sets when selecting
- 39:21the predictors for this model.
- 39:22And so when we looked at the predictive values
- 39:24in these holdout studies, the predictive subtypes,
- 39:27we recapitulated the differences in survival
- 39:30that we observed in other studies as well,
- 39:32where basal-like patients do a lot worse
- 39:34compared to classical patients.
- 39:37If you want to look a little bit more at the details
- 39:39in this paper, you can check out this link here,
- 39:41and if you want to access the code that we used
- 39:44to make these predictions, that's available
- 39:45on this Github page at this link right here.
- 39:50Another thing that we were able to show is that for patients
- 39:53that had samples that are collected through different modes
- 39:56of collection, whether it was bulk, FNA, FFPE,
- 40:00we found that the predictions in these patients tend to be
- 40:03highly consistent, and this is basically deriving
- 40:06itself, again, from the simple assumption behind TSPs,
- 40:09where the relative rank within the subject of the expression
- 40:13of these genes is predicted.
- 40:15So as long as that is being preserved,
- 40:17then you should be able to have the model predict well
- 40:21in different scenarios.
- 40:23So when we also went through CLIA validation for this tool,
- 40:28we also confirmed 95% agreement between replicated runs
- 40:31in other platforms, and we also confirmed concordance
- 40:38between NanoString and RNAC, also through different modes
- 40:43of sample collection.
- 40:44So right now this is the first clinically applicable test
- 40:47for a prospect of first line treatment selection in PDAC.
- 40:51And right now we do have a study that just recently opened
- 40:54at the Medical College of Wisconsin that's using PurIST
- 40:56for prospect of treatment selection,
- 40:58and we have another one opening at University of Rochester,
- 41:02and also at UNC soon as well.
- 41:06So this is just an example about how you can take
- 41:10a problem, you know, in, from the literature,
- 41:14from your collaborators, come up with a method,
- 41:18and some theory behind it, and really be able to come up
- 41:22with a good solution that is robust,
- 41:24and that can really help your collaborative
- 41:27at your institution and elsewhere.
- 41:32Okay, so that was the case study.
- 41:34To talk about some current work
- 41:35that we're doing just briefly.
- 41:36So we wanted to think about how we can also scale up the,
- 41:39this particular framework that we developed for the pGLMM,
- 41:42and one idea that we're pursuing right now
- 41:44with my student Hillary, is that we're thinking
- 41:48about using, borrowing ideas from factor analysis
- 41:50to decompose, do a deep, deterministic decomposition
- 41:53of the random effects to a lower dimensional space,
- 41:56where essentially, we can essentially map
- 42:00between the lower dimensional space (indistinct) factors,
- 42:03which is r-dimensional, to this higher dimensional space,
- 42:05using some by matrix B, which is q by r,
- 42:12and essentially in doing so, this reduces the dimension
- 42:16of the integral in the Monte Carlo EM algorithm.
- 42:20So rather than having to do approximate integral
- 42:22and q dimensions, which can be difficult,
- 42:24you can work in a much lower space in terms of integral,
- 42:27and then have this additional problem
- 42:29of trying to estimate this matrix,
- 42:31and not back to the original dimension cube.
- 42:33So that's something that we're just starting to work on
- 42:35right now, and another thing that we're starting to work on
- 42:39is the idea of trying to extend some of the work
- 42:41in variational autoencoders
- 42:43that my student David is working on now.
- 42:45His current work is trying to account for missing data
- 42:48when trying to train these sort of deep learning models,
- 42:51the VAEs unsupervised learning model's oftentimes used
- 42:55for dimensional reduction.
- 42:56You might've heard of it
- 42:57in single cells sequencing applications.
- 43:01But the question that we wanted to address is, well,
- 43:03what if you have missing data, you know,
- 43:05in your input features X, which might be (indistinct)?
- 43:10So essentially we were able to develop input.
- 43:14So we have a pre-print up right now, it's the code,
- 43:17and we're looking to extend this, where essentially,
- 43:20rather than worrying about this latent space Z,
- 43:23which we're assuming that that encodes a lot
- 43:25of the information in the original data,
- 43:27we replaced that with learning the posterior
- 43:29of the random effect, given the observed data.
- 43:32And then in the second portion here, we replaced
- 43:34this generative model with the general model of y given X
- 43:39in the random effects.
- 43:41So that's another avenue that can allow us
- 43:43to hopefully account for non-linearity,
- 43:45and arbitrator action between features as well.
- 43:47And also it might be an easier way to scale up
- 43:49some of the analysis we've done too,
- 43:53which I've already mentioned.
- 43:55Okay, so in terms of some concluding thoughts,
- 43:58I talked a lot about how the original subtypes were derived
- 44:03for this pancreatic cancer case study using NMF
- 44:06and consensus clustering to get two subtypes.
- 44:09But there were also other groups that are published,
- 44:12subtyping systems, that in one, they found
- 44:16three subtypes, and in another one they found four subtypes.
- 44:19So the question is, well, you know, well,
- 44:22which one do we use?
- 44:23Again, this is also confusing for practitioners
- 44:26about which approach might be more meaningful
- 44:29in the clinical setting.
- 44:30And each of these approaches were also derived
- 44:32using NMF and consensus clustering, and they were done
- 44:35separately on different patient cohorts
- 44:38at different institutions.
- 44:39So you can see that this is another reflection
- 44:41of heterogeneity in single-study learning,
- 44:45and how we can get these different or discrepant results
- 44:49from applying the same technique to 200 genus datasets
- 44:52that were generated at different places.
- 44:54So of course this creates another problem, you know,
- 44:57who's right, which approach do we use?
- 45:00And it's kind of like a circular argument here.
- 45:03So in the paper that I mentioned before with PurIST,
- 45:07another thing that we did is we overlaid
- 45:09the others subtype system calls
- 45:12with the observed clinical outcomes
- 45:15for the studies that we collected.
- 45:17And one of the things that we found was that,
- 45:19and these other subtyping systems,
- 45:22each of them also had something,
- 45:24something that was very similar to the basal-like subtype,
- 45:27and for the remaining subtypes, they had survival
- 45:30that was similar to the classical subtype.
- 45:33So one of the arguments that we made was that,
- 45:35well, if the clinical outcomes are the same
- 45:37for the other subtypes, you know,
- 45:40are they exactly right necessary
- 45:42for clinical decision-making?
- 45:43That was one argument that we put forth.
- 45:46And when we looked at the response data, again,
- 45:48we saw that one of the subtypes in the other approaches
- 45:51also overlapped the basal-like subtype in terms of response.
- 45:56And then for the remaining subtypes,
- 45:57they were just kind of randomly dispersed at the other end,
- 46:01you know, of the spectrum here in terms of tumor present,
- 46:05tumor change after treatment.
- 46:07So the takeaway here is that heterogeneity
- 46:09between studies also impacts tasks in unsupervised learning,
- 46:14like the NMF+ consensus clustering approach
- 46:16to discover subtypes.
- 46:18And what this also does is, as you can imagine,
- 46:21this injects a lot of confusion into the literature,
- 46:24and can also slow down the process of translating
- 46:27some of these approaches to the clinic.
- 46:30So this also underlies the need
- 46:32for replicable cross-study sub discovery approaches,
- 46:35for replicable approaches for unsupervised learning.
- 46:41That's something that, you know, something that we might,
- 46:43we hope to be working on in the future,
- 46:46and we hope to see more work on as well.
- 46:49So to summarize the, one of the major points
- 46:53of this talk was to introduce and discuss, you know,
- 46:55replicability issues in genomic prediction models,
- 46:58supervised learning, that stems from technical,
- 47:01and also non-technical sources.
- 47:03We also introduced a new approach to facilitate
- 47:07data integration and multistory learning
- 47:09in a way that captures between-study heterogeneity,
- 47:12and showed how this can be used for the prediction
- 47:15of subtype for pancreatic cancer, and also introduced
- 47:20some scalable methods and future direction
- 47:23in replicable subtype discovery.
- 47:26So that's it for me.
- 47:28I just want to thank some of my faculty crowd,
- 47:30collaboratives, Quefeng Li, Junier Oliva
- 47:33from UNC computer science, Jen Jen Yeah
- 47:37from surgical oncology at Lineberger,
- 47:40Joe Ibrahim as well, UNC biostatistics,
- 47:43and also my students, Hilary, who's done a lot of work
- 47:45in this area, and also David Lim, who's doing
- 47:48some of the deep learning work in our group.
- 47:50And that's it, thank you.
- 47:58<v Robert>So does anybody here have</v>
- 47:59any questions for the professor?
- 48:09Or anybody on the, on Zoom, any questions you want to ask?
- 48:26<v ->It looks like I'm off the hook.</v>
- 48:29<v Robert>All right, well, thank you so much.</v>
- 48:30Really appreciated your talk.
- 48:33Have a good afternoon.
- 48:36<v ->All right, thank you for having me.</v>