# YSPH Biostatistics Seminar: “Addressing the Replicability and Generalizability of Clinical Prediction Models”

September 08, 2021## Information

Naim Rashid, PhD

Associate Professor, Department of Biostatistics

University of North Carolina at Chapel Hill

September 7, 2021

ID6891

To CiteDCA Citation Guide

- 00:00<v Robert>Hi, I'm a Professor McDougal,</v>
- 00:06and Professor Wayne is also in the back.
- 00:08If you haven't signed in, please make sure that you pass
- 00:11this, get a chance to sign the sign in sheet.
- 00:15So today we are very, very privileged to be joined
- 00:19by Professor Naim Rashid
- 00:22from the University of North Carolina Chapel Hill,
- 00:25Professor Rashid got his bachelor's in biology from Duke,
- 00:30and his PhD in biostatistics from UNC Chapel Hill.
- 00:35He's the author of 34 publications, and he holds a patent
- 00:40on methods in composition for prognostic
- 00:44and/or diagnostic supply chain of pancreatic cancer.
- 00:48He's currently an associate professor at UNC Chapel Hill's
- 00:51department of biostatistics, and he's also affiliated
- 00:54with their comprehensive cancer center there.
- 00:59With that, Professor Rashid, would you like to take it away?
- 01:04<v ->Sure.</v>
- 01:06It looks like it says host disabled screen sharing.
- 01:10(chuckling)
- 01:12<v Robert>All right, give me one second.</v>
- 01:14Thank you.
- 01:17I'm trying to do.
- 01:27(indistinct)
- 01:34Okay, you should be, you should be able to come on now.
- 01:36<v ->All right.</v>
- 01:39Can you guys see my screen?
- 01:44All right.
- 01:48Can you guys see this?
- 01:50<v Robert>There we go.</v>
- 01:52Perfect. Thank you.
- 01:53<v ->Okay, great.</v>
- 01:54So yes, thanks to the department for inviting me to speak
- 01:57today, and also thanks to Robert and Wayne for organizing.
- 02:01And today I'll be talking about issues regarding
- 02:04replicability in terms of clinical prediction models,
- 02:08specifically in the context of genomic prediction models,
- 02:12derived from clinical trials.
- 02:16So as an overview, we'll be talking first a little bit
- 02:18about the problems of replicability in general,
- 02:21in scientific research, and also about specific issues
- 02:24in genomics itself, and then I'll be moving on to talking
- 02:28about a method that we've proposed to assist
- 02:31with issues regarding data integration, and learning
- 02:34in this environment when you have a heterogeneous data sets.
- 02:38I'll talk a little bit about a case study
- 02:40where we apply these practices to subtyping
- 02:43pancreatic cancer, touch on some current work
- 02:45that we're doing, and then end
- 02:47with some concluding thoughts.
- 02:48And feel free to interrupt, you know,
- 02:50as the talk is long, if you have any questions.
- 02:54So I'm now an associate professor in the department
- 02:56of biostatistics at UNC.
- 02:58My work generally involves problems
- 03:00surrounding cancer and genomics, and more recently
- 03:05we've been doing work regarding epigenomics.
- 03:07We just recently published a supply-connected package called
- 03:09Epigram for a consistence of differential key calling,
- 03:13and we've also done some work in model-based clustering.
- 03:15We published a package called, FSCSeq,
- 03:18which helps you derive and discover clusters
- 03:22from RNA seq data, while also determining
- 03:25clusters in specific genes.
- 03:26And today we'll be talking more about the topic
- 03:28of multi-study replicability, which is the topic
- 03:30of a paper that we published a year or two ago,
- 03:34and in our package that we've developed more recently,
- 03:37implementing some of the methods.
- 03:40So before I get deeper into the talk, one of the things
- 03:43I wanted to establish is this definition
- 03:45of what we mean by replicability.
- 03:47You might've heard the term reproducibility as well,
- 03:50and to make the distinction between the two terms,
- 03:52I'd like to define reproducibility in a way
- 03:54that Jeff Leak has defined in the past,
- 03:57where reproducibility is the ability to take
- 03:59coding data from a publication, and to rerun the code,
- 04:03and get the same results as the original publication.
- 04:06Where replicability, we're defining as the ability to be run
- 04:09an experiment generating new data, and get results
- 04:11that are quote, unquote "consistent"
- 04:14with that of the original study.
- 04:16So in this sort of context, when it comes to replicability,
- 04:19you might've heard about publications that have come out
- 04:22in the past that talk about how there are issues
- 04:24regarding replicating the research that's been published
- 04:28in the scientific literature.
- 04:30This one paper in PLOS Medicine was published
- 04:32by, and that is in 2005, and there's been a number
- 04:36of publications that have come out since,
- 04:38talking about problems regarding replicability,
- 04:41and ways that we could potentially address it.
- 04:43And the problem has become large enough where it has
- 04:46its own Wikipedia entry talking about the crisis,
- 04:49and has a long list of examples that talks
- 04:51about issues regarding replicating results
- 04:54from the scientific studies.
- 04:55So this is something that has been a known issue
- 04:58for a while, and these problems also extend
- 05:00to situations where you want to, for example,
- 05:03develop clinical prediction models in genomics.
- 05:06So to give an example of this, let's say that we wanted to,
- 05:10in the population of metastatic breast cancer patients,
- 05:13we wanted to develop a model that predicts
- 05:16some clinical outcome Y, given a set
- 05:18of gene expression values X.
- 05:21And so the purpose of this sort of exercise is
- 05:23to hopefully translate this sort of model
- 05:26that we've developed, and apply it to the clinic,
- 05:28where we can use it for clinical decision-making.
- 05:31Now, if we have data from one particular trial
- 05:35that pertains to this patient population,
- 05:37and the same clinical outcome being measured,
- 05:39in addition to having gene expression data,
- 05:41let's say that we derived a model, let's say
- 05:43that we're modeling some sort of binary outcome,
- 05:44let's say tumor response.
- 05:46And in this model, we used a cost report,
- 05:48or penalized logistic regression model
- 05:51that we fit to the data to try and predict the outcome,
- 05:54given the gene expression values.
- 05:56And here we obtained, let's say, 12 genes
- 05:59after the fitting process, and the internal model 1 UNC
- 06:04on the sort of training subjects is 0.9.
- 06:07But then let's say there's another group at Duke
- 06:09that's using data from their clinical trial,
- 06:11and they have a larger sample size.
- 06:13They also found more genes, 65 genes,
- 06:16but have a slightly lower training at UNC.
- 06:18However, we really need to use external validation
- 06:22to sort of get an independent assessment of how well
- 06:25each one of these alternative models are doing.
- 06:27So let's say we have data from a similar study from Harvard,
- 06:30and we applied both these train models
- 06:33to the genomic data from that study at Harvard.
- 06:35We have the outcome information for those patients as well,
- 06:38so we can calculate how well the model predicts
- 06:42on those validation subjects.
- 06:44And we find here in this data set,
- 06:46model 2 seems to be doing better than model 1,
- 06:49but if you try this again with another data set
- 06:51from Michigan, you might find that model 1 is doing
- 06:53better, better than model 2.
- 06:55So the problem here is where we have researchers
- 06:58that are pointing fingers at each other,
- 06:59and it's really hard to know, "Well, who's who's right?"
- 07:01And why is this even happening in the first place,
- 07:04in terms of why do we get different genes, numbers of genes,
- 07:06and each of the models derived from study 1 and study 2?
- 07:09And why are we seeing very low performance
- 07:12in some of these validation datasets?
- 07:15So here's an example from 2014,
- 07:17in the context of ovarian cancer.
- 07:20The authors basically collected 10 studies,
- 07:22all were microarray studies.
- 07:25The goal here was to predict overall survival
- 07:27in this population of ovarian cancer patients,
- 07:30given gene expression measurements
- 07:32from this microarray platform.
- 07:34So through a series
- 07:35of really complicated cross-fertilization approaches,
- 07:39the data was normalized, and harmonized
- 07:40across the studies, using a combination of ComBat
- 07:43and frozen RNA, and then they took
- 07:4614 published prediction models in the literature,
- 07:48and they applied each of those models to each
- 07:51of the subjects from these 10 studies, and they compared
- 07:53the model predictions across each subject.
- 07:58So each column here in this matrix is a patient,
- 08:00and each row is a different prediction model,
- 08:03and each cell represents the prediction
- 08:06from that model on that patient.
- 08:08So an ideal scenario, where we have the models generalizing
- 08:12and replicating across each of these individuals,
- 08:14we would expect to see the column,
- 08:16each column here to have the same color value,
- 08:19meaning that the predictions are consistent.
- 08:20But clearly we see here that the predictions are
- 08:22actually very inconsistent,
- 08:24and very different from each other.
- 08:27In addition, if you look
- 08:28at the individual risk prediction models
- 08:30that the authors used, there was also
- 08:32substantial differences in the genes
- 08:34that were selected in each of these models.
- 08:36So there's a max 2% overlap in terms of common genes
- 08:40between each of these approaches.
- 08:41And one thing to mention here is that each one
- 08:43of these risk-prediction models were derived
- 08:45from separate individual studies.
- 08:48So the question here is, you know, how exactly,
- 08:51if you were a clinician, you're eager to sort of take
- 08:54the results that you're seeing here,
- 08:57and extend to the clinic,
- 08:58which model do you use, which is right?
- 09:01Why are you seeing this level of variability?
- 09:03This is, of course, concerning, if you, if your goal is
- 09:06to move things towards the clinic, and this also has
- 09:08implications in terms of, you know, getting in the way
- 09:11of trying to approve the use of some
- 09:13of these, and for clinical use.
- 09:17So why is this happening?
- 09:19So there's been a lot of studies have been done
- 09:22that have tied issues to, obviously, sample size
- 09:24in the training studies, smaller sample sizes,
- 09:27and models trained on them may lead to more unstable models,
- 09:31or less accurate models.
- 09:32Between different studies, you might have
- 09:35different prevalences of the clinical outcome.
- 09:36In some studies, you might have higher levels of response,
- 09:39and other studies, you might have lower levels of response,
- 09:40for example, if you have this binary clinical outcome,
- 09:43and also there's issues regarding differences
- 09:46in lab conditions, where the genomic data was extracted.
- 09:49We've seen at Lineberger that, depending on the type
- 09:52of extraction, RNA extraction kit that you use,
- 09:55you might see differences in the expression of a gene,
- 09:58even from the same original tumor.
- 10:00And also the issue of batch placement,
- 10:02which has been widely talked about in the literature,
- 10:04where depending on the day you run the experiment,
- 10:06or the technician who's handling the data,
- 10:11you might see slight differences,
- 10:12technical differences in expression.
- 10:15There's also differences due to protocols.
- 10:17Some trials might have different inclusion
- 10:18and exclusion criteria, so they might be recruiting
- 10:21a slightly different patient population,
- 10:22even though they might be all
- 10:24in the context of metastatic breast cancer.
- 10:25All of these things can help impart heterogeneity
- 10:29between what the genomic data and the outcome data
- 10:34across different studies.
- 10:36In the context of genomic data in particular,
- 10:39there's also this aspect of data preprocessing.
- 10:41For the normalization taking that you use is very important,
- 10:45and we'll talk about that in a little bit.
- 10:47And it's a very critical part when it comes
- 10:48to training models, and trying to validate your model
- 10:52on other datasets, and depending on the type
- 10:54of normalization you use, this could also impact
- 10:58how well your model works.
- 11:00In addition, there's also differences in the potential way
- 11:03in which you measure gene expression.
- 11:04Some trials might use an older technology called microarray.
- 11:07I know other trials might use something
- 11:09relatively more recent called RNAC,
- 11:11or a particular trial might use
- 11:13a more targeted platform like NanoString.
- 11:15So the differences in platform also can lead to differences
- 11:19in your ability to help validate some of these studies.
- 11:21If you train something in marker rate, it's very difficult
- 11:24to take that model, and apply it to RNAC,
- 11:26because the expression values are just are just different.
- 11:30And so, as I mentioned before, this also impacts
- 11:32through to normalization on model performance as well.
- 11:37So the main thing to remember here is that
- 11:40the traditional way in which prediction models,
- 11:43based on genomic data for using the clinical training is
- 11:46typically on the results from a single study.
- 11:52To talk a little bit more about question
- 11:54of between-study normalization, and the purpose of this is
- 11:57to put the expression data on basically an even scale,
- 12:00which helps facilitate training.
- 12:02If there's global shifts, and some of the expression values
- 12:06in one sample versus another, it's very difficult to train
- 12:09an accurate model in that particular scenario.
- 12:11So normalization helps to align
- 12:13the expression you get from different samples,
- 12:16and hopefully across the between difference as well.
- 12:19And so the goal here is to eventually predict this outcome
- 12:23in a new patient, you plug in the genomic data
- 12:25from a new patient in order to get the predicted outcome
- 12:28for that patient based on that training model.
- 12:30So the, in order to do that, you also have to normalize
- 12:34the new data to the training data, right?
- 12:36Because you also want to put the new data on the same scale
- 12:38as a training data, and in the ideal scenario,
- 12:41you would want to make sure that the training samples
- 12:44that you use to train your original model are untouched,
- 12:47because what some people try to do is they try
- 12:49to sort of sidestep this normalization issue,
- 12:52they would combine the new data with the old training data,
- 12:55and renormalize everything at once.
- 12:57And the problem with this is that this changes
- 12:59your training sample values, and in a sense,
- 13:01would necessitate the fact that you need to retrain
- 13:04your old model again.
- 13:04And this leads to instability, and lack of stability
- 13:07over time in terms of the model itself.
- 13:10So in the prior example from ovarian cancer,
- 13:12this is not as big of an issue, because you have
- 13:15all the data you want to work with in hand.
- 13:18This is a retrospective study, you have 10 data sets,
- 13:20so you just normalize everything at the same time,
- 13:22that's in ComBat and frozen RNA.
- 13:24And so you can split up those studies into separate training
- 13:27and test studies, and they're all rated on the same scale.
- 13:31But the problem is that in practice, you're trying to do
- 13:34a prospective type of analysis, where when you train
- 13:37your model, you're normalizing all of the available studies
- 13:40you have, let's say, and then you use that to predict
- 13:44the outcome in a future patient, or a future study.
- 13:47And so the problem with that is that you have to find
- 13:51a good way to align, as I mentioned before,
- 13:55the data from that future study for your training samples,
- 13:57and that may not be an easy task to do,
- 14:00especially for some of the newer platforms like RNAC.
- 14:04So taking this problem a step further,
- 14:06what if there's no good cross study normalization approach
- 14:10that's available to begin with?
- 14:12This really is going to make things difficult in terms
- 14:15of the training in the model in the first place.
- 14:18Another more complicated problem is that you might have
- 14:21different types of platforms at that training time.
- 14:24For example, you might have the only type of data
- 14:26that's available from one study is NanoString in one case,
- 14:29and another study it's only RNAC, so what do you do?
- 14:33And looking forward, as platforms change,
- 14:35as technology evolves, you have different ways
- 14:36of measuring gene expression, for example.
- 14:42So what do you do with the models that are trained
- 14:44on old data, because you can't apply them to the new data?
- 14:48So oftentimes you find this situation
- 14:50where you have to retrain new models on these new platforms,
- 14:53and the old models are not able to be applied
- 14:57directly to this new data types.
- 14:58So that leads to waste here.
- 15:01So if you take all of these problems together,
- 15:03regarding cross-study normalization,
- 15:07and changes in platform,
- 15:09and a lot of the other issues, you know,
- 15:11regarding replicability that I mentioned,
- 15:13it's no wonder that there's only a small handful
- 15:17of expression-based clinically applicable assets have been
- 15:21approved by the FDA, like Oncotype DX, MammaPrint,
- 15:24and Prosigna, because this is a very, very tough problem.
- 15:30So I want to move on with that, to an approach
- 15:33that we proposed to help tackle this sort of issue
- 15:36by using this idea of multi-study learning,
- 15:39where instead of just using, and deriving, and generating
- 15:43models from individual studies, we combine data
- 15:45from multiple studies together, and create a consensus model
- 15:48that we use for prediction, which will hopefully be
- 15:50more stable, and more accurate down the road.
- 15:54So this approach of combining data is called
- 15:56horizontal data integration, where we're merging data
- 15:59from let's say K different studies.
- 16:01And the pro of this approach is that we get increased power,
- 16:04and the ability to reach some sort of consensus
- 16:06across these different studies.
- 16:09The negative is that the effect of a gene
- 16:12and its relationship to outcome may actually vary
- 16:14across studies, and also by, you know, depending on,
- 16:16and also the way that you normalize the genes may also vary
- 16:19across studies too if we're using published data
- 16:21from some prior publication.
- 16:24There's also this issue of sample size and balance.
- 16:25You might have a study that has 500 subjects,
- 16:28and another one that might have 200 subjects.
- 16:30So there are some methods that were designed to account for
- 16:34between-study heterogeneity after you do
- 16:36horizontal data integration.
- 16:38One is called the meta-lasso, another is called
- 16:41the AW statistic, but these two methods don't really have
- 16:44any prediction aspect about them.
- 16:46They're more about feature selection.
- 16:48Ensembling is one approach that can directly account
- 16:50for between-study heterogeneity
- 16:52after horizontal data integration, but there's
- 16:54no explicit future selection step here.
- 16:57But all of these approaches assume
- 16:59that the data has been pre-normalized.
- 17:02As we talked about before,
- 17:03for prospective decision-making, based off a train model,
- 17:07that might be prohibitive in some cases,
- 17:10and we need a strategy also to easily predict
- 17:13and apply these models in new patients.
- 17:20Okay, so moving on, we're going to talk first
- 17:24about this issue of how do we integrate data,
- 17:27and sort of sidestep this normalization problem
- 17:30at training time, and also at test time where we,
- 17:33when we try to predict in new subjects?
- 17:35So the approach that we put forth is to use
- 17:39what's called top scoring pairs, which you can think of
- 17:41as a rank-based transformation of the original set
- 17:45of gene expression values from a patient.
- 17:47So the idea here originally,
- 17:50when top scoring pairs were introduced,
- 17:51was you're trying to find a pair of genes
- 17:53where it's such that if the expression of gene A
- 17:56in the pair is greater than gene B, that would imply
- 17:59that the, let's say, the subtype for that individual is,
- 18:03say, subtype one, and if it's less,
- 18:05then that implies subtype zero with high probability.
- 18:09Now, in this case, this sort of approach was developed
- 18:12with when one has a binary outcome variable
- 18:14that you care about.
- 18:15In this case, we're talking about subtype,
- 18:17but it could also be tumor response or something else.
- 18:20So essentially what you're doing is that you're taking
- 18:22these continuous measurements in terms of gene expression,
- 18:25or integer, and you are converting that, transforming
- 18:31that into basically a binary predictor,
- 18:32which takes on the value of the zero or one.
- 18:34And the hope is that that particular transformed value is
- 18:38going to be associated with this binary outcome.
- 18:41So the simple assumption in this scenario is
- 18:44that the relative rank of these genes
- 18:46in a given sample is predictive of subtype, and that's it.
- 18:51And so the example here I have on the right is an example
- 18:54of two genes, GSTP1 and ESR1.
- 18:58And so you can see here that if you're
- 19:00in the upper left quadrant, this is where this gene is
- 19:02greater than this gene expression, it's implying
- 19:05the triangle subtype with high probability,
- 19:08and otherwise it implies the circle subtype.
- 19:11So that's the general idea of what we're going for here.
- 19:14It's a sort of a rank-based transformation
- 19:16of the original continuous predictor space.
- 19:21So the nice thing about this approach,
- 19:22because we're only based on the simple assumption, right?
- 19:25That we're only caring about the relative rank
- 19:27within a subject, this makes
- 19:29this particular new transformed predictor
- 19:32relatively invariant to batch effects, pre-normalization,
- 19:36and it also most importantly, simplifies merging data
- 19:39from different studies.
- 19:41Everything is now on the same scale, zero to one,
- 19:43so it's very easy to paste together the data
- 19:45from different studies, and we can sidestep this problem
- 19:50of trying to pick a cross-normalization approach,
- 19:53and then work in this sort of transformed space.
- 19:57The other nice thing is that this is easily computable
- 19:59for new patients as well.
- 20:01If you have a new patient that comes into clinic,
- 20:03you just check to see whether the gene A is
- 20:04greater than gene B in terms of expression,
- 20:06and then you have your value for this top scoring pair,
- 20:11and we don't have to worry as much about normalizing
- 20:14this patient's raw gene spectrum data
- 20:18to the training sample expression values.
- 20:21So essentially what we're doing here is that we're,
- 20:23let's enumerate all possible gene pairs for us,
- 20:26instead of a candidate genes, and each column here
- 20:28in this matrix shown on the right pertains
- 20:31to the zero one values for a particular gene pair J.
- 20:34And so this value takes the value of one, it is greater
- 20:38than B, in sample I, in pair j, and zero otherwise.
- 20:41And then we merge over the common top scoring pairs.
- 20:46So in this example have data from four different studies,
- 20:49each indicator by a different color here
- 20:50in the first track, and this data pertains to data
- 20:54from two different platforms,
- 20:55and three different cancer types.
- 20:56And so the clinical outcome here is binary subtype,
- 20:59which is given by the orange and the blue color here.
- 21:02So you can see here that we enumerated the TSPs,
- 21:05we merged the data together, and now we have
- 21:07this transformed predictor agents.
- 21:09And the interesting thing is
- 21:10that you can definitely see some patterning here.
- 21:13With any study where you have a particular set of TSPs
- 21:15that had taken a value of one, when the subtype is blue,
- 21:19and it flips when it's orange.
- 21:21And we see the same general pattern seem to replicate
- 21:24across different studies,
- 21:25but not every top scoring pair changes the same way
- 21:29across different studies.
- 21:32So if we cluster the rows here, we can also see
- 21:35some patterns sort of persist where we see
- 21:38some clustering by subtype,
- 21:40but also some clustering by study as well.
- 21:42And so what this implies is that there's a relationship
- 21:45between TSPs and subtypes, and that can vary across studies,
- 21:47which is not too different from what we've talked
- 21:50about regarding the issues we've seen
- 21:51in replicability in the past.
- 21:53So ideally we would like to see a particular gene pair,
- 21:57or TSP vector here take on a value of one,
- 22:01only when there's the orange subtype,
- 22:03and zero in the blue subtype, or vice versa.
- 22:05And we wanted to see this pattern replicated
- 22:07across patients in studies, but we see obviously
- 22:10that that's not the case.
- 22:12So the question now that we've sort of introduced,
- 22:15or proposed is this sort of approach to simplify
- 22:17data merging in normalization.
- 22:19The question now that we're sort of dealing
- 22:20with is well, how do we actually now find
- 22:22features that are consistent across different studies
- 22:26in their relationship with outcome, and also estimate
- 22:29their study-level effect, and then use them for prediction?
- 22:33So that leads us to the second part of our paper,
- 22:35where we developed a model to help select
- 22:39these particular study-consistent features
- 22:42while accounting for study-level heterogeneity.
- 22:47So to sort of illustrate the idea behind this,
- 22:49let's just start with a simple simulation
- 22:52where we're not doing any normalization,
- 22:54we're not worrying about resuming, everything's fine
- 22:56in terms of the expression values,
- 22:59and we're not doing any selection,
- 23:00no TSP transmission either.
- 23:03So we're going to assimilate data pertaining
- 23:05to two, let's say, known biomarkers
- 23:06that are associated with binary subtype.
- 23:09We're going to generate K datasets,
- 23:11and we're going to try three different strategies
- 23:12for learning a prediction model two to these data sets.
- 23:15And at the end, we're going to validate each of those models
- 23:18on an externally-generated data set
- 23:19to compare their prediction performance.
- 23:22So to do this, we're going to fit and assume for each study
- 23:25that we can fit it with a logistic regression model
- 23:28to model by our outcome with these two predictors,
- 23:31and in generating these K data sets,
- 23:32we're going to vary the number of with respect to K.
- 23:35So we might generate two trained data sets five or 10,
- 23:38and also change the total sample size of each one,
- 23:40and make sure that the sample sizes are in balanced
- 23:42across the different studies, and then assume
- 23:45values for the coefficients for each of these predictors
- 23:50to be these values here, and lastly, to induce some sort
- 23:53of heterogeneity across the different training datasets,
- 23:56we're gonna add in sort of like a random value drop
- 23:59from the normal distribution, where we're assuming
- 24:03this level of variance for this value.
- 24:05So basically we're just injecting heterogeneity
- 24:07into this data generation process.
- 24:09So after we generate the training studies,
- 24:11then we're going to apply three different ways
- 24:13or strategies to the training data.
- 24:15The first is the individual study approach,
- 24:17which we've talked about before, where you train
- 24:20a generalized model separately for each study.
- 24:22The second approach is where you merge the data.
- 24:25Again, we're ignoring the normalization problem here
- 24:26in simulation, obviously, and then train a single GLMM
- 24:30for the combined data, and then lastly,
- 24:32we're going to merge the data, and train
- 24:34a generalized linear mixed model,
- 24:35where we explicitly account for a random intercept,
- 24:38and a random slope for each predictor,
- 24:41assuming, you know, a study-level random effect.
- 24:45So after we do that, we'll generate a validation dataset
- 24:48from the same approach above, and then predict outcome
- 24:52in this validation dataset with respect
- 24:55to the models derived from each of these three strategies.
- 24:59So if we look at the individual strategy performance,
- 25:01where we fit a GLM logistical regression model
- 25:04separately for each study, and then apply it
- 25:06to this validation data set, we can check
- 25:08the prediction accuracy, we can find that,
- 25:11due to the induced level of heterogeneity
- 25:14between studies in predictor effects,
- 25:16in one study, we do really poorly,
- 25:18and another study we do really well,
- 25:20and this variation is entirely due to variations
- 25:24in the gene subtype relationship.
- 25:27And these predictions obviously vary as a result
- 25:29across the different studies.
- 25:30And this will reflect a little bit of what we see
- 25:32in some of the examples that we showed earlier,
- 25:35studies that were trained on different data sets.
- 25:40And then the second approach is where we combine
- 25:43the data sets, and train a single logistical question model
- 25:46to predict outcome.
- 25:46And so we see what the median prediction error is better
- 25:49than most of the models here, but if we fit the GLMM,
- 25:52the median prediction (indistinct) gets better
- 25:54than some of the other approaches here.
- 25:56So this is basically just one example.
- 25:58So we did this over and over a hundred times
- 26:00for every single possible simulation condition,
- 26:03varying K, and the heterogeneity across different studies.
- 26:07And some of the things that we found was that
- 26:10the individual study approach had, as you can see,
- 26:12the worst prediction error overall,
- 26:14combining the data improved this a little bit,
- 26:17but the estimates for the coefficients
- 26:21from the combined GLMM were still biased.
- 26:23There's supposed to be two in this extreme scenario.
- 26:27And a kind of heterogeneity with the GLMM mixed model had
- 26:31the best performance out of the rest,
- 26:32and also had the lowest bias in terms
- 26:35of the regression coefficients as well.
- 26:39So this is great, but we also have a lot
- 26:42of potential types of pairs.
- 26:44We can't really estimate them all
- 26:47with a GLMM mixed model, so we need to find a way
- 26:50where we can, at least in reasonable dimension,
- 26:52figure out a way which fixed effects are non-zero,
- 26:55while accounting for, you know,
- 26:56this sort of study-level heterogeneity for each effect.
- 27:00So this led us to develop a pGLMM, which is basically
- 27:05a high-dimensional generalized intermixed model,
- 27:08where we are able to select fixed and random effects
- 27:11simultaneously using a penalization framework.
- 27:13So essentially here, we're assuming that all the predictors
- 27:17in the model, we assume a random effect,
- 27:20a random slope for each one, and so we were aiming to select
- 27:23the features that have non-zero fixed effects
- 27:28in this particular approach, and indeed we're assuming
- 27:30these are going to be study-consistent.
- 27:32And to do this, we're going to reorganize
- 27:35the linear predictor from the standard GLMM,
- 27:38so basically we're starting with the same general likelihood
- 27:41for, you know, the generalized mixed model.
- 27:44Here, Y is our outcome, X is our predictor,
- 27:49alpha is the, alpha K is the random effect
- 27:53for the case study, fi here is typically assumed to be
- 27:58multi, very normal, means zero, and a covariant
- 28:02on some sort of unstructured covariance matrix typically.
- 28:05And so to sort of simplify this, we factor out
- 28:09the random effects covariance matrix,
- 28:10and incorporate into the linear predictor.
- 28:12And with some more reorganizing, now we're able to select
- 28:16the fixed effects and determine which random effects have
- 28:21true non-covariance, using this sort
- 28:24of joint penalization framework.
- 28:26If you want more detail, you can check out the publication
- 28:28that I linked above, and I also forgot to send out
- 28:31the link to this talk here.
- 28:33I'll do that right now, in case you want to check out
- 28:35some of the publications that I'm linking in this talk.
- 28:41Okay, so how do we do this estimation?
- 28:42And we use that penalized NCM algorithm,
- 28:44where in each step we're drawing from the posterior
- 28:47with respect to the random effects, given
- 28:48the current aspects of the parameters,
- 28:50and the observed data, using Metropolis point of Gibbs.
- 28:55In the R packets, I'm going to talk about in a little bit,
- 28:58we update this to using a Hamiltonian Monte Carlo,
- 29:03but in the original version,
- 29:04we use Metropolis point of Gibbs, where we skipped
- 29:07components that had zero variance from the M-STEP.
- 29:09And then we use, in the M-step,
- 29:12two conditional maximization steps
- 29:14where we first update data, given the draws
- 29:17from the E-step, and the prior estimates for gamma here,
- 29:20and then up to gamma using a group penalty.
- 29:24So we use a couple of other tricks
- 29:25to speed up performance here.
- 29:27I won't go too much into the details there,
- 29:29but you can check out the paper for more detail on that.
- 29:33But with this approach, one of the things
- 29:35that we were able to show was that we have
- 29:37similar conclusions regarding bias and prediction error,
- 29:39as in the simple setup we had before,
- 29:41where in this particular situation, we're simulating
- 29:43a bunch of predictors that do not have any association
- 29:47with outcome, either 10 to 50 extra predictors,
- 29:51or there's only two that are actually truly relevant.
- 29:54And so the prediction error in this model
- 29:56after this penalized selection process is
- 29:59generally the same, if not a little bit worse.
- 30:01And one thing that we find here is that
- 30:03the parameters are selected
- 30:06by the individual study approach we're applying now
- 30:08at penalized distribution regression model has
- 30:10a low sensitivity to detect the true predictors,
- 30:13and a higher false positive rate in terms of selecting
- 30:16predictors that aren't associated
- 30:17with outcome and simulation.
- 30:19And what we find here also is that the approach
- 30:23that we developed had a much better sensitivity
- 30:26compared to other approaches for selecting
- 30:28the true predictors when accounting
- 30:30for study-level homogeneity,
- 30:32and the lower false positive rate as well.
- 30:36The example data sets that I talked about before,
- 30:39the four ones that I showed a figure up earlier,
- 30:43we did a whole data study analysis where we trained
- 30:45on three studies and held out one of the studies.
- 30:48We found that, you know, the approach that we put forward
- 30:51that put combining the data using our TSP approach,
- 30:54and then training a model using the pGLM had
- 30:58the lowest overall holdout study error
- 31:00compared to the approach using just
- 31:02a regular generalized linear model,
- 31:06and then also the individual study approach as well.
- 31:09And we also compared it to another post called
- 31:12the Meta-Lasso, which we were able to adapt
- 31:14to do prediction, and we didn't see that much improvement
- 31:16of performance as well.
- 31:17But in general, the result that we saw here was
- 31:21that the individual study approach had
- 31:23bad prediction error also across the different studies.
- 31:27So again, this sort of takes what we've already seen
- 31:29in the literature in terms of inconsistency,
- 31:31in terms of the number of genes that are being selected
- 31:33in each of these models, and also the variations
- 31:35in the prediction accuracy, this sort of reflects
- 31:38what we've been seeing in some of this prior work.
- 31:44So in order to you implement this approach
- 31:46in a more systematic way, my student and I,
- 31:49Hillary worked, put together an R package called
- 31:51The GLMMPen R Package.
- 31:54So this was just recently submitted
- 31:56to Journal of Statistical Software, but if you want to track
- 31:59the code, it's available on Github right here,
- 32:02and we're in the process of submitting this to CRAN as well.
- 32:05This was sort of like a nice starter project that I gave
- 32:08to Hillary to, you know, get her feet wet with coding,
- 32:12and she's done a really great job, you know,
- 32:15in terms of putting this together.
- 32:16And some of the distinct differences between this
- 32:19and what we put forth in the paper is the use
- 32:21of Hamiltonian Monte Carlo and the east app,
- 32:24instead of the Metropolis Gibbs.
- 32:26It's much faster, much more efficient.
- 32:27We also have added helper functions
- 32:29for the (indistinct) tuning parameters, and also making
- 32:33some diagnostic plots as well, after convergence.
- 32:37And we've also implemented some speed
- 32:39and memory improvements as well, to help with usability.
- 32:44Okay, so we talked about some issues
- 32:47regarding data integration, and then issues
- 32:50with normalization, how that impedes, or can impede
- 32:52validation in future patients, and then we introduced
- 32:56a way to sidestep the normalization problem,
- 32:59using this sort of rank-based transformation,
- 33:01and an approach to select consistent predictors
- 33:03in the presence of between-study heterogeneity.
- 33:07So next, I'm going to talk about a case study
- 33:09in pancreatic cancer, where we took a lot of these tools,
- 33:13and applied them to a problem that some collaboratives
- 33:16of mine were having, you know, at the cancer center at UNC.
- 33:20And to give a brief overview of pancreatic cancer,
- 33:23it has a really poor prognosis.
- 33:26Five-year survival is very low, you know, typically 5%.
- 33:30The median survival tends to be less than 11 months,
- 33:32and the main reason why this is the case is that
- 33:35early detection is very difficult,
- 33:37and so when patients show up to the clinic,
- 33:40they're oftentimes in later stages, or gone metastatic.
- 33:44So for those reasons, it's really important to place
- 33:48patients on optimal therapies upfront, and choosing
- 33:51the best therapies, specifically for a patient, you know,
- 33:54when after they're diagnosed.
- 33:56So breast and colorectal cancers have
- 33:59long-established subtyping systems that are oftentimes used.
- 34:02Again, an example of a few of them in breast
- 34:04that have actually been approved by the FDA
- 34:06for clinical use, but there's nothing available for,
- 34:09in terms of precision medicine for pancreatic cancer,
- 34:11except for a couple of targeted therapies
- 34:14for specific mutations.
- 34:17So in 2015, the Yeh Lab at UNC,
- 34:20using a combination of non-negative matrix factorization
- 34:24and consensus clustering, where it was able to discover
- 34:27two potentially clinically applicable subtypes
- 34:30in pancreatic cancer, which they call basal-like,
- 34:33the orange line here, which has a much worse survival
- 34:37compared to this classical subtype in blue,
- 34:41where patients seem to do a little bit better.
- 34:44And so with this approach, they used
- 34:45this unsupervised learning, set of learning techniques
- 34:48to derive these novel subtypes.
- 34:51And so when they took these subtypes and overlaid them
- 34:54from data from a clinical trial where they had
- 34:56treatment response information, they found that
- 34:58largely patients who with basal-like subtype tended to have
- 35:02tumors that did not respond
- 35:04to common first-line therapy, Folfirinox.
- 35:06Their tumors tended to grow from baseline.
- 35:08Whereas patients that were the classical subtype tended
- 35:12to respond better on average compared to the basal samples.
- 35:16So the implications here are that if you are,
- 35:20subtype is basal, you should avoid Folfirinox
- 35:23at baseline entry with an alternative type drug,
- 35:25typically Gemcitabine and nab-paclitaxel Abraxane.
- 35:27And then for classical patients,
- 35:29they should receive Folfirinox.
- 35:32But the problem here is that subtyping clearly is
- 35:34an unsupervised learning approach, right?
- 35:36It's not a prediction tool.
- 35:37So it's, this approach is quite limited if it,
- 35:42when you have to do, assign a subtype
- 35:45in a small number of patients, it just doesn't work.
- 35:48So what some people have done in the past,
- 35:50so they simply take new patients, and recluster them
- 35:52with existing, their existing training samples.
- 35:55The problem with that is that the subtype assignments
- 35:58for those original training samples might change
- 36:00when they recluster it.
- 36:01So there's not a stable, it's not really
- 36:03a stable approach to really do this.
- 36:05So the goal here was to leverage the existing training data
- 36:08that's available to the lab, which come
- 36:12from different platforms to come up with an approach,
- 36:15a classifier to predict subtype, given
- 36:18new subtypes information, genomic,
- 36:20a new patient's genomic data, to get subtype,
- 36:23a predicted subtype for that individual.
- 36:25So of course, in that scenario, we also want to make sure
- 36:28that that process is simplified, and that we make
- 36:31this prediction process as easy as possible,
- 36:33in the face of all these issues we talked about regarding
- 36:36normalization and the training data to each other,
- 36:40and also normalization of the new patient data
- 36:42to the existing training data.
- 36:45So using some of the techniques that we just talked about,
- 36:49we came up with a classifier that we call PurIST,
- 36:51which was published in the CCR last year,
- 36:53where essentially we were able to do that.
- 36:56We take in the genomic data for a previous patient,
- 36:59and able to predict subtype based off of that,
- 37:04the train model that we developed.
- 37:06And in this particular paper, we had nine data sets
- 37:09that we curated from the literature, three of which
- 37:11that we used for training,
- 37:13the rest we used for validation.
- 37:14And we did consensus clustering on all of them,
- 37:16using the gene list that was derived
- 37:18from the original publication,
- 37:21where the subtypes were discovered to get labels,
- 37:23subject labels for each one of the subjects
- 37:25in each one of these studies.
- 37:27So once we had those labels from consensus clustering,
- 37:30we then merged the data from our three largest studies,
- 37:33which are our training studies.
- 37:35We did some sample for filtering based on quality,
- 37:37and we filtered some genes based off of, you know,
- 37:40expression levels and things like that.
- 37:42And then we applied our previous training approach
- 37:45to get a small subset of top scoring pairs from the data.
- 37:50And in this case, we have eight that we selected,
- 37:51each with their own study-level coefficient.
- 37:55And then for prediction, the process is very simple,
- 37:58we just check in that patient, whether gene A is greater
- 38:00than gene D for each of these pairs,
- 38:02and that gives us their binary vector of ones and zeros.
- 38:05We multiply that by the coefficients from the train model.
- 38:09This is basically just calculating a linear predictor
- 38:11from this logistic regression model.
- 38:14And then we can convert that
- 38:15to a predicted probability of being basal.
- 38:18So using this approach, we were able to select
- 38:2316 genes pertaining to eight subtypes,
- 38:25but we can find here that the predictions
- 38:27from this model tends to coincide very strongly
- 38:31with the labels that were collected
- 38:33using consensus clusters.
- 38:34So that gives us some confidence that reproducing
- 38:36in some way, you know, this, the result that we got
- 38:41using this clustering approach.
- 38:43You can also clearly see here that as the subtype changes,
- 38:46that you see flips in the expression in each one
- 38:49of the pairs of genes that we collected
- 38:52in this particular study.
- 38:54And then when we applied this model
- 38:55to six external validation dataset, we found that it had
- 38:59a very good performance in terms of recapitulating subtype,
- 39:01where we had a relatively good sensitivity
- 39:04and specificity in each case, which we owe part
- 39:07to the fact that we don't have to worry as much
- 39:08about this sort of cross-study normalization training time
- 39:13or test time, and also the fact that we leveraged
- 39:17multiple data sets when selecting
- 39:21the predictors for this model.
- 39:22And so when we looked at the predictive values
- 39:24in these holdout studies, the predictive subtypes,
- 39:27we recapitulated the differences in survival
- 39:30that we observed in other studies as well,
- 39:32where basal-like patients do a lot worse
- 39:34compared to classical patients.
- 39:37If you want to look a little bit more at the details
- 39:39in this paper, you can check out this link here,
- 39:41and if you want to access the code that we used
- 39:44to make these predictions, that's available
- 39:45on this Github page at this link right here.
- 39:50Another thing that we were able to show is that for patients
- 39:53that had samples that are collected through different modes
- 39:56of collection, whether it was bulk, FNA, FFPE,
- 40:00we found that the predictions in these patients tend to be
- 40:03highly consistent, and this is basically deriving
- 40:06itself, again, from the simple assumption behind TSPs,
- 40:09where the relative rank within the subject of the expression
- 40:13of these genes is predicted.
- 40:15So as long as that is being preserved,
- 40:17then you should be able to have the model predict well
- 40:21in different scenarios.
- 40:23So when we also went through CLIA validation for this tool,
- 40:28we also confirmed 95% agreement between replicated runs
- 40:31in other platforms, and we also confirmed concordance
- 40:38between NanoString and RNAC, also through different modes
- 40:43of sample collection.
- 40:44So right now this is the first clinically applicable test
- 40:47for a prospect of first line treatment selection in PDAC.
- 40:51And right now we do have a study that just recently opened
- 40:54at the Medical College of Wisconsin that's using PurIST
- 40:56for prospect of treatment selection,
- 40:58and we have another one opening at University of Rochester,
- 41:02and also at UNC soon as well.
- 41:06So this is just an example about how you can take
- 41:10a problem, you know, in, from the literature,
- 41:14from your collaborators, come up with a method,
- 41:18and some theory behind it, and really be able to come up
- 41:22with a good solution that is robust,
- 41:24and that can really help your collaborative
- 41:27at your institution and elsewhere.
- 41:32Okay, so that was the case study.
- 41:34To talk about some current work
- 41:35that we're doing just briefly.
- 41:36So we wanted to think about how we can also scale up the,
- 41:39this particular framework that we developed for the pGLMM,
- 41:42and one idea that we're pursuing right now
- 41:44with my student Hillary, is that we're thinking
- 41:48about using, borrowing ideas from factor analysis
- 41:50to decompose, do a deep, deterministic decomposition
- 41:53of the random effects to a lower dimensional space,
- 41:56where essentially, we can essentially map
- 42:00between the lower dimensional space (indistinct) factors,
- 42:03which is r-dimensional, to this higher dimensional space,
- 42:05using some by matrix B, which is q by r,
- 42:12and essentially in doing so, this reduces the dimension
- 42:16of the integral in the Monte Carlo EM algorithm.
- 42:20So rather than having to do approximate integral
- 42:22and q dimensions, which can be difficult,
- 42:24you can work in a much lower space in terms of integral,
- 42:27and then have this additional problem
- 42:29of trying to estimate this matrix,
- 42:31and not back to the original dimension cube.
- 42:33So that's something that we're just starting to work on
- 42:35right now, and another thing that we're starting to work on
- 42:39is the idea of trying to extend some of the work
- 42:41in variational autoencoders
- 42:43that my student David is working on now.
- 42:45His current work is trying to account for missing data
- 42:48when trying to train these sort of deep learning models,
- 42:51the VAEs unsupervised learning model's oftentimes used
- 42:55for dimensional reduction.
- 42:56You might've heard of it
- 42:57in single cells sequencing applications.
- 43:01But the question that we wanted to address is, well,
- 43:03what if you have missing data, you know,
- 43:05in your input features X, which might be (indistinct)?
- 43:10So essentially we were able to develop input.
- 43:14So we have a pre-print up right now, it's the code,
- 43:17and we're looking to extend this, where essentially,
- 43:20rather than worrying about this latent space Z,
- 43:23which we're assuming that that encodes a lot
- 43:25of the information in the original data,
- 43:27we replaced that with learning the posterior
- 43:29of the random effect, given the observed data.
- 43:32And then in the second portion here, we replaced
- 43:34this generative model with the general model of y given X
- 43:39in the random effects.
- 43:41So that's another avenue that can allow us
- 43:43to hopefully account for non-linearity,
- 43:45and arbitrator action between features as well.
- 43:47And also it might be an easier way to scale up
- 43:49some of the analysis we've done too,
- 43:53which I've already mentioned.
- 43:55Okay, so in terms of some concluding thoughts,
- 43:58I talked a lot about how the original subtypes were derived
- 44:03for this pancreatic cancer case study using NMF
- 44:06and consensus clustering to get two subtypes.
- 44:09But there were also other groups that are published,
- 44:12subtyping systems, that in one, they found
- 44:16three subtypes, and in another one they found four subtypes.
- 44:19So the question is, well, you know, well,
- 44:22which one do we use?
- 44:23Again, this is also confusing for practitioners
- 44:26about which approach might be more meaningful
- 44:29in the clinical setting.
- 44:30And each of these approaches were also derived
- 44:32using NMF and consensus clustering, and they were done
- 44:35separately on different patient cohorts
- 44:38at different institutions.
- 44:39So you can see that this is another reflection
- 44:41of heterogeneity in single-study learning,
- 44:45and how we can get these different or discrepant results
- 44:49from applying the same technique to 200 genus datasets
- 44:52that were generated at different places.
- 44:54So of course this creates another problem, you know,
- 44:57who's right, which approach do we use?
- 45:00And it's kind of like a circular argument here.
- 45:03So in the paper that I mentioned before with PurIST,
- 45:07another thing that we did is we overlaid
- 45:09the others subtype system calls
- 45:12with the observed clinical outcomes
- 45:15for the studies that we collected.
- 45:17And one of the things that we found was that,
- 45:19and these other subtyping systems,
- 45:22each of them also had something,
- 45:24something that was very similar to the basal-like subtype,
- 45:27and for the remaining subtypes, they had survival
- 45:30that was similar to the classical subtype.
- 45:33So one of the arguments that we made was that,
- 45:35well, if the clinical outcomes are the same
- 45:37for the other subtypes, you know,
- 45:40are they exactly right necessary
- 45:42for clinical decision-making?
- 45:43That was one argument that we put forth.
- 45:46And when we looked at the response data, again,
- 45:48we saw that one of the subtypes in the other approaches
- 45:51also overlapped the basal-like subtype in terms of response.
- 45:56And then for the remaining subtypes,
- 45:57they were just kind of randomly dispersed at the other end,
- 46:01you know, of the spectrum here in terms of tumor present,
- 46:05tumor change after treatment.
- 46:07So the takeaway here is that heterogeneity
- 46:09between studies also impacts tasks in unsupervised learning,
- 46:14like the NMF+ consensus clustering approach
- 46:16to discover subtypes.
- 46:18And what this also does is, as you can imagine,
- 46:21this injects a lot of confusion into the literature,
- 46:24and can also slow down the process of translating
- 46:27some of these approaches to the clinic.
- 46:30So this also underlies the need
- 46:32for replicable cross-study sub discovery approaches,
- 46:35for replicable approaches for unsupervised learning.
- 46:41That's something that, you know, something that we might,
- 46:43we hope to be working on in the future,
- 46:46and we hope to see more work on as well.
- 46:49So to summarize the, one of the major points
- 46:53of this talk was to introduce and discuss, you know,
- 46:55replicability issues in genomic prediction models,
- 46:58supervised learning, that stems from technical,
- 47:01and also non-technical sources.
- 47:03We also introduced a new approach to facilitate
- 47:07data integration and multistory learning
- 47:09in a way that captures between-study heterogeneity,
- 47:12and showed how this can be used for the prediction
- 47:15of subtype for pancreatic cancer, and also introduced
- 47:20some scalable methods and future direction
- 47:23in replicable subtype discovery.
- 47:26So that's it for me.
- 47:28I just want to thank some of my faculty crowd,
- 47:30collaboratives, Quefeng Li, Junier Oliva
- 47:33from UNC computer science, Jen Jen Yeah
- 47:37from surgical oncology at Lineberger,
- 47:40Joe Ibrahim as well, UNC biostatistics,
- 47:43and also my students, Hilary, who's done a lot of work
- 47:45in this area, and also David Lim, who's doing
- 47:48some of the deep learning work in our group.
- 47:50And that's it, thank you.
- 47:58<v Robert>So does anybody here have</v>
- 47:59any questions for the professor?
- 48:09Or anybody on the, on Zoom, any questions you want to ask?
- 48:26<v ->It looks like I'm off the hook.</v>
- 48:29<v Robert>All right, well, thank you so much.</v>
- 48:30Really appreciated your talk.
- 48:33Have a good afternoon.
- 48:36<v ->All right, thank you for having me.</v>