Skip to Main Content

YSPH Biostatistics Seminar: “Addressing the Replicability and Generalizability of Clinical Prediction Models”

September 08, 2021
  • 00:00<v Robert>Hi, I'm a Professor McDougal,</v>
  • 00:06and Professor Wayne is also in the back.
  • 00:08If you haven't signed in, please make sure that you pass
  • 00:11this, get a chance to sign the sign in sheet.
  • 00:15So today we are very, very privileged to be joined
  • 00:19by Professor Naim Rashid
  • 00:22from the University of North Carolina Chapel Hill,
  • 00:25Professor Rashid got his bachelor's in biology from Duke,
  • 00:30and his PhD in biostatistics from UNC Chapel Hill.
  • 00:35He's the author of 34 publications, and he holds a patent
  • 00:40on methods in composition for prognostic
  • 00:44and/or diagnostic supply chain of pancreatic cancer.
  • 00:48He's currently an associate professor at UNC Chapel Hill's
  • 00:51department of biostatistics, and he's also affiliated
  • 00:54with their comprehensive cancer center there.
  • 00:59With that, Professor Rashid, would you like to take it away?
  • 01:04<v ->Sure.</v>
  • 01:06It looks like it says host disabled screen sharing.
  • 01:10(chuckling)
  • 01:12<v Robert>All right, give me one second.</v>
  • 01:14Thank you.
  • 01:17I'm trying to do.
  • 01:27(indistinct)
  • 01:34Okay, you should be, you should be able to come on now.
  • 01:36<v ->All right.</v>
  • 01:39Can you guys see my screen?
  • 01:44All right.
  • 01:48Can you guys see this?
  • 01:50<v Robert>There we go.</v>
  • 01:52Perfect. Thank you.
  • 01:53<v ->Okay, great.</v>
  • 01:54So yes, thanks to the department for inviting me to speak
  • 01:57today, and also thanks to Robert and Wayne for organizing.
  • 02:01And today I'll be talking about issues regarding
  • 02:04replicability in terms of clinical prediction models,
  • 02:08specifically in the context of genomic prediction models,
  • 02:12derived from clinical trials.
  • 02:16So as an overview, we'll be talking first a little bit
  • 02:18about the problems of replicability in general,
  • 02:21in scientific research, and also about specific issues
  • 02:24in genomics itself, and then I'll be moving on to talking
  • 02:28about a method that we've proposed to assist
  • 02:31with issues regarding data integration, and learning
  • 02:34in this environment when you have a heterogeneous data sets.
  • 02:38I'll talk a little bit about a case study
  • 02:40where we apply these practices to subtyping
  • 02:43pancreatic cancer, touch on some current work
  • 02:45that we're doing, and then end
  • 02:47with some concluding thoughts.
  • 02:48And feel free to interrupt, you know,
  • 02:50as the talk is long, if you have any questions.
  • 02:54So I'm now an associate professor in the department
  • 02:56of biostatistics at UNC.
  • 02:58My work generally involves problems
  • 03:00surrounding cancer and genomics, and more recently
  • 03:05we've been doing work regarding epigenomics.
  • 03:07We just recently published a supply-connected package called
  • 03:09Epigram for a consistence of differential key calling,
  • 03:13and we've also done some work in model-based clustering.
  • 03:15We published a package called, FSCSeq,
  • 03:18which helps you derive and discover clusters
  • 03:22from RNA seq data, while also determining
  • 03:25clusters in specific genes.
  • 03:26And today we'll be talking more about the topic
  • 03:28of multi-study replicability, which is the topic
  • 03:30of a paper that we published a year or two ago,
  • 03:34and in our package that we've developed more recently,
  • 03:37implementing some of the methods.
  • 03:40So before I get deeper into the talk, one of the things
  • 03:43I wanted to establish is this definition
  • 03:45of what we mean by replicability.
  • 03:47You might've heard the term reproducibility as well,
  • 03:50and to make the distinction between the two terms,
  • 03:52I'd like to define reproducibility in a way
  • 03:54that Jeff Leak has defined in the past,
  • 03:57where reproducibility is the ability to take
  • 03:59coding data from a publication, and to rerun the code,
  • 04:03and get the same results as the original publication.
  • 04:06Where replicability, we're defining as the ability to be run
  • 04:09an experiment generating new data, and get results
  • 04:11that are quote, unquote "consistent"
  • 04:14with that of the original study.
  • 04:16So in this sort of context, when it comes to replicability,
  • 04:19you might've heard about publications that have come out
  • 04:22in the past that talk about how there are issues
  • 04:24regarding replicating the research that's been published
  • 04:28in the scientific literature.
  • 04:30This one paper in PLOS Medicine was published
  • 04:32by, and that is in 2005, and there's been a number
  • 04:36of publications that have come out since,
  • 04:38talking about problems regarding replicability,
  • 04:41and ways that we could potentially address it.
  • 04:43And the problem has become large enough where it has
  • 04:46its own Wikipedia entry talking about the crisis,
  • 04:49and has a long list of examples that talks
  • 04:51about issues regarding replicating results
  • 04:54from the scientific studies.
  • 04:55So this is something that has been a known issue
  • 04:58for a while, and these problems also extend
  • 05:00to situations where you want to, for example,
  • 05:03develop clinical prediction models in genomics.
  • 05:06So to give an example of this, let's say that we wanted to,
  • 05:10in the population of metastatic breast cancer patients,
  • 05:13we wanted to develop a model that predicts
  • 05:16some clinical outcome Y, given a set
  • 05:18of gene expression values X.
  • 05:21And so the purpose of this sort of exercise is
  • 05:23to hopefully translate this sort of model
  • 05:26that we've developed, and apply it to the clinic,
  • 05:28where we can use it for clinical decision-making.
  • 05:31Now, if we have data from one particular trial
  • 05:35that pertains to this patient population,
  • 05:37and the same clinical outcome being measured,
  • 05:39in addition to having gene expression data,
  • 05:41let's say that we derived a model, let's say
  • 05:43that we're modeling some sort of binary outcome,
  • 05:44let's say tumor response.
  • 05:46And in this model, we used a cost report,
  • 05:48or penalized logistic regression model
  • 05:51that we fit to the data to try and predict the outcome,
  • 05:54given the gene expression values.
  • 05:56And here we obtained, let's say, 12 genes
  • 05:59after the fitting process, and the internal model 1 UNC
  • 06:04on the sort of training subjects is 0.9.
  • 06:07But then let's say there's another group at Duke
  • 06:09that's using data from their clinical trial,
  • 06:11and they have a larger sample size.
  • 06:13They also found more genes, 65 genes,
  • 06:16but have a slightly lower training at UNC.
  • 06:18However, we really need to use external validation
  • 06:22to sort of get an independent assessment of how well
  • 06:25each one of these alternative models are doing.
  • 06:27So let's say we have data from a similar study from Harvard,
  • 06:30and we applied both these train models
  • 06:33to the genomic data from that study at Harvard.
  • 06:35We have the outcome information for those patients as well,
  • 06:38so we can calculate how well the model predicts
  • 06:42on those validation subjects.
  • 06:44And we find here in this data set,
  • 06:46model 2 seems to be doing better than model 1,
  • 06:49but if you try this again with another data set
  • 06:51from Michigan, you might find that model 1 is doing
  • 06:53better, better than model 2.
  • 06:55So the problem here is where we have researchers
  • 06:58that are pointing fingers at each other,
  • 06:59and it's really hard to know, "Well, who's who's right?"
  • 07:01And why is this even happening in the first place,
  • 07:04in terms of why do we get different genes, numbers of genes,
  • 07:06and each of the models derived from study 1 and study 2?
  • 07:09And why are we seeing very low performance
  • 07:12in some of these validation datasets?
  • 07:15So here's an example from 2014,
  • 07:17in the context of ovarian cancer.
  • 07:20The authors basically collected 10 studies,
  • 07:22all were microarray studies.
  • 07:25The goal here was to predict overall survival
  • 07:27in this population of ovarian cancer patients,
  • 07:30given gene expression measurements
  • 07:32from this microarray platform.
  • 07:34So through a series
  • 07:35of really complicated cross-fertilization approaches,
  • 07:39the data was normalized, and harmonized
  • 07:40across the studies, using a combination of ComBat
  • 07:43and frozen RNA, and then they took
  • 07:4614 published prediction models in the literature,
  • 07:48and they applied each of those models to each
  • 07:51of the subjects from these 10 studies, and they compared
  • 07:53the model predictions across each subject.
  • 07:58So each column here in this matrix is a patient,
  • 08:00and each row is a different prediction model,
  • 08:03and each cell represents the prediction
  • 08:06from that model on that patient.
  • 08:08So an ideal scenario, where we have the models generalizing
  • 08:12and replicating across each of these individuals,
  • 08:14we would expect to see the column,
  • 08:16each column here to have the same color value,
  • 08:19meaning that the predictions are consistent.
  • 08:20But clearly we see here that the predictions are
  • 08:22actually very inconsistent,
  • 08:24and very different from each other.
  • 08:27In addition, if you look
  • 08:28at the individual risk prediction models
  • 08:30that the authors used, there was also
  • 08:32substantial differences in the genes
  • 08:34that were selected in each of these models.
  • 08:36So there's a max 2% overlap in terms of common genes
  • 08:40between each of these approaches.
  • 08:41And one thing to mention here is that each one
  • 08:43of these risk-prediction models were derived
  • 08:45from separate individual studies.
  • 08:48So the question here is, you know, how exactly,
  • 08:51if you were a clinician, you're eager to sort of take
  • 08:54the results that you're seeing here,
  • 08:57and extend to the clinic,
  • 08:58which model do you use, which is right?
  • 09:01Why are you seeing this level of variability?
  • 09:03This is, of course, concerning, if you, if your goal is
  • 09:06to move things towards the clinic, and this also has
  • 09:08implications in terms of, you know, getting in the way
  • 09:11of trying to approve the use of some
  • 09:13of these, and for clinical use.
  • 09:17So why is this happening?
  • 09:19So there's been a lot of studies have been done
  • 09:22that have tied issues to, obviously, sample size
  • 09:24in the training studies, smaller sample sizes,
  • 09:27and models trained on them may lead to more unstable models,
  • 09:31or less accurate models.
  • 09:32Between different studies, you might have
  • 09:35different prevalences of the clinical outcome.
  • 09:36In some studies, you might have higher levels of response,
  • 09:39and other studies, you might have lower levels of response,
  • 09:40for example, if you have this binary clinical outcome,
  • 09:43and also there's issues regarding differences
  • 09:46in lab conditions, where the genomic data was extracted.
  • 09:49We've seen at Lineberger that, depending on the type
  • 09:52of extraction, RNA extraction kit that you use,
  • 09:55you might see differences in the expression of a gene,
  • 09:58even from the same original tumor.
  • 10:00And also the issue of batch placement,
  • 10:02which has been widely talked about in the literature,
  • 10:04where depending on the day you run the experiment,
  • 10:06or the technician who's handling the data,
  • 10:11you might see slight differences,
  • 10:12technical differences in expression.
  • 10:15There's also differences due to protocols.
  • 10:17Some trials might have different inclusion
  • 10:18and exclusion criteria, so they might be recruiting
  • 10:21a slightly different patient population,
  • 10:22even though they might be all
  • 10:24in the context of metastatic breast cancer.
  • 10:25All of these things can help impart heterogeneity
  • 10:29between what the genomic data and the outcome data
  • 10:34across different studies.
  • 10:36In the context of genomic data in particular,
  • 10:39there's also this aspect of data preprocessing.
  • 10:41For the normalization taking that you use is very important,
  • 10:45and we'll talk about that in a little bit.
  • 10:47And it's a very critical part when it comes
  • 10:48to training models, and trying to validate your model
  • 10:52on other datasets, and depending on the type
  • 10:54of normalization you use, this could also impact
  • 10:58how well your model works.
  • 11:00In addition, there's also differences in the potential way
  • 11:03in which you measure gene expression.
  • 11:04Some trials might use an older technology called microarray.
  • 11:07I know other trials might use something
  • 11:09relatively more recent called RNAC,
  • 11:11or a particular trial might use
  • 11:13a more targeted platform like NanoString.
  • 11:15So the differences in platform also can lead to differences
  • 11:19in your ability to help validate some of these studies.
  • 11:21If you train something in marker rate, it's very difficult
  • 11:24to take that model, and apply it to RNAC,
  • 11:26because the expression values are just are just different.
  • 11:30And so, as I mentioned before, this also impacts
  • 11:32through to normalization on model performance as well.
  • 11:37So the main thing to remember here is that
  • 11:40the traditional way in which prediction models,
  • 11:43based on genomic data for using the clinical training is
  • 11:46typically on the results from a single study.
  • 11:52To talk a little bit more about question
  • 11:54of between-study normalization, and the purpose of this is
  • 11:57to put the expression data on basically an even scale,
  • 12:00which helps facilitate training.
  • 12:02If there's global shifts, and some of the expression values
  • 12:06in one sample versus another, it's very difficult to train
  • 12:09an accurate model in that particular scenario.
  • 12:11So normalization helps to align
  • 12:13the expression you get from different samples,
  • 12:16and hopefully across the between difference as well.
  • 12:19And so the goal here is to eventually predict this outcome
  • 12:23in a new patient, you plug in the genomic data
  • 12:25from a new patient in order to get the predicted outcome
  • 12:28for that patient based on that training model.
  • 12:30So the, in order to do that, you also have to normalize
  • 12:34the new data to the training data, right?
  • 12:36Because you also want to put the new data on the same scale
  • 12:38as a training data, and in the ideal scenario,
  • 12:41you would want to make sure that the training samples
  • 12:44that you use to train your original model are untouched,
  • 12:47because what some people try to do is they try
  • 12:49to sort of sidestep this normalization issue,
  • 12:52they would combine the new data with the old training data,
  • 12:55and renormalize everything at once.
  • 12:57And the problem with this is that this changes
  • 12:59your training sample values, and in a sense,
  • 13:01would necessitate the fact that you need to retrain
  • 13:04your old model again.
  • 13:04And this leads to instability, and lack of stability
  • 13:07over time in terms of the model itself.
  • 13:10So in the prior example from ovarian cancer,
  • 13:12this is not as big of an issue, because you have
  • 13:15all the data you want to work with in hand.
  • 13:18This is a retrospective study, you have 10 data sets,
  • 13:20so you just normalize everything at the same time,
  • 13:22that's in ComBat and frozen RNA.
  • 13:24And so you can split up those studies into separate training
  • 13:27and test studies, and they're all rated on the same scale.
  • 13:31But the problem is that in practice, you're trying to do
  • 13:34a prospective type of analysis, where when you train
  • 13:37your model, you're normalizing all of the available studies
  • 13:40you have, let's say, and then you use that to predict
  • 13:44the outcome in a future patient, or a future study.
  • 13:47And so the problem with that is that you have to find
  • 13:51a good way to align, as I mentioned before,
  • 13:55the data from that future study for your training samples,
  • 13:57and that may not be an easy task to do,
  • 14:00especially for some of the newer platforms like RNAC.
  • 14:04So taking this problem a step further,
  • 14:06what if there's no good cross study normalization approach
  • 14:10that's available to begin with?
  • 14:12This really is going to make things difficult in terms
  • 14:15of the training in the model in the first place.
  • 14:18Another more complicated problem is that you might have
  • 14:21different types of platforms at that training time.
  • 14:24For example, you might have the only type of data
  • 14:26that's available from one study is NanoString in one case,
  • 14:29and another study it's only RNAC, so what do you do?
  • 14:33And looking forward, as platforms change,
  • 14:35as technology evolves, you have different ways
  • 14:36of measuring gene expression, for example.
  • 14:42So what do you do with the models that are trained
  • 14:44on old data, because you can't apply them to the new data?
  • 14:48So oftentimes you find this situation
  • 14:50where you have to retrain new models on these new platforms,
  • 14:53and the old models are not able to be applied
  • 14:57directly to this new data types.
  • 14:58So that leads to waste here.
  • 15:01So if you take all of these problems together,
  • 15:03regarding cross-study normalization,
  • 15:07and changes in platform,
  • 15:09and a lot of the other issues, you know,
  • 15:11regarding replicability that I mentioned,
  • 15:13it's no wonder that there's only a small handful
  • 15:17of expression-based clinically applicable assets have been
  • 15:21approved by the FDA, like Oncotype DX, MammaPrint,
  • 15:24and Prosigna, because this is a very, very tough problem.
  • 15:30So I want to move on with that, to an approach
  • 15:33that we proposed to help tackle this sort of issue
  • 15:36by using this idea of multi-study learning,
  • 15:39where instead of just using, and deriving, and generating
  • 15:43models from individual studies, we combine data
  • 15:45from multiple studies together, and create a consensus model
  • 15:48that we use for prediction, which will hopefully be
  • 15:50more stable, and more accurate down the road.
  • 15:54So this approach of combining data is called
  • 15:56horizontal data integration, where we're merging data
  • 15:59from let's say K different studies.
  • 16:01And the pro of this approach is that we get increased power,
  • 16:04and the ability to reach some sort of consensus
  • 16:06across these different studies.
  • 16:09The negative is that the effect of a gene
  • 16:12and its relationship to outcome may actually vary
  • 16:14across studies, and also by, you know, depending on,
  • 16:16and also the way that you normalize the genes may also vary
  • 16:19across studies too if we're using published data
  • 16:21from some prior publication.
  • 16:24There's also this issue of sample size and balance.
  • 16:25You might have a study that has 500 subjects,
  • 16:28and another one that might have 200 subjects.
  • 16:30So there are some methods that were designed to account for
  • 16:34between-study heterogeneity after you do
  • 16:36horizontal data integration.
  • 16:38One is called the meta-lasso, another is called
  • 16:41the AW statistic, but these two methods don't really have
  • 16:44any prediction aspect about them.
  • 16:46They're more about feature selection.
  • 16:48Ensembling is one approach that can directly account
  • 16:50for between-study heterogeneity
  • 16:52after horizontal data integration, but there's
  • 16:54no explicit future selection step here.
  • 16:57But all of these approaches assume
  • 16:59that the data has been pre-normalized.
  • 17:02As we talked about before,
  • 17:03for prospective decision-making, based off a train model,
  • 17:07that might be prohibitive in some cases,
  • 17:10and we need a strategy also to easily predict
  • 17:13and apply these models in new patients.
  • 17:20Okay, so moving on, we're going to talk first
  • 17:24about this issue of how do we integrate data,
  • 17:27and sort of sidestep this normalization problem
  • 17:30at training time, and also at test time where we,
  • 17:33when we try to predict in new subjects?
  • 17:35So the approach that we put forth is to use
  • 17:39what's called top scoring pairs, which you can think of
  • 17:41as a rank-based transformation of the original set
  • 17:45of gene expression values from a patient.
  • 17:47So the idea here originally,
  • 17:50when top scoring pairs were introduced,
  • 17:51was you're trying to find a pair of genes
  • 17:53where it's such that if the expression of gene A
  • 17:56in the pair is greater than gene B, that would imply
  • 17:59that the, let's say, the subtype for that individual is,
  • 18:03say, subtype one, and if it's less,
  • 18:05then that implies subtype zero with high probability.
  • 18:09Now, in this case, this sort of approach was developed
  • 18:12with when one has a binary outcome variable
  • 18:14that you care about.
  • 18:15In this case, we're talking about subtype,
  • 18:17but it could also be tumor response or something else.
  • 18:20So essentially what you're doing is that you're taking
  • 18:22these continuous measurements in terms of gene expression,
  • 18:25or integer, and you are converting that, transforming
  • 18:31that into basically a binary predictor,
  • 18:32which takes on the value of the zero or one.
  • 18:34And the hope is that that particular transformed value is
  • 18:38going to be associated with this binary outcome.
  • 18:41So the simple assumption in this scenario is
  • 18:44that the relative rank of these genes
  • 18:46in a given sample is predictive of subtype, and that's it.
  • 18:51And so the example here I have on the right is an example
  • 18:54of two genes, GSTP1 and ESR1.
  • 18:58And so you can see here that if you're
  • 19:00in the upper left quadrant, this is where this gene is
  • 19:02greater than this gene expression, it's implying
  • 19:05the triangle subtype with high probability,
  • 19:08and otherwise it implies the circle subtype.
  • 19:11So that's the general idea of what we're going for here.
  • 19:14It's a sort of a rank-based transformation
  • 19:16of the original continuous predictor space.
  • 19:21So the nice thing about this approach,
  • 19:22because we're only based on the simple assumption, right?
  • 19:25That we're only caring about the relative rank
  • 19:27within a subject, this makes
  • 19:29this particular new transformed predictor
  • 19:32relatively invariant to batch effects, pre-normalization,
  • 19:36and it also most importantly, simplifies merging data
  • 19:39from different studies.
  • 19:41Everything is now on the same scale, zero to one,
  • 19:43so it's very easy to paste together the data
  • 19:45from different studies, and we can sidestep this problem
  • 19:50of trying to pick a cross-normalization approach,
  • 19:53and then work in this sort of transformed space.
  • 19:57The other nice thing is that this is easily computable
  • 19:59for new patients as well.
  • 20:01If you have a new patient that comes into clinic,
  • 20:03you just check to see whether the gene A is
  • 20:04greater than gene B in terms of expression,
  • 20:06and then you have your value for this top scoring pair,
  • 20:11and we don't have to worry as much about normalizing
  • 20:14this patient's raw gene spectrum data
  • 20:18to the training sample expression values.
  • 20:21So essentially what we're doing here is that we're,
  • 20:23let's enumerate all possible gene pairs for us,
  • 20:26instead of a candidate genes, and each column here
  • 20:28in this matrix shown on the right pertains
  • 20:31to the zero one values for a particular gene pair J.
  • 20:34And so this value takes the value of one, it is greater
  • 20:38than B, in sample I, in pair j, and zero otherwise.
  • 20:41And then we merge over the common top scoring pairs.
  • 20:46So in this example have data from four different studies,
  • 20:49each indicator by a different color here
  • 20:50in the first track, and this data pertains to data
  • 20:54from two different platforms,
  • 20:55and three different cancer types.
  • 20:56And so the clinical outcome here is binary subtype,
  • 20:59which is given by the orange and the blue color here.
  • 21:02So you can see here that we enumerated the TSPs,
  • 21:05we merged the data together, and now we have
  • 21:07this transformed predictor agents.
  • 21:09And the interesting thing is
  • 21:10that you can definitely see some patterning here.
  • 21:13With any study where you have a particular set of TSPs
  • 21:15that had taken a value of one, when the subtype is blue,
  • 21:19and it flips when it's orange.
  • 21:21And we see the same general pattern seem to replicate
  • 21:24across different studies,
  • 21:25but not every top scoring pair changes the same way
  • 21:29across different studies.
  • 21:32So if we cluster the rows here, we can also see
  • 21:35some patterns sort of persist where we see
  • 21:38some clustering by subtype,
  • 21:40but also some clustering by study as well.
  • 21:42And so what this implies is that there's a relationship
  • 21:45between TSPs and subtypes, and that can vary across studies,
  • 21:47which is not too different from what we've talked
  • 21:50about regarding the issues we've seen
  • 21:51in replicability in the past.
  • 21:53So ideally we would like to see a particular gene pair,
  • 21:57or TSP vector here take on a value of one,
  • 22:01only when there's the orange subtype,
  • 22:03and zero in the blue subtype, or vice versa.
  • 22:05And we wanted to see this pattern replicated
  • 22:07across patients in studies, but we see obviously
  • 22:10that that's not the case.
  • 22:12So the question now that we've sort of introduced,
  • 22:15or proposed is this sort of approach to simplify
  • 22:17data merging in normalization.
  • 22:19The question now that we're sort of dealing
  • 22:20with is well, how do we actually now find
  • 22:22features that are consistent across different studies
  • 22:26in their relationship with outcome, and also estimate
  • 22:29their study-level effect, and then use them for prediction?
  • 22:33So that leads us to the second part of our paper,
  • 22:35where we developed a model to help select
  • 22:39these particular study-consistent features
  • 22:42while accounting for study-level heterogeneity.
  • 22:47So to sort of illustrate the idea behind this,
  • 22:49let's just start with a simple simulation
  • 22:52where we're not doing any normalization,
  • 22:54we're not worrying about resuming, everything's fine
  • 22:56in terms of the expression values,
  • 22:59and we're not doing any selection,
  • 23:00no TSP transmission either.
  • 23:03So we're going to assimilate data pertaining
  • 23:05to two, let's say, known biomarkers
  • 23:06that are associated with binary subtype.
  • 23:09We're going to generate K datasets,
  • 23:11and we're going to try three different strategies
  • 23:12for learning a prediction model two to these data sets.
  • 23:15And at the end, we're going to validate each of those models
  • 23:18on an externally-generated data set
  • 23:19to compare their prediction performance.
  • 23:22So to do this, we're going to fit and assume for each study
  • 23:25that we can fit it with a logistic regression model
  • 23:28to model by our outcome with these two predictors,
  • 23:31and in generating these K data sets,
  • 23:32we're going to vary the number of with respect to K.
  • 23:35So we might generate two trained data sets five or 10,
  • 23:38and also change the total sample size of each one,
  • 23:40and make sure that the sample sizes are in balanced
  • 23:42across the different studies, and then assume
  • 23:45values for the coefficients for each of these predictors
  • 23:50to be these values here, and lastly, to induce some sort
  • 23:53of heterogeneity across the different training datasets,
  • 23:56we're gonna add in sort of like a random value drop
  • 23:59from the normal distribution, where we're assuming
  • 24:03this level of variance for this value.
  • 24:05So basically we're just injecting heterogeneity
  • 24:07into this data generation process.
  • 24:09So after we generate the training studies,
  • 24:11then we're going to apply three different ways
  • 24:13or strategies to the training data.
  • 24:15The first is the individual study approach,
  • 24:17which we've talked about before, where you train
  • 24:20a generalized model separately for each study.
  • 24:22The second approach is where you merge the data.
  • 24:25Again, we're ignoring the normalization problem here
  • 24:26in simulation, obviously, and then train a single GLMM
  • 24:30for the combined data, and then lastly,
  • 24:32we're going to merge the data, and train
  • 24:34a generalized linear mixed model,
  • 24:35where we explicitly account for a random intercept,
  • 24:38and a random slope for each predictor,
  • 24:41assuming, you know, a study-level random effect.
  • 24:45So after we do that, we'll generate a validation dataset
  • 24:48from the same approach above, and then predict outcome
  • 24:52in this validation dataset with respect
  • 24:55to the models derived from each of these three strategies.
  • 24:59So if we look at the individual strategy performance,
  • 25:01where we fit a GLM logistical regression model
  • 25:04separately for each study, and then apply it
  • 25:06to this validation data set, we can check
  • 25:08the prediction accuracy, we can find that,
  • 25:11due to the induced level of heterogeneity
  • 25:14between studies in predictor effects,
  • 25:16in one study, we do really poorly,
  • 25:18and another study we do really well,
  • 25:20and this variation is entirely due to variations
  • 25:24in the gene subtype relationship.
  • 25:27And these predictions obviously vary as a result
  • 25:29across the different studies.
  • 25:30And this will reflect a little bit of what we see
  • 25:32in some of the examples that we showed earlier,
  • 25:35studies that were trained on different data sets.
  • 25:40And then the second approach is where we combine
  • 25:43the data sets, and train a single logistical question model
  • 25:46to predict outcome.
  • 25:46And so we see what the median prediction error is better
  • 25:49than most of the models here, but if we fit the GLMM,
  • 25:52the median prediction (indistinct) gets better
  • 25:54than some of the other approaches here.
  • 25:56So this is basically just one example.
  • 25:58So we did this over and over a hundred times
  • 26:00for every single possible simulation condition,
  • 26:03varying K, and the heterogeneity across different studies.
  • 26:07And some of the things that we found was that
  • 26:10the individual study approach had, as you can see,
  • 26:12the worst prediction error overall,
  • 26:14combining the data improved this a little bit,
  • 26:17but the estimates for the coefficients
  • 26:21from the combined GLMM were still biased.
  • 26:23There's supposed to be two in this extreme scenario.
  • 26:27And a kind of heterogeneity with the GLMM mixed model had
  • 26:31the best performance out of the rest,
  • 26:32and also had the lowest bias in terms
  • 26:35of the regression coefficients as well.
  • 26:39So this is great, but we also have a lot
  • 26:42of potential types of pairs.
  • 26:44We can't really estimate them all
  • 26:47with a GLMM mixed model, so we need to find a way
  • 26:50where we can, at least in reasonable dimension,
  • 26:52figure out a way which fixed effects are non-zero,
  • 26:55while accounting for, you know,
  • 26:56this sort of study-level heterogeneity for each effect.
  • 27:00So this led us to develop a pGLMM, which is basically
  • 27:05a high-dimensional generalized intermixed model,
  • 27:08where we are able to select fixed and random effects
  • 27:11simultaneously using a penalization framework.
  • 27:13So essentially here, we're assuming that all the predictors
  • 27:17in the model, we assume a random effect,
  • 27:20a random slope for each one, and so we were aiming to select
  • 27:23the features that have non-zero fixed effects
  • 27:28in this particular approach, and indeed we're assuming
  • 27:30these are going to be study-consistent.
  • 27:32And to do this, we're going to reorganize
  • 27:35the linear predictor from the standard GLMM,
  • 27:38so basically we're starting with the same general likelihood
  • 27:41for, you know, the generalized mixed model.
  • 27:44Here, Y is our outcome, X is our predictor,
  • 27:49alpha is the, alpha K is the random effect
  • 27:53for the case study, fi here is typically assumed to be
  • 27:58multi, very normal, means zero, and a covariant
  • 28:02on some sort of unstructured covariance matrix typically.
  • 28:05And so to sort of simplify this, we factor out
  • 28:09the random effects covariance matrix,
  • 28:10and incorporate into the linear predictor.
  • 28:12And with some more reorganizing, now we're able to select
  • 28:16the fixed effects and determine which random effects have
  • 28:21true non-covariance, using this sort
  • 28:24of joint penalization framework.
  • 28:26If you want more detail, you can check out the publication
  • 28:28that I linked above, and I also forgot to send out
  • 28:31the link to this talk here.
  • 28:33I'll do that right now, in case you want to check out
  • 28:35some of the publications that I'm linking in this talk.
  • 28:41Okay, so how do we do this estimation?
  • 28:42And we use that penalized NCM algorithm,
  • 28:44where in each step we're drawing from the posterior
  • 28:47with respect to the random effects, given
  • 28:48the current aspects of the parameters,
  • 28:50and the observed data, using Metropolis point of Gibbs.
  • 28:55In the R packets, I'm going to talk about in a little bit,
  • 28:58we update this to using a Hamiltonian Monte Carlo,
  • 29:03but in the original version,
  • 29:04we use Metropolis point of Gibbs, where we skipped
  • 29:07components that had zero variance from the M-STEP.
  • 29:09And then we use, in the M-step,
  • 29:12two conditional maximization steps
  • 29:14where we first update data, given the draws
  • 29:17from the E-step, and the prior estimates for gamma here,
  • 29:20and then up to gamma using a group penalty.
  • 29:24So we use a couple of other tricks
  • 29:25to speed up performance here.
  • 29:27I won't go too much into the details there,
  • 29:29but you can check out the paper for more detail on that.
  • 29:33But with this approach, one of the things
  • 29:35that we were able to show was that we have
  • 29:37similar conclusions regarding bias and prediction error,
  • 29:39as in the simple setup we had before,
  • 29:41where in this particular situation, we're simulating
  • 29:43a bunch of predictors that do not have any association
  • 29:47with outcome, either 10 to 50 extra predictors,
  • 29:51or there's only two that are actually truly relevant.
  • 29:54And so the prediction error in this model
  • 29:56after this penalized selection process is
  • 29:59generally the same, if not a little bit worse.
  • 30:01And one thing that we find here is that
  • 30:03the parameters are selected
  • 30:06by the individual study approach we're applying now
  • 30:08at penalized distribution regression model has
  • 30:10a low sensitivity to detect the true predictors,
  • 30:13and a higher false positive rate in terms of selecting
  • 30:16predictors that aren't associated
  • 30:17with outcome and simulation.
  • 30:19And what we find here also is that the approach
  • 30:23that we developed had a much better sensitivity
  • 30:26compared to other approaches for selecting
  • 30:28the true predictors when accounting
  • 30:30for study-level homogeneity,
  • 30:32and the lower false positive rate as well.
  • 30:36The example data sets that I talked about before,
  • 30:39the four ones that I showed a figure up earlier,
  • 30:43we did a whole data study analysis where we trained
  • 30:45on three studies and held out one of the studies.
  • 30:48We found that, you know, the approach that we put forward
  • 30:51that put combining the data using our TSP approach,
  • 30:54and then training a model using the pGLM had
  • 30:58the lowest overall holdout study error
  • 31:00compared to the approach using just
  • 31:02a regular generalized linear model,
  • 31:06and then also the individual study approach as well.
  • 31:09And we also compared it to another post called
  • 31:12the Meta-Lasso, which we were able to adapt
  • 31:14to do prediction, and we didn't see that much improvement
  • 31:16of performance as well.
  • 31:17But in general, the result that we saw here was
  • 31:21that the individual study approach had
  • 31:23bad prediction error also across the different studies.
  • 31:27So again, this sort of takes what we've already seen
  • 31:29in the literature in terms of inconsistency,
  • 31:31in terms of the number of genes that are being selected
  • 31:33in each of these models, and also the variations
  • 31:35in the prediction accuracy, this sort of reflects
  • 31:38what we've been seeing in some of this prior work.
  • 31:44So in order to you implement this approach
  • 31:46in a more systematic way, my student and I,
  • 31:49Hillary worked, put together an R package called
  • 31:51The GLMMPen R Package.
  • 31:54So this was just recently submitted
  • 31:56to Journal of Statistical Software, but if you want to track
  • 31:59the code, it's available on Github right here,
  • 32:02and we're in the process of submitting this to CRAN as well.
  • 32:05This was sort of like a nice starter project that I gave
  • 32:08to Hillary to, you know, get her feet wet with coding,
  • 32:12and she's done a really great job, you know,
  • 32:15in terms of putting this together.
  • 32:16And some of the distinct differences between this
  • 32:19and what we put forth in the paper is the use
  • 32:21of Hamiltonian Monte Carlo and the east app,
  • 32:24instead of the Metropolis Gibbs.
  • 32:26It's much faster, much more efficient.
  • 32:27We also have added helper functions
  • 32:29for the (indistinct) tuning parameters, and also making
  • 32:33some diagnostic plots as well, after convergence.
  • 32:37And we've also implemented some speed
  • 32:39and memory improvements as well, to help with usability.
  • 32:44Okay, so we talked about some issues
  • 32:47regarding data integration, and then issues
  • 32:50with normalization, how that impedes, or can impede
  • 32:52validation in future patients, and then we introduced
  • 32:56a way to sidestep the normalization problem,
  • 32:59using this sort of rank-based transformation,
  • 33:01and an approach to select consistent predictors
  • 33:03in the presence of between-study heterogeneity.
  • 33:07So next, I'm going to talk about a case study
  • 33:09in pancreatic cancer, where we took a lot of these tools,
  • 33:13and applied them to a problem that some collaboratives
  • 33:16of mine were having, you know, at the cancer center at UNC.
  • 33:20And to give a brief overview of pancreatic cancer,
  • 33:23it has a really poor prognosis.
  • 33:26Five-year survival is very low, you know, typically 5%.
  • 33:30The median survival tends to be less than 11 months,
  • 33:32and the main reason why this is the case is that
  • 33:35early detection is very difficult,
  • 33:37and so when patients show up to the clinic,
  • 33:40they're oftentimes in later stages, or gone metastatic.
  • 33:44So for those reasons, it's really important to place
  • 33:48patients on optimal therapies upfront, and choosing
  • 33:51the best therapies, specifically for a patient, you know,
  • 33:54when after they're diagnosed.
  • 33:56So breast and colorectal cancers have
  • 33:59long-established subtyping systems that are oftentimes used.
  • 34:02Again, an example of a few of them in breast
  • 34:04that have actually been approved by the FDA
  • 34:06for clinical use, but there's nothing available for,
  • 34:09in terms of precision medicine for pancreatic cancer,
  • 34:11except for a couple of targeted therapies
  • 34:14for specific mutations.
  • 34:17So in 2015, the Yeh Lab at UNC,
  • 34:20using a combination of non-negative matrix factorization
  • 34:24and consensus clustering, where it was able to discover
  • 34:27two potentially clinically applicable subtypes
  • 34:30in pancreatic cancer, which they call basal-like,
  • 34:33the orange line here, which has a much worse survival
  • 34:37compared to this classical subtype in blue,
  • 34:41where patients seem to do a little bit better.
  • 34:44And so with this approach, they used
  • 34:45this unsupervised learning, set of learning techniques
  • 34:48to derive these novel subtypes.
  • 34:51And so when they took these subtypes and overlaid them
  • 34:54from data from a clinical trial where they had
  • 34:56treatment response information, they found that
  • 34:58largely patients who with basal-like subtype tended to have
  • 35:02tumors that did not respond
  • 35:04to common first-line therapy, Folfirinox.
  • 35:06Their tumors tended to grow from baseline.
  • 35:08Whereas patients that were the classical subtype tended
  • 35:12to respond better on average compared to the basal samples.
  • 35:16So the implications here are that if you are,
  • 35:20subtype is basal, you should avoid Folfirinox
  • 35:23at baseline entry with an alternative type drug,
  • 35:25typically Gemcitabine and nab-paclitaxel Abraxane.
  • 35:27And then for classical patients,
  • 35:29they should receive Folfirinox.
  • 35:32But the problem here is that subtyping clearly is
  • 35:34an unsupervised learning approach, right?
  • 35:36It's not a prediction tool.
  • 35:37So it's, this approach is quite limited if it,
  • 35:42when you have to do, assign a subtype
  • 35:45in a small number of patients, it just doesn't work.
  • 35:48So what some people have done in the past,
  • 35:50so they simply take new patients, and recluster them
  • 35:52with existing, their existing training samples.
  • 35:55The problem with that is that the subtype assignments
  • 35:58for those original training samples might change
  • 36:00when they recluster it.
  • 36:01So there's not a stable, it's not really
  • 36:03a stable approach to really do this.
  • 36:05So the goal here was to leverage the existing training data
  • 36:08that's available to the lab, which come
  • 36:12from different platforms to come up with an approach,
  • 36:15a classifier to predict subtype, given
  • 36:18new subtypes information, genomic,
  • 36:20a new patient's genomic data, to get subtype,
  • 36:23a predicted subtype for that individual.
  • 36:25So of course, in that scenario, we also want to make sure
  • 36:28that that process is simplified, and that we make
  • 36:31this prediction process as easy as possible,
  • 36:33in the face of all these issues we talked about regarding
  • 36:36normalization and the training data to each other,
  • 36:40and also normalization of the new patient data
  • 36:42to the existing training data.
  • 36:45So using some of the techniques that we just talked about,
  • 36:49we came up with a classifier that we call PurIST,
  • 36:51which was published in the CCR last year,
  • 36:53where essentially we were able to do that.
  • 36:56We take in the genomic data for a previous patient,
  • 36:59and able to predict subtype based off of that,
  • 37:04the train model that we developed.
  • 37:06And in this particular paper, we had nine data sets
  • 37:09that we curated from the literature, three of which
  • 37:11that we used for training,
  • 37:13the rest we used for validation.
  • 37:14And we did consensus clustering on all of them,
  • 37:16using the gene list that was derived
  • 37:18from the original publication,
  • 37:21where the subtypes were discovered to get labels,
  • 37:23subject labels for each one of the subjects
  • 37:25in each one of these studies.
  • 37:27So once we had those labels from consensus clustering,
  • 37:30we then merged the data from our three largest studies,
  • 37:33which are our training studies.
  • 37:35We did some sample for filtering based on quality,
  • 37:37and we filtered some genes based off of, you know,
  • 37:40expression levels and things like that.
  • 37:42And then we applied our previous training approach
  • 37:45to get a small subset of top scoring pairs from the data.
  • 37:50And in this case, we have eight that we selected,
  • 37:51each with their own study-level coefficient.
  • 37:55And then for prediction, the process is very simple,
  • 37:58we just check in that patient, whether gene A is greater
  • 38:00than gene D for each of these pairs,
  • 38:02and that gives us their binary vector of ones and zeros.
  • 38:05We multiply that by the coefficients from the train model.
  • 38:09This is basically just calculating a linear predictor
  • 38:11from this logistic regression model.
  • 38:14And then we can convert that
  • 38:15to a predicted probability of being basal.
  • 38:18So using this approach, we were able to select
  • 38:2316 genes pertaining to eight subtypes,
  • 38:25but we can find here that the predictions
  • 38:27from this model tends to coincide very strongly
  • 38:31with the labels that were collected
  • 38:33using consensus clusters.
  • 38:34So that gives us some confidence that reproducing
  • 38:36in some way, you know, this, the result that we got
  • 38:41using this clustering approach.
  • 38:43You can also clearly see here that as the subtype changes,
  • 38:46that you see flips in the expression in each one
  • 38:49of the pairs of genes that we collected
  • 38:52in this particular study.
  • 38:54And then when we applied this model
  • 38:55to six external validation dataset, we found that it had
  • 38:59a very good performance in terms of recapitulating subtype,
  • 39:01where we had a relatively good sensitivity
  • 39:04and specificity in each case, which we owe part
  • 39:07to the fact that we don't have to worry as much
  • 39:08about this sort of cross-study normalization training time
  • 39:13or test time, and also the fact that we leveraged
  • 39:17multiple data sets when selecting
  • 39:21the predictors for this model.
  • 39:22And so when we looked at the predictive values
  • 39:24in these holdout studies, the predictive subtypes,
  • 39:27we recapitulated the differences in survival
  • 39:30that we observed in other studies as well,
  • 39:32where basal-like patients do a lot worse
  • 39:34compared to classical patients.
  • 39:37If you want to look a little bit more at the details
  • 39:39in this paper, you can check out this link here,
  • 39:41and if you want to access the code that we used
  • 39:44to make these predictions, that's available
  • 39:45on this Github page at this link right here.
  • 39:50Another thing that we were able to show is that for patients
  • 39:53that had samples that are collected through different modes
  • 39:56of collection, whether it was bulk, FNA, FFPE,
  • 40:00we found that the predictions in these patients tend to be
  • 40:03highly consistent, and this is basically deriving
  • 40:06itself, again, from the simple assumption behind TSPs,
  • 40:09where the relative rank within the subject of the expression
  • 40:13of these genes is predicted.
  • 40:15So as long as that is being preserved,
  • 40:17then you should be able to have the model predict well
  • 40:21in different scenarios.
  • 40:23So when we also went through CLIA validation for this tool,
  • 40:28we also confirmed 95% agreement between replicated runs
  • 40:31in other platforms, and we also confirmed concordance
  • 40:38between NanoString and RNAC, also through different modes
  • 40:43of sample collection.
  • 40:44So right now this is the first clinically applicable test
  • 40:47for a prospect of first line treatment selection in PDAC.
  • 40:51And right now we do have a study that just recently opened
  • 40:54at the Medical College of Wisconsin that's using PurIST
  • 40:56for prospect of treatment selection,
  • 40:58and we have another one opening at University of Rochester,
  • 41:02and also at UNC soon as well.
  • 41:06So this is just an example about how you can take
  • 41:10a problem, you know, in, from the literature,
  • 41:14from your collaborators, come up with a method,
  • 41:18and some theory behind it, and really be able to come up
  • 41:22with a good solution that is robust,
  • 41:24and that can really help your collaborative
  • 41:27at your institution and elsewhere.
  • 41:32Okay, so that was the case study.
  • 41:34To talk about some current work
  • 41:35that we're doing just briefly.
  • 41:36So we wanted to think about how we can also scale up the,
  • 41:39this particular framework that we developed for the pGLMM,
  • 41:42and one idea that we're pursuing right now
  • 41:44with my student Hillary, is that we're thinking
  • 41:48about using, borrowing ideas from factor analysis
  • 41:50to decompose, do a deep, deterministic decomposition
  • 41:53of the random effects to a lower dimensional space,
  • 41:56where essentially, we can essentially map
  • 42:00between the lower dimensional space (indistinct) factors,
  • 42:03which is r-dimensional, to this higher dimensional space,
  • 42:05using some by matrix B, which is q by r,
  • 42:12and essentially in doing so, this reduces the dimension
  • 42:16of the integral in the Monte Carlo EM algorithm.
  • 42:20So rather than having to do approximate integral
  • 42:22and q dimensions, which can be difficult,
  • 42:24you can work in a much lower space in terms of integral,
  • 42:27and then have this additional problem
  • 42:29of trying to estimate this matrix,
  • 42:31and not back to the original dimension cube.
  • 42:33So that's something that we're just starting to work on
  • 42:35right now, and another thing that we're starting to work on
  • 42:39is the idea of trying to extend some of the work
  • 42:41in variational autoencoders
  • 42:43that my student David is working on now.
  • 42:45His current work is trying to account for missing data
  • 42:48when trying to train these sort of deep learning models,
  • 42:51the VAEs unsupervised learning model's oftentimes used
  • 42:55for dimensional reduction.
  • 42:56You might've heard of it
  • 42:57in single cells sequencing applications.
  • 43:01But the question that we wanted to address is, well,
  • 43:03what if you have missing data, you know,
  • 43:05in your input features X, which might be (indistinct)?
  • 43:10So essentially we were able to develop input.
  • 43:14So we have a pre-print up right now, it's the code,
  • 43:17and we're looking to extend this, where essentially,
  • 43:20rather than worrying about this latent space Z,
  • 43:23which we're assuming that that encodes a lot
  • 43:25of the information in the original data,
  • 43:27we replaced that with learning the posterior
  • 43:29of the random effect, given the observed data.
  • 43:32And then in the second portion here, we replaced
  • 43:34this generative model with the general model of y given X
  • 43:39in the random effects.
  • 43:41So that's another avenue that can allow us
  • 43:43to hopefully account for non-linearity,
  • 43:45and arbitrator action between features as well.
  • 43:47And also it might be an easier way to scale up
  • 43:49some of the analysis we've done too,
  • 43:53which I've already mentioned.
  • 43:55Okay, so in terms of some concluding thoughts,
  • 43:58I talked a lot about how the original subtypes were derived
  • 44:03for this pancreatic cancer case study using NMF
  • 44:06and consensus clustering to get two subtypes.
  • 44:09But there were also other groups that are published,
  • 44:12subtyping systems, that in one, they found
  • 44:16three subtypes, and in another one they found four subtypes.
  • 44:19So the question is, well, you know, well,
  • 44:22which one do we use?
  • 44:23Again, this is also confusing for practitioners
  • 44:26about which approach might be more meaningful
  • 44:29in the clinical setting.
  • 44:30And each of these approaches were also derived
  • 44:32using NMF and consensus clustering, and they were done
  • 44:35separately on different patient cohorts
  • 44:38at different institutions.
  • 44:39So you can see that this is another reflection
  • 44:41of heterogeneity in single-study learning,
  • 44:45and how we can get these different or discrepant results
  • 44:49from applying the same technique to 200 genus datasets
  • 44:52that were generated at different places.
  • 44:54So of course this creates another problem, you know,
  • 44:57who's right, which approach do we use?
  • 45:00And it's kind of like a circular argument here.
  • 45:03So in the paper that I mentioned before with PurIST,
  • 45:07another thing that we did is we overlaid
  • 45:09the others subtype system calls
  • 45:12with the observed clinical outcomes
  • 45:15for the studies that we collected.
  • 45:17And one of the things that we found was that,
  • 45:19and these other subtyping systems,
  • 45:22each of them also had something,
  • 45:24something that was very similar to the basal-like subtype,
  • 45:27and for the remaining subtypes, they had survival
  • 45:30that was similar to the classical subtype.
  • 45:33So one of the arguments that we made was that,
  • 45:35well, if the clinical outcomes are the same
  • 45:37for the other subtypes, you know,
  • 45:40are they exactly right necessary
  • 45:42for clinical decision-making?
  • 45:43That was one argument that we put forth.
  • 45:46And when we looked at the response data, again,
  • 45:48we saw that one of the subtypes in the other approaches
  • 45:51also overlapped the basal-like subtype in terms of response.
  • 45:56And then for the remaining subtypes,
  • 45:57they were just kind of randomly dispersed at the other end,
  • 46:01you know, of the spectrum here in terms of tumor present,
  • 46:05tumor change after treatment.
  • 46:07So the takeaway here is that heterogeneity
  • 46:09between studies also impacts tasks in unsupervised learning,
  • 46:14like the NMF+ consensus clustering approach
  • 46:16to discover subtypes.
  • 46:18And what this also does is, as you can imagine,
  • 46:21this injects a lot of confusion into the literature,
  • 46:24and can also slow down the process of translating
  • 46:27some of these approaches to the clinic.
  • 46:30So this also underlies the need
  • 46:32for replicable cross-study sub discovery approaches,
  • 46:35for replicable approaches for unsupervised learning.
  • 46:41That's something that, you know, something that we might,
  • 46:43we hope to be working on in the future,
  • 46:46and we hope to see more work on as well.
  • 46:49So to summarize the, one of the major points
  • 46:53of this talk was to introduce and discuss, you know,
  • 46:55replicability issues in genomic prediction models,
  • 46:58supervised learning, that stems from technical,
  • 47:01and also non-technical sources.
  • 47:03We also introduced a new approach to facilitate
  • 47:07data integration and multistory learning
  • 47:09in a way that captures between-study heterogeneity,
  • 47:12and showed how this can be used for the prediction
  • 47:15of subtype for pancreatic cancer, and also introduced
  • 47:20some scalable methods and future direction
  • 47:23in replicable subtype discovery.
  • 47:26So that's it for me.
  • 47:28I just want to thank some of my faculty crowd,
  • 47:30collaboratives, Quefeng Li, Junier Oliva
  • 47:33from UNC computer science, Jen Jen Yeah
  • 47:37from surgical oncology at Lineberger,
  • 47:40Joe Ibrahim as well, UNC biostatistics,
  • 47:43and also my students, Hilary, who's done a lot of work
  • 47:45in this area, and also David Lim, who's doing
  • 47:48some of the deep learning work in our group.
  • 47:50And that's it, thank you.
  • 47:58<v Robert>So does anybody here have</v>
  • 47:59any questions for the professor?
  • 48:09Or anybody on the, on Zoom, any questions you want to ask?
  • 48:26<v ->It looks like I'm off the hook.</v>
  • 48:29<v Robert>All right, well, thank you so much.</v>
  • 48:30Really appreciated your talk.
  • 48:33Have a good afternoon.
  • 48:36<v ->All right, thank you for having me.</v>