BIS Seminar - 6.23.2020 - Model-averaged estimation of molecular evolution and natural selection in SARS-COV-1 and SARS-CoV-2 coronaviruses during zoonosis

Name: BIS Seminar - 6.23.2020 - Model-averaged estimation of molecular evolution and natural selection in SARS-COV-1 and SARS-CoV-2 coronaviruses during zoonosis
Uploaded: 2020-06-23T19:01:38.897Z
Duration: 1 h 9 min 13 s
Description: Jeffrey Townsend, PhD Elihu Professor of Biostatistics and Professor of Ecology and Evolutionary Biology

June 23, 2020

Jeffrey Townsend, PhD
Elihu Professor of Biostatistics and Professor of Ecology and Evolutionary Biology

Information

ID: 5349
To Cite: DCA Citation Guide

Download Transcript

00:19- All right, I see more people joining
00:32Jeff, how long do you how long do you have like an hour?
00:36Less than that?
00:36- I think I can probably finish in less than an hour.
00:40- Less than hour, all right.
00:58I think we should get started.
01:02So hi, everyone.
01:03Welcome to our seminar series on COVID-19,
01:07organized by the Department of Biostatistics.
01:10I'm very pleased to have here today, Jeff Thompson,
01:15Professor of biostatistics, Ecology and Evolutionary Biology
01:20from the Yale School of Public Health.
01:23Thank you, Jeff, for being here today with us.
01:27As usual, you're welcome to write questions
01:30in the chat box or even unmute yourself, if you can,
01:35and other people are not talking.
01:38And, Jeff, why don't you take it from here?
01:42- Okay, thank you very much for the introduction, Laura.
01:45I'm really pleased to have an opportunity to talk
01:46about the work that we've been doing.
01:49I think like many speakers in this series, you know,
01:52we've been doing a lot of work very hard
01:54on a short period to try to get some progress on COVID-19.
01:58Ironically, this is the first work
01:59I think that I started In response to the COVID-19 epidemic
02:03and it's turned out to be a lot of work.
02:07So it's actually gotten the least far.
02:11So we've done a little bit of work, for instance,
02:13on epidemic modeling of COVID-19.
02:15That's already, it's actually been submitted,
02:18I actually have some other work on quarantine
02:20and stuff that turns out to be really interesting
02:24and far along in the research.
02:26And then this work, which I started early on,
02:28which is more evolutionary, and looking at the zoonotic
02:31process has gone a little bit slower.
02:32So what that means is consistent with
02:35many other speakers in this series,
02:35I'm gonna be talking a lot about
02:38the methods that we're going to be using,
02:40which are well developed, and what we're planning to do,
02:43I don't have a lot of results.
02:44But I think that's consistent with these talks in general.
02:47So hopefully, that will be of interest to you
02:49and also be illuminating in terms
02:53of possible research approaches towards this kind of work.
02:58So as Laura mentioned,
03:00I use a lot of evolutionary approaches
03:02to do my analyses of things.
03:04And the title of this talk is model averaged estimation
03:08of molecular evolution and natural selection
03:12in SARS coronavirus, one and SARS coronavirus two
03:14two Corona viruses during the zoonotic period.
03:18So what was attracting my interest in this particular case
03:21is that it's usually very difficult and challenging to find.
03:25And I'll get to this later in the talk to figure
03:27out what's going on during the zoonotic period,
03:29because you don't usually get much sampling there.
03:32So, what I wanted to do was apply some techniques
03:35that I've developed to this problem.
03:38And I will get to those techniques
03:41and the application to this problem.
03:43But I first just wanna give a little bit of introduction,
03:46I think, maybe from a statistics point of view
03:47towards some of the methodologies that we're using,
03:49just so everyone can sort of see on board
03:51at least how I see this as contributing
03:55to interesting statistical questions.
03:57So and in a broad sense, if I can get this to Move forward.
04:00Here we go.
04:01I think one of the most intriguing
04:03and interesting and challenging areas of mathematics
04:05and statistics is understanding this border
04:08between the discrete and the continuous.
04:09So these are just some one particular
04:13example you can pick out is, if you look at discrete
04:16and continuous distributions that are frequently
04:19in use in statistical probabilistic analyses,
04:21we have the geometric and negative binomial distributions.
04:25And we have the exponential and gamma distributions.
04:30These are basically essentially waiting for discrete events
04:32when you have a probability over time.
04:33We're waiting for the earth event if you
04:35have probably over time,
04:37and they correspond to the distributions on a continuous
04:39time for the wait for the first event
04:42or the wait for the alpha event.
04:45So there's a real clear correspondence
04:46between these two distributions.
04:48And you can actually see in the mathematics,
04:50how they're similar as well.
04:53And that correspondence is kind of interesting.
04:54And the reason why I say it's interesting is
04:56because often many of the biggest problems I think
04:59we wrestle with in statistics are when we're trying
05:01to deal with data that is some intermediate
05:04level between continuous and discrete,
05:07and where we're trying to figure out which
05:08approach is the best to use, should we use some sort
05:11sort of parameterize distribution to address it?
05:13Or should we use some sort of nonparametric
05:17approach based on the discrete?
05:18I'm not sure in any particular case.
05:19But I just wanna mention
05:21that I think that's a very interesting area.
05:22And the technique I'm gonna tell you about
05:23is definitely wrestling with exactly this kind of question.
05:27So what kind of question do I mean?
05:29Well, I mean, questions that deal with state spaces,
05:32over time, or over any discrete or continuous axis.
05:36And you can see in this diagram just give you a picture
05:40of the kinds of problems that one deals with
05:43between discrete and continuous measures.
05:45You can have here it's depicted as time,
05:48you could have a discrete state space,
05:51state space you're measuring over time,
05:53you could have a continuous sorry,
05:56you're gonna have discrete measurements
05:59over where You've got discrete time
06:01in a discrete state space,
06:03you could also have discrete time
06:06and a continuous state space.
06:08You can have continuous, continuous
06:12or you can have discrete, continuous.
06:13And this two on the bottom are, two on the left,
06:15sorry, are the relevant ones for
06:17what I wanna talk to you about.
06:19In my research, which is largely focused
06:22on informatik data that we can obtain from sequencing
06:26or other approaches like that.
06:28A lot of what we're trying to do is look at these discrete
06:30linear sequences that have sites DNA sites or amino acid
06:34sites and trying to understand is there some
06:37pattern in those sites that allows us to understand
06:40something about the biology of the organism
06:41or the biology that we want to know something more about?
06:45So what essentially I'm gonna be doing
06:48is telling you about approach an approach
06:50that takes essentially discrete items over some X axis
06:54here, in which case in my case, it's always going to be
06:56sequence space, like the nucleotides
06:58or the amino acids of a sequence.
07:01And turns it into these kinds of more discrete models.
07:04And then in some, in a procedure that I'm going to tell you
07:07about actually gives us more of a continuous measure
07:10over that space, it's not completely continuous,
07:13it actually is on every site.
07:14But when you work with hundreds of sites,
07:17it turns out to look very continuous
07:20in terms of how it appears.
07:22But it's done with a discrete model
07:23that looks over multiple sites.
07:24So well, I'll tell you how it works in a moment.
07:26And I hope it's of interest to you guys.
07:28So just to introduce that, in general,
07:31the lab has worked on a lot of different kinds of data,
07:34and including things like gene expression data
07:36that borders this discrete continuous measurement.
07:39The old micro arrays we used to use give us
07:43essentially continuous measures of gene expression.
07:44Now we get discrete counts
07:46from our census sequencing approaches.
07:49Then all the sequence data we work with
07:51often ends up being essentially clusters
07:53of sites and various kinds.
07:56And then we also use a lot of phylogenetic inference,
07:59which is another kind of just discrete modeling
08:01in terms of the topology, but the borders
08:03between these two because we have discrete modeling of the
08:07topology, there are certain topologies
08:10that the taxa that we're interested in looking at
08:12that show their relationship to each other.
08:13At the same time, there's also a continuous
08:15measure out of that, which is these branch lengths,
08:17or how diverge these different tacks
08:19are from each other and constructing the phylogeny.
08:22So this sort of border between discrete
08:24and continuous measures, always sort of plagues
08:28and intrigues me, I guess it would be the question.
08:30Okay, so what am I gonna do today?
08:32What I wannado today is talk about
08:35maximum likelihood model averaging to profile clustering
08:37of site types across discrete linear sequences.
08:40So at the very base level,
08:41how do we take kind of these discrete sequences
08:44of amino acids or nucleotides
08:46and understand whether sites are closer to each other
08:50or farther apart from each other
08:52this is the question are they just uniformly
08:53distributed site types across a sequence?
08:55Are they clustered close together or far apart?
08:58Secondly, I'm gonna talk about how we can
09:01then use that approach to understand whether sites
09:04are under selection in a gene expressed in a sequence.
09:07And what I mean by under selection is that,
09:09in fact, sites are changing in a rapid
09:12or at a more rapid pace than you'd expect simply
09:14by mutation alone.
09:16So mutation, of course, is going to introduce
09:18variation into a genetic sequence.
09:19But when you see changes that are happening faster
09:21over time in a population,
09:23then mutation alone would produce
09:26that implies that every time that mutation is happening,
09:29it's spreading across the population.
09:30And that's why you see that uptick
09:31in the rate of change of those sites.
09:34So we can actually use this clustering approach
09:36to identify regions of the gene that have
09:38that sort of uptick and I'll explain how we do that.
09:41Now lastly, I'm just going to show you a very few slides
09:43on the title of the talk,
09:45which is this model average estimation of the molecular
09:48evolution and natural selection in SARS Coronavirus one
09:51and SARS Coronavirus two during the zoonosis.
09:55So by the time we refer to these,
09:57I'll just let you know we're almost done with the talk.
09:59AlL right, so to talk about the first one
10:01maximum likelihood model averaging five clustering
10:03of sites across the street linear sequences.
10:09I just want to... (phone ringing)
10:11Sorry, emphasize that we wanna figure out
10:20whether site types are clustered within a linear sequence.
10:22This sounds like a very straightforward
10:24statistical question seems like something
10:27that should have been addressed many, many times
10:28in the statistical literature.
10:29Much to my surprise,
10:30it's actually not terribly well explored.
10:34You have a linear sequence,
10:36it's so long and you have site types of one type
10:38or another are they clustered next to each other?
10:39Well, if you know the bounds of the region of interest,
10:42and others, if you can describe oh,
10:43it's I'm interested in this domain right here,
10:46and it's from site to site 90 or some other description.
10:48If you know the bounds,
10:49it's very simple to analyze that kind of data.
10:52You can just quantify the site type proportions
10:55within and outside those bounds.
10:57use something like a straightforward fisher's exact
10:59test for significance extremely simple problem.
11:01But what if you don't actually know those bounds?
11:04What if you don't know even what you're looking for exactly?
11:05you just know you're interested in concentrations
11:07of one site type compared to another site type
11:10across some discrete linear sequence,
11:12like this series of zeros and ones you see below.
11:15There's one, zero, zeros, there's one, zero, ones,
11:17there's periods where ones are closer to each other a series
11:20of ones are closer or farther apart from each other.
11:22How should we figure out whether things
11:24are actually clustered in that site?
11:26Or are they random?
11:27So if you don't know exactly where to describe,
11:31or what size you're looking for,
11:33the most common solution people use
11:35is some kind of sliding window,
11:36they take a window over the series,
11:38and they slide it across and say,
11:40"How many are in this window?"
11:41And then you can come up with based on the sliding window
11:44a sort of diagram of the clustering.
11:46And that's an approach that actually does
11:49give a good metric of the clustering
11:51in terms of like you see peaks where there's
11:53a lot of clustering and valleys where there is none.
11:56However, significance testing with that kind of approach
11:59is often awkward to construct.
12:00Due to a strong or autocorrelation
12:02among this URL overlapping windows.
12:04And of course, if you just sort of
12:06take windows arbitrarily from one location to another,
12:09then you're really instituting, (indistinct chatter)
12:13then that causes problems.
12:14Because what if the cluster is really on a border
12:16between two windows, so you have to slide it over and then
12:19you have the autocorrelation.
12:20And it becomes actually statistically
12:21quite challenging to sort of account
12:24for all of those auto correlations.
12:25Secondly, they need to specify that window
12:27size itself presents a user with a procedural ambiguity
12:31that almost inevitably leads to post hoc selection of window
12:34size and can mislead inference that is just the fact that
12:37you have to choose a window size.
12:39And if you don't actually have a good arbitrary
12:41outside reason to choose it.
12:43It's very hard not to choose a window size
12:44that ends up validating your hypothesis in some way.
12:49So it'd be better if we could just have an approach
12:51that does not require us to place in some
12:53arbitrary parameter that gives us a window size.
12:56So in order to address this question,
12:58a postdoc of mine, John John, who you see below work
13:01with me to address it.
13:03Oh, I wanted to say one other thing,
13:04which is that, yes, this has been addressed with some
13:07nonparametric methods that people have developed,
13:11including some rather famous people like Sam Carlin.
13:14And these are methods that do not assume prior knowledge.
13:17And they've been suggested to detect this clustering
13:20and discrete linear sequences.
13:21So you can do runs tests that look for
13:22the longest unbroken run, or the variance of the run
13:26links across the entire sequence.
13:27Both of these are indicators of clustering.
13:30Unfortunately, both of those are using
13:32are not sufficient tests.
13:34And those they don't use enough of the information
13:36to say that you're actually have as much power as you'd
13:39like to do the analysis.
13:40And that's because if you use like
13:42the longest run link, for instance, of course,
13:44you're only really using a little bit
13:45of information about the entire sequence.
13:47And of course, you're really missing anything
13:49like the cluster of ones that are have a bunch of small
13:52clusters that are all next to each other interspersed
13:54with a few of the other type,
13:56so the longest unbroken run doesn't work well.
13:59If you use the In terms of power,
14:01if you use the variance of long run link
14:04that gets rid of the fact that you're looking for just one.
14:05But unfortunately, a variance doesn't tell you anything
14:07about the relative position of site
14:11that are of the same type across the sequence.
14:14So the fact that this one, one, one, one here is close
14:18to the one, one here, and the one another is,
14:20and this the fact that these are all close to each other,
14:22does not give us the power that it should
14:25for understanding this region may
14:27be under maybe cluster.
14:30So variants of run length is also an underpowered approach.
14:33The most powerful approach that's been published out there,
14:36aside from the ones we've been working on,
14:38is the empirical cumulative distribution functions
14:41to sick that's where you sort of go across the sequence
14:43and just say, "oh, okay, we're accumulating ones here,
14:47we're shooting more accumulating more."
14:49And there's fortunately a number
14:52of highly developed statistical approaches
14:53to look at the empirical distribution and figure
14:55out whether you see an increase beyond
15:00expected during some period during that ECDF,
15:03the power is better than either the previous methods,
15:05but it's still not very strong.
15:07It's not clear that it includes all the
15:08information that it should.
15:10And it can be affected.
15:12Research has shown that it can be affected
15:14by the location of the cluster, which is not desirable.
15:16So if you have a cluster on an end,
15:18that has less the ECDF will have less power
15:21or more power compared to a cluster in the middle.
15:23It's also challenging to interpret in the end,
15:26for reasons I'm not gonna go into right away.
15:29So what did we do?
15:30What we did was develop a tripartite divide
15:32and conquer approach to model variant sites
15:35based on iterative sub clustering.
15:37And I'll describe it in detail right now.
15:39I'll just tell you the plus and the minus
15:40of this approach at the beginning,
15:42which is it's sort of a bioinformatics approach
15:45and that are bioinformatics statisticians approach
15:48and that it uses intensive computation
15:50to solve the problem instead of giving
15:52a strict analytical result.
15:55And in fact, what it does is it just says,
15:58Well, if we're interested in clustering in any case,
16:00clusters should be represented by increases in
16:03the probability within some cluster central region
16:06compared to some side regions.
16:08And if we define CS and CE to be anything
16:11from the very beginning to the very end of the sequence,
16:14it encompasses all possible single clusters
16:17within a sequence.
16:19So, for instance, if the cluster were on the far left
16:22we can just define CS to be at zero,
16:25the left hand cluster is nothing and the right hand cluster,
16:28right hand area that has depressed in variant type intensity
16:35would be the other category.
16:38Anyway, so, what we can do is divide any sequence
16:42into three sections, just count up the number
16:44of site types in each one, estimate the maximum
16:46likelihood probability for the site type
16:50to be of the variant type of interest,
16:52say it's a glycine amino acids within a protein
16:55or add mean nucleotides limited gene, whatever it is.
17:00So then you can just come up with a null hypothesis,
17:03which is the likelihood under the hypothesis
17:06that these things are located at random
17:09across the whole sequence.
17:11And then an alternate hypothesis that allows
17:14that is invoking a model which involves more parameters,
17:18which then separate separates into a clustered
17:21versus non-clustered state.
17:23So that would be fine if what we really
17:25expected in a sequence was one cluster,
17:27compared to nothing else,
17:29compared to the sort of baseline rate of clustering,
17:33sort of baseline rate of variant types.
17:35And but what we really want is an approach
17:39that can take clustering at many, many levels.
17:42So what if there's a cluster within the cluster
17:43or cluster within left?
17:45So what you can do is then take each
17:46of these sub clusters you've identified and actually
17:50do the same process on them looking for whether there's
17:53a higher likelihood of the data given another cluster
17:56somewhere within this sequence, et cetera, et cetera.
17:59Now, if you think so this sort of dictates a procedure,
18:04which is that you start, you input the sequence,
18:07you start at, you know, the first at
18:09the left and move all the way to the right,
18:11essentially, you find the most likely cluster
18:13among all the possible clusters.
18:15If the cluster is statistically significant,
18:17you then sub sequence each of those three parts,
18:21the left hand part, the central center part
18:24and the right hand part, find the most
18:26likely clusters within each of them.
18:27And proceed doing this until you reach a point
18:30where you can no longer find any statistical evidence
18:32that there is continued clustering within it.
18:34And that's the point at which you stop.
18:36And then what you can do.
18:37And this, I think, is sort of a key because
18:39at the end of that, what you get is one discrete diagram,
18:42kind of like that diagram I showed you initially,
18:44where it proceeds flat, goes up,
18:46proceeds flat goes down, et cetera.
18:47I'll show you an example of that in a moment.
18:50But what you really wanna do possibly,
18:53right, what I think is really appealing about
18:55this approach is that then you can take
18:56that as one model, the most likely model and you can look
18:59at all the other possible models
19:00that you could have constructed.
19:02And you can use AIC weighting to actually figure
19:05out how much you should believe what is the weight
19:11for every possible model.
19:13And then you can average across those models
19:14to give you a continuous description
19:17of how much clustering you see across the sequence.
19:18And again, the advantage that I mentioned
19:20early on about this,
19:22from my standpoint is I haven't put in anything
19:24about how big a window how big a cluster,
19:26I put in nothing about what I'm expecting
19:28to see out of the sequence.
19:30I'm just asking, what's the most likely description
19:32of this given the assay penalty for parameterization
19:37and what the result gives me.
19:39So then we have a bunch of different weights
19:41for all our different models.
19:44And what it gives us something like this.
19:45So on the top, I've shown you the AIC model selection
19:48which is the first thing I showed you
19:49if I just took the most likely description
19:51of this particular sequence.
19:53It's not important what it is it's PRF
19:55ADHD, which has been widely studied in evolutionary biology.
19:59But if you take this model selection would,
20:02the most likely description
20:05given that sub clustering looks something like this
20:07where we have a region with fairly high concentration
20:10of polymorphism, in this case, a valley,
20:14a region, an intermediate level,
20:16a point where we have a lot of polymorphism.
20:19And then it moves and changes across the sequence.
20:21Now, if you then instead take not just that one model,
20:25but a series of models and do the AIC model average,
20:28you get a much more continuous description across
20:30the sequence of what the probability
20:33of sight types being different is.
20:36And that enables us to ask a question
20:37that's a little bit more interesting in many cases,
20:41and I'll show you how it enables us to ask questions
20:43about natural selection in a moment.
20:45So in particular, it allows us to get an estimate,
20:49you know of what the probability
20:50is across the entire sequence.
20:51Even though we don't have
20:52observations within the central region
20:54or this barren region here.
20:56We can still estimate what the model average,
21:00probably of a change of hearing in different places
21:02have this gene are and that enables us
21:05to ask questions that we otherwise could not do.
21:08All right, so that's an introduction of MACML.
21:11I'll just mention, and I could give you more detail on this.
21:14It's like this is actually published work,
21:16so you can find it.
21:17But compared to the ECDF statistics,
21:19that approach I just showed you has greater power
21:21to detect heterogeneous clusters
21:23it identifies clusters with greater accuracy and precision
21:26based on the Kullback-Liebler divergence between
21:28the actual distribution of the observed distribution,
21:31sorry, the actual distribution
21:34and the inferred distribution.
21:36It has better power and accuracy across
21:37different levels of clustering,
21:38better power and accuracy across
21:40different sequence links,
21:41and better power and accuracy and finding
21:43multiple clusters compared to a single cluster.
21:45The disadvantage is, it's extraordinarily
21:47computationally intensive, and it is prohibitively
21:49so for very long sequences.
21:51So for genes a very long length,
21:53we can't actually run it on the full-length gene
21:55and we have to do some more heuristic processes
21:58to crunch those genes into smaller size.
22:01Which we then can analyze and then build them up.
22:03Again, I won't go into those at the moment.
22:05But the point is that at certain links,
22:07it gets just computationally too intensive to go
22:09through all the possible models that could explain the data.
22:13Now, I've talked about the maximum-likelihood averaging
22:17to profile clustering of site types
22:19across discrete linear sequences,
22:21introduced that methodology to now I'm gonna talk about
22:24how we can at apply that methodology
22:26to get us a better idea of which sites are under selection
22:29using a what's called a pause on random fields approach.
22:32And don't worry about that terminology.
22:34You might know it from statistics,
22:37it has to do with a particular observation
22:40in molecular evolutionary biology,
22:42which is why they're using it
22:44and it's not really important for this talk,
22:46why it's called that.
22:48So let's go on and go ahead and do that talk
22:51about the model-averaged site selection
22:53using Poisson random fields.
22:54So first, I need to give you a little bit of background
22:56in the evolutionary biology for those of you
22:59who haven't had a lot of biology,
23:00so you understand how this fits in with
23:02what we tend to do another strategy.
23:03Of course, evolutionary biologists
23:05are often very interested in understanding
23:06what things are under selection.
23:07And in the context of this talk,
23:09why is that important?
23:10Well, we'd really like to know what things
23:12are under selection in the COVID epidemic,
23:14because we'd like to know what sites
23:16are actually causing the COVID epidemic
23:18to spread more or not, and what sites may have
23:21been important in it prior to zoonosis,
23:24MSN, perhaps, especially in the context of this talk,
23:26what sites were selected during
23:28that zoonotic process that made this virus perhaps able
23:31to infect humans in the first place.
23:33So what we're doing is,
23:34so to give you an introduction,
23:36I just wanna mention that they're sort of ways
23:39to look at ancient times and understand
23:40whether selection was happening.
23:42And that's this approach that's called
23:44that looks at phylogenetic divergence,
23:45looking at multiple sites and saying,
23:47"Oh, we have a whole bunch of phylogeny
23:49of how these organisms are related."
23:51And then we have a bunch of sites that are for each taxon.
23:55When we see sites like this, for instance,
23:57that's having A and then a couple C's and then a G
24:00and another tacks on, we know that this site changed twice
24:03on that phylogeny, at least right?
24:05So it changed to probably change from C ancestrally
24:09to an A in this lineage and to a G
24:11in this lineage independently.
24:13And so the fact that it changed twice means
24:16that it's got an elevated rate of change.
24:18And that elevated rate of change is an indication
24:20that there's been positive selection for change.
24:22It's especially likely in sort of pathogen hosts
24:25interactions that high rates of high change are
24:28because pathogens are changing in order
24:30to not be recognizable by their hosts.
24:33And often the host has recognition proteins
24:35that are changing to still recognize the pathogen,
24:36even the pathogen is changing.
24:38So these high rates of evolution
24:40are very strong indicators of selection
24:42in host pathogen situations.
24:45So this is one way to study a natural selection.
24:48It does depend, though, on having a lot of data going back
24:52in time because you're actually reliant on these changes
24:55are occurring in multiple places on multiple lineages.
24:58Now, a more recent level, and I'm going to go back
25:02to the middle in a moment.
25:05But a very recent time, you may have
25:07heard of selective sweep detection,
25:08a couple of methods people use are tajima's D,
25:11or IHS, there's a bunch of other methods that are out now.
25:14And the idea there is to look at polymorphism.
25:16And if you look at an individual, before selection,
25:20this is sort of just a idea diagram,
25:22not what you look at.
25:23But so if you look at an individual who has a variant,
25:26and what you see in a population is that
25:30one individual with variant, a variant that's important
25:33as somehow swept across the population.
25:35So if you see this would be before selection,
25:37there's a lot of variation at a particular locus
25:39in the genome after selection,
25:41that one individuals variant which contributed
25:44to the reproductive fitness would then imply
25:46that they would spread across the population.
25:50And if they spread across the population,
25:52then the genetic variants that were present
25:54in that original individual spread across
25:56the population as well along with this selected site,
26:00and so you can look for this kind of partial or speedy.
26:04And the selection is going on neither
26:07of the approaches that I just talked about
26:09or the approach that I'm doing today.
26:10So I just wanted to introduce those,
26:12so you knew those are different.
26:13And they're different because we're looking
26:15at a more intermediate timescale.
26:16That's like the sweet detection is purely
26:19dependent on polymorphism in the population,
26:21like what's happening in a population right now.
26:24The phylogenetic divergence is purely dependent
26:26on this ancient changes that you get from a phylogeny
26:28understanding how different species are related
26:31to each other at an intermediate level,
26:33our methods use that use both the polymorphism
26:35and the divergence.
26:37And the idea here in the McDonald-Kreitman approach,
26:40and the master approach I'm going to tell you
26:42about is that the polymorphism what you see generally
26:46in the population is sort of consistent with this.
26:48Sorry, if I go back to this slide.
26:51With this before selection, you know,
26:53all of these blue sites are assumed
26:55to not be under selection,
26:57and that generally what we believe in evolutionary biology,
26:59because of empirical data that validates it
27:02is that most sites that you find varying in populations
27:05are not under strong selection.
27:07If they were on stronger selection,
27:08they would probably fix it, everyone would have them.
27:11And if they were under negative selection,
27:13they wouldn't rise to a high frequency.
27:14So generally speaking sites that you actually see
27:17change differences between us and our genetics
27:18typically are not affecting anything.
27:20Of course, we spend in our...
27:23In the media, you only hear about the changes
27:24that actually affect things.
27:25And that's because those are important to us,
27:26the ones that don't change anything
27:28we don't really care about.
27:29So nobody talks about that much.
27:30But most of the changes within population or differences
27:33within population don't have much material effect.
27:35So under that hypothesis,
27:37then when you look at polymorphism,
27:39most polymorphism is just an indication
27:41of the underlying mutation rate,
27:43some mutation happened didn't have any effect.
27:45It's drifting up and down in the population.
27:47And so the advantage of that is if you know
27:50that polymorphism is signal is a signature
27:52of just random mutation, it gives us an estimate
27:54of the underlying mutation rate, which we can then compare
27:57to the divergence and using that comparison,
28:00we can understand how organisms are related.
28:02So whether organisms are under selection
28:05or not, if the divergence is high compared
28:07to the polymorphism, that indicates a lot of selection.
28:09That means (indistinct chatter)
28:12in the timescale of the analysis you're doing,
28:14we have a lot of change the population,
28:17and on the other hand, you have a lot of polymorphism
28:20and not that much divergence, then that indicates
28:22you've got a lot of change going on,
28:23but it's not actually being directionally
28:26selected because the divergence is much lower.
28:27So how does that test work in practice?
28:30Well, just to step back for one moment,
28:32so we're gonna apply that kind of test.
28:35In this talk I'm applying that test
28:36to the emergence of COVID-19.
28:39I'm actually applying it but also to SARS, which is fairly
28:44closely related the SARS coronavirus one
28:46because we have similar data and can apply
28:48the same test in the same way to that data set.
28:52And we're using in addition the SARS like
28:55Coronavirus in a sample that had been sequence
28:58basically collected from bats.
29:00Over the past 20 years or so,
29:02so what you can see here is a phylogeny,
29:05which includes COVID-19 epidemic ongoing now in humans,
29:09the SARS epidemic, which caused some 400 deaths
29:13or so back in the early 2000s.
29:18And what we're doing is analyzing both and looking at,
29:21in particular, the very short internode here
29:25were between the most closely related non human infections
29:31and the human infection set that we can see.
29:33And this internode here, also,
29:36between these non human infections and the human
29:39infections we can see here, because the changes
29:42that may have enabled, we don't know,
29:45there may be no changes that enabled it,
29:47maybe this virus throughout
29:49its entire history could have infected humans,
29:51but it just never managed to or never did.
29:53But if there are changes that are unique to this virus
29:56that happened during zoonosis, enabling it to infect us,
29:59they happened on this lineage,
30:00and so we're interested in seeing what those changes are.
30:04And so that's what we're gonna do is we're gonna run
30:06this polymorphism and divergence approach on this lineage.
30:10And what I just want to make (indistinct chatter)
30:13clear to you is the reason
30:14why the polymorphism divergence approach is important is
30:18the phylogenetic approach, the ancient approach
30:20relies on a large clade of data, which we don't have
30:22for that particular lineage here,
30:24we just have the human infection,
30:26which is no longer zoonotic.
30:26And we have this one lineage.
30:28And so what we can do is ancestrally reconstruct
30:30the ancestor of this lineage, which is right here,
30:33actually on the phylogeny,
30:34and also the ancestor right here,
30:37and then use mass PRF, this approach that's based
30:40on polymorphism in the room, so I'll explain to you
30:43on the divergence between that ancestor
30:46and the first ancestor of all the human infections.
30:48And we can take that as the near zoonosis time
30:51and figure out what mutations might
30:53have happened during that time.
30:54All right, so we're gonna do that in both
30:56the COVID-19 and SARS cases.
30:59Now, how does this work in principle?
31:02Well, there's an old approach,
31:03which is not what we're using.
31:05But I have to compare it to in order to
31:06sort of reference it in terms of the literature.
31:09And that is that when you assume
31:11that polymorphism is neutral,
31:13we expect a different proportion of replacement
31:16to synonymous divergence compared to replacement
31:18to synonymous polymorphism in a gene.
31:21So it's just a two by two table here, again,
31:23very simple statistics, where we look at
31:25the number of replacement sites that are divergent
31:28the number of synonymous sites replacement,
31:30again, is when an amino acid change
31:32occurs in a DNA sequence.
31:33DNA sequence changes can either change the amino acid
31:35or not depending on what the sequence of the code on
31:39the three base pair code on in the DNA sequences.
31:42So if there's a replacement, we tally it here,
31:44if it's a synonymous change, that doesn't change the amino
31:46acid, we tally it here, these ones are preserved.
31:48Sometimes changes are presumably neutral because
31:50they don't change anything about your protein.
31:52And then the if it's a polymorphic replacement,
31:56then we see it here.
31:57And if it's a synonymous polymorphism we see it here.
31:59So under the hypothesis that I mentioned,
32:01all three of these cells should occur, it should
32:04be sort of changing in exactly the same way
32:06because polymorphic sites, whether they're replacement
32:09are synonymous, we're assuming are neutral,
32:11synonymous sites, whether the divergent
32:12or polymorphic, we're assuming is neutral.
32:15The only one that apparently that under
32:17assumption is not neutral are these replacement
32:19changes at replacement divergence sites.
32:22So, if this replacement divergence, if the marginals
32:25add up so that this replacement divergence is sort of in
32:29line with all these others, then we assume nothing important
32:30is happening in that gene, it's probably not selected,
32:33it's just neutral changes that are happening there.
32:35If this divergence is higher, though,
32:38then we might conclude that it's under
32:39selection for changes at a rapid pace.
32:41So neutrality yields a DN over DS that's equal
32:44to the PN over PS positive selection means
32:46that the DN DS is greater than the PN PS and negative
32:50selection where changes are actually being selected against
32:53at a high level indicates the DN DS
32:56is gonna be less than PN PS.
32:59All right now Let's get to a little bit of the
33:01complexity on this thing that I mentioned that's called
33:04Poisson random field theory, quantitatively estimates
33:05gene-wide selection intensity.
33:09So what you can do is take that
33:12same two by two table, and you can say under a model of
33:14selection, what do we actually think is happening here.
33:18And that gives us the ability to estimate the selection
33:20coefficient, which is a basically the rate at which that
33:22change allows the virus to increase its reproductive ability
33:25or survival ability in the host.
33:27And that that is this gamma term right here
33:32in these terms, and this, these look complicated,
33:34but essentially, these formulas are just saying
33:36that the expectation for a synonymous sorry,
33:39the synonymous and replacement have reversed
33:41on this chart compared to the last,
33:43so don't be confused by that.
33:45But the expectation under synonymous
33:45changes is essentially the mutation rate.
33:48And these terms are just about the sampling properties
33:50of when you sequence how many of these things you get,
33:52I don't need to go into the detail about that here.
33:55Similarly, the polymorphic sequence
33:57is just basically dependent on the mutation rate.
34:00How the replacement sequences are a little bit more
34:02complicated in that they have to account
34:07for kinds of selection that may be going on.
34:11For reasons that I don't wanna get into
34:12the polymorphic selection, so both of them are depending
34:16on the mutation rate for replacement sites,
34:18and both of them depend on
34:20how much each variant is selected.
34:23Selection doesn't pack the polymorphism
34:25to a certain degree in the sense that if variants
34:27are moving through the population very fast,
34:30that can change how much polymorphism you see.
34:32But then if you use these sampling formulas, and the formula
34:36for the estimate of the strength of selection,
34:38given how many variants we see changing,
34:41you get these formulas for how much replacement
34:44divergence and polymorphism you expect to see.
34:47So this is a population genetics that was worked
34:49out by Stan Sawyer and Dan Hurley in 1992.
34:52The only change I'm making in this is pure F,
34:56instead of using a year which was how many grants
35:00that you see in the the McConnell Craven uses it,
35:04I'm taking the probabilities of replacement divergence
35:08and the probabilities of some polymorphism
35:11and putting them in here.
35:12And the advantage here is that what
35:13I can do with that is what I mentioned earlier,
35:15I can go back to the old mass MACML
35:18approach sequence clustering approach
35:20that I mentioned before, estimating those probabilities
35:25across the entire gene, I can then estimate action across
35:27the entire gene by using these probability single site,
35:30I don't have changes for single site.
35:32So what this allows
35:34us to estimate this gamma, minimizing likelihood of what
35:38gamma is to blame those problems exist, see.
35:42So this is a very complex diagram of how this all works,
35:46again, is a pretty elaborate method of computation.
35:50But again, has the nice properties that I'm not putting
35:53in any I'm not using assumptions
35:55and not putting in any parameters.
35:56They go in.
35:58I just take the polymorph at the end analyze it for
36:01weather sites are clustered into four different categories.
36:04Again, replacement polymorphism.
36:06That's this arc here.
36:07So polymorphisms anonymous divergence, placement divergence,
36:12we cluster within all four of those categories.
36:15We calculate the model average probability,
36:17all those clusters and merge the data together.
36:20I'm not going to go through the details.
36:22But just if you were to do essentially the KML,
36:25like clustering on those four categories
36:27for a particular gene polymorphisms
36:30and Ana's polymorphisms, monster and placement divergence
36:33if you plug those in, to the formulas I showed you before,
36:37you're basically plugging into these categories,
36:39you can estimate those formulas.
36:41And in the end, what you get is
36:42an estimate of gamma across nucleotide positions in a gene.
36:49I won't go into what this result here,
36:51it's an interesting result for reasons
36:54that are only of interest mostly to evolutionary
36:56biologist, but you can see here in this particular gene
36:58that there's a lot of variation in the selection
37:02intensity across the gene.
37:04Now, that is actually really
37:06consistent with what we'd expect.
37:08From a sort of basic biology standpoint.
37:11Different parts of a gene are gonna either
37:13be very strongly selected to stay the same
37:15or they're gonna change, you shouldn't really expect
37:18that all parts of gene are equally likely to change.
37:20And this gives a very nice diagram
37:22that allows you to understand how
37:23it's different across the gene.
37:25So if we compare this kind of approach
37:27to the McDonald kreitman tests, which again,
37:30are just putting in the DN DS, PN PS values
37:33into this two by two table,
37:36and I went through that, the important difference is that
37:39the Mk test assumes this intergenic homogeneous selection
37:42that in fact, a gene has the same selection
37:44across the entire sequence.
37:46The problem with that is if you have one small
37:48region that's under selection,
37:50the averaging out process across that entire gene
37:53can mean that you don't detect the selection there,
37:54even though it may be very strong for that small region.
37:57And so the hope is that mastery graph can
38:01identify those regions much better
38:02than MK for instance, would.
38:04And in fact, I went through this already.
38:09I'll just skip past this because I went through it already.
38:13And this it does do that.
38:18So this is an example of McDonnell Craven
38:21tests here applied to a Drosophila gene,
38:23what you see is this high evolution of a high level
38:27of replacement divergence, which turns out
38:30to indicate high selection.
38:33And you can see here that the DN DS ratio
38:35is about eight to one word as the PN PS ratio
38:38is almost even.
38:40So this is a gene that's under very strong selection
38:42based on the McDonald kreitman test.
38:45Now, interestingly, so this one works
38:47with a homogeneity.
38:49And then if you analyze the ACP 26 AA gene
38:55and look for the probability of all four categories.
38:58These are the four categories and of course,
39:01the replacement divergence here is the one
39:04that's most likely to drive selection.
39:06What do you get when you estimate gamma using this?
39:09Well, interestingly, what you see is not something
39:10that's under very strong selection across the entire gene,
39:13but something that's on moderately strong selection,
39:15basically in the second half of the gene,
39:17and then one peak of very strong
39:19selection right around the middle of the gene.
39:21And this is visible in currents because
39:23of a number of changes that occur
39:26in one particular domain of the gene here.
39:28Now, if you look at just the replacement divergence,
39:30you wouldn't be able to figure this out.
39:32Because you see there are other
39:34peaks along here.
39:35Those don't turn out to be so important.
39:36And the reason why they don't turn out to be so important
39:39is that the synonymous divergence synonymous by morphism
39:41replacement polymorphism.
39:42Tell us more about the underlying mutation rate
39:44that says those elevations are probably have
39:47something to do with mutation rate, and not necessarily
39:49to do with added divergence.
39:52You can sort of see this elevation
39:54on the right hand side over here compared
39:56to the small dip right here and up here
39:59and the way it all works out mathematically
40:02is we can really see that there's strong selection here.
40:04We can also get what I call model intervals for this.
40:06If you look across all the models,
40:08what are the estimates of selection?
40:11Possibly, what do we get is the 95% model interval for this?
40:14And that's what these very faint gray lines you
40:17may be able to see are those allow us to detect whether
40:19these are significant, least significant,
40:22statistically significant differences in selection.
40:24All right, I'm gonna skip through this
40:27just because I want to spend the time
40:29but the point is, you can do this for other genes,
40:29and it shows similar results that allow us
40:32to understand where sites are under selection in that gene.
40:34I'll just cover a few more examples
40:37of how we've used this to give you an idea
40:39of what it can look like in a comparison between humans
40:42and chimpanzees where we've run this just to understand
40:44how we've diverged from chimpanzees.
40:47We see a bunch of different examples here.
40:50Again, doing a little bit of comparison to
40:52that traditional McDonald kreitman test
40:54and the mass PRF test.
40:56Here you see a gene, which is statistically significant
41:00people's point of view.
41:01Based on the Mk tests, the four categories
41:04of the four tallies of which are indicated here.
41:07Here's the MASS -PRF profile, and it shows us again
41:10a particular region within this SLC AA
41:12one gene that is under selection.
41:14There are interesting stories behind all of these,
41:17but I'm not gonna take the time to go through them.
41:19Here's another example where and this is an example
41:22where the McDonald pregnant test
41:23comes out is not significant.
41:25There's just not that much divergence
41:26compared to the other categories.
41:28But if you do this, spatially with the MASS-PRF test,
41:32you actually see that a very central region there
41:34has very strong selection, and then the rest of the gene
41:37is under almost zero selection or almost no selection.
41:41So this is an example I talked about,
41:43where you could have some very small portion
41:45of the gene under very strongest selection.
41:47And McDonald-Kreitman test wouldn't detect it
41:49because it's averaging over the entire gene.
41:51Similarly, you'll get some genes.
41:52Oops, I didn't mean to do that.
41:54Some jeans, here's M gamma over here, where there's a...
41:58Well, let me turn to that one last.
41:59Actually, let me look at TPH First,
42:02there's no statistical selection according to the Mk tests.
42:06And in fact, in our MASS-PRF,
42:08there's no specific selection either
42:09the error bars are entirely overlapping zero here,
42:12which indicates no selection.
42:15Lastly, here's M gamma.
42:16This is the one of the very few examples
42:18we were able to find where McDonald test did detect
42:21selection where, where MASS-PRF didn't.
42:24As you can see, there's quite high tallies here,
42:26which means there's a lot of power
42:27to detect selection if it's there,
42:28but it's probably not very strong,
42:30because the numbers are not all that different
42:32from each other.
42:34And McDonald-Kreitman says it's statistically significant.
42:36Now the reason why McDonald Kreitman is telling
42:40it's statistic's nothing compared to mass PRF
42:41is that actually, I didn't explain this in detail to you.
42:44But McDonald- Kreitman doesn't actually assume
42:47that there's an elevation of rate here.
42:48And so the significance here is actually driven by
42:51the high polymorphic replacement level.
42:53So there's a lot of polymorphic replacements in there.
42:56And what that means is there's some other
43:00kind of selection that isn't a directional selection.
43:01I won't go into the details there.
43:02But the nice thing is that in the examples
43:04where we find that McDonald kreitman is statistically
43:07significant and MASS-PRF isn't examples
43:10where in fact MASS-PRF is not designed to detect
43:12that kind of selection and MK test is.
43:15In general MASS-PRF turned out to be significant
43:18in almost every case math MK tests were not.
43:21Okay, so how can we use this, apply this
43:24to instances like COVID-19, the point of this whole talk,
43:27and I'm just gonna give you one example first
43:30to justify why we think it's a good idea,
43:32because we don't have results on doing it,
43:34at least not many results on doing it to COVID-19
43:36yet, and that is that we applied this influenza before,
43:39which has some similarities to COVID-19, as everyone knows
43:43and in influenza, again, we're interested in looking across
43:46the gene are there sites that are under selection
43:48because those sites that are under selection
43:50are candidates where we need to be aware that
43:53in fact, vaccines need like for every year they design
43:58a new influenza vaccine, right?
43:58And what they're trying to do is accommodate
44:00the fact that these changes occur on the sites
44:03that are actually susceptible
44:04to your immune system recognizing the influenza virus.
44:08So we need to understand those sites that are changing
44:11and where they are in in order to design
44:13more universal vaccines that maybe could target sites
44:16that won't change rapidly because they can't change
44:19because they're structurally constrained in the virus.
44:22So what we did was apply this MASS-PRF approach
44:25to influenza similarly on a phylogeny
44:29to like I described for Coronavirus.
44:30I don't have the phylogeny in the slide set,
44:33but the point is just looking at the ancestral influenza
44:36and it's divergent sites within a particular region.
44:40And what we were able to do is identify a set of sites
44:43that are under select---ion using mass PRF
44:46that are beyond what people had prophesied
44:48as positive selection sites in the past.
44:50So there's a paper by Westgeest al 2012
44:53which is essentially the gold standard for this
44:55and they found a bunch of sites that are all
44:58these circled sites in gray MASS-PRF.
45:00Also found those the orange diagram here
45:03is the MASS-PRF for this gene.
45:09And it also identified other sites
45:10that are under selection as well.
45:14And we're in the process of understanding
45:16better how those can be validated.
45:17But the ultimate point is that
45:20these are important selected sites that may be relevant
45:25to the design of vaccines for influenza.
45:28So similarlY, we'd like to illuminate
45:31which sites might be changing rapidly
45:34and under positive selection in Coronavirus,
45:37not only during the human epidemic,
45:39but again during the zonotic zoonotic time period.
45:41And so now we're finally coming to the final
45:43part of my talk, which is what we're doing
45:46in terms of the model average estimation the mcos
45:48and natural selection in SARS coronavirus,
45:51one and SARS coronavirus two,
45:53Corona viruses during zoonosis.
45:53But the whole point here is really
45:56explain to you what I've done because the results I have
45:57as I said are I just have a few plots of some of the stuff
46:01longest selection we were able to check
46:03because we have to process through a lot more data
46:05before we get a more in depth look at the lesser
46:07selected sites that are on these genes.
46:10And so we looked at this for the for Coronavirus.
46:13This is just a Coronavirus, Getty image that Yale
46:17has used looking at Coronavirus.
46:21And again, as I mentioned,
46:23we're looking at these two sites of where COVID-19
46:26emergence occurred, and where SARS emergence occurred.
46:30And the question is, are there changes
46:33that happen there that are specifically
46:34responsible perhaps for those zoonosis and the only results
46:38I have are just a few results again, highlighting some of
46:40the strongest selection we saw.
46:42This is actually a diagram of the spike
46:44protein which if you've heard much about COVID-19
46:47molecular biology, you probably have heard about the spike
46:50protein, it's what sticks out from the virus.
46:52It's what grabs onto the AC receptor,
46:56and essentially is what most vaccines
46:58that one might design for the virus would target.
47:01And the point is that the recombination binding
47:04domain, which has gotten a lot of press already turns out
47:07to have the selected sites.
47:08You can see them here, here, here and here.
47:12These are sites that are selected,
47:13meaning they're changing rapidly
47:13during the pre zoonotic phase.
47:17So these are sites that are changing, not in humans,
47:20but in the bats in the pangolins.
47:22And whatever other animals that this virus
47:25is spreading among, or has been spreading among
47:27before the zoonosis to humans.
47:29So then the question is, are similar sites under
47:30selection during zoonosis?
47:31And during post zoonosis?
47:36And the answer right now is yes,
47:38it seems kind of similar,
47:39although we don't get the same sites.
47:40So we have to do a little bit
47:42more molecular, you know, staring at this and understanding
47:44it because these results are literally
47:46I got these results today, actually.
47:48So we have to sort of do more of this
47:51and we actually can actually look at more depth
47:54and get more sites with other approaches
47:55that we haven't implemented at this moment.
47:57But during near zoonosis what you see is again,
47:58the selected sites which are in bright red
48:06are also on the sort of the visible side
48:08of the recombination binding domain
48:13of the spike protein, which is the tip
48:17the outside portion of this gene.
48:23Lastly, if we look post-zoonosis that's in
48:24the evolution of humans, we again see that
48:26the selected sites are sites that are at this tip region.
48:33Again, none of this is terribly surprising.
48:35The interesting thing is that it kind of indicates
48:36that the zoonosis it kind of indicates consistency.
48:38Again, there's a lot more to do before
48:40we can conclude anything like this,
48:42but the idea we have right now indicates
48:44a good deal of consistency between the selection
48:46that's ongoing in humans during zoonosis and pre zoonosis.
48:51And what that implies is that this may
48:54well have been as I said, very briefly,
48:56during this talk an instance where there's a virus
49:00just circulating around in bats and penguins
49:01that could have caused this disease at any time,
49:04it's just a matter of whether or not we actually
49:07have exposure to, to those organisms
49:11that allows the transmission to happen.
49:14Consistent with this, I'll just mention
49:17a couple like verbal points,
49:18which is that all the evidence that we have indicates
49:20that this virus spread extremely quickly
49:23from the moment that it zoonosis into humans.
49:26And in fact, in most cases of zoonosis,
49:28we find that that's true,
49:31which is somewhat counterintuitive.
49:33Obviously, it hasn't adapted to humans,
49:34it has adapted to the amount of mammalian immune system.
49:37And so to the extent that our immune system is not
49:39tremendously different from that of bats or pangolins,
49:41it may be not surprising that it can infect us.
49:44But one of the things that is true is that
49:47if it did not spread very quickly,
49:48very easily from the very moment it transmitted to someone,
49:51it would probably lead to a dead end.
49:52In other words, if you don't have
49:55an ability to transmit and spread just from the get go,
49:57the first person who gets infected
50:00is very unlikely to transmit it to someone else.
50:02So it sort of has to be well pre adapted
50:04for a zoonotic event to actually spread in humans.
50:07Now there's, we need more zoonotic events,
50:11God forbid that it actually happens,
50:13to really get a better picture of that.
50:15But the general result and the scientific
50:16literature does seem to show that zoonosis happens.
50:18the disease's already well set to cause problems.
50:22And the examples that we don't have where
50:24it happens like that, like MERS
50:27or like, well, MERS is a good example.
50:30It's a really deadly disease,
50:31but it doesn't transmit well among humans.
50:32And so that's an example where maybe it's transmitting
50:35to humans, but it's not transmitting among humans.
50:37And it's very hard for that disease
50:40to catch on within the human population
50:43and do human transmission as opposed to zoonotic events.
50:45And that's because it doesn't transmit
50:47and it doesn't usually evolve that ability
50:48to transmit over the short time that
50:51that individuals might get infected.
50:53when when they get it usually from camels.
50:57Okay, so I've showed you those examples.
50:59I just wanna to mention what else we're gonna be doing.
51:02So I what I just showed you was actually
51:05the sort of SARS coronavirus to some sites
51:06that are under selection in search
51:08for Coronavirus two genes.
51:10This is the S gene right here.
51:12That's the spike gene.
51:13We're gonna be looking at that in SARS coronavirus,
51:15one and two, we're also going to be looking
51:18at other genes in the genomes.
51:22These have other functions.
51:23The M gene, for instance, is a membrane gene.
51:26So it might be relevant to and the gene
51:28as well might be relevant to vaccine generation.
51:32Like if we could generate a vaccine that targeted
51:35those, maybe they would be unable to change at the same
51:41pace that spike protein would they might be more conserved.
51:44And that might be one approach towards developing a vaccine.
51:46That would be a longer term vaccine because one thing we
51:49have to worry about, of course with this Coronavirus,
51:53is and I have other research that we're doing on
51:55this question, which I'd love to talk about if anyone's
51:57curious, but you can estimate
51:59what the actual waning immunity of it is,
52:00even though we don't have data on that by Looking
52:03at other related species and using the phylogeny
52:05to understand how the how the waning immunity
52:08has evolved across the species
52:09and what the projected or most likely
52:12waning immunity of SARS coronavirus is,
52:15and it's, it tends to be it looks like
52:16it's around 80 weeks or so.
52:18So if we get about 8 weeks of waiting a period
52:21of immunity from this, that's not that
52:22much in terms of every two years or so we're gonna have
52:25Coronavirus coming around and in terms of we're going to
52:28be susceptible again to Coronavirus.
52:30Not that we're going to get it every two years.
52:33And what that would mean is that
52:36it's likely to persist as a circulating virus.
52:38And if it remains as deadly as it is that's a serious issue.
52:40So we're gonna really want to buy a vaccine.
52:42And we're not necessarily going to wanna have another flu
52:44vaccine that we have to get every year.
52:49So what we really want to do is target
52:51some genes that may be under more constraint
52:53then the recombination binding protein gene, the spike gene.
52:57So anyway, so the point is looking at multiple genes for
53:00trying to understand where conservative regions are where
53:03regions that are under selection are important.
53:05And we'll be doing that.
53:07And hopefully some of those results will
53:11help to guide the kind of generation of vaccines,
53:15and also the generation of therapeutics,
53:16because sites that are under
53:19selection are functional.
53:20So if you actually design a therapeutic
53:21that interferes with the sites that are under selection
53:22sort of in an opposite way, from vaccines, vaccines,
53:25we really want to target something that just doesn't change.
53:26With therapeutics, we may want to target
53:27the changing regions, if we can design something
53:30that generically does, because those changing
53:31regions are functional.
53:32In other words, those sites at the end of the spike protein
53:33are clearly ones that do bind the ACE gene.
53:35It's just that they're flexible
53:38about what they are in order to bind it.
53:42So we need to include
53:43all of those changing sites, if we wanna dissolve develop
53:46a therapeutic that for instance, would somehow interfering
53:50with the binding of Ace to receptors from the spike genes.
53:53So thank you very much for listening to the ongoing work
53:56we're doing on COVID-19.
53:59I would love to entertain any questions that you have.
54:03Let me just take one moment to acknowledge
54:05some of the people that I should acknowledge in this work,
54:09I already showed you a picture of John John who was earlier
54:11the the picture and the associated with the Mac ml approach
54:13that we developed many years ago 10 years ago basically
54:15Yinfei Wu has been taking the lead on this project.
54:18She's a master student.
54:19Yano os Wang was an assistant was in visiting
54:22Assistant Professor Stephen Gaugham,
54:24is in the Evie department
54:26has been helping out with this analysis.
54:28Haley Hassler is in my lab, has been helping out
54:30with phylogenetics Jayveer Singh is an undergrad
54:32who's been doing some of the research work
54:35some of the actually literature research
54:37that has helped us to contextualize
54:39the work we're doing Mofeed Najib
54:41produced those diagrams of the spike protein
54:44with the sites that we have identified
54:46as under selection so far,
54:48Zheng Wang is a long term collaborator of mine who works
54:54on nearly all the phylogenetic projects
54:56that I do, who's works with me.
54:59And then Alex Thornburg is A long term collaborator of mine,
55:02now in North Carolina.
55:06He was while he's currently at the North Carolina
55:08Museum of sciences, but he works on a lot of phylogenetic
55:11projects with me as well.
55:13And by the way, all of this, fortunately
55:16was recently awarded one of the NSF rapid grants
55:19to do this research.
55:20So we're very pleased to have funding to
55:22continue to work on this as time goes on, which is good
55:25because it's taking quite a lot of work
55:27to do the sequence wrangling.
55:29And the analyses themselves.
55:30As I mentioned, they're computationally intensive.
55:32So Alex and I were the PI's on that particular
55:36grant from the NSF.
55:37So we're excited to continue to do that work.
55:41And with that, I think I would
55:42like to entertain any questions you might have.
55:45- Thank you, Jeff, this was great.
55:48I'm sure we have a lot of questions
55:49who gets first?
55:54Again, you can type the questions on the
55:59chat box or just mute.
56:13- I have a quick question.
56:14- Okay.
56:16- You mentioned or you touched a bit on this before,
56:20but how would this compare to cite wise estimates
56:24of omega that you would get from Pamel
56:28or similar program?
56:29- So I'm sorry, I sort of was rushing at the end,
56:32I didn't explain that, in fact, I'm using pamel for some,
56:35So I'm using Pamela
56:36for the pre zoonosis analysis, and for the post zoonosis
56:40analysis, because as I mentioned during the talk,
56:44if you have a large phylogeny
56:46with multiple branches, et cetera, et cetera,
56:49where you can look over that entire phylogeny then you
56:51can get multiple changes at individual sites,
56:53which is what pamel actually uses to infer selection, right?
56:55You have to have the site change not just once
56:57but twice or three times.
57:02And then it says all that's under selection because
57:07it keeps changing again and again and again.
57:12So, so Pamela allows you to do that
57:13if you have this sort of deep time
57:15or large amount of time and multiple lineages that you're
57:17looking at, the master of approach that I'm using, enables
57:19you to do that on just a single lineage without needing
57:22multiple changes, I mean, multiple changes
57:23on a single language you can't even detect
57:25because it just looks like one change
57:26if you have the ancestral sequence, which is what we do
57:28ancestral data summation, get the ancestral sequence.
57:31And if you have the descendant sequence, a changes
57:33to T, you don't know if it changed to A to G to C to T again
57:35or if it just changed a to T, you have no idea you can
57:36just say it changed once.
57:38And so there's no real way to run pants,
57:40there is a way but it's really it's statistically
57:41really underpowered terrible thing
57:42to do to try to run pamel on a single lineage
57:44and figure out whether something's under selection.
57:47The advantage of this approach is because it
57:49can use that polymorphism data, the data of like what's
57:51just circulating in within populations as a metric for how
57:54much mutation is occurring.
57:56You can essentially divide out by that
57:59and then again, because we're integrating over all
58:04these models of how these things change, we're essentially
58:07borrowing information from neighboring sites for what their
58:10rates of change are, et cetera et cetera
58:13to estimate what the possible amount
58:15of selection is on all these sites.
58:16So by using the polymorphism data, and by doing this model
58:19averaging approach, we're actually able
58:21to take individual lineages and estimate
58:23the selection on them.
58:25And that's what we're doing in the near zonosis analysis
58:29that I showed you in the middle here.
58:33So there are different ways of doing the analysis.
58:35And it's necessitated by the fact that we just have this
58:37one lineage and there's no way it won't be a single lineage
58:39in any dataset we look at because for zoonosis,
58:42we're going to have human sequences,
58:44we're gonna have some animal sequences,
58:45we're not going to know we're not going
58:48to have any information about the actual zoonosis.
58:50Even if we knew the first human,
58:52we could just take that as an estimate.
58:54We still probably need some data here.
58:56Maybe you could have the first human
58:58and the first animal that you got it from.
59:00That just doesn't exist.
59:01We don't have that data for any zoonosis.
59:04How would we would never be there at the moment.
59:07So we have to assume that there's a number
59:09of transmissions among humans
59:10and a number of transmissions among animals
59:13during that near zoonotic period.
59:14And it's just a single lineage.
59:16So we can't really run pamel on that,
59:19in summary, because pamel requires multiple
59:21changes multiple lineages to have power
59:23to actually infer evolutionary change.
59:25MASS-PRF fortunatelY, can do that,
59:27because you can look on single lineages.
59:28So you can use MK tests as well on single lineage
59:33is basically designed to look at single lineages.
59:36But the problem with MK tests, as I mentioned,
59:37is that they're assuming the entire
59:39gene is under selection, which means it doesn't give you
59:41the scope or understanding about recombination
59:44binding gene sites under selection or something like that.
59:46It often will just give you a result of the genes not under
59:47selection, which is not true.
59:51- Does that answer your question?
59:54- Yes.
59:55- Great.
60:00- Any other questions?
01:00:04- I have one more if no one else wants to.
01:00:05- Sure, go ahead.
01:00:07- So in B cells, we have mechanisms
01:00:10that have mutation that specifically
01:00:13bias towards replacement mutations.
01:00:17So in the absence of selection,
01:00:18the mutation mechanisms actually cause
01:00:21an Omega greater than one.
01:00:24would this have any way of correcting for that?
01:00:28- So the tricky part is, and I don't know how it might,
01:00:31the tricky part is not so much running the software,
01:00:33which you could certainly do on that.
01:00:37The tricky part would be identifying
01:00:39what polymorphism is, in the case of those cells.
01:00:43So if you could identify sets of cells that are undergoing
01:00:47the mutation but aren't under selection in some way, then
01:00:51you could use that as the proxy for the way we use it here
01:00:54is polymorphism within population polymorphism,
01:00:57and then estimate that.
01:00:59I just don't know whether you have a way of
01:01:01doing Doing that if you want to discuss
01:01:03it with me, we could.
01:01:05That's sort of always the key for detecting selection.
01:01:09And it's, you know, many of you may be familiar that I work
01:01:11on cancer and some of the work that I do.
01:01:13It's the same
01:01:18problem that I'm working on there all the time, I'm trying
01:01:21to understand what the baseline mutation rates of cancer
01:01:23in cancer and somatic evolution of cells are.
01:01:25Because if I understand the baseline rates
01:01:27, how often those things change,
01:01:29just the mutation alone,
01:01:30then I can always estimate selection.
01:01:32And that's the thing we almost always want to
01:01:34know about in the analog analysis of sequence data.
01:01:37So, again, it's all about figuring out if there's some piece
01:01:42of the data that can be used to estimate that polymorphism
01:01:46and an approach like this, the benefit of an approach like
01:01:48this would be, you know, maybe you can estimate that for
01:01:50some portions of the gene, but not others, you know, maybe
01:01:52then there's a way that you could use this sort of model
01:01:54averaging approach to get at the underlying rate that it's
01:01:56happening, even if you can't estimate
01:01:58for that particular site, for instance.
01:02:00So I think the Might be potential to do it,
01:02:02but it just depends, you know, about on whether
01:02:04there's a critical, you know, set of data in what you're
01:02:09looking at which I haven't spent much time
01:02:12looking at back in the day.
01:02:13So I wouldn't know whether there's some way
01:02:15of baseline getting that baseline polymorphism or baseline
01:02:19mutation rate, which essentially amounts to the same thing.
01:02:23It just depends on whether, you know, you're assuming the
01:02:26population is sort of has, you know,
01:02:29it's just whether you're looking at at a population level,
01:02:31or you have some sort of covariance matrix
01:02:34to better understand the mutation rates itself.
01:02:36- I think there is a similar population B cells,
01:02:38- Great, so I encourage you to look into that.
01:02:44- Jeff, I have a quick question.
01:02:47I'm not too familiar with genome sequencing.
01:02:50But I think the Clustering Problem,
01:02:53the issue and the solution you have
01:02:55can be applied to many types of data.
01:02:58So I'm kind of confused.
01:02:59So you start In the diagram where you describe
01:03:02the different steps, you said that you first pick the most
01:03:06likely cluster and then you essentially
01:03:07keep splitting the clusters, right?
01:03:09How do you get the first clusters? Like
01:03:12there is some randomness in how you split the first?
01:03:16- Oh, so I sorry, I apologize.
01:03:19I didn't explain it in enough detail.
01:03:22The reason why it's so computationally intensive
01:03:24is we look at all possible.
01:03:27all possible exhaustedly.
01:03:29Now, I actually spent a year of my life trying
01:03:31to find a way to develop a Bayesian approach
01:03:34or some approach that would allow me
01:03:38to not look at all possible, you know, like to
01:03:40make this because because if you could do that,
01:03:41this would be a great way for doing tons of different things
01:03:45on very large data sets, right, large, like,
01:03:47and what amazed me is, I found that
01:03:50it was just an impenetrable problem.
01:03:53If I didn't look at every possible model.
01:03:56I could not get it to work I couldn't prove that
01:04:00That's Through like, I don't have any proof, that's true.
01:04:04And I would encourage anyone who really wants to dive
01:04:05in there, go ahead.
01:04:06But I'll warn you that I spent a year
01:04:07banging my head against that problem.
01:04:09And when I didn't
01:04:10exhaustively search all the models, I could not, I always
01:04:12caused these biases, like there was no way to sample them.
01:04:16I even have ways of sampling the models
01:04:17according to their probability.
01:04:24But even that causes a bias because sometimes
01:04:31there's a large number.
01:04:31So if you look at the, if you think
01:04:34about the set of models, it's a very large set of models.
01:04:35And there isn't actually a huge amount
01:04:38of likelihood differences between these models.
01:04:42That's the thing.
01:04:45So when you don't exhaustively sample the models,
01:04:49if you just sample some of the most likely models,
01:04:53you actually are sampling just
01:04:56one corner of the space.
01:04:57And it's possible for a bunch of
01:04:59not quite so likely models, but reasonable models
01:05:00that are not in that corner to sort of be actually
01:05:03highly influential on the model average.
01:05:04And so the bottom line is like sampling
01:05:05by trying to pick in the you know, most likely space doesn't
01:05:06work sampling by picking randomly doesn't work.
01:05:07And I could go into more detail about it.
01:05:09But it turned out that I couldn't do it
01:05:10any way other than exhaustive sampling.
01:05:12So, I say that Sorry, I missed that mistake.
01:05:14I couldn't do it by any biased approach
01:05:16towards that exhaustive handling
01:05:18the approach that I'm showing you right here.
01:05:21Actually, there are two ways of doing it.
01:05:22One is to sample stochastically,
01:05:23according to likelihood, and the other is to sample exactly
01:05:27across all exhausted sampling significantly works.
01:05:30In fact, it's implemented in the approach that I
01:05:33was just showing, I'm sorry, I just sort of jumped too fast
01:05:35to say what I was saying.
01:05:37So sampling stochastically works
01:05:38and sampling exhaustively work sampling stochastically is
01:05:40still very computationally intensive.
01:05:42But there's no I couldn't
01:05:44find any way to sort of, you know, important sample or do
01:05:48some sort of approach that would allow me to get a smaller
01:05:50set of models, which would then if we could do that,
01:05:53that could be really important,
01:05:55because then you could do this
01:05:57on more than like 2000 site,
01:05:59it's somewhere around 2000 sites.
01:06:00So you start running into real problems with
01:06:04just too much computing computation time
01:06:06to make it worthwhile.
01:06:07So we could extend this to 10,000 100,000, you know,
01:06:11potentially really, really large numbers of sites,
01:06:13and really, really sparse sets of sites.
01:06:16If only we could find a way
01:06:19to bias the sampling towards models that are more likely
01:06:24without causing biases in the results.
01:06:26I couldn't find any way to do.
01:06:27- This seems very much related to tree based
01:06:30methods where essentially you've got, like split the space
01:06:36and then you model of geology models,
01:06:39like the random forest, for example,
01:06:41or is very much related to that right.
01:06:45- Yeah, I have to say I was now familiar
01:06:47with those approaches.
01:06:49But when I was completely unfamiliar with it, yeah, I sort
01:06:52of thought about it that way.
01:06:54But you're absolutely right.
01:06:56Yeah, I guess the difference but here
01:06:57you have a sequence like one sequence,
01:07:00tghere you have a space.
01:07:01So you just split in
01:07:02different dimensions, but it is really good.
01:07:05- And I can mention, just to speculate,
01:07:10I'm kind of interested in a number of
01:07:13other ways of applying this.
01:07:15So for instance, if the one I've been thinking about
01:07:18and actually worked on a little
01:07:20bit haven't gotten very far with, but it's like,
01:07:21when you're dealing with event spaces over time,
01:07:22like if you have days, and you have individuals like,
01:07:24prominent us in public health,
01:07:27like individuals who are undergoing events
01:07:29you end up with a very sparse matrix of events.
01:07:31And so we use these approaches like survival plots
01:07:38all these approaches that we use to sort of understand
01:07:40how these rare events are happening,
01:07:42and how people are changing over this,
01:07:44that event space is actually really sparse.
01:07:45But it's kind of a matrix.
01:07:47And you could do this in two dimensions,
01:07:48not just one, right?
01:07:49So you could model average across two dimensions,
01:07:52and then you could get something
01:07:53that the thing that really appeals to me about that is that
01:07:55again, it's really this approach is really,
01:08:00it only builds up from the this binomial event
01:08:04No, no event, stuff, a picture that's very continuous over
01:08:09over the space and involves no assumptions
01:08:11about distribution whatsoever.
01:08:12So I'm just wondering if there aren't instances
01:08:14where, you know, we could come up
01:08:17with a better understanding of what's going on
01:08:19with individuals in a matrix such as
01:08:20that by using this approach.
01:08:22And it's an approach that is
01:08:23that still works even with these sparse spaces, because
01:08:26you can model average over these tremendously large number
01:08:29of models that all have fairly likely fairly
01:08:33equal likelihood to get a result.
01:08:35So I don't know that's just a sort of a
01:08:37speculation that there might be some interesting approaches
01:08:38, ways to approach those problems using this kind of kind
01:08:41of model averaging technique.
01:08:46- Great, I think we should wrap up.
01:08:49Thank you, Jeff, for this great presentation was great.
01:08:52And thank you all for joining today.
01:08:57See you next next seminar
01:08:58is gonna be I think, July 14.
01:09:01So we'll send out invites.
01:09:05All right, thank you, Jeff.
01:09:07Thank you all, bye, bye.