Biostatistics Seminar: BETS: The dangers of selection bias in early analyses of the coronavirus disease (COVID-19) pandemic

May 06, 2020

Information

Qingyuan Zhao, Statistical Laboratory, University of Cambridge

May 5, 2020

ID5158

To CiteDCA Citation Guide

00:03- All right, and it says the meeting is being recorded.
00:06Okay, so thanks everyone,
00:10for coming to this seminar.
00:13And I hope everyone is doing well.
00:17Today, I'm going to talk about some issues
00:21of selection bias in early analysis
00:24of the COVID-19 pandemic.
00:28You can find the manuscript on line, on arXiv,
00:31and the slides of this talk is also available on my webpage.
00:38So, here are the three collaborators,
00:42involved in this project.
00:45So Nianqiao is a PHD student at Harvard,
00:48and we kind of only met online.
00:50We never met in person, and I sort of created
00:54a dataset in January, and I wanted some help,
00:58and somehow she saw this and she said: I could help you.
01:03And we kind of developed a collaboration.
01:08And Sergio and Rajen are both, ah,
01:12lecturers in the Stats Lab in Cambridge.
01:18And I'd like to thank many, many people
01:19who have given us very helpful suggestions.
01:23This is just some of them.
01:28I'd like to begin with just saying COVID-19
01:32is personal for everyone, and what I would share
01:37is partly my story, my personal story with COVID-19.
01:44So here is a photo of me and my parents,
01:50taken last September, when I went back to China,
01:56to see my family.
01:58So both myself and my parents,
02:01we all grow up in Wuhan, China.
02:06And on a sunny day in September, we went to,
02:10well, this is the Yellow Crane Tower,
02:13a sort of landmark building in Wuhan.
02:17And the funny thing is, I think I've never been there,
02:20on top of the tower, in my entire life.
02:24And this is actually the first time I went there.
02:27This is something like if you have a famous local attraction
02:32for tourists, you actually don't go, as a local.
02:39And so, on January 23, because the epidemic
02:43was growing so fast in Wuhan, it started a lockdown.
02:52So, if we went on top of the Yellow Crane Tower,
02:56this is what we would see on a typical day,
03:00before the lockdown.
03:02And on the right, so, there's sort of what happens
03:06after the lockdown, and I liked how the journalist
03:10used sort of this gloomy weather as the background,
03:13and certainly reflected everybody's mood,
03:17after the lockdown.
03:21So, this project begins on January 29.
03:26So had a conversation with my parents over the phone,
03:30and they told me that a close relative of ours
03:35was just diagnosed with, quote/unquote, viral pneumonia.
03:41So, basically at that point, we all think that must
03:45be COVID-19, but because there was not enough tests,
03:51this relative could not get confirmed.
03:54And this prompted me to start looking
03:56through the data available at the time.
03:59But I quickly realized that the epidemiological data
04:03from Wuhan are very unreliable.
04:07And here is some anecdotal evidence.
04:10The first evidence is about inadequate testing.
04:16So actually this relative of mine could not get
04:18an RT-PCR test until mid-February,
04:22and she actually developed symptoms on about January 20.
04:29So by mid-February, she was already recovering.
04:33And she took, I think, several tests.
04:36Her first test was actually negative,
04:38and a few days later she was tested again,
04:40and the result came back positive.
04:43So there's also a lot of false negative tests.
04:46I think, in general.
04:49And another problem with the epidemiological data from Wuhan
04:53is insufficient contact tracing.
04:56So, her husband, this relative of mine's husband,
05:03he also showed COVID symptoms, but he quickly recovered
05:08from that, and in the end he was never tested for COVID.
05:17So, you can also see the insufficient testing
05:19from this incidence plot.
05:22So this is the daily confirmed cases, up until mid-February,
05:29and this is when the travel ban started,
05:33or the lockdown started, January 23,
05:36and on February 12, there was a huge spike
05:41of over 10,000 cases, much more than the previous few weeks.
05:50And the reason for that was not suddenly because people
05:54were infected on that date.
05:57It's because of a change of diagnostic criterion.
06:01So before February 12,
06:04everybody needs to have a positive RT-PCR test
06:10to be confirmed a COVID-19 case.
06:13But since February 12, because there,
06:16the health system in Wuhan was so overwhelmed,
06:20the government decided to change diagnostic criterion.
06:23So without RT-PCR tests, you can still be diagnosed
06:28with COVID-19 if you satisfy several other criteria.
06:34And this sort of change in diagnostic criteria
06:37only happened in the Hubei Province
06:41and not elsewhere in China.
06:45So a solution, if we like to avoid these problems
06:50with data from Wuhan, so one clever solution
06:55is to use cases that are reported from, sorry,
06:58exported from Wuhan.
07:01So this has two benefits.
07:03First of all, testing and contact tracing
07:05were quite intensive in other locations.
07:09So, it's reasonable to expect that a lot of the bias
07:13due to sort of under-ascertainment will be less severe
07:16if we use data from elsewhere.
07:20And also, many locations, particularly in some cities
07:26in China, published detailed case reports,
07:31instead of just case counts.
07:34And if you look at these detailed case reports there are
07:36a lot of information that can be used for inference.
07:44This is not our idea.
07:47And I think one of the, at least one of the first persons
07:51to use this design was a report from Neil Ferguson's group
07:57in Imperial College, London,
07:59and they published a report on January 17,
08:03and what it did was a simple sort of division of the number
08:07of cases detected internationally, over the number
08:11of people traveled from Wuhan, internationally.
08:15And they found that it could be
08:18over 1,700 cases by January 17, in Wuhan.
08:26So, I started this on January 29,
08:30and within about two weeks, managed to put something online.
08:37Which we also used internationally confirmed cases
08:40to estimate epidemic growth.
08:44And what we used were 46 coronavirus cases
08:48who traveled from Wuhan and then were subsequently confirmed
08:53in six Asian countries and regions.
08:59And the main result was that the epidemic was doubling
09:02in size every 2.9 days.
09:06And we used the Bayesian analysis, and the 95 percent
09:10critical interval was two to 4.1.
09:14And of course, when I was writing this article,
09:17I was mostly just working on this dataset that we collected,
09:22very hard and (muttering), thinking about what model
09:27is suitable for this kind of data.
09:30And just before I posted this pre-print,
09:34I realized there was a similar article
09:38that already published in The Lancet, on January 31.
09:45And what's really puzzling is they used almost the same data
09:51and very similar models, but somehow reached
09:54completely different conclusions.
09:58So they used data from December 31 to January 28,
10:02that are exported from Wuhan internationally.
10:05And they would like to infer the number
10:07of infections in Wuhan.
10:10And one of the main results,
10:12which was this epidemic doubling time, was 6.4 days,
10:16and the 95 percent critical interval was 5.8 to 7.1.
10:21So that's drastically different from ours.
10:24So again, ours was 2.7, within two to four,
10:29and this was 6.4.
10:33And this is talking about the doubling time.
10:36So the doubling time of six days versus three days,
10:40that's sort of really, really different.
10:43And the confidence intervals, the credible intervals
10:45didn't even overlap.
10:49So I was really puzzled by this.
10:52And before I tell you what I think,
10:58how the Lancet paper got it wrong,
11:01I'd like to just show you this plot.
11:03You probably have seen this many times before,
11:05in news articles, which is just sort of a logarithm
11:10of the total cases versus the days, ah,
11:16or some time, zero, for each country.
11:21And what you see is for both the total number of cases
11:26and the total number of deaths,
11:29it sort of grew about 100-fold in the first 20 days.
11:35At least among these countries
11:36that were most hard-hit by COVID-19.
11:42And if you just use that as a variable of estimate,
11:45of the doubling time, that corresponds
11:47to a doubling time of three days.
11:52Of course, this is sort of very kind of anecdotal,
11:56because this data were not collected in a very careful way,
12:01and the amount of cases were not reported,
12:04but this is just to show you that perhaps
12:07the doubling time of 6.4 days was a bit just, too long.
12:14So, towards the end of the talk,
12:17I'll tell you what we think led
12:21to these very different results.
12:24Just some spoilers, so the crucial difference
12:30is that the Lancet study actually did not
12:33take into account the travel ban on January 23.
12:38And that actually had a very,
12:39very circumstantial selection effect on the data.
12:45And this will be made precise later on in the talk.
12:53So, for the rest of the talk,
12:54I'll first give you an overview of selection bias.
12:57So no math, just sort of an outline of what kind
13:01of selection bias you could encounter in COVID-19 studies.
13:05Then I'll talk about how we sort of overcome them,
13:08by sort of collecting the dataset very carefully
13:12and building a model very carefully.
13:17And then I'll talk about why
13:20the Lancet study I just mentioned
13:22and some other early analysis were severely biased.
13:26If there is time, I will tell you a little bit
13:29about our Bayesian nonparametric model.
13:33And then I'll give you some lessons
13:36I learned from this work.
13:40So selection bias.
13:42So we identified at least five kinds
13:46of selection bias in COVID-19 studies.
13:49So the first one is due to under-ascertainment.
13:53So this may occur if symptomatic patients
13:56do not seek healthcare, or could not be diagnosed.
14:00So essentially, all studies using cases confirmed
14:04when testing is insufficient,
14:08would be susceptible to this kind of bias.
14:11And there is no cure to this.
14:14It may lead to varied kind of direction and magnitude
14:21of bias, and basically what we can do is to,
14:27to think about a clever design to avoid this problem,
14:32to focus on locations where the testing is intensive.
14:42The second bias is due to non-random sample selection.
14:48So, basically this means that the cases included
14:51in the study are not representative of the population.
14:56So this essentially applies to all studies,
15:03because detailed information about COVID-19 cases
15:06are usually sparse; they're not always published.
15:11But especially for studies that do not have a clear
15:14inclusion criterion, and if they just sort of simply
15:19collect data out of convenience, then there could be
15:25a lot of non-random sample selection bias.
15:30And again, statistical models are not really gonna help you
15:33with this kind of bias.
15:35You'd use, you'd follow some protocol for data collection,
15:40and you would exclude some data that do not meet
15:44the sample inclusion criterion.
15:47Even when that may, leads to inefficient estimates.
15:57The third bias is due to the travel ban.
16:00This is kind of my spoiler about that Lancet study.
16:06So basically, outbound travel from Wuhan
16:09to anywhere else was banned from January 23 to April eight.
16:16So if the study analyzed cases exported from Wuhan,
16:21then they're susceptible to this selection defect.
16:27And this would usually lead to underestimation
16:31of epidemic growth, and the reason is that, so,
16:35the epidemic is growing very fast,
16:37but then you essentially can't observe cases
16:41that were supposed to leave Wuhan after January 23.
16:44So if you just wait for a long time,
16:47and then look at the epidemic curve among the cases
16:50exported from Wuhan, it may appear that, ah,
16:55it sort of dies down a little bit,
16:58but that's not because of the epidemic being controlled.
17:01That's because of the travel ban.
17:04And fortunately this bias, you can correct for it
17:08by deriving some likelihood function
17:10tailored for the travel restrictions.
17:15The fourth bias is ignoring, is due to ignoring
17:20the epidemic growth, and basically if you think about people
17:25who have been in Wuhan before January 23,
17:29they're much more likely to be infected
17:31towards the end of their exposure period than early,
17:37and that's because the epidemic was growing quickly.
17:42So, there are many studies, or I should say
17:45there are several studies of the incubation period
17:48that simply treat infections as uniformly distributed
17:52over the patients' exposure period to Wuhan.
17:57And this will lead to overestimation
17:59of the incubation period.
18:02Because actually, the infection time is much,
18:04much closer to sort of the end of their exposure.
18:11And this is also a bias that can be corrected for,
18:15by doing statistical analysis carefully.
18:21The fifth and last bias is due to right-truncation.
18:25So this happens in early analysis because,
18:30to sort of win time to battle for this epidemic,
18:36and to publish sort of fast.
18:38So as you all know, there's a race for publications
18:43about COVID-19; a lot of people sort of truncated
18:48the dataset before a certain time,
18:52but by that time the epidemic maybe
18:54was still quickly growing or evolving.
18:58And this could lead to some right-truncation bias.
19:03And this generally would lead to underestimation
19:07of the incubation period.
19:10So this is, so incubation period, I forgot to mention,
19:13is just the time between infection to showing symptoms.
19:20So, right-truncation would lead to underestimation
19:22of incubation period, because people with longer
19:26incubation period may not have showed symptoms
19:31by the time that these datasets were collected.
19:38So the solution to this is we need to both collect cases
19:45that meet the selection criterion, and continue
19:48that data collection until a sufficiently long time.
19:54Or, you derive some likelihood function to correct
19:59for the right-truncation.
20:00So we'll go over this later.
20:04So just to recap,
20:07so on a very high level, there are at least five
20:11kinds of biases in COVID-19 analysis.
20:15And if you read sort of article pre-prints or use articles,
20:20I think you will find some kind, I mean,
20:24some resemblance of these biases in many studies.
20:30And the keys to avoid selection bias is basically,
20:35I mean, this is simple in words,
20:38but you just do everything carefully.
20:40You design the study carefully,
20:42and collect the sample carefully,
20:45and analyze the data carefully.
20:47But the reality, of course, is not that simple.
20:51And what I will show below, it's an example
20:55of our try to eliminate or to reduce selection bias,
21:02as much as possible.
21:06So, let me tell you the dataset we collected.
21:10So we found 14 locations in Asia,
21:16some are international, so Japan, South Korea, Taiwan,
21:21Hong Kong, Macau, Singapore.
21:23Some are sort of in mainland China.
21:27So there are several cities in mainland China.
21:31So all these locations have published detailed case reports
21:36from their first local case.
21:40So, most of the Chinese locations, I mean,
21:43they were done with the first wave of the epidemic
21:46by the end of February.
21:49So Japan, Korea and Singapore saw some resurgence
21:54of the epidemic later on, and eventually,
21:57they did not publish detailed case reports.
22:02But for our purposes, these locations all published
22:07detailed reports before mid-February,
22:11and that's about three weeks after the lockdown of Wuhan.
22:15So it's pretty much enough to find out
22:19all the Wuhan exported cases.
22:24So just to give you a sense of the kind of data
22:28that we collected, this is sort of all
22:32the important columns in the dataset,
22:36and the particularly important columns are marked in red.
22:42So, we collected, there was a case ID,
22:49where the case lived, the gender, the age,
22:54whether they had known epidemiological contact
22:57with other confirmed cases, whether it has
23:02known relationship with other confirmed cases.
23:07This is sort of an interesting column
23:09that basically we like to find out what cases were
23:15exported from Wuhan, but that's, of course, not recorded.
23:20I mean you can only infer that from what has been published.
23:27So this is an attempt to do that.
23:28So this column, outside column means that,
23:32whether we think the data collector thinks
23:35this case is transmitted outside Wuhan.
23:39So most of the time, this is relatively easy to fill.
23:45For example, if you've never been to Wuhan,
23:47this entry must be yes.
23:50But sometimes, this can be a little bit tricky.
23:52For example, this person, the fifth case in Hong Kong,
23:56is the husband of the fourth case in Hong Kong,
24:00and they traveled together from Wuhan to Hong Kong.
24:04So it's unclear if this case is transmitted
24:11in or outside Wuhan, so we put a "likely" there.
24:16And the other information are some dates,
24:20the beginning of stay in Wuhan, the end of stay in Wuhan,
24:26the period of exposure, which would equal to
24:30beginning to the end of stay in Wuhan,
24:33for Wuhan exported cases,
24:35but can be different for other cases.
24:41When the person, when the case arrived at a final location
24:44where they are confirmed a COVID-19 case.
24:48When the person showed symptoms.
24:51When did they first go to a hospital,
24:54and when were they confirmed a COVID-19 case.
24:59So we collected about 1,400 cases with all this information.
25:05And overall, I think our dataset was relatively high
25:11in quality, and most of the cases had known symptom onset
25:18dates; only nine percent of them have that entry missing.
25:27So,
25:30so one important step after this is to find out
25:33which cases are actually exported from Wuhan.
25:37So I've been using this terminology from the beginning
25:41of the talk, but basically the case is Wuhan exported
25:45if they are infected, if they were infected in Wuhan.
25:50And then confirmed elsewhere.
25:53So we had a sample selection criterion
25:58to discern a Wuhan exported case.
26:03I'm not going to go over it in detail,
26:06but basically the principle we followed
26:09is that we would only consider a case as Wuhan exported
26:14if it passed a beyond a reasonable doubt test.
26:19So basically, if we think there is a reasonable doubt
26:21that the case could be infected elsewhere,
26:26then we would say: let's exclude that from the dataset.
26:31So this eventually gives us 378 cases.
26:39Next I'm gonna talk about the model we used.
26:46So the model is called: BETS.
26:48It's named after sort of four key epidemiological events.
26:53The beginning of exposure, the end of exposure,
26:56time of transmission, which is usually unobserved,
27:01and the time of symptom onset, S.
27:06So what we will do below is we'll first define the support
27:13of these variables, so we call that P.
27:17Which is basically represents the Wuhan exposed population.
27:24So this is the population we would like to study.
27:28We will then construct a generative model
27:31for these random variables.
27:34Basically, for everyone in the Wuhan exposed population.
27:39Then, to consider the sample selection,
27:42we'll define a sample selection set, D,
27:46that corresponds to cases that are exported from Wuhan.
27:51Then finally we will derive likelihood functions
27:54to adjust for the sample selection.
27:57So essentially, what we're trying to infer is
28:01the disease dynamics in the population, P,
28:05but we only have data from this sample, D.
28:11So here's a lot of work that needs to be done
28:14to correct for that sample selection.
28:20So intuitively, this population P are just all people
28:23who have stayed in Wuhan, between December first
28:29and January 24, so anyone who has been in Wuhan
28:36for maybe even just a few hours,
28:39they would count as someone exposed to Wuhan.
28:45And I'm going to make some conventions to simplify
28:51this set, P, a little bit.
28:54So B equals to zero has a special meaning.
28:59So, so zero is the time zero,
29:02which is 12 AM of December one.
29:06And it means that they actually started their stay in Wuhan
29:11before time zero, so they live in Wuhan essentially.
29:16And B greater than zero means these other cases
29:21visited Wuhan sometime in the middle of this period,
29:25and then they left Wuhan.
29:29So E equals to infinity means that the case did not arrive
29:33in the 14 locations we are considering
29:36before this lockdown time, L.
29:41So for the purpose of our study,
29:42we did not need to differentiate between people who
29:45have always stayed in Wuhan past time L,
29:49or people who left Wuhan before time L,
29:52but went to a different location
29:55other than the ones we are considering.
29:58So T equals to infinity means that the cases
30:02were not infected during their stay in Wuhan.
30:06So this could be infected outside Wuhan,
30:08or it could be they were never infected.
30:12And S equals to infinity means that the case
30:16did not show symptoms of COVID-19,
30:19and it can simply be, they were never infected.
30:22Or the case was actually tested positive for COVID-19,
30:27but never showed symptoms, so it's, they're asymptomatic.
30:34So under these conventions, this is the set,
30:38this is the support for this population, P.
30:41So B is between zero and L,
30:44E is between B and L or infinity,
30:47T is between B and E, which means that they are,
30:51in fact, in Wuhan, or infinity.
30:54And S is between T and infinity,
30:56and S can be equal to infinity.
31:00So now we have defined this population, P.
31:04And now let's look at a general model,
31:09a data-generated model for this population.
31:15So, by the basic law of probability,
31:18we could decompose the joint distribution
31:21of BETS into these four, and the first two
31:25are the distribution of B and E.
31:27They are related to travel.
31:30The second one, sorry, the third one is the distribution
31:32of T given B and E.
31:35So that's about the disease transmission.
31:38And the last one is the distribution of S,
31:41given BET, and that's related to disease progression.
31:47So we need to make two basic assumptions,
31:50and they are important because we would like to infer
31:54what's going on in the population P,
31:57from the sample T, from these Wuhan exported cases.
32:02So we need to sort of make assumptions
32:05so we can make that extrapolation.
32:08So the first assumption, we assume it's about
32:11this disease transmission, and it basically means
32:14that the disease transmission is independent of travel.
32:18So there is a basic sort of function that's independent
32:22of the travel that's growing over time.
32:27And then there's the rest of the points mass at infinity.
32:33This T function, so, it will appear later on.
32:37It's the epidemic growth function.
32:40The second assumption is that the disease progression
32:43is also independent of travel.
32:46So, what's assumed here is basically
32:49that there is one minus mu of the infections,
32:56that are asymptomatic in that they didn't show symptoms.
33:00The amount of people who showed symptoms,
33:03the incubation period, which is just S minus T,
33:07follows this distribution, H.
33:11Okay, so H is the density of the incubation period,
33:14for symptomatic cases.
33:17And this whole distribution does not depend on B and E.
33:24So these are sort of the two basic assumptions
33:26that we relied on.
33:30There are two further parametric assumptions
33:32that were useful to simplify the interpretation,
33:37but they can be relaxed.
33:41So the next, one assumption is the epidemic
33:45was growing exponentially before the lockdown.
33:51And then that, the other assumption is that the incubation
33:54period is gamma-distributed, okay?
33:58So there's some parameters, kappa, R and alpha, beta.
34:05So, don't worry about nuisance parameter mu,
34:09which is the proportion of asymptomatic cases.
34:12And kappa, which is some baseline transmission.
34:16So it turns out that they would be canceled
34:19in the likelihood function, so they won't appear
34:23in the likelihood function.
34:25And (muttering) these parametric assumptions,
34:28they can be relaxed and they will be relaxed
34:32in the Bayesian parametric analysis, if I can get to there.
34:38But essentially, these are very useful assumptions
34:42that allow us to derive formulas explicitly.
34:50So I have covered the full data BETS model
34:56for the population P.
34:58Now we need to look at what we can observe.
35:02So what we can observe are people in B
35:07that satisfy three additional restrictions.
35:12The first restriction is that the transmission
35:15is between their exposure to Wuhan.
35:23The second restriction is that the case needs to leave
35:27Wuhan before the lockdown time, L.
35:31The third restriction is that the case
35:33needs to show symptoms.
35:36So S is less than infinity.
35:39So some of the locations we considered
35:41did report a few asymptomatic cases, but overall,
35:46asymptomatic ascertainment was very inconsistent.
35:50So we only considered cases who showed symptoms.
35:56So this gives us the set of samples
36:01that we can observe in our data.
36:09So, which likelihood function should we use?
36:15For a moment, let's just pretend that the time
36:17of transmission, T, is observed.
36:20So if we had samples, ID samples from the population, P,
36:25then we could just use this product of the density
36:29of BETS as a likelihood function.
36:34But this is not something we should use,
36:36because we actually don't have samples from P.
36:40We have samples from D, so what we should do is to condition
36:46on the selection set, D, and use this likelihood function,
36:52which is basically just the density divided by the
36:56probability that someone is selected in this set, D.
37:04Okay, this is called unconditional likelihood,
37:07to contrast with the conditional likelihood.
37:11So, in unconditional likelihood,
37:14we consider the joint distribution of B, E, T, and S.
37:18But in the conditional likelihood,
37:20we consider the conditional distribution of T and S,
37:25given B and E.
37:26So this is the conditional distribution of the disease
37:29transmission and progression, given the travel.
37:32So this treats travel as fixed.
37:35So to compute this conditional likelihood,
37:38we need further conditions on B and E, okay?
37:48But in reality, the time of transmission, T, is unobserved,
37:52so we cannot directly use the likelihood function,
37:55as on the last slide, so one possibility is to treat T
38:01as a latent variable and use, for example, an EM algorithm.
38:07The way we chose is to use an integrated likelihood.
38:11That just sort of marginalized
38:14over this unobserved variable, T.
38:19So, the unconditional likelihood is the product
38:23over the cases of the integral
38:26of the density function over T.
38:31And the conditional likelihood is just a product
38:34of the integral of the conditional distribution of T and S,
38:40over T.
38:45So, the reason we sort of considered both
38:49the unconditional likelihood and conditional likelihood
38:51is that the unconditional likelihood is a little bit
38:55more efficient, because it also uses information
39:00in this density, BE, given your selected.
39:06So that contains a little bit of information.
39:09But a conditional likelihood is more robust.
39:12So, because it does not need to specify how people traveled,
39:18so it is robust to misspecifying those distributions.
39:24So I'll stop here and take any questions up to now.
39:36Is this clear to everyone?
39:40If so, I'm gonna proceed.
39:45Okay, so under these four assumptions
39:49that I introduced earlier, you can sort of compute
39:53the explicit forms of the conditional likelihood functions.
39:57I'm not gonna go over the detailed forms,
39:59but I just want to point out that first of all,
40:02as I mentioned earlier, this does not depend on
40:04the two nuisance parameters, mu and kappa.
40:08And second of all, this actually reduces to a likelihood
40:12function that's previously derived in this paper in 2009
40:19by setting this R equals to zero.
40:22So R equals to zero means that the epidemic
40:24was not growing, so it's mostly a stationary epidemic.
40:30So that's reasonable for maybe influenza, but not for COVID.
40:40So for unconditional likelihood, we need to make
40:42further assumptions about how people traveled,
40:46the assumption we used was just a very simple,
40:49sort of a uniform assumption,
40:51uniform distribution assumption,
40:52that assumes that the travel was stable
40:55in the period that we considered.
40:59And we use those assumptions,
41:00we can derive the closed form unconditional likelihood.
41:06There's a little bit of approximation that's needed,
41:09but that's very, very reasonable in this case.
41:18So, I'd like to show you the results
41:22that fit in these parametric models.
41:24So what we did is we obtained point estimates
41:28of the parameters by maximizing the likelihood functions
41:32I just showed you, and then we obtained 95 percent
41:36confidence intervals, by a likelihood ratio test.
41:41So, what you can see is broadly, over different locations,
41:46the estimated doubling time was very consistent.
41:52Also cross-conditional and unconditional likelihood,
41:55so the doubling time was about two to 2.5 days.
42:01And the median incubation period is about four days,
42:07but there is a little bit of variability
42:11in the estimates.
42:14It turns out that the variability is mostly
42:16because of the parametric assumptions that we used.
42:21And then the 95 percent quantile is about,
42:2712 to 14 days.
42:29Or if you consider the sampling variability,
42:31that is about 11 to 15 days.
42:35Okay, but broadly speaking, across the different locations,
42:40they seem to suggest very similar answers.
42:47So, just to summarize, the initial doubling time
42:51seems to be between two to 2.5 days.
42:55Median incubation period is about four days,
42:57and 95 percent quantile is about 11 to 15 days.
43:03So, those sort of were our results,
43:05using the parametric model.
43:08And next I'm going to compare it with some other
43:12earlier analysis, and give you a demonstration,
43:18or an argument of why some of the other early analysis
43:21were severely biased.
43:23So first, let's look at this Lancet paper that I mentioned
43:27in the beginning of the talk that estimated doubling time.
43:30So the doubling time they estimated was 6.4. days.
43:37So, what happened is these authors used a modified
43:44SEIR model, so the SEIR model is very common
43:48in epidemic modeling, so the modified that model
43:51to account for traveling, but they did not account
43:55for the travel ban.
43:58So, basically to sort of simplify what's going on,
44:05what they essentially did is they used the density
44:09of the symptoms as in the population P,
44:15so they fitted this density, but they fit it using, ah,
44:20samples from the set D.
44:26So it is quite reasonable to assume that the incidence
44:29of symptom onset was growing exponentially in the population
44:34that is exposed to Wuhan.
44:37So given P, this distribution, margin distribution of S,
44:42was perhaps growing exponentially before the lockdown.
44:47But we don't actually have samples from P.
44:49We have a sample from D.
44:52So, we actually can derive the density of S and D,
44:59and that looked very different from exponential growth.
45:03So, basically the intuition is that if you look at
45:06the distribution of the transmission, T,
45:09it is growing exponentially, but it also has this effect,
45:13this exponential RT times L minus T.
45:17So basically, if you are transmitted on time T,
45:20then you only have the time between T to L
45:25to leave Wuhan and be observed by us.
45:29Okay, so that's why it's not only exponential growth,
45:32but there's also a decreasing trend, L minus T,
45:39for the distribution of the time of transmission.
45:43So from the time of symptom onset,
45:45it's just the time of transmission,
45:48convolved with the distribution of the incubation period.
45:52And that has this form that is approximately
45:56an exponential growth, and then times this term,
46:00that is L plus some quantity that depends
46:03on the incubation period and the epidemic growth, minus S.
46:10So this is a term that is not considered,
46:13in this simple exponential growth model.
46:18Which is basically what's used in that Lancet paper.
46:23Okay, so to illustrate this,
46:26what I'm showing you here is a histogram
46:29of the symptom onset of all the Wuhan exported cases,
46:35who are also residents of Wuhan.
46:37So they stayed from December first to January 23.
46:43What you see is that it was kind of growing very fast,
46:46perhaps exponentially in the beginning,
46:49but then it slows down around the time of the lockdown.
46:55Okay, so the orange curve is the theoretical fit
47:00that we obtained in the last slide,
47:04using the maximum likelihood estimator of the parameters.
47:08So it fits the data quite will.
47:12So what happened, I think, with the Lancet paper is,
47:17so the basically stopped about January 28th,
47:20so it's about here, and they essentially tried to fit
47:23an exponential growth from the beginning to January 28.
47:29And that would lead to much faster growth
47:33than fitting the whole model to account for the selection.
47:41Okay.
47:44So that's about epidemic growth.
47:46Next I will talk about several studies
47:49of the incubation period.
47:52So, these studies are susceptible to two kinds of biases.
47:57One is that some of them ignore the epidemic growth,
48:01so instead of using this likelihood function,
48:04this conditional likelihood function,
48:06to just fit this R is equal to zero,
48:08and then they use this likelihood function
48:10that was derived in the early paper.
48:15The other bias is sort of right-truncation.
48:20And basically, they kind of stopped
48:22the data collection early and only used cases
48:24confirmed by then, so people with long incubation periods
48:29are less likely to be included in the data,
48:33so that leads to underestimation of the incubation period.
48:38And a solution to this is you can actually derive
48:40the likelihood with additional conditioning events,
48:43that S is equal, sorry,
48:45less than or equal to some threshold, M.
48:48Suppose you stop the data collection a week after M,
48:52and you say: perhaps we have all, find out all the cases
48:56who showed symptoms beforehand.
48:59We can use this likelihood function.
49:02I'm not gonna show you the exact form,
49:04but basically you need to further divide by, ah,
49:10the probability of S less than or equal to M,
49:14and you can obtain closed-form expression for this
49:18under our parametric assumptions.
49:22Using integration by parts.
49:25So, I'd like to show you an experiment
49:29to illustrate this selection bias.
49:33So in this experiment, we kind of stop the data collection
49:38between any day from January 23 to February 18,
49:43and we fitted sort of this parametric BETS model,
49:48using one of the following likelihood.
49:51So this is the likelihood that treats R equals to zero,
49:54so it's adjusted for nothing,
49:56and this is the likelihood derived earlier
49:59and used in other studies.
50:02This is the likelihood function that adjusts for the growth,
50:05so R is treated as an unknown parameter.
50:08And this is the likelihood on the last slide that adjusted
50:12for both the growth and the right-truncation,
50:16as less than or equal to M.
50:21So the point estimates are obtained by MLEs,
50:23and the confidence intervals are obtained
50:25by nonparametric Bootstrap,
50:28and we compared our results with three previous studies.
50:36So this is, basically summarizes this experiment.
50:42This is a little bit complicated,
50:43so let me walk you through slowly.
50:48So there are three likelihood functions we used.
50:50One adjusts for nothing; that's the orange.
50:54The one is adjusted only for growth,
50:57and the ones that adjusted for both growth and truncation.
51:02Okay, so what you can immediately see
51:04is that if we adjusted for, ah,
51:08if we adjusted for nothing, then this is much larger
51:14than the other estimates.
51:18So actually, if you adjusted for nothing,
51:20and if you sort of used our entire data set,
51:23the median incubation period would be about nine days.
51:27And the 95 percent quantile would be about 25 days.
51:31So that's just way too large.
51:35And if you ignored right-truncation, for example,
51:38if you used this likelihood function we derived earlier,
51:43that only accounts for growth, you underestimate
51:48the incubation period in the beginning, as expected,
51:51but you slowly converge to this final estimate.
51:57And if you use this likelihood function and adjust for both
52:00growth and truncation, you actually get
52:03some quite sensible results by the end of January.
52:09So, it has large uncertainty, but it's roughly unbiased,
52:14and it kind of eventually converges to that estimate.
52:18The same estimate that we obtained
52:23using the blue curve, but using the full data.
52:28Okay.
52:30So, for the sake of time, I think I'll skip the part
52:36about Bayesian nonparametric inference.
52:40One thing that's a little bit interesting, I think,
52:43is there seems to be some difference between men
52:48and women in their incubation period.
52:51So these are sort of the posterior mean
52:54and posterior credible intervals for nonparametric
53:01incubation period, and you can see that men
53:04seem to develop symptoms quicker than women.
53:11So, that's a little bit interesting,
53:14and maybe, I mean, I'm not a doctor,
53:18but it could be related to the observation
53:22that men seem to be more susceptible,
53:24and die more frequently than women.
53:31So let's, let me conclude this talk.
53:34So these are some conclusions we found about COVID-19,
53:40using our dataset and our model.
53:43Initial doubling time in Wuhan was about two to 2.5 days.
53:50The median incubation period is about four days,
53:52and the proportion of incubation period
53:55above 14 days is about five percent.
54:00There are a number of limitations for our study.
54:03For example, we used the symptom onset reported
54:07by the patients and they are not always accurate.
54:11There could be behavioral reasons for people
54:13to report a later symptom onset.
54:18Even though these locations are intensive in their testing
54:21and contact tracing, some degree of under-ascertainment
54:25is perhaps inevitable.
54:28As I have shown you, in our dataset collection,
54:34discerning the Wuhan exported case
54:36is not a black and white decision.
54:39We used this beyond a reasonable doubt kind of criterion,
54:43but that's one criterion you can apply.
54:47And the crucial assumptions, we put the first
54:51two assumptions, which means that the travel
54:53and disease are independent, and that can be violated.
54:57For example, if I, if people tend to cancel
55:02their travel plans when feeling sick.
55:09Nevertheless, I think I have demonstrated some very
55:12compelling evidence for selection bias in early studies.
55:17Some of the biases you can correct by designing the study
55:25more carefully, some require more sophisticated
55:29statistical adjustments.
55:33And basically, I think the conclusion is:
55:37you should make un-calculated BETS.
55:41So, we should always carefully design the study
55:44and adhere to our sample inclusion criteria.
55:48And the statistical inference should not be based
55:53on some intuitive calculations,
55:55but should be based on first principles.
55:58So in this study, we kind of went back all the way
56:00to defining the support of random variables.
56:05So that's sort of statistics 101.
56:08But that's actually, it's extremely important.
56:11So I found it really helpful to start all the way
56:15from the beginning and develop a generative model.
56:20And that avoids a lot of potential selection biases.
56:25So the final lesson I'd like to share from this whole study
56:29is that I think this demonstrates the data quality
56:34and better design are much more important
56:38than data quantity and better modeling,
56:42in many real data studies.
56:46Thanks for the attention,
56:47and I'll take any questions from here.
56:51- Thanks to you for the nice talk.
56:54Does anyone have questions for Qingyuan?
57:00So Qing, I think someone, ah,
57:04yeah, Joe sent you a question.
57:07- Okay.
57:09- Are there any information in datasets of whether patient
57:12is healthcare worker?
57:15- No, these are not usually healthcare workers.
57:19These are exported from Wuhan, so they're usually
57:21just people who traveled maybe for sightseeing,
57:24or for the Chinese New Year, they traveled from Wuhan
57:28to other places and were diagnosed there.
57:34- Right, so also he has another question,
57:38Joe has another question also: how can we evaluate
57:41the effectiveness of social distancing and mask guidelines?
57:49- I think this study we did was not designed
57:53to answer those questions.
57:57We did have a very, ah,
58:01sort of preliminary analysis.
58:03So we broke the study period into two parts.
58:08So on January 20, it was confirmed publicly
58:12that the disease was human-to-human transmissible,
58:16so we broke the period into two parts:
58:20those before January 20 and those after January 20.
58:25But the after period is just three days.
58:27So January 21, 22, 23, and we found that if we fit
58:33different growths to these two periods, the second period,
58:36it seemed that the growth was substantially slower.
58:42The growth, the exponent R is not quite zero,
58:48but it's close.
58:50So it seems that the knowledge of sort
58:52of human-to-human transmissibility and the fact that,
58:58I think, masks are probably much more,
59:01were much more available in Wuhan,
59:03people started to do some social distancing
59:08right after January 20.
59:11I think that seemed to play a role.
59:14But that's very, very preliminary,
59:17and I think there are a lot of good studies about this now.
59:25- Donna has a question.
59:26Donna, do you want to say what your question is?
59:32- [Donna] Yeah, sure, thanks.
59:33That was a very interesting and clear talk.
59:36I really appreciated the way you carefully went through,
59:40step by step, to show-- (audio distorting)
59:47Who aren't doing that, I feel.
59:50But my question was, it was still hard for me to tell
59:54to what extent your estimates were identifiable
59:59due to assumptions and to what extent the data
01:00:04made the estimates fairly identifiable.
01:00:09- Yeah so essentially, I mean, selection bias,
01:00:12usually you cannot always avoid it, unless you
01:00:17make some kind of missing at random type of assumption.
01:00:22Here, we don't have a random selection.
01:00:25It's more like a deterministic selection,
01:00:27and we can quantify that selection event,
01:00:30but still, as you said, I think these are great questions
01:00:37to sort of disentangle the nonparametric assumptions
01:00:41needed for identification and the parametric assumptions
01:00:44needed for sort of better and easier inference.
01:00:51I don't have a formal result,
01:00:53but my feeling is the first two assumptions
01:00:56that are assumed, sort of the independence of travel
01:01:00and disease, that's sort of essential to the identification.
01:01:07And then later on, the assumptions are perhaps relaxable.
01:01:14So we did try to relax those
01:01:15in the Bayesian nonparametric analysis.
01:01:19But that's not a proof, so that's my, ah,
01:01:25best guess at this point.
01:01:28- [Donna] Thank you.
01:01:32- From Casey, said, ah, the estimates,
01:01:39people have estimated about five to 80 percent
01:01:41of asymptomatic infections, and isn't that a limitation
01:01:46of your model that you did not account
01:01:48for asymptomatic carriers?
01:01:50And if so, how can we possibly model for it,
01:01:52given the large range of estimates?
01:01:55So this is actually a feature of our study,
01:01:59because we actually had a, let's see,
01:02:05we had a term for the asymptomatic transmission.
01:02:14So, but that's just that parameter was canceled.
01:02:19So this parameter, mu, or one minus mu,
01:02:22is the proportion of asymptomatic infections.
01:02:29But then because we only observed cases who are,
01:02:36who showed symptoms, so actually in likelihood,
01:02:39this parameter mu got canceled.
01:02:43So, of course the reason we could cancel that mu
01:02:46is because of this assumption, too,
01:02:48that S is independent of the travel.
01:02:53So that's important.
01:02:55But once you assume that you actually, ah,
01:03:00sort of don't need to worry about asymptomatic transmission,
01:03:04and on the other hand, this dataset, or this whole method
01:03:08also provides more information about the proportion
01:03:11of asymptomatic infection.
01:03:15Hopefully that'll answer your question.
01:03:17- [Casey] Yeah, thanks; so you account for it
01:03:19by saying it's not really significant, in your estimate?
01:03:24- Yeah, so in the likelihood, you will get canceled.
01:03:26So it doesn't appear in the likelihood.
01:03:28So the likelihood of the data does not depend
01:03:30on how much are asymptomatic, because we only look
01:03:36at cases who are symptomatic.
01:03:39So this incubation period that we estimated
01:03:41are also the incubation period among
01:03:44those people who showed symptoms.
01:03:47- [Casey] So it's an elegant way of sidestepping
01:03:49the question, (laughing) in a way.
01:03:52- Well, it's not a sidestep, it's sort of,
01:03:56it's a limitation of this design.
01:04:00So the whole design should be robust
01:04:04to asymptomatic transmission, and it also gives
01:04:07no information about asymptomatic transmission.
01:04:12- [Casey] Yeah, I was really impressed at the way
01:04:13you took on that Lancet article and just really, ah,
01:04:18it was really impressive; what a great talk.
01:04:20Thank you so much.
01:04:22- Well thank you.
01:04:27- Hi Qing I have a question.
01:04:29So you mentioned before that because the measurements
01:04:33inside of Wuhan are the, or the, ah,
01:04:37the measurements that we have inside Wuhan,
01:04:39the numbers aren't very accurate due to various reasons.
01:04:42So I'm wondering that if you calculate the doubling time
01:04:46using the data for Wuhan city,
01:04:50and then take into, that uses the measurements
01:04:53before they changed the criterion for when it's counted
01:04:58as a confirmed case, and using the data before, say,
01:05:02you locked down, but taking into consideration
01:05:04that the data, you only looked at data.
01:05:07So you only looked at the confirmed cases before that date.
01:05:10Will you get a similar measurement,
01:05:13a similar estimate as if you're using the traveling data,
01:05:16or it is much worse?
01:05:19- Yeah, people have done an analysis on the data from Wuhan.
01:05:26What I would like to point out is that this figure
01:05:29is only the number of new, confirmed cases.
01:05:34So what is usually done in epidemic analysis
01:05:36is they don't look at the number of confirmed cases,
01:05:40but the number of cases who showed symptoms on a certain day
01:05:45because that's usually less variable, less noisy,
01:05:51than this sort of confirmation,
01:05:55because of the problem about confirmation.
01:05:59So people have done that, and I don't see a doubling time
01:06:06estimation from that; there was a journal paper on that.
01:06:13And there was also a very interesting comment on it
01:06:18that criticized some of its methodology.
01:06:21I didn't see a doubling time estimate.
01:06:25So they seemed to focus on the R-naught of the epidemic.
01:06:31I actually had thought about that as well,
01:06:34and we, in this study I have presented,
01:06:37I intentionally avoided to estimate R-naught.
01:06:41Because I think there was a lot of issues with, ah,
01:06:47finding out the unbiased estimate of the serial interval,
01:06:52which is very important in estimating R-naught.
01:06:56So, this estimate we found is not directly comparable
01:07:05to that journal paper, I guess.
01:07:08But so what happened, I think, is around late January,
01:07:12early February, all of people have tried to estimate
01:07:17the R-naught and the doubling time of the epidemic,
01:07:21and what I've found interesting was
01:07:23there were kind of two modes.
01:07:25There's several papers estimated that the doubling time
01:07:29was about six to seven days, and there were several papers
01:07:31that estimated doubling times of about two to four days.
01:07:37And I think, ah,
01:07:41at least I have shown that the Lancet paper,
01:07:45that their whole method seems to be very flawed.
01:07:50But whether this means that our estimate is very close
01:07:54to the truth, it doesn't necessarily mean so.
01:07:58Because we also have a lot of limitations.
01:08:02- Okay, thanks.
01:08:09Any more question for Qingyuan?
01:08:14Okay, thanks Qing.
01:08:16I guess that's all for today, and it's a great talk.
01:08:20If you have any more questions for Qing,
01:08:21you can send him an email, and you can find his email
01:08:25on his website, okay?
01:08:29- Okay.
01:08:30(muttering)
01:08:32All right, okay, thank you everyone.
01:08:35- Thank you, oh, we got a new message?
01:08:38(muttering)
01:08:40- It's just a, Keyong said thank you.
01:08:43- Okay, okay, bye!
01:08:45- [Qingyuan] All right, bye.