Biostatistics Seminar: BETS: The dangers of selection bias in early analyses of the coronavirus disease (COVID-19) pandemicMay 06, 2020
Qingyuan Zhao, Statistical Laboratory, University of Cambridge
May 5, 2020
To CiteDCA Citation Guide
- 00:03- All right, and it says the meeting is being recorded.
- 00:06Okay, so thanks everyone,
- 00:10for coming to this seminar.
- 00:13And I hope everyone is doing well.
- 00:17Today, I'm going to talk about some issues
- 00:21of selection bias in early analysis
- 00:24of the COVID-19 pandemic.
- 00:28You can find the manuscript on line, on arXiv,
- 00:31and the slides of this talk is also available on my webpage.
- 00:38So, here are the three collaborators,
- 00:42involved in this project.
- 00:45So Nianqiao is a PHD student at Harvard,
- 00:48and we kind of only met online.
- 00:50We never met in person, and I sort of created
- 00:54a dataset in January, and I wanted some help,
- 00:58and somehow she saw this and she said: I could help you.
- 01:03And we kind of developed a collaboration.
- 01:08And Sergio and Rajen are both, ah,
- 01:12lecturers in the Stats Lab in Cambridge.
- 01:18And I'd like to thank many, many people
- 01:19who have given us very helpful suggestions.
- 01:23This is just some of them.
- 01:28I'd like to begin with just saying COVID-19
- 01:32is personal for everyone, and what I would share
- 01:37is partly my story, my personal story with COVID-19.
- 01:44So here is a photo of me and my parents,
- 01:50taken last September, when I went back to China,
- 01:56to see my family.
- 01:58So both myself and my parents,
- 02:01we all grow up in Wuhan, China.
- 02:06And on a sunny day in September, we went to,
- 02:10well, this is the Yellow Crane Tower,
- 02:13a sort of landmark building in Wuhan.
- 02:17And the funny thing is, I think I've never been there,
- 02:20on top of the tower, in my entire life.
- 02:24And this is actually the first time I went there.
- 02:27This is something like if you have a famous local attraction
- 02:32for tourists, you actually don't go, as a local.
- 02:39And so, on January 23, because the epidemic
- 02:43was growing so fast in Wuhan, it started a lockdown.
- 02:52So, if we went on top of the Yellow Crane Tower,
- 02:56this is what we would see on a typical day,
- 03:00before the lockdown.
- 03:02And on the right, so, there's sort of what happens
- 03:06after the lockdown, and I liked how the journalist
- 03:10used sort of this gloomy weather as the background,
- 03:13and certainly reflected everybody's mood,
- 03:17after the lockdown.
- 03:21So, this project begins on January 29.
- 03:26So had a conversation with my parents over the phone,
- 03:30and they told me that a close relative of ours
- 03:35was just diagnosed with, quote/unquote, viral pneumonia.
- 03:41So, basically at that point, we all think that must
- 03:45be COVID-19, but because there was not enough tests,
- 03:51this relative could not get confirmed.
- 03:54And this prompted me to start looking
- 03:56through the data available at the time.
- 03:59But I quickly realized that the epidemiological data
- 04:03from Wuhan are very unreliable.
- 04:07And here is some anecdotal evidence.
- 04:10The first evidence is about inadequate testing.
- 04:16So actually this relative of mine could not get
- 04:18an RT-PCR test until mid-February,
- 04:22and she actually developed symptoms on about January 20.
- 04:29So by mid-February, she was already recovering.
- 04:33And she took, I think, several tests.
- 04:36Her first test was actually negative,
- 04:38and a few days later she was tested again,
- 04:40and the result came back positive.
- 04:43So there's also a lot of false negative tests.
- 04:46I think, in general.
- 04:49And another problem with the epidemiological data from Wuhan
- 04:53is insufficient contact tracing.
- 04:56So, her husband, this relative of mine's husband,
- 05:03he also showed COVID symptoms, but he quickly recovered
- 05:08from that, and in the end he was never tested for COVID.
- 05:17So, you can also see the insufficient testing
- 05:19from this incidence plot.
- 05:22So this is the daily confirmed cases, up until mid-February,
- 05:29and this is when the travel ban started,
- 05:33or the lockdown started, January 23,
- 05:36and on February 12, there was a huge spike
- 05:41of over 10,000 cases, much more than the previous few weeks.
- 05:50And the reason for that was not suddenly because people
- 05:54were infected on that date.
- 05:57It's because of a change of diagnostic criterion.
- 06:01So before February 12,
- 06:04everybody needs to have a positive RT-PCR test
- 06:10to be confirmed a COVID-19 case.
- 06:13But since February 12, because there,
- 06:16the health system in Wuhan was so overwhelmed,
- 06:20the government decided to change diagnostic criterion.
- 06:23So without RT-PCR tests, you can still be diagnosed
- 06:28with COVID-19 if you satisfy several other criteria.
- 06:34And this sort of change in diagnostic criteria
- 06:37only happened in the Hubei Province
- 06:41and not elsewhere in China.
- 06:45So a solution, if we like to avoid these problems
- 06:50with data from Wuhan, so one clever solution
- 06:55is to use cases that are reported from, sorry,
- 06:58exported from Wuhan.
- 07:01So this has two benefits.
- 07:03First of all, testing and contact tracing
- 07:05were quite intensive in other locations.
- 07:09So, it's reasonable to expect that a lot of the bias
- 07:13due to sort of under-ascertainment will be less severe
- 07:16if we use data from elsewhere.
- 07:20And also, many locations, particularly in some cities
- 07:26in China, published detailed case reports,
- 07:31instead of just case counts.
- 07:34And if you look at these detailed case reports there are
- 07:36a lot of information that can be used for inference.
- 07:44This is not our idea.
- 07:47And I think one of the, at least one of the first persons
- 07:51to use this design was a report from Neil Ferguson's group
- 07:57in Imperial College, London,
- 07:59and they published a report on January 17,
- 08:03and what it did was a simple sort of division of the number
- 08:07of cases detected internationally, over the number
- 08:11of people traveled from Wuhan, internationally.
- 08:15And they found that it could be
- 08:18over 1,700 cases by January 17, in Wuhan.
- 08:26So, I started this on January 29,
- 08:30and within about two weeks, managed to put something online.
- 08:37Which we also used internationally confirmed cases
- 08:40to estimate epidemic growth.
- 08:44And what we used were 46 coronavirus cases
- 08:48who traveled from Wuhan and then were subsequently confirmed
- 08:53in six Asian countries and regions.
- 08:59And the main result was that the epidemic was doubling
- 09:02in size every 2.9 days.
- 09:06And we used the Bayesian analysis, and the 95 percent
- 09:10critical interval was two to 4.1.
- 09:14And of course, when I was writing this article,
- 09:17I was mostly just working on this dataset that we collected,
- 09:22very hard and (muttering), thinking about what model
- 09:27is suitable for this kind of data.
- 09:30And just before I posted this pre-print,
- 09:34I realized there was a similar article
- 09:38that already published in The Lancet, on January 31.
- 09:45And what's really puzzling is they used almost the same data
- 09:51and very similar models, but somehow reached
- 09:54completely different conclusions.
- 09:58So they used data from December 31 to January 28,
- 10:02that are exported from Wuhan internationally.
- 10:05And they would like to infer the number
- 10:07of infections in Wuhan.
- 10:10And one of the main results,
- 10:12which was this epidemic doubling time, was 6.4 days,
- 10:16and the 95 percent critical interval was 5.8 to 7.1.
- 10:21So that's drastically different from ours.
- 10:24So again, ours was 2.7, within two to four,
- 10:29and this was 6.4.
- 10:33And this is talking about the doubling time.
- 10:36So the doubling time of six days versus three days,
- 10:40that's sort of really, really different.
- 10:43And the confidence intervals, the credible intervals
- 10:45didn't even overlap.
- 10:49So I was really puzzled by this.
- 10:52And before I tell you what I think,
- 10:58how the Lancet paper got it wrong,
- 11:01I'd like to just show you this plot.
- 11:03You probably have seen this many times before,
- 11:05in news articles, which is just sort of a logarithm
- 11:10of the total cases versus the days, ah,
- 11:16or some time, zero, for each country.
- 11:21And what you see is for both the total number of cases
- 11:26and the total number of deaths,
- 11:29it sort of grew about 100-fold in the first 20 days.
- 11:35At least among these countries
- 11:36that were most hard-hit by COVID-19.
- 11:42And if you just use that as a variable of estimate,
- 11:45of the doubling time, that corresponds
- 11:47to a doubling time of three days.
- 11:52Of course, this is sort of very kind of anecdotal,
- 11:56because this data were not collected in a very careful way,
- 12:01and the amount of cases were not reported,
- 12:04but this is just to show you that perhaps
- 12:07the doubling time of 6.4 days was a bit just, too long.
- 12:14So, towards the end of the talk,
- 12:17I'll tell you what we think led
- 12:21to these very different results.
- 12:24Just some spoilers, so the crucial difference
- 12:30is that the Lancet study actually did not
- 12:33take into account the travel ban on January 23.
- 12:38And that actually had a very,
- 12:39very circumstantial selection effect on the data.
- 12:45And this will be made precise later on in the talk.
- 12:53So, for the rest of the talk,
- 12:54I'll first give you an overview of selection bias.
- 12:57So no math, just sort of an outline of what kind
- 13:01of selection bias you could encounter in COVID-19 studies.
- 13:05Then I'll talk about how we sort of overcome them,
- 13:08by sort of collecting the dataset very carefully
- 13:12and building a model very carefully.
- 13:17And then I'll talk about why
- 13:20the Lancet study I just mentioned
- 13:22and some other early analysis were severely biased.
- 13:26If there is time, I will tell you a little bit
- 13:29about our Bayesian nonparametric model.
- 13:33And then I'll give you some lessons
- 13:36I learned from this work.
- 13:40So selection bias.
- 13:42So we identified at least five kinds
- 13:46of selection bias in COVID-19 studies.
- 13:49So the first one is due to under-ascertainment.
- 13:53So this may occur if symptomatic patients
- 13:56do not seek healthcare, or could not be diagnosed.
- 14:00So essentially, all studies using cases confirmed
- 14:04when testing is insufficient,
- 14:08would be susceptible to this kind of bias.
- 14:11And there is no cure to this.
- 14:14It may lead to varied kind of direction and magnitude
- 14:21of bias, and basically what we can do is to,
- 14:27to think about a clever design to avoid this problem,
- 14:32to focus on locations where the testing is intensive.
- 14:42The second bias is due to non-random sample selection.
- 14:48So, basically this means that the cases included
- 14:51in the study are not representative of the population.
- 14:56So this essentially applies to all studies,
- 15:03because detailed information about COVID-19 cases
- 15:06are usually sparse; they're not always published.
- 15:11But especially for studies that do not have a clear
- 15:14inclusion criterion, and if they just sort of simply
- 15:19collect data out of convenience, then there could be
- 15:25a lot of non-random sample selection bias.
- 15:30And again, statistical models are not really gonna help you
- 15:33with this kind of bias.
- 15:35You'd use, you'd follow some protocol for data collection,
- 15:40and you would exclude some data that do not meet
- 15:44the sample inclusion criterion.
- 15:47Even when that may, leads to inefficient estimates.
- 15:57The third bias is due to the travel ban.
- 16:00This is kind of my spoiler about that Lancet study.
- 16:06So basically, outbound travel from Wuhan
- 16:09to anywhere else was banned from January 23 to April eight.
- 16:16So if the study analyzed cases exported from Wuhan,
- 16:21then they're susceptible to this selection defect.
- 16:27And this would usually lead to underestimation
- 16:31of epidemic growth, and the reason is that, so,
- 16:35the epidemic is growing very fast,
- 16:37but then you essentially can't observe cases
- 16:41that were supposed to leave Wuhan after January 23.
- 16:44So if you just wait for a long time,
- 16:47and then look at the epidemic curve among the cases
- 16:50exported from Wuhan, it may appear that, ah,
- 16:55it sort of dies down a little bit,
- 16:58but that's not because of the epidemic being controlled.
- 17:01That's because of the travel ban.
- 17:04And fortunately this bias, you can correct for it
- 17:08by deriving some likelihood function
- 17:10tailored for the travel restrictions.
- 17:15The fourth bias is ignoring, is due to ignoring
- 17:20the epidemic growth, and basically if you think about people
- 17:25who have been in Wuhan before January 23,
- 17:29they're much more likely to be infected
- 17:31towards the end of their exposure period than early,
- 17:37and that's because the epidemic was growing quickly.
- 17:42So, there are many studies, or I should say
- 17:45there are several studies of the incubation period
- 17:48that simply treat infections as uniformly distributed
- 17:52over the patients' exposure period to Wuhan.
- 17:57And this will lead to overestimation
- 17:59of the incubation period.
- 18:02Because actually, the infection time is much,
- 18:04much closer to sort of the end of their exposure.
- 18:11And this is also a bias that can be corrected for,
- 18:15by doing statistical analysis carefully.
- 18:21The fifth and last bias is due to right-truncation.
- 18:25So this happens in early analysis because,
- 18:30to sort of win time to battle for this epidemic,
- 18:36and to publish sort of fast.
- 18:38So as you all know, there's a race for publications
- 18:43about COVID-19; a lot of people sort of truncated
- 18:48the dataset before a certain time,
- 18:52but by that time the epidemic maybe
- 18:54was still quickly growing or evolving.
- 18:58And this could lead to some right-truncation bias.
- 19:03And this generally would lead to underestimation
- 19:07of the incubation period.
- 19:10So this is, so incubation period, I forgot to mention,
- 19:13is just the time between infection to showing symptoms.
- 19:20So, right-truncation would lead to underestimation
- 19:22of incubation period, because people with longer
- 19:26incubation period may not have showed symptoms
- 19:31by the time that these datasets were collected.
- 19:38So the solution to this is we need to both collect cases
- 19:45that meet the selection criterion, and continue
- 19:48that data collection until a sufficiently long time.
- 19:54Or, you derive some likelihood function to correct
- 19:59for the right-truncation.
- 20:00So we'll go over this later.
- 20:04So just to recap,
- 20:07so on a very high level, there are at least five
- 20:11kinds of biases in COVID-19 analysis.
- 20:15And if you read sort of article pre-prints or use articles,
- 20:20I think you will find some kind, I mean,
- 20:24some resemblance of these biases in many studies.
- 20:30And the keys to avoid selection bias is basically,
- 20:35I mean, this is simple in words,
- 20:38but you just do everything carefully.
- 20:40You design the study carefully,
- 20:42and collect the sample carefully,
- 20:45and analyze the data carefully.
- 20:47But the reality, of course, is not that simple.
- 20:51And what I will show below, it's an example
- 20:55of our try to eliminate or to reduce selection bias,
- 21:02as much as possible.
- 21:06So, let me tell you the dataset we collected.
- 21:10So we found 14 locations in Asia,
- 21:16some are international, so Japan, South Korea, Taiwan,
- 21:21Hong Kong, Macau, Singapore.
- 21:23Some are sort of in mainland China.
- 21:27So there are several cities in mainland China.
- 21:31So all these locations have published detailed case reports
- 21:36from their first local case.
- 21:40So, most of the Chinese locations, I mean,
- 21:43they were done with the first wave of the epidemic
- 21:46by the end of February.
- 21:49So Japan, Korea and Singapore saw some resurgence
- 21:54of the epidemic later on, and eventually,
- 21:57they did not publish detailed case reports.
- 22:02But for our purposes, these locations all published
- 22:07detailed reports before mid-February,
- 22:11and that's about three weeks after the lockdown of Wuhan.
- 22:15So it's pretty much enough to find out
- 22:19all the Wuhan exported cases.
- 22:24So just to give you a sense of the kind of data
- 22:28that we collected, this is sort of all
- 22:32the important columns in the dataset,
- 22:36and the particularly important columns are marked in red.
- 22:42So, we collected, there was a case ID,
- 22:49where the case lived, the gender, the age,
- 22:54whether they had known epidemiological contact
- 22:57with other confirmed cases, whether it has
- 23:02known relationship with other confirmed cases.
- 23:07This is sort of an interesting column
- 23:09that basically we like to find out what cases were
- 23:15exported from Wuhan, but that's, of course, not recorded.
- 23:20I mean you can only infer that from what has been published.
- 23:27So this is an attempt to do that.
- 23:28So this column, outside column means that,
- 23:32whether we think the data collector thinks
- 23:35this case is transmitted outside Wuhan.
- 23:39So most of the time, this is relatively easy to fill.
- 23:45For example, if you've never been to Wuhan,
- 23:47this entry must be yes.
- 23:50But sometimes, this can be a little bit tricky.
- 23:52For example, this person, the fifth case in Hong Kong,
- 23:56is the husband of the fourth case in Hong Kong,
- 24:00and they traveled together from Wuhan to Hong Kong.
- 24:04So it's unclear if this case is transmitted
- 24:11in or outside Wuhan, so we put a "likely" there.
- 24:16And the other information are some dates,
- 24:20the beginning of stay in Wuhan, the end of stay in Wuhan,
- 24:26the period of exposure, which would equal to
- 24:30beginning to the end of stay in Wuhan,
- 24:33for Wuhan exported cases,
- 24:35but can be different for other cases.
- 24:41When the person, when the case arrived at a final location
- 24:44where they are confirmed a COVID-19 case.
- 24:48When the person showed symptoms.
- 24:51When did they first go to a hospital,
- 24:54and when were they confirmed a COVID-19 case.
- 24:59So we collected about 1,400 cases with all this information.
- 25:05And overall, I think our dataset was relatively high
- 25:11in quality, and most of the cases had known symptom onset
- 25:18dates; only nine percent of them have that entry missing.
- 25:30so one important step after this is to find out
- 25:33which cases are actually exported from Wuhan.
- 25:37So I've been using this terminology from the beginning
- 25:41of the talk, but basically the case is Wuhan exported
- 25:45if they are infected, if they were infected in Wuhan.
- 25:50And then confirmed elsewhere.
- 25:53So we had a sample selection criterion
- 25:58to discern a Wuhan exported case.
- 26:03I'm not going to go over it in detail,
- 26:06but basically the principle we followed
- 26:09is that we would only consider a case as Wuhan exported
- 26:14if it passed a beyond a reasonable doubt test.
- 26:19So basically, if we think there is a reasonable doubt
- 26:21that the case could be infected elsewhere,
- 26:26then we would say: let's exclude that from the dataset.
- 26:31So this eventually gives us 378 cases.
- 26:39Next I'm gonna talk about the model we used.
- 26:46So the model is called: BETS.
- 26:48It's named after sort of four key epidemiological events.
- 26:53The beginning of exposure, the end of exposure,
- 26:56time of transmission, which is usually unobserved,
- 27:01and the time of symptom onset, S.
- 27:06So what we will do below is we'll first define the support
- 27:13of these variables, so we call that P.
- 27:17Which is basically represents the Wuhan exposed population.
- 27:24So this is the population we would like to study.
- 27:28We will then construct a generative model
- 27:31for these random variables.
- 27:34Basically, for everyone in the Wuhan exposed population.
- 27:39Then, to consider the sample selection,
- 27:42we'll define a sample selection set, D,
- 27:46that corresponds to cases that are exported from Wuhan.
- 27:51Then finally we will derive likelihood functions
- 27:54to adjust for the sample selection.
- 27:57So essentially, what we're trying to infer is
- 28:01the disease dynamics in the population, P,
- 28:05but we only have data from this sample, D.
- 28:11So here's a lot of work that needs to be done
- 28:14to correct for that sample selection.
- 28:20So intuitively, this population P are just all people
- 28:23who have stayed in Wuhan, between December first
- 28:29and January 24, so anyone who has been in Wuhan
- 28:36for maybe even just a few hours,
- 28:39they would count as someone exposed to Wuhan.
- 28:45And I'm going to make some conventions to simplify
- 28:51this set, P, a little bit.
- 28:54So B equals to zero has a special meaning.
- 28:59So, so zero is the time zero,
- 29:02which is 12 AM of December one.
- 29:06And it means that they actually started their stay in Wuhan
- 29:11before time zero, so they live in Wuhan essentially.
- 29:16And B greater than zero means these other cases
- 29:21visited Wuhan sometime in the middle of this period,
- 29:25and then they left Wuhan.
- 29:29So E equals to infinity means that the case did not arrive
- 29:33in the 14 locations we are considering
- 29:36before this lockdown time, L.
- 29:41So for the purpose of our study,
- 29:42we did not need to differentiate between people who
- 29:45have always stayed in Wuhan past time L,
- 29:49or people who left Wuhan before time L,
- 29:52but went to a different location
- 29:55other than the ones we are considering.
- 29:58So T equals to infinity means that the cases
- 30:02were not infected during their stay in Wuhan.
- 30:06So this could be infected outside Wuhan,
- 30:08or it could be they were never infected.
- 30:12And S equals to infinity means that the case
- 30:16did not show symptoms of COVID-19,
- 30:19and it can simply be, they were never infected.
- 30:22Or the case was actually tested positive for COVID-19,
- 30:27but never showed symptoms, so it's, they're asymptomatic.
- 30:34So under these conventions, this is the set,
- 30:38this is the support for this population, P.
- 30:41So B is between zero and L,
- 30:44E is between B and L or infinity,
- 30:47T is between B and E, which means that they are,
- 30:51in fact, in Wuhan, or infinity.
- 30:54And S is between T and infinity,
- 30:56and S can be equal to infinity.
- 31:00So now we have defined this population, P.
- 31:04And now let's look at a general model,
- 31:09a data-generated model for this population.
- 31:15So, by the basic law of probability,
- 31:18we could decompose the joint distribution
- 31:21of BETS into these four, and the first two
- 31:25are the distribution of B and E.
- 31:27They are related to travel.
- 31:30The second one, sorry, the third one is the distribution
- 31:32of T given B and E.
- 31:35So that's about the disease transmission.
- 31:38And the last one is the distribution of S,
- 31:41given BET, and that's related to disease progression.
- 31:47So we need to make two basic assumptions,
- 31:50and they are important because we would like to infer
- 31:54what's going on in the population P,
- 31:57from the sample T, from these Wuhan exported cases.
- 32:02So we need to sort of make assumptions
- 32:05so we can make that extrapolation.
- 32:08So the first assumption, we assume it's about
- 32:11this disease transmission, and it basically means
- 32:14that the disease transmission is independent of travel.
- 32:18So there is a basic sort of function that's independent
- 32:22of the travel that's growing over time.
- 32:27And then there's the rest of the points mass at infinity.
- 32:33This T function, so, it will appear later on.
- 32:37It's the epidemic growth function.
- 32:40The second assumption is that the disease progression
- 32:43is also independent of travel.
- 32:46So, what's assumed here is basically
- 32:49that there is one minus mu of the infections,
- 32:56that are asymptomatic in that they didn't show symptoms.
- 33:00The amount of people who showed symptoms,
- 33:03the incubation period, which is just S minus T,
- 33:07follows this distribution, H.
- 33:11Okay, so H is the density of the incubation period,
- 33:14for symptomatic cases.
- 33:17And this whole distribution does not depend on B and E.
- 33:24So these are sort of the two basic assumptions
- 33:26that we relied on.
- 33:30There are two further parametric assumptions
- 33:32that were useful to simplify the interpretation,
- 33:37but they can be relaxed.
- 33:41So the next, one assumption is the epidemic
- 33:45was growing exponentially before the lockdown.
- 33:51And then that, the other assumption is that the incubation
- 33:54period is gamma-distributed, okay?
- 33:58So there's some parameters, kappa, R and alpha, beta.
- 34:05So, don't worry about nuisance parameter mu,
- 34:09which is the proportion of asymptomatic cases.
- 34:12And kappa, which is some baseline transmission.
- 34:16So it turns out that they would be canceled
- 34:19in the likelihood function, so they won't appear
- 34:23in the likelihood function.
- 34:25And (muttering) these parametric assumptions,
- 34:28they can be relaxed and they will be relaxed
- 34:32in the Bayesian parametric analysis, if I can get to there.
- 34:38But essentially, these are very useful assumptions
- 34:42that allow us to derive formulas explicitly.
- 34:50So I have covered the full data BETS model
- 34:56for the population P.
- 34:58Now we need to look at what we can observe.
- 35:02So what we can observe are people in B
- 35:07that satisfy three additional restrictions.
- 35:12The first restriction is that the transmission
- 35:15is between their exposure to Wuhan.
- 35:23The second restriction is that the case needs to leave
- 35:27Wuhan before the lockdown time, L.
- 35:31The third restriction is that the case
- 35:33needs to show symptoms.
- 35:36So S is less than infinity.
- 35:39So some of the locations we considered
- 35:41did report a few asymptomatic cases, but overall,
- 35:46asymptomatic ascertainment was very inconsistent.
- 35:50So we only considered cases who showed symptoms.
- 35:56So this gives us the set of samples
- 36:01that we can observe in our data.
- 36:09So, which likelihood function should we use?
- 36:15For a moment, let's just pretend that the time
- 36:17of transmission, T, is observed.
- 36:20So if we had samples, ID samples from the population, P,
- 36:25then we could just use this product of the density
- 36:29of BETS as a likelihood function.
- 36:34But this is not something we should use,
- 36:36because we actually don't have samples from P.
- 36:40We have samples from D, so what we should do is to condition
- 36:46on the selection set, D, and use this likelihood function,
- 36:52which is basically just the density divided by the
- 36:56probability that someone is selected in this set, D.
- 37:04Okay, this is called unconditional likelihood,
- 37:07to contrast with the conditional likelihood.
- 37:11So, in unconditional likelihood,
- 37:14we consider the joint distribution of B, E, T, and S.
- 37:18But in the conditional likelihood,
- 37:20we consider the conditional distribution of T and S,
- 37:25given B and E.
- 37:26So this is the conditional distribution of the disease
- 37:29transmission and progression, given the travel.
- 37:32So this treats travel as fixed.
- 37:35So to compute this conditional likelihood,
- 37:38we need further conditions on B and E, okay?
- 37:48But in reality, the time of transmission, T, is unobserved,
- 37:52so we cannot directly use the likelihood function,
- 37:55as on the last slide, so one possibility is to treat T
- 38:01as a latent variable and use, for example, an EM algorithm.
- 38:07The way we chose is to use an integrated likelihood.
- 38:11That just sort of marginalized
- 38:14over this unobserved variable, T.
- 38:19So, the unconditional likelihood is the product
- 38:23over the cases of the integral
- 38:26of the density function over T.
- 38:31And the conditional likelihood is just a product
- 38:34of the integral of the conditional distribution of T and S,
- 38:40over T.
- 38:45So, the reason we sort of considered both
- 38:49the unconditional likelihood and conditional likelihood
- 38:51is that the unconditional likelihood is a little bit
- 38:55more efficient, because it also uses information
- 39:00in this density, BE, given your selected.
- 39:06So that contains a little bit of information.
- 39:09But a conditional likelihood is more robust.
- 39:12So, because it does not need to specify how people traveled,
- 39:18so it is robust to misspecifying those distributions.
- 39:24So I'll stop here and take any questions up to now.
- 39:36Is this clear to everyone?
- 39:40If so, I'm gonna proceed.
- 39:45Okay, so under these four assumptions
- 39:49that I introduced earlier, you can sort of compute
- 39:53the explicit forms of the conditional likelihood functions.
- 39:57I'm not gonna go over the detailed forms,
- 39:59but I just want to point out that first of all,
- 40:02as I mentioned earlier, this does not depend on
- 40:04the two nuisance parameters, mu and kappa.
- 40:08And second of all, this actually reduces to a likelihood
- 40:12function that's previously derived in this paper in 2009
- 40:19by setting this R equals to zero.
- 40:22So R equals to zero means that the epidemic
- 40:24was not growing, so it's mostly a stationary epidemic.
- 40:30So that's reasonable for maybe influenza, but not for COVID.
- 40:40So for unconditional likelihood, we need to make
- 40:42further assumptions about how people traveled,
- 40:46the assumption we used was just a very simple,
- 40:49sort of a uniform assumption,
- 40:51uniform distribution assumption,
- 40:52that assumes that the travel was stable
- 40:55in the period that we considered.
- 40:59And we use those assumptions,
- 41:00we can derive the closed form unconditional likelihood.
- 41:06There's a little bit of approximation that's needed,
- 41:09but that's very, very reasonable in this case.
- 41:18So, I'd like to show you the results
- 41:22that fit in these parametric models.
- 41:24So what we did is we obtained point estimates
- 41:28of the parameters by maximizing the likelihood functions
- 41:32I just showed you, and then we obtained 95 percent
- 41:36confidence intervals, by a likelihood ratio test.
- 41:41So, what you can see is broadly, over different locations,
- 41:46the estimated doubling time was very consistent.
- 41:52Also cross-conditional and unconditional likelihood,
- 41:55so the doubling time was about two to 2.5 days.
- 42:01And the median incubation period is about four days,
- 42:07but there is a little bit of variability
- 42:11in the estimates.
- 42:14It turns out that the variability is mostly
- 42:16because of the parametric assumptions that we used.
- 42:21And then the 95 percent quantile is about,
- 42:2712 to 14 days.
- 42:29Or if you consider the sampling variability,
- 42:31that is about 11 to 15 days.
- 42:35Okay, but broadly speaking, across the different locations,
- 42:40they seem to suggest very similar answers.
- 42:47So, just to summarize, the initial doubling time
- 42:51seems to be between two to 2.5 days.
- 42:55Median incubation period is about four days,
- 42:57and 95 percent quantile is about 11 to 15 days.
- 43:03So, those sort of were our results,
- 43:05using the parametric model.
- 43:08And next I'm going to compare it with some other
- 43:12earlier analysis, and give you a demonstration,
- 43:18or an argument of why some of the other early analysis
- 43:21were severely biased.
- 43:23So first, let's look at this Lancet paper that I mentioned
- 43:27in the beginning of the talk that estimated doubling time.
- 43:30So the doubling time they estimated was 6.4. days.
- 43:37So, what happened is these authors used a modified
- 43:44SEIR model, so the SEIR model is very common
- 43:48in epidemic modeling, so the modified that model
- 43:51to account for traveling, but they did not account
- 43:55for the travel ban.
- 43:58So, basically to sort of simplify what's going on,
- 44:05what they essentially did is they used the density
- 44:09of the symptoms as in the population P,
- 44:15so they fitted this density, but they fit it using, ah,
- 44:20samples from the set D.
- 44:26So it is quite reasonable to assume that the incidence
- 44:29of symptom onset was growing exponentially in the population
- 44:34that is exposed to Wuhan.
- 44:37So given P, this distribution, margin distribution of S,
- 44:42was perhaps growing exponentially before the lockdown.
- 44:47But we don't actually have samples from P.
- 44:49We have a sample from D.
- 44:52So, we actually can derive the density of S and D,
- 44:59and that looked very different from exponential growth.
- 45:03So, basically the intuition is that if you look at
- 45:06the distribution of the transmission, T,
- 45:09it is growing exponentially, but it also has this effect,
- 45:13this exponential RT times L minus T.
- 45:17So basically, if you are transmitted on time T,
- 45:20then you only have the time between T to L
- 45:25to leave Wuhan and be observed by us.
- 45:29Okay, so that's why it's not only exponential growth,
- 45:32but there's also a decreasing trend, L minus T,
- 45:39for the distribution of the time of transmission.
- 45:43So from the time of symptom onset,
- 45:45it's just the time of transmission,
- 45:48convolved with the distribution of the incubation period.
- 45:52And that has this form that is approximately
- 45:56an exponential growth, and then times this term,
- 46:00that is L plus some quantity that depends
- 46:03on the incubation period and the epidemic growth, minus S.
- 46:10So this is a term that is not considered,
- 46:13in this simple exponential growth model.
- 46:18Which is basically what's used in that Lancet paper.
- 46:23Okay, so to illustrate this,
- 46:26what I'm showing you here is a histogram
- 46:29of the symptom onset of all the Wuhan exported cases,
- 46:35who are also residents of Wuhan.
- 46:37So they stayed from December first to January 23.
- 46:43What you see is that it was kind of growing very fast,
- 46:46perhaps exponentially in the beginning,
- 46:49but then it slows down around the time of the lockdown.
- 46:55Okay, so the orange curve is the theoretical fit
- 47:00that we obtained in the last slide,
- 47:04using the maximum likelihood estimator of the parameters.
- 47:08So it fits the data quite will.
- 47:12So what happened, I think, with the Lancet paper is,
- 47:17so the basically stopped about January 28th,
- 47:20so it's about here, and they essentially tried to fit
- 47:23an exponential growth from the beginning to January 28.
- 47:29And that would lead to much faster growth
- 47:33than fitting the whole model to account for the selection.
- 47:44So that's about epidemic growth.
- 47:46Next I will talk about several studies
- 47:49of the incubation period.
- 47:52So, these studies are susceptible to two kinds of biases.
- 47:57One is that some of them ignore the epidemic growth,
- 48:01so instead of using this likelihood function,
- 48:04this conditional likelihood function,
- 48:06to just fit this R is equal to zero,
- 48:08and then they use this likelihood function
- 48:10that was derived in the early paper.
- 48:15The other bias is sort of right-truncation.
- 48:20And basically, they kind of stopped
- 48:22the data collection early and only used cases
- 48:24confirmed by then, so people with long incubation periods
- 48:29are less likely to be included in the data,
- 48:33so that leads to underestimation of the incubation period.
- 48:38And a solution to this is you can actually derive
- 48:40the likelihood with additional conditioning events,
- 48:43that S is equal, sorry,
- 48:45less than or equal to some threshold, M.
- 48:48Suppose you stop the data collection a week after M,
- 48:52and you say: perhaps we have all, find out all the cases
- 48:56who showed symptoms beforehand.
- 48:59We can use this likelihood function.
- 49:02I'm not gonna show you the exact form,
- 49:04but basically you need to further divide by, ah,
- 49:10the probability of S less than or equal to M,
- 49:14and you can obtain closed-form expression for this
- 49:18under our parametric assumptions.
- 49:22Using integration by parts.
- 49:25So, I'd like to show you an experiment
- 49:29to illustrate this selection bias.
- 49:33So in this experiment, we kind of stop the data collection
- 49:38between any day from January 23 to February 18,
- 49:43and we fitted sort of this parametric BETS model,
- 49:48using one of the following likelihood.
- 49:51So this is the likelihood that treats R equals to zero,
- 49:54so it's adjusted for nothing,
- 49:56and this is the likelihood derived earlier
- 49:59and used in other studies.
- 50:02This is the likelihood function that adjusts for the growth,
- 50:05so R is treated as an unknown parameter.
- 50:08And this is the likelihood on the last slide that adjusted
- 50:12for both the growth and the right-truncation,
- 50:16as less than or equal to M.
- 50:21So the point estimates are obtained by MLEs,
- 50:23and the confidence intervals are obtained
- 50:25by nonparametric Bootstrap,
- 50:28and we compared our results with three previous studies.
- 50:36So this is, basically summarizes this experiment.
- 50:42This is a little bit complicated,
- 50:43so let me walk you through slowly.
- 50:48So there are three likelihood functions we used.
- 50:50One adjusts for nothing; that's the orange.
- 50:54The one is adjusted only for growth,
- 50:57and the ones that adjusted for both growth and truncation.
- 51:02Okay, so what you can immediately see
- 51:04is that if we adjusted for, ah,
- 51:08if we adjusted for nothing, then this is much larger
- 51:14than the other estimates.
- 51:18So actually, if you adjusted for nothing,
- 51:20and if you sort of used our entire data set,
- 51:23the median incubation period would be about nine days.
- 51:27And the 95 percent quantile would be about 25 days.
- 51:31So that's just way too large.
- 51:35And if you ignored right-truncation, for example,
- 51:38if you used this likelihood function we derived earlier,
- 51:43that only accounts for growth, you underestimate
- 51:48the incubation period in the beginning, as expected,
- 51:51but you slowly converge to this final estimate.
- 51:57And if you use this likelihood function and adjust for both
- 52:00growth and truncation, you actually get
- 52:03some quite sensible results by the end of January.
- 52:09So, it has large uncertainty, but it's roughly unbiased,
- 52:14and it kind of eventually converges to that estimate.
- 52:18The same estimate that we obtained
- 52:23using the blue curve, but using the full data.
- 52:30So, for the sake of time, I think I'll skip the part
- 52:36about Bayesian nonparametric inference.
- 52:40One thing that's a little bit interesting, I think,
- 52:43is there seems to be some difference between men
- 52:48and women in their incubation period.
- 52:51So these are sort of the posterior mean
- 52:54and posterior credible intervals for nonparametric
- 53:01incubation period, and you can see that men
- 53:04seem to develop symptoms quicker than women.
- 53:11So, that's a little bit interesting,
- 53:14and maybe, I mean, I'm not a doctor,
- 53:18but it could be related to the observation
- 53:22that men seem to be more susceptible,
- 53:24and die more frequently than women.
- 53:31So let's, let me conclude this talk.
- 53:34So these are some conclusions we found about COVID-19,
- 53:40using our dataset and our model.
- 53:43Initial doubling time in Wuhan was about two to 2.5 days.
- 53:50The median incubation period is about four days,
- 53:52and the proportion of incubation period
- 53:55above 14 days is about five percent.
- 54:00There are a number of limitations for our study.
- 54:03For example, we used the symptom onset reported
- 54:07by the patients and they are not always accurate.
- 54:11There could be behavioral reasons for people
- 54:13to report a later symptom onset.
- 54:18Even though these locations are intensive in their testing
- 54:21and contact tracing, some degree of under-ascertainment
- 54:25is perhaps inevitable.
- 54:28As I have shown you, in our dataset collection,
- 54:34discerning the Wuhan exported case
- 54:36is not a black and white decision.
- 54:39We used this beyond a reasonable doubt kind of criterion,
- 54:43but that's one criterion you can apply.
- 54:47And the crucial assumptions, we put the first
- 54:51two assumptions, which means that the travel
- 54:53and disease are independent, and that can be violated.
- 54:57For example, if I, if people tend to cancel
- 55:02their travel plans when feeling sick.
- 55:09Nevertheless, I think I have demonstrated some very
- 55:12compelling evidence for selection bias in early studies.
- 55:17Some of the biases you can correct by designing the study
- 55:25more carefully, some require more sophisticated
- 55:29statistical adjustments.
- 55:33And basically, I think the conclusion is:
- 55:37you should make un-calculated BETS.
- 55:41So, we should always carefully design the study
- 55:44and adhere to our sample inclusion criteria.
- 55:48And the statistical inference should not be based
- 55:53on some intuitive calculations,
- 55:55but should be based on first principles.
- 55:58So in this study, we kind of went back all the way
- 56:00to defining the support of random variables.
- 56:05So that's sort of statistics 101.
- 56:08But that's actually, it's extremely important.
- 56:11So I found it really helpful to start all the way
- 56:15from the beginning and develop a generative model.
- 56:20And that avoids a lot of potential selection biases.
- 56:25So the final lesson I'd like to share from this whole study
- 56:29is that I think this demonstrates the data quality
- 56:34and better design are much more important
- 56:38than data quantity and better modeling,
- 56:42in many real data studies.
- 56:46Thanks for the attention,
- 56:47and I'll take any questions from here.
- 56:51- Thanks to you for the nice talk.
- 56:54Does anyone have questions for Qingyuan?
- 57:00So Qing, I think someone, ah,
- 57:04yeah, Joe sent you a question.
- 57:07- Okay.
- 57:09- Are there any information in datasets of whether patient
- 57:12is healthcare worker?
- 57:15- No, these are not usually healthcare workers.
- 57:19These are exported from Wuhan, so they're usually
- 57:21just people who traveled maybe for sightseeing,
- 57:24or for the Chinese New Year, they traveled from Wuhan
- 57:28to other places and were diagnosed there.
- 57:34- Right, so also he has another question,
- 57:38Joe has another question also: how can we evaluate
- 57:41the effectiveness of social distancing and mask guidelines?
- 57:49- I think this study we did was not designed
- 57:53to answer those questions.
- 57:57We did have a very, ah,
- 58:01sort of preliminary analysis.
- 58:03So we broke the study period into two parts.
- 58:08So on January 20, it was confirmed publicly
- 58:12that the disease was human-to-human transmissible,
- 58:16so we broke the period into two parts:
- 58:20those before January 20 and those after January 20.
- 58:25But the after period is just three days.
- 58:27So January 21, 22, 23, and we found that if we fit
- 58:33different growths to these two periods, the second period,
- 58:36it seemed that the growth was substantially slower.
- 58:42The growth, the exponent R is not quite zero,
- 58:48but it's close.
- 58:50So it seems that the knowledge of sort
- 58:52of human-to-human transmissibility and the fact that,
- 58:58I think, masks are probably much more,
- 59:01were much more available in Wuhan,
- 59:03people started to do some social distancing
- 59:08right after January 20.
- 59:11I think that seemed to play a role.
- 59:14But that's very, very preliminary,
- 59:17and I think there are a lot of good studies about this now.
- 59:25- Donna has a question.
- 59:26Donna, do you want to say what your question is?
- 59:32- [Donna] Yeah, sure, thanks.
- 59:33That was a very interesting and clear talk.
- 59:36I really appreciated the way you carefully went through,
- 59:40step by step, to show-- (audio distorting)
- 59:47Who aren't doing that, I feel.
- 59:50But my question was, it was still hard for me to tell
- 59:54to what extent your estimates were identifiable
- 59:59due to assumptions and to what extent the data
- 01:00:04made the estimates fairly identifiable.
- 01:00:09- Yeah so essentially, I mean, selection bias,
- 01:00:12usually you cannot always avoid it, unless you
- 01:00:17make some kind of missing at random type of assumption.
- 01:00:22Here, we don't have a random selection.
- 01:00:25It's more like a deterministic selection,
- 01:00:27and we can quantify that selection event,
- 01:00:30but still, as you said, I think these are great questions
- 01:00:37to sort of disentangle the nonparametric assumptions
- 01:00:41needed for identification and the parametric assumptions
- 01:00:44needed for sort of better and easier inference.
- 01:00:51I don't have a formal result,
- 01:00:53but my feeling is the first two assumptions
- 01:00:56that are assumed, sort of the independence of travel
- 01:01:00and disease, that's sort of essential to the identification.
- 01:01:07And then later on, the assumptions are perhaps relaxable.
- 01:01:14So we did try to relax those
- 01:01:15in the Bayesian nonparametric analysis.
- 01:01:19But that's not a proof, so that's my, ah,
- 01:01:25best guess at this point.
- 01:01:28- [Donna] Thank you.
- 01:01:32- From Casey, said, ah, the estimates,
- 01:01:39people have estimated about five to 80 percent
- 01:01:41of asymptomatic infections, and isn't that a limitation
- 01:01:46of your model that you did not account
- 01:01:48for asymptomatic carriers?
- 01:01:50And if so, how can we possibly model for it,
- 01:01:52given the large range of estimates?
- 01:01:55So this is actually a feature of our study,
- 01:01:59because we actually had a, let's see,
- 01:02:05we had a term for the asymptomatic transmission.
- 01:02:14So, but that's just that parameter was canceled.
- 01:02:19So this parameter, mu, or one minus mu,
- 01:02:22is the proportion of asymptomatic infections.
- 01:02:29But then because we only observed cases who are,
- 01:02:36who showed symptoms, so actually in likelihood,
- 01:02:39this parameter mu got canceled.
- 01:02:43So, of course the reason we could cancel that mu
- 01:02:46is because of this assumption, too,
- 01:02:48that S is independent of the travel.
- 01:02:53So that's important.
- 01:02:55But once you assume that you actually, ah,
- 01:03:00sort of don't need to worry about asymptomatic transmission,
- 01:03:04and on the other hand, this dataset, or this whole method
- 01:03:08also provides more information about the proportion
- 01:03:11of asymptomatic infection.
- 01:03:15Hopefully that'll answer your question.
- 01:03:17- [Casey] Yeah, thanks; so you account for it
- 01:03:19by saying it's not really significant, in your estimate?
- 01:03:24- Yeah, so in the likelihood, you will get canceled.
- 01:03:26So it doesn't appear in the likelihood.
- 01:03:28So the likelihood of the data does not depend
- 01:03:30on how much are asymptomatic, because we only look
- 01:03:36at cases who are symptomatic.
- 01:03:39So this incubation period that we estimated
- 01:03:41are also the incubation period among
- 01:03:44those people who showed symptoms.
- 01:03:47- [Casey] So it's an elegant way of sidestepping
- 01:03:49the question, (laughing) in a way.
- 01:03:52- Well, it's not a sidestep, it's sort of,
- 01:03:56it's a limitation of this design.
- 01:04:00So the whole design should be robust
- 01:04:04to asymptomatic transmission, and it also gives
- 01:04:07no information about asymptomatic transmission.
- 01:04:12- [Casey] Yeah, I was really impressed at the way
- 01:04:13you took on that Lancet article and just really, ah,
- 01:04:18it was really impressive; what a great talk.
- 01:04:20Thank you so much.
- 01:04:22- Well thank you.
- 01:04:27- Hi Qing I have a question.
- 01:04:29So you mentioned before that because the measurements
- 01:04:33inside of Wuhan are the, or the, ah,
- 01:04:37the measurements that we have inside Wuhan,
- 01:04:39the numbers aren't very accurate due to various reasons.
- 01:04:42So I'm wondering that if you calculate the doubling time
- 01:04:46using the data for Wuhan city,
- 01:04:50and then take into, that uses the measurements
- 01:04:53before they changed the criterion for when it's counted
- 01:04:58as a confirmed case, and using the data before, say,
- 01:05:02you locked down, but taking into consideration
- 01:05:04that the data, you only looked at data.
- 01:05:07So you only looked at the confirmed cases before that date.
- 01:05:10Will you get a similar measurement,
- 01:05:13a similar estimate as if you're using the traveling data,
- 01:05:16or it is much worse?
- 01:05:19- Yeah, people have done an analysis on the data from Wuhan.
- 01:05:26What I would like to point out is that this figure
- 01:05:29is only the number of new, confirmed cases.
- 01:05:34So what is usually done in epidemic analysis
- 01:05:36is they don't look at the number of confirmed cases,
- 01:05:40but the number of cases who showed symptoms on a certain day
- 01:05:45because that's usually less variable, less noisy,
- 01:05:51than this sort of confirmation,
- 01:05:55because of the problem about confirmation.
- 01:05:59So people have done that, and I don't see a doubling time
- 01:06:06estimation from that; there was a journal paper on that.
- 01:06:13And there was also a very interesting comment on it
- 01:06:18that criticized some of its methodology.
- 01:06:21I didn't see a doubling time estimate.
- 01:06:25So they seemed to focus on the R-naught of the epidemic.
- 01:06:31I actually had thought about that as well,
- 01:06:34and we, in this study I have presented,
- 01:06:37I intentionally avoided to estimate R-naught.
- 01:06:41Because I think there was a lot of issues with, ah,
- 01:06:47finding out the unbiased estimate of the serial interval,
- 01:06:52which is very important in estimating R-naught.
- 01:06:56So, this estimate we found is not directly comparable
- 01:07:05to that journal paper, I guess.
- 01:07:08But so what happened, I think, is around late January,
- 01:07:12early February, all of people have tried to estimate
- 01:07:17the R-naught and the doubling time of the epidemic,
- 01:07:21and what I've found interesting was
- 01:07:23there were kind of two modes.
- 01:07:25There's several papers estimated that the doubling time
- 01:07:29was about six to seven days, and there were several papers
- 01:07:31that estimated doubling times of about two to four days.
- 01:07:37And I think, ah,
- 01:07:41at least I have shown that the Lancet paper,
- 01:07:45that their whole method seems to be very flawed.
- 01:07:50But whether this means that our estimate is very close
- 01:07:54to the truth, it doesn't necessarily mean so.
- 01:07:58Because we also have a lot of limitations.
- 01:08:02- Okay, thanks.
- 01:08:09Any more question for Qingyuan?
- 01:08:14Okay, thanks Qing.
- 01:08:16I guess that's all for today, and it's a great talk.
- 01:08:20If you have any more questions for Qing,
- 01:08:21you can send him an email, and you can find his email
- 01:08:25on his website, okay?
- 01:08:29- Okay.
- 01:08:32All right, okay, thank you everyone.
- 01:08:35- Thank you, oh, we got a new message?
- 01:08:40- It's just a, Keyong said thank you.
- 01:08:43- Okay, okay, bye!
- 01:08:45- [Qingyuan] All right, bye.