YSPH Biostatistics Seminar: "Three Challenges Confronting Spatiotemporal Hawkes Models"
September 28, 2021Information
Andrew Holbrook, PhD, Assistant Professor, Department of Biostatistics, University of California, Los Angeles
September 28, 2021
ID6943
To CiteDCA Citation Guide
- 00:00<v Man>Good afternoon, everybody.</v>
- 00:02Good morning, Professor Holbrook.
- 00:05Today I'm honored to introduce Professor Andrew Holbrook.
- 00:08So professor Holbrook earned his bachelor's from UC Berkeley
- 00:11and a statistics masters and PhD from UC Irvine.
- 00:15His research touches a number of areas
- 00:17of biomedical interests,
- 00:18including Alzheimer's and epidemiology.
- 00:22He's currently an assistant professor
- 00:24of biostatistics at UCLA, where he teaches their advanced
- 00:27basic computer course.
- 00:29And he's the author of several pieces
- 00:30of scientific software.
- 00:32All of it, I think, is he's very fond of parallelization,
- 00:37and he also has a package including one on studying
- 00:40Hawkes processes, which he's going to tell us...
- 00:44Well, he's gonna tell us about the biological phenomenon
- 00:46and what's going on today.
- 00:48So Professor Holbrook, thank you so much.
- 00:52<v ->Okay, great.</v>
- 00:53Thank you so much for the kind invitation,
- 00:57and thanks for having me this morning slash afternoon.
- 01:02So today I'm actually gonna be kind of trying to present
- 01:06more of a high level talk that's gonna just focus on
- 01:10a couple of different problems that have
- 01:14come up when modeling Hawkes processes
- 01:18for public health data, and in particular
- 01:21for large scale public health data.
- 01:24So, today I'm interested in spatiotemporal data
- 01:28in public health, and this can take a number
- 01:30of different forms.
- 01:33So a great example of this is in Washington D.C.
- 01:39Here, I've got about 4,000 gunshots.
- 01:42You'll see this figure again,
- 01:44and I'll explain the colors to you
- 01:46and everything like that.
- 01:49But I just want you to see that in the year 2018 alone,
- 01:53there were 4,000 gunshots recorded in Washington DC.
- 01:57And this is just one example of really a gun violence
- 02:01problem in the U S of epidemic proportions.
- 02:07But spatiotemporal public health data
- 02:10can take on many forms.
- 02:11So here, for example, I have almost almost 3000 wildfires
- 02:18in Alaska between the years, 2015 and 2019.
- 02:24And this is actually just one piece of a larger
- 02:30trend that's going on in the American west.
- 02:35And then finally, another example spatiotemporal public
- 02:39health data is, and I believe that we don't need to spend
- 02:44too much time on this motivation,
- 02:46but it's the global spread of viruses.
- 02:48So for example, here, I've got 5,000 influenza cases
- 02:52recorded throughout, through 2000 to 2012.
- 02:58So if I want to model this data,
- 03:00what I'm doing is I'm modeling event data.
- 03:02And one of the classic models for doing so
- 03:06is really the canonical stochastic process here,
- 03:12in this context is, is the Poisson process.
- 03:14And I hope that you'll bear with me if we do just a little
- 03:18bit of review for our probability 101.
- 03:21But we say that accounting process
- 03:24is a homogeneous Poisson process, point process
- 03:28with rate parameter, excuse me, parameter lambda,
- 03:32which is greater than zero.
- 03:34If this process is always equal to zero at zero,
- 03:39if it's independent increments, excuse me,
- 03:43if it's increment over non-overlapping intervals
- 03:48are independent random variables.
- 03:50And then finally, if it's increments
- 03:52are Poisson distributed with mean given
- 03:57by that rate parameter lambda,
- 04:00and then the difference in the times.
- 04:04So we can make this model
- 04:07just a very little bit more complex.
- 04:09We can create an inhomogeneous Poisson point process,
- 04:13simply by saying that that rate parameter
- 04:16is no longer fixed, but itself is a function
- 04:20over the positive real line.
- 04:22And here everything is the exact same,
- 04:24except now we're saying that it's increments,
- 04:28it's differences over two different time periods
- 04:30are Poisson distributed, where now the mean is simply given
- 04:35by the definite integral over that interval.
- 04:40So we just integrate that rate function.
- 04:44Okay.
- 04:45So then how do we choose our rate function for the problems
- 04:48that we're interested in?
- 04:49Well, if we return to say the gun violence example,
- 04:53then it is plausible that at least sometimes some gun
- 04:58violence might precipitate more gun violence.
- 05:03So here we would say that having observed an event,
- 05:09having observed gunshots at a certain location
- 05:12at a certain time, we might expect that the probability
- 05:15of observing gunshots nearby and soon after is elevated,
- 05:23and the same could plausibly go for wildfires as well.
- 05:28It's that having observed a wildfire in a certain location,
- 05:33this could directly contribute to the existence
- 05:39or to the observation of other wildfires.
- 05:42So for example, this could happen by natural means.
- 05:45So we could have embers that are blown by the wind,
- 05:51or there could be a human that is in fact
- 05:54causing these wildfires, which is also quite common.
- 06:01And then it's not a stretch at all
- 06:03to believe that viral observation,
- 06:08so a child sick with influenza could precipitate
- 06:12another child that becomes sick with influenza
- 06:16in the same classroom and perhaps on the next day.
- 06:23So then, the solution to building this kind of dynamic into
- 06:27an in homogeneous Poisson process is simply to craft
- 06:33the rate function in a way that is asymmetric in time.
- 06:37So here is just a regular temporal Hawkes process.
- 06:43And what we do is we divide this rate function, lambda T,
- 06:48which I'm showing you in the bottom of the equation,
- 06:51into a background portion which is here.
- 06:55I denote nu, and this nu can be a function itself.
- 07:00And then we also have this self excitatory component C of T.
- 07:04And this self excitatory component for time T,
- 07:08it depends exclusively on observations
- 07:13that occur before time T.
- 07:17So each tn, where tn is less than T,
- 07:22are able to contribute information
- 07:25in some way to this process.
- 07:29And typically G is our triggering function.
- 07:32G is non increasing.
- 07:37And then the only other thing that we ask
- 07:40is that the different events contribute
- 07:42in an additive manner to the rate.
- 07:45So here, we've got the background rate in this picture,
- 07:49We have observation T1.
- 07:50The rate increases.
- 07:53It slowly decreases.
- 07:55We have another observation, the rate increases.
- 07:57And what you see is actually that after T1,
- 08:00we have a nice little bit of self excitation as it's termed,
- 08:04where we observe more observations.
- 08:09This model itself can be made just a little bit more complex
- 08:13if we add a spatial component.
- 08:14So here now, is the spatiotemporal Hawkes process
- 08:18where I'm simply showing you the background process,
- 08:22which now I'm allowing to be described
- 08:26by a rate function over space.
- 08:29And then, we also have the self excitatory component,
- 08:32which again, although it also involves
- 08:35a spatial component in it,
- 08:37it still has this asymmetry in time.
- 08:40So in this picture, we have these,
- 08:42what are often called immigrant events
- 08:44or parent events in black.
- 08:49And then we have the child events,
- 08:50the offspring from these events described in blue.
- 08:53So this appears to a pretty good stochastic process model,
- 08:58which is not overly complex, but is simply complex enough
- 09:02to capture contagion dynamics.
- 09:08So for this talk, I'm gonna be talking about some major
- 09:11challenges that are confronting the really data analysis
- 09:17using the Hawkes process.
- 09:19So very applied in nature, and these challenges persist
- 09:23despite the use of a very simple model.
- 09:26So basically, all the models that I'm showing you today
- 09:29are variations on this extremely simple model,
- 09:33as far as the Hawkes process literature goes.
- 09:35So we assume an exponential decay triggering function.
- 09:40So here in this self excitatory component,
- 09:42what this looks like is the triggering function
- 09:47is simply the exponentiation of negative omega,
- 09:52where one over omega is some sort of length scale.
- 09:56And then we've got T minus tn.
- 09:58Again, that difference between a T
- 10:01and preceding event times.
- 10:05And then we're also assuming Gaussian kernel
- 10:07spatial smoothers, very simple.
- 10:09And then finally, another simplifying assumption
- 10:12that we're making is separability.
- 10:14So, in these individual components of the rate function,
- 10:20we always have separation between the temporal component.
- 10:25So here on the left, and then the spatial component
- 10:28on the right, and this is a simplifying assumption.
- 10:34So what are the challenges that I'm gonna present today?
- 10:37The first challenge is big data because when we are modeling
- 10:42many events, what we see is the computational complexity
- 10:46of actually carrying out inference,
- 10:51whether using maximum likelihood or using say,
- 10:54Markov chain Monte Carlo,
- 10:56well, that's actually gonna explode quickly,
- 10:57the computational complexity.
- 10:59Something else is the spatial data precision.
- 11:02And this is actually related to big data.
- 11:06As we accrue more data,
- 11:08it's harder to guarantee data quality,
- 11:11but then also the tools that I'm gonna offer up to actually
- 11:15deal with poor spatial data precision are actually
- 11:18gonna also suffer under a big data setting.
- 11:21And then finally, big models.
- 11:24So, you know, when we're trying to draw very specific
- 11:27scientific conclusions from our model, then what happens?
- 11:31And all these data, excuse me,
- 11:33all these challenges are intertwined,
- 11:34and I'll try to express that.
- 11:39Finally today, I am interested in scientifically
- 11:43interpretable inference.
- 11:46So, I'm not gonna talk about prediction,
- 11:48but if you have questions about prediction,
- 11:51then we can talk about that afterward.
- 11:53I'm happy too.
- 11:57Okay.
- 11:58So I've shown you this figure before,
- 12:00and it's not the last time that you'll see it.
- 12:02But again, this is 4,000 gunshots in 2018.
- 12:05This is part of a larger dataset that's made available
- 12:07by the Washington DC Police Department.
- 12:12And in fact, from 2006 to 2018,
- 12:15we have over 85,000 potential gunshots recorded.
- 12:20How are they recorded?
- 12:21They're recorded using the help of an acoustic gunshot
- 12:24locator system that uses the actual acoustics
- 12:28to triangulate the time and the location
- 12:32of the individual gunshots.
- 12:35So in a 2018 paper, Charles Loeffler and Seth Flaxman,
- 12:40they used a subset of this data in a paper entitled
- 12:44"Is Gun Violence Contagious?"
- 12:46And they in fact apply to Hawkes process model
- 12:49to try to determine their question,
- 12:51the answer to their question.
- 12:53But in order to do, though,
- 12:55they had to significantly subset.
- 12:57They took roughly 10% of the data.
- 13:00So the question is whether their conclusions,
- 13:02which in fact work yes to the affirmative,
- 13:06they were able to detect this kind of contagion dynamics.
- 13:11But the question is, do their results hold
- 13:14when we analyze the complete data set?
- 13:18So for likelihood based inference,
- 13:20which we're going to need to use in order to learn,
- 13:25in order to apply the Hawkes process to real-world data,
- 13:30for the first thing to see is that the likelihood
- 13:34takes on the form of an integral term on the left.
- 13:39And then we have a simple product of the rate function
- 13:43evaluated at our individual events, observed events.
- 13:50And when we consider the log likelihood,
- 13:53then it in fact will involve this term that I'm showing you
- 13:58on the bottom line, where it's the sum
- 14:00of the log of the, again, the rate function evaluated
- 14:04at the individual events. (background ringing)
- 14:07I'm sorry.
- 14:08You might be hearing a little bit of the sounds
- 14:10of Los Angeles in the background, and there's very little
- 14:14that I can do about Los Angeles.
- 14:16So moving on.
- 14:19So this summation in the log likelihood occurs.
- 14:25It actually involves a double summation.
- 14:28So it is the sum over all of our observations,
- 14:32of the log of the rate function.
- 14:34And then, again, the rate function because of the very
- 14:37specific form taken by the self excitatory component
- 14:41is also gonna involve this summation.
- 14:45So the upshot is that we actually need to evaluate.
- 14:49Every time we evaluate the log likelihood,
- 14:53we're going to need to evaluate N choose two,
- 14:59where N is the number of data points.
- 15:01N choose two terms, in this summation right here,
- 15:06and then we're gonna need to sum them together.
- 15:09And then the gradient also features this,
- 15:16quadratic computational complexity.
- 15:21So the solution, the first solution that I'm gonna offer up
- 15:23is not a statistical solution.
- 15:25It's a parallel computing solution.
- 15:27And the basic idea is, well, all of these terms that we need
- 15:31to sum over, evaluate and sum over, let's do it all at once
- 15:36and thereby speed up our inference.
- 15:41I do so, using multiple computational tools.
- 15:44So the first one is I use CP, they're just multi-core CPUs.
- 15:50These can have anywhere from two to 100 cores.
- 15:54And then I combine this with something called SIMD,
- 15:58single instruction multiple data, which is vectorization.
- 16:02So the idea, the basic idea is that I can apply a function,
- 16:09the same function, the same instruction set to an extended
- 16:13register or vector of input data, and thereby speed up
- 16:20my computing by a factor that is proportional
- 16:24to the size of the vector that I'm evaluating
- 16:27my function over.
- 16:29And then, I actually can do something better than this.
- 16:33I can use a graphics processing unit,
- 16:35which instead of hundreds cores, has thousands of cores.
- 16:39And instead of SIMD, or it can be interpreted as SIMD,
- 16:42but Nvidia likes to call it a single instruction
- 16:45multiple threads or SIMT.
- 16:47And here, what the major difference
- 16:50is the scale at which it's occurring.
- 16:54And then, the other difference is that actually
- 16:58individual threads or small working groups of threads
- 17:01on my GPU can work together.
- 17:03So actually the tools that I have available are very complex
- 17:07and a lot of need for care.
- 17:10There's a lot of need to carefully code this up.
- 17:13The solution is not statistical, but it's very much
- 17:18an engineering solution.
- 17:19But the results are really, really impressive
- 17:24from my standpoint, because if I compare.
- 17:27So on the left, I'm comparing relative speed ups against
- 17:32a very fast single core SIMD implementation on the left.
- 17:40So my baseline right here is the bottom of this blue curve.
- 17:44The X axis is giving me the number of CPU threads
- 17:48that I'm using, between one and 18.
- 17:52And then, the top line is not using CPU threads.
- 17:55So I just create a top-line that's flat.
- 17:58This is the GPU results.
- 18:01If I don't use SIMD, if I use non vectorized
- 18:04single core computing, of course, this is still
- 18:06pre-compiled C++ implementation.
- 18:08So it's fast or at least faster than R,
- 18:11and I'll show you that on the next slide.
- 18:13If I do that, then AVX is twice as fast.
- 18:17As I increased the number of cores,
- 18:21my relative speed up increases,
- 18:24but I also suffer diminishing returns.
- 18:28And then that is actually all these simulations
- 18:31on the left-hand plot.
- 18:33That's for a fixed amount of data.
- 18:34That's 75,000 randomly generated data points
- 18:38at each iteration of my simulation.
- 18:42But I can also just look at the seconds per evaluation.
- 18:45So that's my Y axis on the right-hand side.
- 18:49So ideally I want this to be as low as possible.
- 18:53And then I'm increasing the number of data points
- 18:56on the Y axis, on the X axis, excuse me.
- 19:00And then as the number of threads that I use,
- 19:03as I increased the number of threads,
- 19:05then my implementation is much faster.
- 19:08But again, you're seeing this quadratic computational
- 19:12complexity at play, right.
- 19:14All of these lines are looking rather parabolic.
- 19:18Finally, I go down all the way to the bottom,
- 19:21where I've got my GPU curve,
- 19:22again, suffering, computational complexity,
- 19:25which the quadratic computational complexity,
- 19:27which we can't get past, but doing a much better job
- 19:31than the CPU computing.
- 19:32Now you might ask, well, you might say,
- 19:35well, a 100 fold speed up is not that great.
- 19:38So I'd put this in perspective and say, well,
- 19:41what does this mean for R, which I use every day?
- 19:45Well, what it amounts to,
- 19:49and here, I'll just focus on the relative speed up
- 19:51over our implementation on the right.
- 19:55The GPU is reliably over 1000 times faster.
- 20:04So the way that Charles Loeffler and Seth Flaxman
- 20:12obtained a subset of their data was actually
- 20:16by thinning the data.
- 20:21They needed to do so because of the sheer computational
- 20:24complexity of using the Hawkes model.
- 20:27So, I'm not criticizing this in any way,
- 20:30but I'm simply pointing out why our results
- 20:34using the full data set, differ.
- 20:36So on the left, on the top left,
- 20:40we have the posterior density for the spatial length scale
- 20:44of the self excitatory component.
- 20:46And when we use the full data set,
- 20:48then we believe that we're operating more at around 70
- 20:51meters instead of the 126 inferred in the original paper.
- 20:56So one thing that you might notice is our posterior
- 21:01densities are much more concentrated than in blue,
- 21:08than the original analysis in Salmon.
- 21:12And this of course makes sense.
- 21:14We're using 10 times the amount of the data.
- 21:18Our temporal length scale is also meant,
- 21:20is also, we believe, much smaller, in fact.
- 21:24So now it's down to one minute instead of 10 minutes.
- 21:28Again, this could be interpreted
- 21:29as the simple result of thinning.
- 21:32And then finally, I just want to focus on this on
- 21:35the green posterior density.
- 21:41This is the proportion of events that we're interpreting
- 21:45that arise from self excitation or contagion dynamics.
- 21:50Experts believe that anywhere between 10 and 18% of gun
- 21:56violence events are retaliatory in nature.
- 21:59So actually our inference is kind of agreeing with,
- 22:07it safely within the band suggested by the experts.
- 22:15Actually, another thing that we can do,
- 22:18and that also requires a pretty computationally.
- 22:22So this is also quadratic computational complexity.
- 22:27Again, is post-processing.
- 22:30So if, for example, for individual events,
- 22:32we want to know the probability that the event arose
- 22:38from retaliatory gun violence,
- 22:41then we could look at the self excitatory component
- 22:46of the rate function divided by the total rate function.
- 22:49And then we can just look at the posterior
- 22:51distribution of this statistic.
- 22:55And this will give us our posterior probability
- 22:58that the event arose from contagion dynamics at least.
- 23:04And you can see that we can actually observe
- 23:06a very wide variety of values.
- 23:23So the issue of big data is actually not gonna go away,
- 23:28as we move on to discussing spatial data precision.
- 23:33Now, I'll tell you a little bit more about this data.
- 23:38All the data that we access is freely accessible online,
- 23:42is rounded to the nearest 100 meters
- 23:48by the DC Police Department.
- 23:51And the reason that they do this is for reasons of privacy.
- 23:58So one immediate question that we can ask is, well,
- 24:01how does this rounding actually affect our inference?
- 24:10Now we actually observed wildfires
- 24:13of wildly different sizes.
- 24:16And the question is, well, how does...
- 24:23If we want to model the spread of wildfires,
- 24:28then it would be useful to know
- 24:30where the actual ignition site,
- 24:33the site of ignition was.
- 24:37Where did the fire occur originally?
- 24:41And many of these fires are actually discovered
- 24:44out in the wild, far away from humans.
- 24:48And there's a lot of uncertainty.
- 24:50There's actually a large swaths of land that are involved.
- 24:57Finally, this, this global influenza data
- 25:00is very nice for certain reasons.
- 25:03For example, it features all of the observations,
- 25:07actually provide a viral genome data.
- 25:10So we can perform other more complex
- 25:12analyses on the data.
- 25:14And in fact, I'll do that in the third section
- 25:17for related data.
- 25:21But the actual spatial precision for this data is very poor.
- 25:25So, for some of these viral cases,
- 25:29we know the city in which it occurred.
- 25:32For some of them, we know the region
- 25:34or the state in which it occurred.
- 25:35And for some of them, we know the country
- 25:37in which it occurred.
- 25:40So I'm gonna start with the easy problem,
- 25:42which is analyzing the DC gun violence, the DC gunshot data.
- 25:48And here again, the police department rounds the data
- 25:50to the nearest hundred meters.
- 25:52So what do we do?
- 25:53We take that at face value and we simply use,
- 25:57place a uniform prior over the 10,000 meters square
- 26:04that is centered at each one of our observations.
- 26:06So here I'm denoting our actual data,
- 26:10our observed data with this kind of Gothic X,
- 26:15and then I'm placing a prior over the location
- 26:17at which the gunshot actually occurred.
- 26:19And this is a uniform prior over a box centered at my data.
- 26:23And using this prior actually has another interpretation
- 26:28similar to some other concepts
- 26:33from the missing data literature.
- 26:36And use of this prior actually corresponds to using
- 26:40something called the group data likelihood.
- 26:43And it's akin to the expected, complete data likelihood
- 26:48if you're familiar with the missing data literature.
- 26:53So what we do, and I'm not gonna get too much into
- 26:57the inference at this point, but we actually use MCMC
- 27:00to simultaneously infer the locations,
- 27:04and the Hawkes model parameters,
- 27:08the rate function parameters at the same time.
- 27:12So here, I'm just showing you a couple of examples
- 27:15of what this looks like.
- 27:16For each one of our observations colored yellow,
- 27:20we then have 100 posterior samples.
- 27:25So these dynamics can take on different forms
- 27:28and they take on different forms in very complex ways,
- 27:32simply because what we're essentially doing when we're...
- 27:38I'm going to loosely use the word impute.
- 27:41When we're imputing this data, when we're actually inferring
- 27:44these locations, we're basically simulating
- 27:47from a very complex n-body problem.
- 27:53So on the left, how can we interpret this?
- 27:57Well, we've got these four points and the model believes
- 28:01that actually they are farther away
- 28:02from each other than observed.
- 28:04Why is that?
- 28:05Well, right in the middle here, we have a shopping center,
- 28:09where there's actually many less gunshots.
- 28:13And then we've got residential areas
- 28:15where there are many more gunshots on the outside.
- 28:18And the bottom right, we actually have all of these,
- 28:26we believe that the actual locations of these gunshots
- 28:30collect closer together, kind of toward a very high
- 28:34intensity region in Washington, DC.
- 28:39And then we can just think about
- 28:41the general posterior displacement.
- 28:44So the mean posterior displacement.
- 28:46So in general, are there certain points that,
- 28:50where the model believes that the gunshots occurred
- 28:53further away from the observed events?
- 28:58And in general, there's not really.
- 29:01It's hard to come up with any steadfast rules.
- 29:04For example, in the bottom, right, we have some shots,
- 29:08some gunshots that show a very large posterior displacement,
- 29:13and they're in a very high density region.
- 29:15Whereas on the top, we also get large displacement
- 29:19and we're not surrounded by very many gunshots at all.
- 29:21So it is a very complex n-body problem
- 29:24that we're solving.
- 29:27And the good news is, for this problem,
- 29:30it doesn't matter much anyway.
- 29:32The results that we get are pretty much the same.
- 29:37I mean, so from the standpoint of statistical significance,
- 29:42we do get some statistically significant results.
- 29:45So in this figure, on the top,
- 29:47I'm showing you 95% credible intervals,
- 29:51and this is the self excitatory spatial length scale.
- 29:56We believe that it's smaller,
- 29:57but from a practical standpoint, it's not much smaller.
- 30:01It's a difference between 60 meters
- 30:03and maybe it's at 73 meters, 72 meters.
- 30:13But we shouldn't take too much comfort
- 30:16because actually as we increase the spatial prec-
- 30:19excuse me, as we decrease the spatial precision,
- 30:22we find that the model that does not take account
- 30:26of the rounding, performs much worse.
- 30:29So for example, if you look in the table,
- 30:33then we have the fixed locations model,
- 30:37where I'm not actually inferring the locations.
- 30:40And I just want to see, what's the empirical coverage
- 30:45of the 95% credible intervals?
- 30:48And let's just focus on the 95%
- 30:53credible intervals, specifically,
- 30:55simply because actually the other intervals,
- 30:59the 50% credible interval, the 80% credible interval,
- 31:03they showed the similar dynamic, which is that as we,
- 31:10so if we start on the right-hand side,
- 31:13we have precision down to down to 0.1.
- 31:16This is a unit list example.
- 31:19So we have higher precision, actually.
- 31:22Then we see that we have very good coverage,
- 31:24even if we don't take this locational
- 31:31coarsening into account.
- 31:33But as we increase the size of our error box,
- 31:38then we actually lose coverage,
- 31:41and we deviate from that 95% coverage.
- 31:44And then finally, if we increase too much,
- 31:46then we're never actually going to be
- 31:51capturing the true spatial length scale,
- 31:57whereas if we actually do sample the locations,
- 31:59we perform surprisingly well,
- 32:01even when we have a very high amount of spatial coarsening.
- 32:08Well, how else can we break the model?
- 32:11Another way that we can break this model,
- 32:13and by break the model, I mean, my naive model
- 32:16where I'm not inferring the locations.
- 32:18Another way that we can break this model
- 32:22is simply by considering data
- 32:24where we have variable spatial coarsening.
- 32:28That is where different data points
- 32:32are coarsened different amounts,
- 32:34so we have a variable precision.
- 32:40So considering the wildfire data,
- 32:43we actually see something with the naive approach
- 32:48where we're not inferring the locations.
- 32:51We actually see something that is actually recorded
- 32:56elsewhere in the Hawkes process literature.
- 33:00And that is that when we try to use a flexible
- 33:05background function, as we are trying to do,
- 33:07then we get this multimodal posterior distribution.
- 33:13And that's fine.
- 33:14We can also talk about it in a frequentist,
- 33:17from the frequency standpoint,
- 33:19because it's observed there as well
- 33:21in the maximum likelihood context, which is,
- 33:25we still see this multimodality.
- 33:29What specific form does this multimodality take?
- 33:34So what we see is that we get modes around the places
- 33:40where the background rate parameters,
- 33:47the background length scale parameters are equal
- 33:50to the temporal, excuse me, the self excitatory
- 33:54length scale parameters.
- 33:56So for the naive model, it's mode A,
- 34:00it believes that the spatial length scale
- 34:03is about 24 kilometers, and that the spatial length scale
- 34:07of the self excitatory dynamics
- 34:09are also roughly 24 kilometers.
- 34:14And then for the other mode,
- 34:16we get equal temporal length scales.
- 34:20So here, it believes 10 days, and 10 days
- 34:24for the self excitatory in the background component.
- 34:27And this can be very bad indeed.
- 34:29So for example, for mode A,
- 34:31it completely, the Hawkes model completely fails
- 34:36to capture seasonal dynamics, which is the first thing
- 34:40that you would want it to pick up on.
- 34:43The first thing that you would want it to understand
- 34:47is that wildfires...
- 34:49Okay, I need to be careful here
- 34:51because I'm not an expert on wildfires.
- 34:55I'll go out on a limb and say,
- 34:56wildfires don't happen in Alaska during the winter.
- 35:03On the other hand, when we use the full model
- 35:05and we're actually simultaneously inferring the locations,
- 35:08then we get this kind of Goldilocks effect,
- 35:11where here, the spatial length scale
- 35:14is somewhere around 35 kilometers,
- 35:17which is between the 23 kilometers and 63 kilometers
- 35:21for mode modes A and B, and we see that reliably.
- 35:33I can stop for some questions because I'm making good time.
- 35:44<v Man>Does anybody have any questions, if you want to ask?</v>
- 35:52<v Student>What's the interpretation</v>
- 35:53of the spatial length scale and the temporal length scale?
- 35:56What do those numbers actually mean?
- 35:59<v ->Yeah, thank you.</v>
- 36:02So, the interpretation of the...
- 36:06I think that the most useful interpretation,
- 36:11so just to give you an idea of how they can be interpreted.
- 36:15So for example, for the self excitatory component, right,
- 36:20that's describing the contagion dynamics.
- 36:23What this is saying is that if we see a wildfire,
- 36:29then we expect to observe another wildfire
- 36:34with mean distribution of one day.
- 36:41So the temporal length scale is in units days.
- 36:46So in the full model, after observing the wildfire,
- 36:50we expect to see another wildfire with mean, you know,
- 36:54on average, the next day.
- 36:56And this of course, you know, we have this model
- 37:02that's taking space and time into account.
- 37:05So the idea though, is that because of the separability
- 37:10in our model, we're basically simply
- 37:12expecting to see it somewhere.
- 37:19<v Student>Thank you.</v>
- 37:24<v Man>Any other questions?</v>
- 37:26(man speaking indistinctly)
- 37:31<v Student>Hi, can I have one question?</v>
- 37:35<v ->Go head.</v>
- 37:36<v Student>Okay.</v>
- 37:38I'm curious.
- 37:38What is a main difference between
- 37:39the naive model A and the naive model B?
- 37:43<v ->Okay.</v>
- 37:44So, sorry.
- 37:45This is...
- 37:47I think I could have presented
- 37:49this aspect better within the table itself.
- 37:52So this is the same exact model.
- 37:58But all that I'm doing is I'm applying
- 38:01the model multiple times.
- 38:03So in this case, I'm using Markov chain Monte Carlo.
- 38:07So one question that you might ask is,
- 38:10well, what happens when I run MCMC multiple times?
- 38:16Sometimes I get trapped in one mode.
- 38:20Sometimes I get trapped in another mode.
- 38:22You can just for, you know, a mental cartoon,
- 38:25we can think of like a (indistinct)
- 38:27a mixture of Gaussian distribution, right.
- 38:30Sometimes I can get trapped in this Gaussian component.
- 38:34Sometimes I could get trapped in this Gaussian component.
- 38:38So there's nothing intrinsically wrong with multimodality.
- 38:44We prefer to avoid it as best we can simply because it makes
- 38:47interpretation much more difficult.
- 38:52In this case, if I only perform inference
- 38:56and only see mode A, then I'm never actually gonna be
- 39:00picking up on seasonal dynamics.
- 39:07Does that (indistinct)?
- 39:10<v Woman>Yeah, it's clear.</v>
- 39:12<v Instructor>Okay.</v>
- 39:13<v Woman>Okay, and I also (indistinct).</v>
- 39:16So for the full model, you can capture
- 39:18the spatial dynamic property.
- 39:21So how to do that?
- 39:23So I know you need the Hawkes process that sees,
- 39:25clarifies the baseline.
- 39:28So how do you estimate a baseline part?
- 39:32<v ->Oh, okay, great.</v>
- 39:35In the exact same way.
- 39:37<v Student>Okay, I see.</v>
- 39:39<v ->So I'm jointly, simultaneously performing inference</v>
- 39:45over all of the model parameters.
- 39:47And I can go all the way back.
- 39:53Right.
- 39:54'Cause it's actually a very similar model.
- 39:58Yes.
- 39:59So this is my baseline.
- 40:02And so, for example, when we're talking about that temporal
- 40:06smooth that you saw on that last figure,
- 40:09where I'm supposed to be capturing seasonal dynamics.
- 40:13Well, if tau T, which I'm just calling
- 40:18my temporal length scale, if that is too large,
- 40:22then I'm never going to be capturing
- 40:24those seasonal dynamics, which I would be hoping to capture
- 40:28precisely using this background smoother.
- 40:33<v Student>Okay, I see.</v>
- 40:34So it looks like they assume the formula for the baseline,
- 40:38and then you estimates some parameters in these formulas.
- 40:42<v ->Yes.</v>
- 40:43<v Student>In my understanding,</v>
- 40:44in the current Hawkes literature,
- 40:47somebody uses (indistinct) function
- 40:49to approximate baseline also.
- 40:52<v ->Yes.</v>
- 40:52<v Student>This is also interesting.</v>
- 40:54Thank you. <v ->Yes.</v>
- 40:55Okay, okay, great.
- 40:56I'm happy to show another, you know.
- 40:59And of course I did not invent this.
- 41:00This is just another tact that you can take.
- 41:03<v Student>Yeah, yeah, yeah, yeah.</v>
- 41:04That's interesting.
- 41:05Thanks
- 41:06<v ->Yup.</v>
- 41:10<v Student>As just a quick follow up on</v>
- 41:13when you were showing the naive model,
- 41:16and this maybe a naive question on my part.
- 41:20Did you choose naive model A to be the one
- 41:24that does the type seasonality or is that approach
- 41:27just not (indistinct) seasonality?
- 41:33<v ->So I think that the point</v>
- 41:38is that sometimes based on, you know,
- 41:42I'm doing MCMC.
- 41:44It's random in nature, right.
- 41:46So just sometimes when I do that,
- 41:49I get trapped in that mode A,
- 41:53and sometimes I get trapped in that mode B.
- 42:00The label that I apply to it is just arbitrary,
- 42:04but maybe I'm not getting your question.
- 42:11<v Student>No, I think you did.</v>
- 42:14So, it's possible that we detect it.
- 42:17It's possible that we don't.
- 42:20<v ->Exactly.</v>
- 42:21And that's, you know,
- 42:22<v Student>That's what it is.</v>
- 42:23<v ->multimodality.</v>
- 42:25So this is kind of nice though,
- 42:27that this can actually give you,
- 42:30that actually inferring the locations can somehow,
- 42:35at least in this case, right,
- 42:37I mean, this is a case study, really,
- 42:40that this can help resolve that multimodality.
- 42:47<v Student>Thank you.</v>
- 42:48Yeah.
- 42:49<v Student>So back to the comparison between CPU and GPU.</v>
- 42:55Let's say, if we increase the thread of CPU,
- 43:00say like to infinity, will it be possible that the speed
- 43:06of CPU match the speed up of GPU?
- 43:12<v ->So.</v>
- 43:15You're saying if we increase.
- 43:17So, can I ask you one more time?
- 43:19Can I just ask for clarification?
- 43:21You're saying if we increase what to infinity?
- 43:25<v Student>The thread of CPU.</v>
- 43:28I think in the graph you're increasing the threads
- 43:32of CPU from like one to 80.
- 43:35And the speed up increase as the number
- 43:39of threats increasing.
- 43:42So just say like, let's say the threads of CPU
- 43:45increase to infinity, will the speed up match,
- 43:51because GPU with like (indistinct).
- 43:54Very high, right. <v ->Yeah, yeah.</v>
- 43:57Let me show you another figure,
- 44:00and then we can return to that.
- 44:03I think it's a good segue into the next section.
- 44:07So, let me answer that in a couple slides.
- 44:10<v Student>Okay, sounds good.</v>
- 44:12<v ->Okay.</v>
- 44:13So, questions about.
- 44:15I've gotten some good questions about how do we interpret
- 44:18the length scales and then this makes me think about,
- 44:23well, if all that we're doing is interpreting
- 44:26the length scales, how much is that telling us about
- 44:29the phenomenon that we're interested in?
- 44:32And can we actually craft more complex hierarchical models
- 44:37so that we can actually learn something perhaps
- 44:41even biologically interpretable?
- 44:43So here, I'm looking at 2014, 2016
- 44:47Ebola virus outbreak data.
- 44:50This is over almost 22,000 cases.
- 44:54And of these cases, we have about 1600
- 45:00that are providing us genome data.
- 45:08And then of those 1600, we have a smaller subset
- 45:12that provide us genome data, as well as spatiotemporal data.
- 45:20So often people use genome data, say RNA sequences in order
- 45:27to try to infer the way that different viral cases
- 45:29are related to each other.
- 45:31And the question is, can we pull together sequenced
- 45:34and unsequenced data at the same time?
- 45:39So what I'm doing here is, again,
- 45:42I'm not inventing this.
- 45:44This is something that already exists.
- 45:47So all that I'm doing is modifying my triggering function G,
- 45:52and giving it this little N,
- 45:54this little subscript right there,
- 45:57which is denoting the fact that I'm allowing different viral
- 46:01observations to contribute to the rate function
- 46:05in different manners.
- 46:07And the exact form that that's gonna take on
- 46:09for my specific simple model that I'm using,
- 46:12is I'm going to give this this data N.
- 46:17And I'm gonna include this data N parameter
- 46:20in my self excitatory component.
- 46:22And this data N is restricted to be greater than zero.
- 46:28So if it is greater than one,
- 46:30I'm gonna assume that actually, this self excite,
- 46:34excuse me, that this particular observation,
- 46:37little N is somehow more contagious.
- 46:41And if data is less than one,
- 46:43then I'm going to assume that it's less contagious.
- 46:48And this is an entirely unsatisfactory part of my talk,
- 46:52where I'm gonna gloss over a massive part of my model.
- 46:58And all that I'm gonna say is that
- 47:02this Phylogenetic Hawkes process, which I'm gonna be telling
- 47:05you about in the context of big modeling,
- 47:09and that challenge is that we start
- 47:13with the phylogenetic tree, which is simply the family tree
- 47:16that is uniting my 1600 sequenced cases.
- 47:22And then based on that, actually conditioned on that tree,
- 47:25we're gonna allow that tree to inform the larger
- 47:28co-variants of my model parameters, which are then going to
- 47:33contribute to the overall Hawkes rate function
- 47:37in a differential manner, although it's still additive.
- 47:45Now, let's see.
- 47:49Do I get to go till 10 or 9:50?
- 47:57<v Man>So you can go till 10.</v>
- 47:59<v ->Okay, great.</v>
- 48:00So then, I'll quickly say that if I'm inferring
- 48:06all of these rates, then I'm inferring over 1300 rates.
- 48:13So that is actually the dimensionality
- 48:15of my posterior distribution.
- 48:21So a tool that I can use,
- 48:23a classic tool over 50 years old at this point,
- 48:26that I can use, is I can use the random walk metropolis
- 48:29algorithm, which is actually going to sample
- 48:32from the posterior distribution of these rates.
- 48:36And it's gonna do so in a manner that is effective
- 48:40in low dimensions, but not effective in high dimensions.
- 48:46And the way that it works is say,
- 48:47we start at negative three, negative three.
- 48:49What we want to do is we want to explore this high density
- 48:52region of this bi-variate Gaussian,
- 48:55and we slowly amble forward, and eventually we get there.
- 49:03But this algorithm breaks down in moderate dimensions.
- 49:07So.
- 49:11An algorithm that I think many of us are aware of
- 49:14at this point, that is kind of a workhorse
- 49:16in high dimensional Bayesian inference
- 49:18is Hamiltonian Monte Carlo.
- 49:20And this works by using actual gradient information about
- 49:24our log posterior in order to intelligently guide
- 49:28the MCMC proposals that we're making.
- 49:32So, again, let's just pretend that we start
- 49:34at negative three, negative three,
- 49:36but within a small number of steps,
- 49:38we're actually effectively exploring
- 49:40that high density region, and we're doing so
- 49:45because we're using that gradient information
- 49:47of the log posterior.
- 49:51I'm not going to go too deep right now into the formulation
- 49:56of Hamiltonian Monte Carlo, for the sake of time.
- 50:00But what I would like to point out,
- 50:04is that after constructing this kind of physical system
- 50:13that is based on our target distribution
- 50:20on the posterior distribution, in some manner,
- 50:24we actually obtain our proposals within the MCMC.
- 50:30We obtain the proposals by simulating, by forward simulating
- 50:35the physical system, according to Hamilton's equations.
- 50:40Now,
- 50:43what this simulation involves is a massive number
- 50:48of repeated gradient evaluations.
- 50:53Moreover, if the posterior distribution is an ugly one,
- 51:00that is if it is still conditioned, which we interpret as,
- 51:06the log posterior Hessian has eigenvalues
- 51:09that are all over the place.
- 51:12Then we can also use a mass matrix, M, which is gonna allow
- 51:17us to condition our dynamics, and make sure that we are
- 51:24exploring all the dimensions of our model in an even manner.
- 51:29So the benefit of Hamiltonian Monte-Carlo is that it scales
- 51:32to tens of thousands of parameters.
- 51:34But the challenge is that that HMC necessitates repeated
- 51:38computation at the log likelihood,
- 51:42it's gradient and then preconditioning.
- 51:46And the best way that I know to precondition actually
- 51:49involves evaluating the log likelihood Hessian as well.
- 51:55And I told you that the challenges that I'm talking about
- 51:57today are intertwined.
- 51:58So what does this look like in a big data setting?
- 52:02Well, we've already managed to speed up the log likelihood
- 52:06computations that are quadratic in computational complexity.
- 52:11Well, it turns out that the log likelihood gradient
- 52:14and the log likelihood Hessian
- 52:17are all quadratic and computational complexity.
- 52:21So this means that as the size of our data set grows,
- 52:24we're going to...
- 52:27HMC, which is good at scaling to high dimensional models
- 52:31is going to break down because it's just gonna take too long
- 52:35to evaluate the quantities that we need to evaluate.
- 52:43To show you exactly how these parallel
- 52:45gradient calculations can work.
- 52:51So, what am I gonna do?
- 52:53I'm gonna parallelize again on a GPU
- 52:55or a multi-core CPU implementation,
- 53:00and I'm interested in evaluating or obtaining
- 53:04the quantities in the red box.
- 53:06These are simply the gradient of the log likelihood
- 53:09with respect to the individual rate parameters.
- 53:13And because of the summation that it involves,
- 53:17we actually obtain in the left, top left,
- 53:21we have the contribution of the first observation
- 53:25to that gradient term.
- 53:28Then we have the contribution of the second observation
- 53:31all the way up to the big int observation,
- 53:35that contribution to the gradient term.
- 53:37And these all need to be evaluated and summed over.
- 53:41So what do we do?
- 53:42We just do a running total, very simple.
- 53:45We start by getting the first contribution.
- 53:49We keep that stored in place.
- 53:53We evaluate the second contribution,
- 53:56all at the same time in parallel,
- 53:57and we simply increment our total observat-
- 54:01excuse me, our total gradient by that value.
- 54:05Very simple.
- 54:06We do this again and again.
- 54:08Kind of complicated to program, to be honest.
- 54:11But it's simple.
- 54:16It's simple when you think about it from the high level.
- 54:19So I showed you this figure before.
- 54:21And well, a similar figure before,
- 54:24and the interpretations are the same,
- 54:26but here I'll just focus on the question that I received.
- 54:30In the top left, we have the gradient.
- 54:32In the bottom left, excuse me,
- 54:34top row, we have the gradient.
- 54:35Bottom row, we have the Hessian,
- 54:37and here I'm increasing to 104 cores.
- 54:42So this is not infinite cores, right.
- 54:46It's 104.
- 54:47But I do want you to see that there's diminishing returns.
- 54:54And to give a little bit more technical
- 54:57response to that question,
- 55:02the thing to bear in mind is that
- 55:04it's not just about the number of threads that we use.
- 55:08It's having a lot of RAM very close
- 55:12to where the computing is being done.
- 55:15And that is something that GPUs,
- 55:18modern gigantic GPS do very well.
- 55:26So why is it important to do all this parallelization?
- 55:28Well, this is really, I want to kind of communicate
- 55:32this fact because it is so important.
- 55:36This slide underlines almost the entire challenge
- 55:40of big modeling using the spatiotemporal Hawkes process.
- 55:44The computing to apply this model to the 20,000 plus
- 55:49data points took about a month
- 55:54using a very large Nvidia GV100 GPU.
- 56:00Why?
- 56:01Because we had to generate 100 million Markov chain states
- 56:04at a rate of roughly three and a half million each day.
- 56:11After 100 million Markov chain states,
- 56:15after generating 100 million Markov chain states,
- 56:20this is the empirical distribution on the left
- 56:23of the effective sample sizes across,
- 56:28across all of the individual rates that we're inferring,
- 56:31actually all the model parameters.
- 56:34The minimum is 222, and that's right above my typical
- 56:39threshold of 200, because in general, we want the effective
- 56:43sample size to be as large as possible.
- 56:48Well, why was it so difficult?
- 56:50Well, a lot of the posterior,
- 56:53a lot of the marginal posteriors
- 56:55for our different parameters were very complex.
- 57:01So for example, here, I just have one individual rate,
- 57:05and this is the posterior that we learned from it.
- 57:08It's bi-modal.
- 57:10And not only is it bi-modal,
- 57:11but the modes exist on very different scales.
- 57:16Well, why else is it a difficult posterior to sample from?
- 57:19Well, because actually, as you might imagine,
- 57:22these rates have a very complex correlation in structure.
- 57:28This is kind of repeating something that I said earlier
- 57:30when we were actually inferring locations,
- 57:33which is that what this amounts to is really simulating
- 57:36a very large n-body problem.
- 57:44But what's the upshot?
- 57:45Well, we can actually capture these individual rates,
- 57:51which could give us hints at where to look for certain
- 57:55mutations that are allowing, say in this example,
- 58:01the Ebola virus to spread more effectively.
- 58:05And here, red is generally the highest,
- 58:09whereas blue is the lowest.
- 58:13We can get credible intervals,
- 58:15which can give us another way of thinking about, you know,
- 58:18where should I be looking
- 58:22in this collection of viral samples, for the next big one?
- 58:29And then I can also ask, well, how do these rates actually
- 58:32distribute along the phylogenetic tree?
- 58:37So I can look for clades or groups of branches
- 58:41that are in general, more red in this case than others.
- 58:53So, something that I...
- 58:55Okay, so it's 10 o'clock, and I will finish in one slide.
- 59:03The challenges that I'm talking about today,
- 59:05they're complex and they're intertwined,
- 59:08but they're not the only challenges.
- 59:10There are many challenges in the application
- 59:14of spatiotemporal Hawkes models,
- 59:16and there's actually a very large literature.
- 59:21So some other challenges that we might consider,
- 59:25and that will also be extremely challenging to overcome
- 59:31in a big data setting.
- 59:32So, kind of the first challenge is flexible modeling.
- 59:38So here, we want to use as flexible
- 59:41of a Hawkes model as possible.
- 59:44And this challenge kind of encapsulates one of the great
- 59:49ironies of model-based nonparametrics, which is that,
- 59:55the precise time that we actually want to use
- 59:58a flexible model, is the big data setting.
- 01:00:03I mean, I don't know if you recall my earlier slide
- 01:00:07where I was showing the posterior distribution
- 01:00:10of some of the length scales associated with
- 01:00:13the Washington DC data, and they're extremely tight.
- 01:00:19But this is actually exactly where we'd want to be able
- 01:00:24to use a flexible model, because no matter what,
- 01:00:28if I apply my model to 85,000 data points,
- 01:00:32I'm going to be very certain in my conclusion,
- 01:00:36conditioned on the specific model that I'm using.
- 01:00:41There's also boundary issues, right.
- 01:00:43This is a huge, a huge thing.
- 01:00:45So for those of you that are aware
- 01:00:47of the survival literature, which I'm sure many of you are,
- 01:00:52you know, they're censoring.
- 01:00:54So what about gunshots that occurred right outside
- 01:00:57of the border of Washington DC, and it occurred as a result
- 01:01:01of gunshots that occurred within the border?
- 01:01:03And then we can flip that on its head.
- 01:01:05What about parent events outside of Washington DC
- 01:01:10that precipitated gun violence within Washington DC.
- 01:01:13And then finally, sticking with the same example,
- 01:01:16differential sampling.
- 01:01:20You can be certain that those acoustic gunshot locators,
- 01:01:27location system sensors are not planted
- 01:01:30all over Washington DC.
- 01:01:34And how does their distribution affect things?
- 01:01:41Okay.
- 01:01:42This is joint work with Mark Suchard at UCLA, also at UCLA.
- 01:01:45And then my very good friend,
- 01:01:47my very dear friend, Xiang Ji at Tulane.
- 01:01:50It's funded by the K-Award Big Data Predictive Phylogenetics
- 01:01:54with Bayesian learning, funded by the NIH.
- 01:01:58And that's it.
- 01:01:59Thank you.
- 01:02:06<v Man>All right.</v>
- 01:02:07Thank you so much, Professor Holbrook.
- 01:02:08Does anybody have any other questions?
- 01:02:11(people speaking indistinctly)
- 01:02:18Yeah.
- 01:02:21Any other questions from the room here, or from Zoom?
- 01:02:25(people speaking indistinctly)