YSPH Biostatistics Seminar: "Three Challenges Confronting Spatiotemporal Hawkes Models"

September 28, 2021

Information

Andrew Holbrook, PhD, Assistant Professor, Department of Biostatistics, University of California, Los Angeles

September 28, 2021

ID6943

To CiteDCA Citation Guide

00:00<v Man>Good afternoon, everybody.</v>
00:02Good morning, Professor Holbrook.
00:05Today I'm honored to introduce Professor Andrew Holbrook.
00:08So professor Holbrook earned his bachelor's from UC Berkeley
00:11and a statistics masters and PhD from UC Irvine.
00:15His research touches a number of areas
00:17of biomedical interests,
00:18including Alzheimer's and epidemiology.
00:22He's currently an assistant professor
00:24of biostatistics at UCLA, where he teaches their advanced
00:27basic computer course.
00:29And he's the author of several pieces
00:30of scientific software.
00:32All of it, I think, is he's very fond of parallelization,
00:37and he also has a package including one on studying
00:40Hawkes processes, which he's going to tell us...
00:44Well, he's gonna tell us about the biological phenomenon
00:46and what's going on today.
00:48So Professor Holbrook, thank you so much.
00:52<v ->Okay, great.</v>
00:53Thank you so much for the kind invitation,
00:57and thanks for having me this morning slash afternoon.
01:02So today I'm actually gonna be kind of trying to present
01:06more of a high level talk that's gonna just focus on
01:10a couple of different problems that have
01:14come up when modeling Hawkes processes
01:18for public health data, and in particular
01:21for large scale public health data.
01:24So, today I'm interested in spatiotemporal data
01:28in public health, and this can take a number
01:30of different forms.
01:33So a great example of this is in Washington D.C.
01:39Here, I've got about 4,000 gunshots.
01:42You'll see this figure again,
01:44and I'll explain the colors to you
01:46and everything like that.
01:49But I just want you to see that in the year 2018 alone,
01:53there were 4,000 gunshots recorded in Washington DC.
01:57And this is just one example of really a gun violence
02:01problem in the U S of epidemic proportions.
02:07But spatiotemporal public health data
02:10can take on many forms.
02:11So here, for example, I have almost almost 3000 wildfires
02:18in Alaska between the years, 2015 and 2019.
02:24And this is actually just one piece of a larger
02:30trend that's going on in the American west.
02:35And then finally, another example spatiotemporal public
02:39health data is, and I believe that we don't need to spend
02:44too much time on this motivation,
02:46but it's the global spread of viruses.
02:48So for example, here, I've got 5,000 influenza cases
02:52recorded throughout, through 2000 to 2012.
02:58So if I want to model this data,
03:00what I'm doing is I'm modeling event data.
03:02And one of the classic models for doing so
03:06is really the canonical stochastic process here,
03:12in this context is, is the Poisson process.
03:14And I hope that you'll bear with me if we do just a little
03:18bit of review for our probability 101.
03:21But we say that accounting process
03:24is a homogeneous Poisson process, point process
03:28with rate parameter, excuse me, parameter lambda,
03:32which is greater than zero.
03:34If this process is always equal to zero at zero,
03:39if it's independent increments, excuse me,
03:43if it's increment over non-overlapping intervals
03:48are independent random variables.
03:50And then finally, if it's increments
03:52are Poisson distributed with mean given
03:57by that rate parameter lambda,
04:00and then the difference in the times.
04:04So we can make this model
04:07just a very little bit more complex.
04:09We can create an inhomogeneous Poisson point process,
04:13simply by saying that that rate parameter
04:16is no longer fixed, but itself is a function
04:20over the positive real line.
04:22And here everything is the exact same,
04:24except now we're saying that it's increments,
04:28it's differences over two different time periods
04:30are Poisson distributed, where now the mean is simply given
04:35by the definite integral over that interval.
04:40So we just integrate that rate function.
04:44Okay.
04:45So then how do we choose our rate function for the problems
04:48that we're interested in?
04:49Well, if we return to say the gun violence example,
04:53then it is plausible that at least sometimes some gun
04:58violence might precipitate more gun violence.
05:03So here we would say that having observed an event,
05:09having observed gunshots at a certain location
05:12at a certain time, we might expect that the probability
05:15of observing gunshots nearby and soon after is elevated,
05:23and the same could plausibly go for wildfires as well.
05:28It's that having observed a wildfire in a certain location,
05:33this could directly contribute to the existence
05:39or to the observation of other wildfires.
05:42So for example, this could happen by natural means.
05:45So we could have embers that are blown by the wind,
05:51or there could be a human that is in fact
05:54causing these wildfires, which is also quite common.
06:01And then it's not a stretch at all
06:03to believe that viral observation,
06:08so a child sick with influenza could precipitate
06:12another child that becomes sick with influenza
06:16in the same classroom and perhaps on the next day.
06:23So then, the solution to building this kind of dynamic into
06:27an in homogeneous Poisson process is simply to craft
06:33the rate function in a way that is asymmetric in time.
06:37So here is just a regular temporal Hawkes process.
06:43And what we do is we divide this rate function, lambda T,
06:48which I'm showing you in the bottom of the equation,
06:51into a background portion which is here.
06:55I denote nu, and this nu can be a function itself.
07:00And then we also have this self excitatory component C of T.
07:04And this self excitatory component for time T,
07:08it depends exclusively on observations
07:13that occur before time T.
07:17So each tn, where tn is less than T,
07:22are able to contribute information
07:25in some way to this process.
07:29And typically G is our triggering function.
07:32G is non increasing.
07:37And then the only other thing that we ask
07:40is that the different events contribute
07:42in an additive manner to the rate.
07:45So here, we've got the background rate in this picture,
07:49We have observation T1.
07:50The rate increases.
07:53It slowly decreases.
07:55We have another observation, the rate increases.
07:57And what you see is actually that after T1,
08:00we have a nice little bit of self excitation as it's termed,
08:04where we observe more observations.
08:09This model itself can be made just a little bit more complex
08:13if we add a spatial component.
08:14So here now, is the spatiotemporal Hawkes process
08:18where I'm simply showing you the background process,
08:22which now I'm allowing to be described
08:26by a rate function over space.
08:29And then, we also have the self excitatory component,
08:32which again, although it also involves
08:35a spatial component in it,
08:37it still has this asymmetry in time.
08:40So in this picture, we have these,
08:42what are often called immigrant events
08:44or parent events in black.
08:49And then we have the child events,
08:50the offspring from these events described in blue.
08:53So this appears to a pretty good stochastic process model,
08:58which is not overly complex, but is simply complex enough
09:02to capture contagion dynamics.
09:08So for this talk, I'm gonna be talking about some major
09:11challenges that are confronting the really data analysis
09:17using the Hawkes process.
09:19So very applied in nature, and these challenges persist
09:23despite the use of a very simple model.
09:26So basically, all the models that I'm showing you today
09:29are variations on this extremely simple model,
09:33as far as the Hawkes process literature goes.
09:35So we assume an exponential decay triggering function.
09:40So here in this self excitatory component,
09:42what this looks like is the triggering function
09:47is simply the exponentiation of negative omega,
09:52where one over omega is some sort of length scale.
09:56And then we've got T minus tn.
09:58Again, that difference between a T
10:01and preceding event times.
10:05And then we're also assuming Gaussian kernel
10:07spatial smoothers, very simple.
10:09And then finally, another simplifying assumption
10:12that we're making is separability.
10:14So, in these individual components of the rate function,
10:20we always have separation between the temporal component.
10:25So here on the left, and then the spatial component
10:28on the right, and this is a simplifying assumption.
10:34So what are the challenges that I'm gonna present today?
10:37The first challenge is big data because when we are modeling
10:42many events, what we see is the computational complexity
10:46of actually carrying out inference,
10:51whether using maximum likelihood or using say,
10:54Markov chain Monte Carlo,
10:56well, that's actually gonna explode quickly,
10:57the computational complexity.
10:59Something else is the spatial data precision.
11:02And this is actually related to big data.
11:06As we accrue more data,
11:08it's harder to guarantee data quality,
11:11but then also the tools that I'm gonna offer up to actually
11:15deal with poor spatial data precision are actually
11:18gonna also suffer under a big data setting.
11:21And then finally, big models.
11:24So, you know, when we're trying to draw very specific
11:27scientific conclusions from our model, then what happens?
11:31And all these data, excuse me,
11:33all these challenges are intertwined,
11:34and I'll try to express that.
11:39Finally today, I am interested in scientifically
11:43interpretable inference.
11:46So, I'm not gonna talk about prediction,
11:48but if you have questions about prediction,
11:51then we can talk about that afterward.
11:53I'm happy too.
11:57Okay.
11:58So I've shown you this figure before,
12:00and it's not the last time that you'll see it.
12:02But again, this is 4,000 gunshots in 2018.
12:05This is part of a larger dataset that's made available
12:07by the Washington DC Police Department.
12:12And in fact, from 2006 to 2018,
12:15we have over 85,000 potential gunshots recorded.
12:20How are they recorded?
12:21They're recorded using the help of an acoustic gunshot
12:24locator system that uses the actual acoustics
12:28to triangulate the time and the location
12:32of the individual gunshots.
12:35So in a 2018 paper, Charles Loeffler and Seth Flaxman,
12:40they used a subset of this data in a paper entitled
12:44"Is Gun Violence Contagious?"
12:46And they in fact apply to Hawkes process model
12:49to try to determine their question,
12:51the answer to their question.
12:53But in order to do, though,
12:55they had to significantly subset.
12:57They took roughly 10% of the data.
13:00So the question is whether their conclusions,
13:02which in fact work yes to the affirmative,
13:06they were able to detect this kind of contagion dynamics.
13:11But the question is, do their results hold
13:14when we analyze the complete data set?
13:18So for likelihood based inference,
13:20which we're going to need to use in order to learn,
13:25in order to apply the Hawkes process to real-world data,
13:30for the first thing to see is that the likelihood
13:34takes on the form of an integral term on the left.
13:39And then we have a simple product of the rate function
13:43evaluated at our individual events, observed events.
13:50And when we consider the log likelihood,
13:53then it in fact will involve this term that I'm showing you
13:58on the bottom line, where it's the sum
14:00of the log of the, again, the rate function evaluated
14:04at the individual events. (background ringing)
14:07I'm sorry.
14:08You might be hearing a little bit of the sounds
14:10of Los Angeles in the background, and there's very little
14:14that I can do about Los Angeles.
14:16So moving on.
14:19So this summation in the log likelihood occurs.
14:25It actually involves a double summation.
14:28So it is the sum over all of our observations,
14:32of the log of the rate function.
14:34And then, again, the rate function because of the very
14:37specific form taken by the self excitatory component
14:41is also gonna involve this summation.
14:45So the upshot is that we actually need to evaluate.
14:49Every time we evaluate the log likelihood,
14:53we're going to need to evaluate N choose two,
14:59where N is the number of data points.
15:01N choose two terms, in this summation right here,
15:06and then we're gonna need to sum them together.
15:09And then the gradient also features this,
15:16quadratic computational complexity.
15:21So the solution, the first solution that I'm gonna offer up
15:23is not a statistical solution.
15:25It's a parallel computing solution.
15:27And the basic idea is, well, all of these terms that we need
15:31to sum over, evaluate and sum over, let's do it all at once
15:36and thereby speed up our inference.
15:41I do so, using multiple computational tools.
15:44So the first one is I use CP, they're just multi-core CPUs.
15:50These can have anywhere from two to 100 cores.
15:54And then I combine this with something called SIMD,
15:58single instruction multiple data, which is vectorization.
16:02So the idea, the basic idea is that I can apply a function,
16:09the same function, the same instruction set to an extended
16:13register or vector of input data, and thereby speed up
16:20my computing by a factor that is proportional
16:24to the size of the vector that I'm evaluating
16:27my function over.
16:29And then, I actually can do something better than this.
16:33I can use a graphics processing unit,
16:35which instead of hundreds cores, has thousands of cores.
16:39And instead of SIMD, or it can be interpreted as SIMD,
16:42but Nvidia likes to call it a single instruction
16:45multiple threads or SIMT.
16:47And here, what the major difference
16:50is the scale at which it's occurring.
16:54And then, the other difference is that actually
16:58individual threads or small working groups of threads
17:01on my GPU can work together.
17:03So actually the tools that I have available are very complex
17:07and a lot of need for care.
17:10There's a lot of need to carefully code this up.
17:13The solution is not statistical, but it's very much
17:18an engineering solution.
17:19But the results are really, really impressive
17:24from my standpoint, because if I compare.
17:27So on the left, I'm comparing relative speed ups against
17:32a very fast single core SIMD implementation on the left.
17:40So my baseline right here is the bottom of this blue curve.
17:44The X axis is giving me the number of CPU threads
17:48that I'm using, between one and 18.
17:52And then, the top line is not using CPU threads.
17:55So I just create a top-line that's flat.
17:58This is the GPU results.
18:01If I don't use SIMD, if I use non vectorized
18:04single core computing, of course, this is still
18:06pre-compiled C++ implementation.
18:08So it's fast or at least faster than R,
18:11and I'll show you that on the next slide.
18:13If I do that, then AVX is twice as fast.
18:17As I increased the number of cores,
18:21my relative speed up increases,
18:24but I also suffer diminishing returns.
18:28And then that is actually all these simulations
18:31on the left-hand plot.
18:33That's for a fixed amount of data.
18:34That's 75,000 randomly generated data points
18:38at each iteration of my simulation.
18:42But I can also just look at the seconds per evaluation.
18:45So that's my Y axis on the right-hand side.
18:49So ideally I want this to be as low as possible.
18:53And then I'm increasing the number of data points
18:56on the Y axis, on the X axis, excuse me.
19:00And then as the number of threads that I use,
19:03as I increased the number of threads,
19:05then my implementation is much faster.
19:08But again, you're seeing this quadratic computational
19:12complexity at play, right.
19:14All of these lines are looking rather parabolic.
19:18Finally, I go down all the way to the bottom,
19:21where I've got my GPU curve,
19:22again, suffering, computational complexity,
19:25which the quadratic computational complexity,
19:27which we can't get past, but doing a much better job
19:31than the CPU computing.
19:32Now you might ask, well, you might say,
19:35well, a 100 fold speed up is not that great.
19:38So I'd put this in perspective and say, well,
19:41what does this mean for R, which I use every day?
19:45Well, what it amounts to,
19:49and here, I'll just focus on the relative speed up
19:51over our implementation on the right.
19:55The GPU is reliably over 1000 times faster.
20:04So the way that Charles Loeffler and Seth Flaxman
20:12obtained a subset of their data was actually
20:16by thinning the data.
20:21They needed to do so because of the sheer computational
20:24complexity of using the Hawkes model.
20:27So, I'm not criticizing this in any way,
20:30but I'm simply pointing out why our results
20:34using the full data set, differ.
20:36So on the left, on the top left,
20:40we have the posterior density for the spatial length scale
20:44of the self excitatory component.
20:46And when we use the full data set,
20:48then we believe that we're operating more at around 70
20:51meters instead of the 126 inferred in the original paper.
20:56So one thing that you might notice is our posterior
21:01densities are much more concentrated than in blue,
21:08than the original analysis in Salmon.
21:12And this of course makes sense.
21:14We're using 10 times the amount of the data.
21:18Our temporal length scale is also meant,
21:20is also, we believe, much smaller, in fact.
21:24So now it's down to one minute instead of 10 minutes.
21:28Again, this could be interpreted
21:29as the simple result of thinning.
21:32And then finally, I just want to focus on this on
21:35the green posterior density.
21:41This is the proportion of events that we're interpreting
21:45that arise from self excitation or contagion dynamics.
21:50Experts believe that anywhere between 10 and 18% of gun
21:56violence events are retaliatory in nature.
21:59So actually our inference is kind of agreeing with,
22:07it safely within the band suggested by the experts.
22:15Actually, another thing that we can do,
22:18and that also requires a pretty computationally.
22:22So this is also quadratic computational complexity.
22:27Again, is post-processing.
22:30So if, for example, for individual events,
22:32we want to know the probability that the event arose
22:38from retaliatory gun violence,
22:41then we could look at the self excitatory component
22:46of the rate function divided by the total rate function.
22:49And then we can just look at the posterior
22:51distribution of this statistic.
22:55And this will give us our posterior probability
22:58that the event arose from contagion dynamics at least.
23:04And you can see that we can actually observe
23:06a very wide variety of values.
23:23So the issue of big data is actually not gonna go away,
23:28as we move on to discussing spatial data precision.
23:33Now, I'll tell you a little bit more about this data.
23:38All the data that we access is freely accessible online,
23:42is rounded to the nearest 100 meters
23:48by the DC Police Department.
23:51And the reason that they do this is for reasons of privacy.
23:58So one immediate question that we can ask is, well,
24:01how does this rounding actually affect our inference?
24:10Now we actually observed wildfires
24:13of wildly different sizes.
24:16And the question is, well, how does...
24:23If we want to model the spread of wildfires,
24:28then it would be useful to know
24:30where the actual ignition site,
24:33the site of ignition was.
24:37Where did the fire occur originally?
24:41And many of these fires are actually discovered
24:44out in the wild, far away from humans.
24:48And there's a lot of uncertainty.
24:50There's actually a large swaths of land that are involved.
24:57Finally, this, this global influenza data
25:00is very nice for certain reasons.
25:03For example, it features all of the observations,
25:07actually provide a viral genome data.
25:10So we can perform other more complex
25:12analyses on the data.
25:14And in fact, I'll do that in the third section
25:17for related data.
25:21But the actual spatial precision for this data is very poor.
25:25So, for some of these viral cases,
25:29we know the city in which it occurred.
25:32For some of them, we know the region
25:34or the state in which it occurred.
25:35And for some of them, we know the country
25:37in which it occurred.
25:40So I'm gonna start with the easy problem,
25:42which is analyzing the DC gun violence, the DC gunshot data.
25:48And here again, the police department rounds the data
25:50to the nearest hundred meters.
25:52So what do we do?
25:53We take that at face value and we simply use,
25:57place a uniform prior over the 10,000 meters square
26:04that is centered at each one of our observations.
26:06So here I'm denoting our actual data,
26:10our observed data with this kind of Gothic X,
26:15and then I'm placing a prior over the location
26:17at which the gunshot actually occurred.
26:19And this is a uniform prior over a box centered at my data.
26:23And using this prior actually has another interpretation
26:28similar to some other concepts
26:33from the missing data literature.
26:36And use of this prior actually corresponds to using
26:40something called the group data likelihood.
26:43And it's akin to the expected, complete data likelihood
26:48if you're familiar with the missing data literature.
26:53So what we do, and I'm not gonna get too much into
26:57the inference at this point, but we actually use MCMC
27:00to simultaneously infer the locations,
27:04and the Hawkes model parameters,
27:08the rate function parameters at the same time.
27:12So here, I'm just showing you a couple of examples
27:15of what this looks like.
27:16For each one of our observations colored yellow,
27:20we then have 100 posterior samples.
27:25So these dynamics can take on different forms
27:28and they take on different forms in very complex ways,
27:32simply because what we're essentially doing when we're...
27:38I'm going to loosely use the word impute.
27:41When we're imputing this data, when we're actually inferring
27:44these locations, we're basically simulating
27:47from a very complex n-body problem.
27:53So on the left, how can we interpret this?
27:57Well, we've got these four points and the model believes
28:01that actually they are farther away
28:02from each other than observed.
28:04Why is that?
28:05Well, right in the middle here, we have a shopping center,
28:09where there's actually many less gunshots.
28:13And then we've got residential areas
28:15where there are many more gunshots on the outside.
28:18And the bottom right, we actually have all of these,
28:26we believe that the actual locations of these gunshots
28:30collect closer together, kind of toward a very high
28:34intensity region in Washington, DC.
28:39And then we can just think about
28:41the general posterior displacement.
28:44So the mean posterior displacement.
28:46So in general, are there certain points that,
28:50where the model believes that the gunshots occurred
28:53further away from the observed events?
28:58And in general, there's not really.
29:01It's hard to come up with any steadfast rules.
29:04For example, in the bottom, right, we have some shots,
29:08some gunshots that show a very large posterior displacement,
29:13and they're in a very high density region.
29:15Whereas on the top, we also get large displacement
29:19and we're not surrounded by very many gunshots at all.
29:21So it is a very complex n-body problem
29:24that we're solving.
29:27And the good news is, for this problem,
29:30it doesn't matter much anyway.
29:32The results that we get are pretty much the same.
29:37I mean, so from the standpoint of statistical significance,
29:42we do get some statistically significant results.
29:45So in this figure, on the top,
29:47I'm showing you 95% credible intervals,
29:51and this is the self excitatory spatial length scale.
29:56We believe that it's smaller,
29:57but from a practical standpoint, it's not much smaller.
30:01It's a difference between 60 meters
30:03and maybe it's at 73 meters, 72 meters.
30:13But we shouldn't take too much comfort
30:16because actually as we increase the spatial prec-
30:19excuse me, as we decrease the spatial precision,
30:22we find that the model that does not take account
30:26of the rounding, performs much worse.
30:29So for example, if you look in the table,
30:33then we have the fixed locations model,
30:37where I'm not actually inferring the locations.
30:40And I just want to see, what's the empirical coverage
30:45of the 95% credible intervals?
30:48And let's just focus on the 95%
30:53credible intervals, specifically,
30:55simply because actually the other intervals,
30:59the 50% credible interval, the 80% credible interval,
31:03they showed the similar dynamic, which is that as we,
31:10so if we start on the right-hand side,
31:13we have precision down to down to 0.1.
31:16This is a unit list example.
31:19So we have higher precision, actually.
31:22Then we see that we have very good coverage,
31:24even if we don't take this locational
31:31coarsening into account.
31:33But as we increase the size of our error box,
31:38then we actually lose coverage,
31:41and we deviate from that 95% coverage.
31:44And then finally, if we increase too much,
31:46then we're never actually going to be
31:51capturing the true spatial length scale,
31:57whereas if we actually do sample the locations,
31:59we perform surprisingly well,
32:01even when we have a very high amount of spatial coarsening.
32:08Well, how else can we break the model?
32:11Another way that we can break this model,
32:13and by break the model, I mean, my naive model
32:16where I'm not inferring the locations.
32:18Another way that we can break this model
32:22is simply by considering data
32:24where we have variable spatial coarsening.
32:28That is where different data points
32:32are coarsened different amounts,
32:34so we have a variable precision.
32:40So considering the wildfire data,
32:43we actually see something with the naive approach
32:48where we're not inferring the locations.
32:51We actually see something that is actually recorded
32:56elsewhere in the Hawkes process literature.
33:00And that is that when we try to use a flexible
33:05background function, as we are trying to do,
33:07then we get this multimodal posterior distribution.
33:13And that's fine.
33:14We can also talk about it in a frequentist,
33:17from the frequency standpoint,
33:19because it's observed there as well
33:21in the maximum likelihood context, which is,
33:25we still see this multimodality.
33:29What specific form does this multimodality take?
33:34So what we see is that we get modes around the places
33:40where the background rate parameters,
33:47the background length scale parameters are equal
33:50to the temporal, excuse me, the self excitatory
33:54length scale parameters.
33:56So for the naive model, it's mode A,
34:00it believes that the spatial length scale
34:03is about 24 kilometers, and that the spatial length scale
34:07of the self excitatory dynamics
34:09are also roughly 24 kilometers.
34:14And then for the other mode,
34:16we get equal temporal length scales.
34:20So here, it believes 10 days, and 10 days
34:24for the self excitatory in the background component.
34:27And this can be very bad indeed.
34:29So for example, for mode A,
34:31it completely, the Hawkes model completely fails
34:36to capture seasonal dynamics, which is the first thing
34:40that you would want it to pick up on.
34:43The first thing that you would want it to understand
34:47is that wildfires...
34:49Okay, I need to be careful here
34:51because I'm not an expert on wildfires.
34:55I'll go out on a limb and say,
34:56wildfires don't happen in Alaska during the winter.
35:03On the other hand, when we use the full model
35:05and we're actually simultaneously inferring the locations,
35:08then we get this kind of Goldilocks effect,
35:11where here, the spatial length scale
35:14is somewhere around 35 kilometers,
35:17which is between the 23 kilometers and 63 kilometers
35:21for mode modes A and B, and we see that reliably.
35:33I can stop for some questions because I'm making good time.
35:44<v Man>Does anybody have any questions, if you want to ask?</v>
35:52<v Student>What's the interpretation</v>
35:53of the spatial length scale and the temporal length scale?
35:56What do those numbers actually mean?
35:59<v ->Yeah, thank you.</v>
36:02So, the interpretation of the...
36:06I think that the most useful interpretation,
36:11so just to give you an idea of how they can be interpreted.
36:15So for example, for the self excitatory component, right,
36:20that's describing the contagion dynamics.
36:23What this is saying is that if we see a wildfire,
36:29then we expect to observe another wildfire
36:34with mean distribution of one day.
36:41So the temporal length scale is in units days.
36:46So in the full model, after observing the wildfire,
36:50we expect to see another wildfire with mean, you know,
36:54on average, the next day.
36:56And this of course, you know, we have this model
37:02that's taking space and time into account.
37:05So the idea though, is that because of the separability
37:10in our model, we're basically simply
37:12expecting to see it somewhere.
37:19<v Student>Thank you.</v>
37:24<v Man>Any other questions?</v>
37:26(man speaking indistinctly)
37:31<v Student>Hi, can I have one question?</v>
37:35<v ->Go head.</v>
37:36<v Student>Okay.</v>
37:38I'm curious.
37:38What is a main difference between
37:39the naive model A and the naive model B?
37:43<v ->Okay.</v>
37:44So, sorry.
37:45This is...
37:47I think I could have presented
37:49this aspect better within the table itself.
37:52So this is the same exact model.
37:58But all that I'm doing is I'm applying
38:01the model multiple times.
38:03So in this case, I'm using Markov chain Monte Carlo.
38:07So one question that you might ask is,
38:10well, what happens when I run MCMC multiple times?
38:16Sometimes I get trapped in one mode.
38:20Sometimes I get trapped in another mode.
38:22You can just for, you know, a mental cartoon,
38:25we can think of like a (indistinct)
38:27a mixture of Gaussian distribution, right.
38:30Sometimes I can get trapped in this Gaussian component.
38:34Sometimes I could get trapped in this Gaussian component.
38:38So there's nothing intrinsically wrong with multimodality.
38:44We prefer to avoid it as best we can simply because it makes
38:47interpretation much more difficult.
38:52In this case, if I only perform inference
38:56and only see mode A, then I'm never actually gonna be
39:00picking up on seasonal dynamics.
39:07Does that (indistinct)?
39:10<v Woman>Yeah, it's clear.</v>
39:12<v Instructor>Okay.</v>
39:13<v Woman>Okay, and I also (indistinct).</v>
39:16So for the full model, you can capture
39:18the spatial dynamic property.
39:21So how to do that?
39:23So I know you need the Hawkes process that sees,
39:25clarifies the baseline.
39:28So how do you estimate a baseline part?
39:32<v ->Oh, okay, great.</v>
39:35In the exact same way.
39:37<v Student>Okay, I see.</v>
39:39<v ->So I'm jointly, simultaneously performing inference</v>
39:45over all of the model parameters.
39:47And I can go all the way back.
39:53Right.
39:54'Cause it's actually a very similar model.
39:58Yes.
39:59So this is my baseline.
40:02And so, for example, when we're talking about that temporal
40:06smooth that you saw on that last figure,
40:09where I'm supposed to be capturing seasonal dynamics.
40:13Well, if tau T, which I'm just calling
40:18my temporal length scale, if that is too large,
40:22then I'm never going to be capturing
40:24those seasonal dynamics, which I would be hoping to capture
40:28precisely using this background smoother.
40:33<v Student>Okay, I see.</v>
40:34So it looks like they assume the formula for the baseline,
40:38and then you estimates some parameters in these formulas.
40:42<v ->Yes.</v>
40:43<v Student>In my understanding,</v>
40:44in the current Hawkes literature,
40:47somebody uses (indistinct) function
40:49to approximate baseline also.
40:52<v ->Yes.</v>
40:52<v Student>This is also interesting.</v>
40:54Thank you. <v ->Yes.</v>
40:55Okay, okay, great.
40:56I'm happy to show another, you know.
40:59And of course I did not invent this.
41:00This is just another tact that you can take.
41:03<v Student>Yeah, yeah, yeah, yeah.</v>
41:04That's interesting.
41:05Thanks
41:06<v ->Yup.</v>
41:10<v Student>As just a quick follow up on</v>
41:13when you were showing the naive model,
41:16and this maybe a naive question on my part.
41:20Did you choose naive model A to be the one
41:24that does the type seasonality or is that approach
41:27just not (indistinct) seasonality?
41:33<v ->So I think that the point</v>
41:38is that sometimes based on, you know,
41:42I'm doing MCMC.
41:44It's random in nature, right.
41:46So just sometimes when I do that,
41:49I get trapped in that mode A,
41:53and sometimes I get trapped in that mode B.
42:00The label that I apply to it is just arbitrary,
42:04but maybe I'm not getting your question.
42:11<v Student>No, I think you did.</v>
42:14So, it's possible that we detect it.
42:17It's possible that we don't.
42:20<v ->Exactly.</v>
42:21And that's, you know,
42:22<v Student>That's what it is.</v>
42:23<v ->multimodality.</v>
42:25So this is kind of nice though,
42:27that this can actually give you,
42:30that actually inferring the locations can somehow,
42:35at least in this case, right,
42:37I mean, this is a case study, really,
42:40that this can help resolve that multimodality.
42:47<v Student>Thank you.</v>
42:48Yeah.
42:49<v Student>So back to the comparison between CPU and GPU.</v>
42:55Let's say, if we increase the thread of CPU,
43:00say like to infinity, will it be possible that the speed
43:06of CPU match the speed up of GPU?
43:12<v ->So.</v>
43:15You're saying if we increase.
43:17So, can I ask you one more time?
43:19Can I just ask for clarification?
43:21You're saying if we increase what to infinity?
43:25<v Student>The thread of CPU.</v>
43:28I think in the graph you're increasing the threads
43:32of CPU from like one to 80.
43:35And the speed up increase as the number
43:39of threats increasing.
43:42So just say like, let's say the threads of CPU
43:45increase to infinity, will the speed up match,
43:51because GPU with like (indistinct).
43:54Very high, right. <v ->Yeah, yeah.</v>
43:57Let me show you another figure,
44:00and then we can return to that.
44:03I think it's a good segue into the next section.
44:07So, let me answer that in a couple slides.
44:10<v Student>Okay, sounds good.</v>
44:12<v ->Okay.</v>
44:13So, questions about.
44:15I've gotten some good questions about how do we interpret
44:18the length scales and then this makes me think about,
44:23well, if all that we're doing is interpreting
44:26the length scales, how much is that telling us about
44:29the phenomenon that we're interested in?
44:32And can we actually craft more complex hierarchical models
44:37so that we can actually learn something perhaps
44:41even biologically interpretable?
44:43So here, I'm looking at 2014, 2016
44:47Ebola virus outbreak data.
44:50This is over almost 22,000 cases.
44:54And of these cases, we have about 1600
45:00that are providing us genome data.
45:08And then of those 1600, we have a smaller subset
45:12that provide us genome data, as well as spatiotemporal data.
45:20So often people use genome data, say RNA sequences in order
45:27to try to infer the way that different viral cases
45:29are related to each other.
45:31And the question is, can we pull together sequenced
45:34and unsequenced data at the same time?
45:39So what I'm doing here is, again,
45:42I'm not inventing this.
45:44This is something that already exists.
45:47So all that I'm doing is modifying my triggering function G,
45:52and giving it this little N,
45:54this little subscript right there,
45:57which is denoting the fact that I'm allowing different viral
46:01observations to contribute to the rate function
46:05in different manners.
46:07And the exact form that that's gonna take on
46:09for my specific simple model that I'm using,
46:12is I'm going to give this this data N.
46:17And I'm gonna include this data N parameter
46:20in my self excitatory component.
46:22And this data N is restricted to be greater than zero.
46:28So if it is greater than one,
46:30I'm gonna assume that actually, this self excite,
46:34excuse me, that this particular observation,
46:37little N is somehow more contagious.
46:41And if data is less than one,
46:43then I'm going to assume that it's less contagious.
46:48And this is an entirely unsatisfactory part of my talk,
46:52where I'm gonna gloss over a massive part of my model.
46:58And all that I'm gonna say is that
47:02this Phylogenetic Hawkes process, which I'm gonna be telling
47:05you about in the context of big modeling,
47:09and that challenge is that we start
47:13with the phylogenetic tree, which is simply the family tree
47:16that is uniting my 1600 sequenced cases.
47:22And then based on that, actually conditioned on that tree,
47:25we're gonna allow that tree to inform the larger
47:28co-variants of my model parameters, which are then going to
47:33contribute to the overall Hawkes rate function
47:37in a differential manner, although it's still additive.
47:45Now, let's see.
47:49Do I get to go till 10 or 9:50?
47:57<v Man>So you can go till 10.</v>
47:59<v ->Okay, great.</v>
48:00So then, I'll quickly say that if I'm inferring
48:06all of these rates, then I'm inferring over 1300 rates.
48:13So that is actually the dimensionality
48:15of my posterior distribution.
48:21So a tool that I can use,
48:23a classic tool over 50 years old at this point,
48:26that I can use, is I can use the random walk metropolis
48:29algorithm, which is actually going to sample
48:32from the posterior distribution of these rates.
48:36And it's gonna do so in a manner that is effective
48:40in low dimensions, but not effective in high dimensions.
48:46And the way that it works is say,
48:47we start at negative three, negative three.
48:49What we want to do is we want to explore this high density
48:52region of this bi-variate Gaussian,
48:55and we slowly amble forward, and eventually we get there.
49:03But this algorithm breaks down in moderate dimensions.
49:07So.
49:11An algorithm that I think many of us are aware of
49:14at this point, that is kind of a workhorse
49:16in high dimensional Bayesian inference
49:18is Hamiltonian Monte Carlo.
49:20And this works by using actual gradient information about
49:24our log posterior in order to intelligently guide
49:28the MCMC proposals that we're making.
49:32So, again, let's just pretend that we start
49:34at negative three, negative three,
49:36but within a small number of steps,
49:38we're actually effectively exploring
49:40that high density region, and we're doing so
49:45because we're using that gradient information
49:47of the log posterior.
49:51I'm not going to go too deep right now into the formulation
49:56of Hamiltonian Monte Carlo, for the sake of time.
50:00But what I would like to point out,
50:04is that after constructing this kind of physical system
50:13that is based on our target distribution
50:20on the posterior distribution, in some manner,
50:24we actually obtain our proposals within the MCMC.
50:30We obtain the proposals by simulating, by forward simulating
50:35the physical system, according to Hamilton's equations.
50:40Now,
50:43what this simulation involves is a massive number
50:48of repeated gradient evaluations.
50:53Moreover, if the posterior distribution is an ugly one,
51:00that is if it is still conditioned, which we interpret as,
51:06the log posterior Hessian has eigenvalues
51:09that are all over the place.
51:12Then we can also use a mass matrix, M, which is gonna allow
51:17us to condition our dynamics, and make sure that we are
51:24exploring all the dimensions of our model in an even manner.
51:29So the benefit of Hamiltonian Monte-Carlo is that it scales
51:32to tens of thousands of parameters.
51:34But the challenge is that that HMC necessitates repeated
51:38computation at the log likelihood,
51:42it's gradient and then preconditioning.
51:46And the best way that I know to precondition actually
51:49involves evaluating the log likelihood Hessian as well.
51:55And I told you that the challenges that I'm talking about
51:57today are intertwined.
51:58So what does this look like in a big data setting?
52:02Well, we've already managed to speed up the log likelihood
52:06computations that are quadratic in computational complexity.
52:11Well, it turns out that the log likelihood gradient
52:14and the log likelihood Hessian
52:17are all quadratic and computational complexity.
52:21So this means that as the size of our data set grows,
52:24we're going to...
52:27HMC, which is good at scaling to high dimensional models
52:31is going to break down because it's just gonna take too long
52:35to evaluate the quantities that we need to evaluate.
52:43To show you exactly how these parallel
52:45gradient calculations can work.
52:51So, what am I gonna do?
52:53I'm gonna parallelize again on a GPU
52:55or a multi-core CPU implementation,
53:00and I'm interested in evaluating or obtaining
53:04the quantities in the red box.
53:06These are simply the gradient of the log likelihood
53:09with respect to the individual rate parameters.
53:13And because of the summation that it involves,
53:17we actually obtain in the left, top left,
53:21we have the contribution of the first observation
53:25to that gradient term.
53:28Then we have the contribution of the second observation
53:31all the way up to the big int observation,
53:35that contribution to the gradient term.
53:37And these all need to be evaluated and summed over.
53:41So what do we do?
53:42We just do a running total, very simple.
53:45We start by getting the first contribution.
53:49We keep that stored in place.
53:53We evaluate the second contribution,
53:56all at the same time in parallel,
53:57and we simply increment our total observat-
54:01excuse me, our total gradient by that value.
54:05Very simple.
54:06We do this again and again.
54:08Kind of complicated to program, to be honest.
54:11But it's simple.
54:16It's simple when you think about it from the high level.
54:19So I showed you this figure before.
54:21And well, a similar figure before,
54:24and the interpretations are the same,
54:26but here I'll just focus on the question that I received.
54:30In the top left, we have the gradient.
54:32In the bottom left, excuse me,
54:34top row, we have the gradient.
54:35Bottom row, we have the Hessian,
54:37and here I'm increasing to 104 cores.
54:42So this is not infinite cores, right.
54:46It's 104.
54:47But I do want you to see that there's diminishing returns.
54:54And to give a little bit more technical
54:57response to that question,
55:02the thing to bear in mind is that
55:04it's not just about the number of threads that we use.
55:08It's having a lot of RAM very close
55:12to where the computing is being done.
55:15And that is something that GPUs,
55:18modern gigantic GPS do very well.
55:26So why is it important to do all this parallelization?
55:28Well, this is really, I want to kind of communicate
55:32this fact because it is so important.
55:36This slide underlines almost the entire challenge
55:40of big modeling using the spatiotemporal Hawkes process.
55:44The computing to apply this model to the 20,000 plus
55:49data points took about a month
55:54using a very large Nvidia GV100 GPU.
56:00Why?
56:01Because we had to generate 100 million Markov chain states
56:04at a rate of roughly three and a half million each day.
56:11After 100 million Markov chain states,
56:15after generating 100 million Markov chain states,
56:20this is the empirical distribution on the left
56:23of the effective sample sizes across,
56:28across all of the individual rates that we're inferring,
56:31actually all the model parameters.
56:34The minimum is 222, and that's right above my typical
56:39threshold of 200, because in general, we want the effective
56:43sample size to be as large as possible.
56:48Well, why was it so difficult?
56:50Well, a lot of the posterior,
56:53a lot of the marginal posteriors
56:55for our different parameters were very complex.
57:01So for example, here, I just have one individual rate,
57:05and this is the posterior that we learned from it.
57:08It's bi-modal.
57:10And not only is it bi-modal,
57:11but the modes exist on very different scales.
57:16Well, why else is it a difficult posterior to sample from?
57:19Well, because actually, as you might imagine,
57:22these rates have a very complex correlation in structure.
57:28This is kind of repeating something that I said earlier
57:30when we were actually inferring locations,
57:33which is that what this amounts to is really simulating
57:36a very large n-body problem.
57:44But what's the upshot?
57:45Well, we can actually capture these individual rates,
57:51which could give us hints at where to look for certain
57:55mutations that are allowing, say in this example,
58:01the Ebola virus to spread more effectively.
58:05And here, red is generally the highest,
58:09whereas blue is the lowest.
58:13We can get credible intervals,
58:15which can give us another way of thinking about, you know,
58:18where should I be looking
58:22in this collection of viral samples, for the next big one?
58:29And then I can also ask, well, how do these rates actually
58:32distribute along the phylogenetic tree?
58:37So I can look for clades or groups of branches
58:41that are in general, more red in this case than others.
58:53So, something that I...
58:55Okay, so it's 10 o'clock, and I will finish in one slide.
59:03The challenges that I'm talking about today,
59:05they're complex and they're intertwined,
59:08but they're not the only challenges.
59:10There are many challenges in the application
59:14of spatiotemporal Hawkes models,
59:16and there's actually a very large literature.
59:21So some other challenges that we might consider,
59:25and that will also be extremely challenging to overcome
59:31in a big data setting.
59:32So, kind of the first challenge is flexible modeling.
59:38So here, we want to use as flexible
59:41of a Hawkes model as possible.
59:44And this challenge kind of encapsulates one of the great
59:49ironies of model-based nonparametrics, which is that,
59:55the precise time that we actually want to use
59:58a flexible model, is the big data setting.
01:00:03I mean, I don't know if you recall my earlier slide
01:00:07where I was showing the posterior distribution
01:00:10of some of the length scales associated with
01:00:13the Washington DC data, and they're extremely tight.
01:00:19But this is actually exactly where we'd want to be able
01:00:24to use a flexible model, because no matter what,
01:00:28if I apply my model to 85,000 data points,
01:00:32I'm going to be very certain in my conclusion,
01:00:36conditioned on the specific model that I'm using.
01:00:41There's also boundary issues, right.
01:00:43This is a huge, a huge thing.
01:00:45So for those of you that are aware
01:00:47of the survival literature, which I'm sure many of you are,
01:00:52you know, they're censoring.
01:00:54So what about gunshots that occurred right outside
01:00:57of the border of Washington DC, and it occurred as a result
01:01:01of gunshots that occurred within the border?
01:01:03And then we can flip that on its head.
01:01:05What about parent events outside of Washington DC
01:01:10that precipitated gun violence within Washington DC.
01:01:13And then finally, sticking with the same example,
01:01:16differential sampling.
01:01:20You can be certain that those acoustic gunshot locators,
01:01:27location system sensors are not planted
01:01:30all over Washington DC.
01:01:34And how does their distribution affect things?
01:01:41Okay.
01:01:42This is joint work with Mark Suchard at UCLA, also at UCLA.
01:01:45And then my very good friend,
01:01:47my very dear friend, Xiang Ji at Tulane.
01:01:50It's funded by the K-Award Big Data Predictive Phylogenetics
01:01:54with Bayesian learning, funded by the NIH.
01:01:58And that's it.
01:01:59Thank you.
01:02:06<v Man>All right.</v>
01:02:07Thank you so much, Professor Holbrook.
01:02:08Does anybody have any other questions?
01:02:11(people speaking indistinctly)
01:02:18Yeah.
01:02:21Any other questions from the room here, or from Zoom?
01:02:25(people speaking indistinctly)