YSPH Biostatistics Seminar: "Three Challenges Confronting Spatiotemporal Hawkes Models"

September 28, 2021

Andrew Holbrook, PhD, Assistant Professor, Department of Biostatistics, University of California, Los Angeles

September 28, 2021

Information

ID: 6943
To Cite: DCA Citation Guide

Download Transcript

00:00<v Man>Good afternoon, everybody.</v>
00:02Good morning, Professor Holbrook.
00:05Today I'm honored to introduce Professor Andrew Holbrook.
00:08So professor Holbrook earned his bachelor's from UC Berkeley
00:11and a statistics masters and PhD from UC Irvine.
00:15His research touches a number of areas
00:17of biomedical interests,
00:18including Alzheimer's and epidemiology.
00:22He's currently an assistant professor
00:24of biostatistics at UCLA, where he teaches their advanced
00:27basic computer course.
00:29And he's the author of several pieces
00:30of scientific software.
00:32All of it, I think, is he's very fond of parallelization,
00:37and he also has a package including one on studying
00:40Hawkes processes, which he's going to tell us...
00:44Well, he's gonna tell us about the biological phenomenon
00:46and what's going on today.
00:48So Professor Holbrook, thank you so much.
00:52<v ->Okay, great.</v>
00:53Thank you so much for the kind invitation,
00:57and thanks for having me this morning slash afternoon.
01:02So today I'm actually gonna be kind of trying to present
01:06more of a high level talk that's gonna just focus on
01:10a couple of different problems that have
01:14come up when modeling Hawkes processes
01:18for public health data, and in particular
01:21for large scale public health data.
01:24So, today I'm interested in spatiotemporal data
01:28in public health, and this can take a number
01:30of different forms.
01:33So a great example of this is in Washington D.C.
01:39Here, I've got about 4,000 gunshots.
01:42You'll see this figure again,
01:44and I'll explain the colors to you
01:46and everything like that.
01:49But I just want you to see that in the year 2018 alone,
01:53there were 4,000 gunshots recorded in Washington DC.
01:57And this is just one example of really a gun violence
02:01problem in the U S of epidemic proportions.
02:07But spatiotemporal public health data
02:10can take on many forms.
02:11So here, for example, I have almost almost 3000 wildfires
02:18in Alaska between the years, 2015 and 2019.
02:24And this is actually just one piece of a larger
02:30trend that's going on in the American west.
02:35And then finally, another example spatiotemporal public
02:39health data is, and I believe that we don't need to spend
02:44too much time on this motivation,
02:46but it's the global spread of viruses.
02:48So for example, here, I've got 5,000 influenza cases
02:52recorded throughout, through 2000 to 2012.
02:58So if I want to model this data,
03:00what I'm doing is I'm modeling event data.
03:02And one of the classic models for doing so
03:06is really the canonical stochastic process here,
03:12in this context is, is the Poisson process.
03:14And I hope that you'll bear with me if we do just a little
03:18bit of review for our probability 101.
03:21But we say that accounting process
03:24is a homogeneous Poisson process, point process
03:28with rate parameter, excuse me, parameter lambda,
03:32which is greater than zero.
03:34If this process is always equal to zero at zero,
03:39if it's independent increments, excuse me,
03:43if it's increment over non-overlapping intervals
03:48are independent random variables.
03:50And then finally, if it's increments
03:52are Poisson distributed with mean given
03:57by that rate parameter lambda,
04:00and then the difference in the times.
04:04So we can make this model
04:07just a very little bit more complex.
04:09We can create an inhomogeneous Poisson point process,
04:13simply by saying that that rate parameter
04:16is no longer fixed, but itself is a function
04:20over the positive real line.
04:22And here everything is the exact same,
04:24except now we're saying that it's increments,
04:28it's differences over two different time periods
04:30are Poisson distributed, where now the mean is simply given
04:35by the definite integral over that interval.
04:40So we just integrate that rate function.
04:44Okay.
04:45So then how do we choose our rate function for the problems
04:48that we're interested in?
04:49Well, if we return to say the gun violence example,
04:53then it is plausible that at least sometimes some gun
04:58violence might precipitate more gun violence.
05:03So here we would say that having observed an event,
05:09having observed gunshots at a certain location
05:12at a certain time, we might expect that the probability
05:15of observing gunshots nearby and soon after is elevated,
05:23and the same could plausibly go for wildfires as well.
05:28It's that having observed a wildfire in a certain location,
05:33this could directly contribute to the existence
05:39or to the observation of other wildfires.
05:42So for example, this could happen by natural means.
05:45So we could have embers that are blown by the wind,
05:51or there could be a human that is in fact
05:54causing these wildfires, which is also quite common.
06:01And then it's not a stretch at all
06:03to believe that viral observation,
06:08so a child sick with influenza could precipitate
06:12another child that becomes sick with influenza
06:16in the same classroom and perhaps on the next day.
06:23So then, the solution to building this kind of dynamic into
06:27an in homogeneous Poisson process is simply to craft
06:33the rate function in a way that is asymmetric in time.
06:37So here is just a regular temporal Hawkes process.
06:43And what we do is we divide this rate function, lambda T,
06:48which I'm showing you in the bottom of the equation,
06:51into a background portion which is here.
06:55I denote nu, and this nu can be a function itself.
07:00And then we also have this self excitatory component C of T.
07:04And this self excitatory component for time T,
07:08it depends exclusively on observations
07:13that occur before time T.
07:17So each tn, where tn is less than T,
07:22are able to contribute information
07:25in some way to this process.
07:29And typically G is our triggering function.
07:32G is non increasing.
07:37And then the only other thing that we ask
07:40is that the different events contribute
07:42in an additive manner to the rate.
07:45So here, we've got the background rate in this picture,
07:49We have observation T1.
07:50The rate increases.
07:53It slowly decreases.
07:55We have another observation, the rate increases.
07:57And what you see is actually that after T1,
08:00we have a nice little bit of self excitation as it's termed,
08:04where we observe more observations.
08:09This model itself can be made just a little bit more complex
08:13if we add a spatial component.
08:14So here now, is the spatiotemporal Hawkes process
08:18where I'm simply showing you the background process,
08:22which now I'm allowing to be described
08:26by a rate function over space.
08:29And then, we also have the self excitatory component,
08:32which again, although it also involves
08:35a spatial component in it,
08:37it still has this asymmetry in time.
08:40So in this picture, we have these,
08:42what are often called immigrant events
08:44or parent events in black.
08:49And then we have the child events,
08:50the offspring from these events described in blue.
08:53So this appears to a pretty good stochastic process model,
08:58which is not overly complex, but is simply complex enough
09:02to capture contagion dynamics.
09:08So for this talk, I'm gonna be talking about some major
09:11challenges that are confronting the really data analysis
09:17using the Hawkes process.
09:19So very applied in nature, and these challenges persist
09:23despite the use of a very simple model.
09:26So basically, all the models that I'm showing you today
09:29are variations on this extremely simple model,
09:33as far as the Hawkes process literature goes.
09:35So we assume an exponential decay triggering function.
09:40So here in this self excitatory component,
09:42what this looks like is the triggering function
09:47is simply the exponentiation of negative omega,
09:52where one over omega is some sort of length scale.
09:56And then we've got T minus tn.
09:58Again, that difference between a T
10:01and preceding event times.
10:05And then we're also assuming Gaussian kernel
10:07spatial smoothers, very simple.
10:09And then finally, another simplifying assumption
10:12that we're making is separability.
10:14So, in these individual components of the rate function,
10:20we always have separation between the temporal component.
10:25So here on the left, and then the spatial component
10:28on the right, and this is a simplifying assumption.
10:34So what are the challenges that I'm gonna present today?
10:37The first challenge is big data because when we are modeling
10:42many events, what we see is the computational complexity
10:46of actually carrying out inference,
10:51whether using maximum likelihood or using say,
10:54Markov chain Monte Carlo,
10:56well, that's actually gonna explode quickly,
10:57the computational complexity.
10:59Something else is the spatial data precision.
11:02And this is actually related to big data.
11:06As we accrue more data,
11:08it's harder to guarantee data quality,
11:11but then also the tools that I'm gonna offer up to actually
11:15deal with poor spatial data precision are actually
11:18gonna also suffer under a big data setting.
11:21And then finally, big models.
11:24So, you know, when we're trying to draw very specific
11:27scientific conclusions from our model, then what happens?
11:31And all these data, excuse me,
11:33all these challenges are intertwined,
11:34and I'll try to express that.
11:39Finally today, I am interested in scientifically
11:43interpretable inference.
11:46So, I'm not gonna talk about prediction,
11:48but if you have questions about prediction,
11:51then we can talk about that afterward.
11:53I'm happy too.
11:57Okay.
11:58So I've shown you this figure before,
12:00and it's not the last time that you'll see it.
12:02But again, this is 4,000 gunshots in 2018.
12:05This is part of a larger dataset that's made available
12:07by the Washington DC Police Department.
12:12And in fact, from 2006 to 2018,
12:15we have over 85,000 potential gunshots recorded.
12:20How are they recorded?
12:21They're recorded using the help of an acoustic gunshot
12:24locator system that uses the actual acoustics
12:28to triangulate the time and the location
12:32of the individual gunshots.
12:35So in a 2018 paper, Charles Loeffler and Seth Flaxman,
12:40they used a subset of this data in a paper entitled
12:44"Is Gun Violence Contagious?"
12:46And they in fact apply to Hawkes process model
12:49to try to determine their question,
12:51the answer to their question.
12:53But in order to do, though,
12:55they had to significantly subset.
12:57They took roughly 10% of the data.
13:00So the question is whether their conclusions,
13:02which in fact work yes to the affirmative,
13:06they were able to detect this kind of contagion dynamics.
13:11But the question is, do their results hold
13:14when we analyze the complete data set?
13:18So for likelihood based inference,
13:20which we're going to need to use in order to learn,
13:25in order to apply the Hawkes process to real-world data,
13:30for the first thing to see is that the likelihood
13:34takes on the form of an integral term on the left.
13:39And then we have a simple product of the rate function
13:43evaluated at our individual events, observed events.
13:50And when we consider the log likelihood,
13:53then it in fact will involve this term that I'm showing you
13:58on the bottom line, where it's the sum
14:00of the log of the, again, the rate function evaluated
14:04at the individual events. (background ringing)
14:07I'm sorry.
14:08You might be hearing a little bit of the sounds
14:10of Los Angeles in the background, and there's very little
14:14that I can do about Los Angeles.
14:16So moving on.
14:19So this summation in the log likelihood occurs.
14:25It actually involves a double summation.
14:28So it is the sum over all of our observations,
14:32of the log of the rate function.
14:34And then, again, the rate function because of the very
14:37specific form taken by the self excitatory component
14:41is also gonna involve this summation.
14:45So the upshot is that we actually need to evaluate.
14:49Every time we evaluate the log likelihood,
14:53we're going to need to evaluate N choose two,
14:59where N is the number of data points.
15:01N choose two terms, in this summation right here,
15:06and then we're gonna need to sum them together.
15:09And then the gradient also features this,
15:16quadratic computational complexity.
15:21So the solution, the first solution that I'm gonna offer up
15:23is not a statistical solution.
15:25It's a parallel computing solution.
15:27And the basic idea is, well, all of these terms that we need
15:31to sum over, evaluate and sum over, let's do it all at once
15:36and thereby speed up our inference.
15:41I do so, using multiple computational tools.
15:44So the first one is I use CP, they're just multi-core CPUs.
15:50These can have anywhere from two to 100 cores.
15:54And then I combine this with something called SIMD,
15:58single instruction multiple data, which is vectorization.
16:02So the idea, the basic idea is that I can apply a function,
16:09the same function, the same instruction set to an extended
16:13register or vector of input data, and thereby speed up
16:20my computing by a factor that is proportional
16:24to the size of the vector that I'm evaluating
16:27my function over.
16:29And then, I actually can do something better than this.
16:33I can use a graphics processing unit,
16:35which instead of hundreds cores, has thousands of cores.
16:39And instead of SIMD, or it can be interpreted as SIMD,
16:42but Nvidia likes to call it a single instruction
16:45multiple threads or SIMT.
16:47And here, what the major difference
16:50is the scale at which it's occurring.
16:54And then, the other difference is that actually
16:58individual threads or small working groups of threads
17:01on my GPU can work together.
17:03So actually the tools that I have available are very complex
17:07and a lot of need for care.
17:10There's a lot of need to carefully code this up.
17:13The solution is not statistical, but it's very much
17:18an engineering solution.
17:19But the results are really, really impressive
17:24from my standpoint, because if I compare.
17:27So on the left, I'm comparing relative speed ups against
17:32a very fast single core SIMD implementation on the left.
17:40So my baseline right here is the bottom of this blue curve.
17:44The X axis is giving me the number of CPU threads
17:48that I'm using, between one and 18.
17:52And then, the top line is not using CPU threads.
17:55So I just create a top-line that's flat.
17:58This is the GPU results.
18:01If I don't use SIMD, if I use non vectorized
18:04single core computing, of course, this is still
18:06pre-compiled C++ implementation.
18:08So it's fast or at least faster than R,
18:11and I'll show you that on the next slide.
18:13If I do that, then AVX is twice as fast.
18:17As I increased the number of cores,
18:21my relative speed up increases,
18:24but I also suffer diminishing returns.
18:28And then that is actually all these simulations
18:31on the left-hand plot.
18:33That's for a fixed amount of data.
18:34That's 75,000 randomly generated data points
18:38at each iteration of my simulation.
18:42But I can also just look at the seconds per evaluation.
18:45So that's my Y axis on the right-hand side.
18:49So ideally I want this to be as low as possible.
18:53And then I'm increasing the number of data points
18:56on the Y axis, on the X axis, excuse me.
19:00And then as the number of threads that I use,
19:03as I increased the number of threads,
19:05then my implementation is much faster.
19:08But again, you're seeing this quadratic computational
19:12complexity at play, right.
19:14All of these lines are looking rather parabolic.
19:18Finally, I go down all the way to the bottom,
19:21where I've got my GPU curve,
19:22again, suffering, computational complexity,
19:25which the quadratic computational complexity,
19:27which we can't get past, but doing a much better job
19:31than the CPU computing.
19:32Now you might ask, well, you might say,
19:35well, a 100 fold speed up is not that great.
19:38So I'd put this in perspective and say, well,
19:41what does this mean for R, which I use every day?
19:45Well, what it amounts to,
19:49and here, I'll just focus on the relative speed up
19:51over our implementation on the right.
19:55The GPU is reliably over 1000 times faster.
20:04So the way that Charles Loeffler and Seth Flaxman
20:12obtained a subset of their data was actually
20:16by thinning the data.
20:21They needed to do so because of the sheer computational
20:24complexity of using the Hawkes model.
20:27So, I'm not criticizing this in any way,
20:30but I'm simply pointing out why our results
20:34using the full data set, differ.
20:36So on the left, on the top left,
20:40we have the posterior density for the spatial length scale
20:44of the self excitatory component.
20:46And when we use the full data set,
20:48then we believe that we're operating more at around 70
20:51meters instead of the 126 inferred in the original paper.
20:56So one thing that you might notice is our posterior
21:01densities are much more concentrated than in blue,
21:08than the original analysis in Salmon.
21:12And this of course makes sense.
21:14We're using 10 times the amount of the data.
21:18Our temporal length scale is also meant,
21:20is also, we believe, much smaller, in fact.
21:24So now it's down to one minute instead of 10 minutes.
21:28Again, this could be interpreted
21:29as the simple result of thinning.
21:32And then finally, I just want to focus on this on
21:35the green posterior density.
21:41This is the proportion of events that we're interpreting
21:45that arise from self excitation or contagion dynamics.
21:50Experts believe that anywhere between 10 and 18% of gun
21:56violence events are retaliatory in nature.
21:59So actually our inference is kind of agreeing with,
22:07it safely within the band suggested by the experts.
22:15Actually, another thing that we can do,
22:18and that also requires a pretty computationally.
22:22So this is also quadratic computational complexity.
22:27Again, is post-processing.
22:30So if, for example, for individual events,
22:32we want to know the probability that the event arose
22:38from retaliatory gun violence,
22:41then we could look at the self excitatory component
22:46of the rate function divided by the total rate function.
22:49And then we can just look at the posterior
22:51distribution of this statistic.
22:55And this will give us our posterior probability
22:58that the event arose from contagion dynamics at least.
23:04And you can see that we can actually observe
23:06a very wide variety of values.
23:23So the issue of big data is actually not gonna go away,
23:28as we move on to discussing spatial data precision.
23:33Now, I'll tell you a little bit more about this data.
23:38All the data that we access is freely accessible online,
23:42is rounded to the nearest 100 meters
23:48by the DC Police Department.
23:51And the reason that they do this is for reasons of privacy.
23:58So one immediate question that we can ask is, well,
24:01how does this rounding actually affect our inference?
24:10Now we actually observed wildfires
24:13of wildly different sizes.
24:16And the question is, well, how does...
24:23If we want to model the spread of wildfires,
24:28then it would be useful to know
24:30where the actual ignition site,
24:33the site of ignition was.
24:37Where did the fire occur originally?
24:41And many of these fires are actually discovered
24:44out in the wild, far away from humans.
24:48And there's a lot of uncertainty.
24:50There's actually a large swaths of land that are involved.
24:57Finally, this, this global influenza data
25:00is very nice for certain reasons.
25:03For example, it features all of the observations,
25:07actually provide a viral genome data.
25:10So we can perform other more complex
25:12analyses on the data.
25:14And in fact, I'll do that in the third section
25:17for related data.
25:21But the actual spatial precision for this data is very poor.
25:25So, for some of these viral cases,
25:29we know the city in which it occurred.
25:32For some of them, we know the region
25:34or the state in which it occurred.
25:35And for some of them, we know the country
25:37in which it occurred.
25:40So I'm gonna start with the easy problem,
25:42which is analyzing the DC gun violence, the DC gunshot data.
25:48And here again, the police department rounds the data
25:50to the nearest hundred meters.
25:52So what do we do?
25:53We take that at face value and we simply use,
25:57place a uniform prior over the 10,000 meters square
26:04that is centered at each one of our observations.
26:06So here I'm denoting our actual data,
26:10our observed data with this kind of Gothic X,
26:15and then I'm placing a prior over the location
26:17at which the gunshot actually occurred.
26:19And this is a uniform prior over a box centered at my data.
26:23And using this prior actually has another interpretation
26:28similar to some other concepts
26:33from the missing data literature.
26:36And use of this prior actually corresponds to using
26:40something called the group data likelihood.
26:43And it's akin to the expected, complete data likelihood
26:48if you're familiar with the missing data literature.
26:53So what we do, and I'm not gonna get too much into
26:57the inference at this point, but we actually use MCMC
27:00to simultaneously infer the locations,
27:04and the Hawkes model parameters,
27:08the rate function parameters at the same time.
27:12So here, I'm just showing you a couple of examples
27:15of what this looks like.
27:16For each one of our observations colored yellow,
27:20we then have 100 posterior samples.
27:25So these dynamics can take on different forms
27:28and they take on different forms in very complex ways,
27:32simply because what we're essentially doing when we're...
27:38I'm going to loosely use the word impute.
27:41When we're imputing this data, when we're actually inferring
27:44these locations, we're basically simulating
27:47from a very complex n-body problem.
27:53So on the left, how can we interpret this?
27:57Well, we've got these four points and the model believes
28:01that actually they are farther away
28:02from each other than observed.
28:04Why is that?
28:05Well, right in the middle here, we have a shopping center,
28:09where there's actually many less gunshots.
28:13And then we've got residential areas
28:15where there are many more gunshots on the outside.
28:18And the bottom right, we actually have all of these,
28:26we believe that the actual locations of these gunshots
28:30collect closer together, kind of toward a very high
28:34intensity region in Washington, DC.
28:39And then we can just think about
28:41the general posterior displacement.
28:44So the mean posterior displacement.
28:46So in general, are there certain points that,
28:50where the model believes that the gunshots occurred
28:53further away from the observed events?
28:58And in general, there's not really.
29:01It's hard to come up with any steadfast rules.
29:04For example, in the bottom, right, we have some shots,
29:08some gunshots that show a very large posterior displacement,
29:13and they're in a very high density region.
29:15Whereas on the top, we also get large displacement
29:19and we're not surrounded by very many gunshots at all.
29:21So it is a very complex n-body problem
29:24that we're solving.
29:27And the good news is, for this problem,
29:30it doesn't matter much anyway.
29:32The results that we get are pretty much the same.
29:37I mean, so from the standpoint of statistical significance,
29:42we do get some statistically significant results.
29:45So in this figure, on the top,
29:47I'm showing you 95% credible intervals,
29:51and this is the self excitatory spatial length scale.
29:56We believe that it's smaller,
29:57but from a practical standpoint, it's not much smaller.
30:01It's a difference between 60 meters
30:03and maybe it's at 73 meters, 72 meters.
30:13But we shouldn't take too much comfort
30:16because actually as we increase the spatial prec-
30:19excuse me, as we decrease the spatial precision,
30:22we find that the model that does not take account
30:26of the rounding, performs much worse.
30:29So for example, if you look in the table,
30:33then we have the fixed locations model,
30:37where I'm not actually inferring the locations.
30:40And I just want to see, what's the empirical coverage
30:45of the 95% credible intervals?
30:48And let's just focus on the 95%
30:53credible intervals, specifically,
30:55simply because actually the other intervals,
30:59the 50% credible interval, the 80% credible interval,
31:03they showed the similar dynamic, which is that as we,
31:10so if we start on the right-hand side,
31:13we have precision down to down to 0.1.
31:16This is a unit list example.
31:19So we have higher precision, actually.
31:22Then we see that we have very good coverage,
31:24even if we don't take this locational
31:31coarsening into account.
31:33But as we increase the size of our error box,
31:38then we actually lose coverage,
31:41and we deviate from that 95% coverage.
31:44And then finally, if we increase too much,
31:46then we're never actually going to be
31:51capturing the true spatial length scale,
31:57whereas if we actually do sample the locations,
31:59we perform surprisingly well,
32:01even when we have a very high amount of spatial coarsening.
32:08Well, how else can we break the model?
32:11Another way that we can break this model,
32:13and by break the model, I mean, my naive model
32:16where I'm not inferring the locations.
32:18Another way that we can break this model
32:22is simply by considering data
32:24where we have variable spatial coarsening.
32:28That is where different data points
32:32are coarsened different amounts,
32:34so we have a variable precision.
32:40So considering the wildfire data,
32:43we actually see something with the naive approach
32:48where we're not inferring the locations.
32:51We actually see something that is actually recorded
32:56elsewhere in the Hawkes process literature.
33:00And that is that when we try to use a flexible
33:05background function, as we are trying to do,
33:07then we get this multimodal posterior distribution.
33:13And that's fine.
33:14We can also talk about it in a frequentist,
33:17from the frequency standpoint,
33:19because it's observed there as well
33:21in the maximum likelihood context, which is,
33:25we still see this multimodality.
33:29What specific form does this multimodality take?
33:34So what we see is that we get modes around the places
33:40where the background rate parameters,
33:47the background length scale parameters are equal
33:50to the temporal, excuse me, the self excitatory
33:54length scale parameters.
33:56So for the naive model, it's mode A,
34:00it believes that the spatial length scale
34:03is about 24 kilometers, and that the spatial length scale
34:07of the self excitatory dynamics
34:09are also roughly 24 kilometers.
34:14And then for the other mode,
34:16we get equal temporal length scales.
34:20So here, it believes 10 days, and 10 days
34:24for the self excitatory in the background component.
34:27And this can be very bad indeed.
34:29So for example, for mode A,
34:31it completely, the Hawkes model completely fails
34:36to capture seasonal dynamics, which is the first thing
34:40that you would want it to pick up on.
34:43The first thing that you would want it to understand
34:47is that wildfires...
34:49Okay, I need to be careful here
34:51because I'm not an expert on wildfires.
34:55I'll go out on a limb and say,
34:56wildfires don't happen in Alaska during the winter.
35:03On the other hand, when we use the full model
35:05and we're actually simultaneously inferring the locations,
35:08then we get this kind of Goldilocks effect,
35:11where here, the spatial length scale
35:14is somewhere around 35 kilometers,
35:17which is between the 23 kilometers and 63 kilometers
35:21for mode modes A and B, and we see that reliably.
35:33I can stop for some questions because I'm making good time.
35:44<v Man>Does anybody have any questions, if you want to ask?</v>
35:52<v Student>What's the interpretation</v>
35:53of the spatial length scale and the temporal length scale?
35:56What do those numbers actually mean?
35:59<v ->Yeah, thank you.</v>
36:02So, the interpretation of the...
36:06I think that the most useful interpretation,
36:11so just to give you an idea of how they can be interpreted.
36:15So for example, for the self excitatory component, right,
36:20that's describing the contagion dynamics.
36:23What this is saying is that if we see a wildfire,
36:29then we expect to observe another wildfire
36:34with mean distribution of one day.
36:41So the temporal length scale is in units days.
36:46So in the full model, after observing the wildfire,
36:50we expect to see another wildfire with mean, you know,
36:54on average, the next day.
36:56And this of course, you know, we have this model
37:02that's taking space and time into account.
37:05So the idea though, is that because of the separability
37:10in our model, we're basically simply
37:12expecting to see it somewhere.
37:19<v Student>Thank you.</v>
37:24<v Man>Any other questions?</v>
37:26(man speaking indistinctly)
37:31<v Student>Hi, can I have one question?</v>
37:35<v ->Go head.</v>
37:36<v Student>Okay.</v>
37:38I'm curious.
37:38What is a main difference between
37:39the naive model A and the naive model B?
37:43<v ->Okay.</v>
37:44So, sorry.
37:45This is...
37:47I think I could have presented
37:49this aspect better within the table itself.
37:52So this is the same exact model.
37:58But all that I'm doing is I'm applying
38:01the model multiple times.
38:03So in this case, I'm using Markov chain Monte Carlo.
38:07So one question that you might ask is,
38:10well, what happens when I run MCMC multiple times?
38:16Sometimes I get trapped in one mode.
38:20Sometimes I get trapped in another mode.
38:22You can just for, you know, a mental cartoon,
38:25we can think of like a (indistinct)
38:27a mixture of Gaussian distribution, right.
38:30Sometimes I can get trapped in this Gaussian component.
38:34Sometimes I could get trapped in this Gaussian component.
38:38So there's nothing intrinsically wrong with multimodality.
38:44We prefer to avoid it as best we can simply because it makes
38:47interpretation much more difficult.
38:52In this case, if I only perform inference
38:56and only see mode A, then I'm never actually gonna be
39:00picking up on seasonal dynamics.
39:07Does that (indistinct)?
39:10<v Woman>Yeah, it's clear.</v>
39:12<v Instructor>Okay.</v>
39:13<v Woman>Okay, and I also (indistinct).</v>
39:16So for the full model, you can capture
39:18the spatial dynamic property.
39:21So how to do that?
39:23So I know you need the Hawkes process that sees,
39:25clarifies the baseline.
39:28So how do you estimate a baseline part?
39:32<v ->Oh, okay, great.</v>
39:35In the exact same way.
39:37<v Student>Okay, I see.</v>
39:39<v ->So I'm jointly, simultaneously performing inference</v>
39:45over all of the model parameters.
39:47And I can go all the way back.
39:53Right.
39:54'Cause it's actually a very similar model.
39:58Yes.
39:59So this is my baseline.
40:02And so, for example, when we're talking about that temporal
40:06smooth that you saw on that last figure,
40:09where I'm supposed to be capturing seasonal dynamics.
40:13Well, if tau T, which I'm just calling
40:18my temporal length scale, if that is too large,
40:22then I'm never going to be capturing
40:24those seasonal dynamics, which I would be hoping to capture
40:28precisely using this background smoother.
40:33<v Student>Okay, I see.</v>
40:34So it looks like they assume the formula for the baseline,
40:38and then you estimates some parameters in these formulas.
40:42<v ->Yes.</v>
40:43<v Student>In my understanding,</v>
40:44in the current Hawkes literature,
40:47somebody uses (indistinct) function
40:49to approximate baseline also.
40:52<v ->Yes.</v>
40:52<v Student>This is also interesting.</v>
40:54Thank you. <v ->Yes.</v>
40:55Okay, okay, great.
40:56I'm happy to show another, you know.
40:59And of course I did not invent this.
41:00This is just another tact that you can take.
41:03<v Student>Yeah, yeah, yeah, yeah.</v>
41:04That's interesting.
41:05Thanks
41:06<v ->Yup.</v>
41:10<v Student>As just a quick follow up on</v>
41:13when you were showing the naive model,
41:16and this maybe a naive question on my part.
41:20Did you choose naive model A to be the one
41:24that does the type seasonality or is that approach
41:27just not (indistinct) seasonality?
41:33<v ->So I think that the point</v>
41:38is that sometimes based on, you know,
41:42I'm doing MCMC.
41:44It's random in nature, right.
41:46So just sometimes when I do that,
41:49I get trapped in that mode A,
41:53and sometimes I get trapped in that mode B.
42:00The label that I apply to it is just arbitrary,
42:04but maybe I'm not getting your question.
42:11<v Student>No, I think you did.</v>
42:14So, it's possible that we detect it.
42:17It's possible that we don't.
42:20<v ->Exactly.</v>
42:21And that's, you know,
42:22<v Student>That's what it is.</v>
42:23<v ->multimodality.</v>
42:25So this is kind of nice though,
42:27that this can actually give you,
42:30that actually inferring the locations can somehow,
42:35at least in this case, right,
42:37I mean, this is a case study, really,
42:40that this can help resolve that multimodality.
42:47<v Student>Thank you.</v>
42:48Yeah.
42:49<v Student>So back to the comparison between CPU and GPU.</v>
42:55Let's say, if we increase the thread of CPU,
43:00say like to infinity, will it be possible that the speed
43:06of CPU match the speed up of GPU?
43:12<v ->So.</v>
43:15You're saying if we increase.
43:17So, can I ask you one more time?
43:19Can I just ask for clarification?
43:21You're saying if we increase what to infinity?
43:25<v Student>The thread of CPU.</v>
43:28I think in the graph you're increasing the threads
43:32of CPU from like one to 80.
43:35And the speed up increase as the number
43:39of threats increasing.
43:42So just say like, let's say the threads of CPU
43:45increase to infinity, will the speed up match,
43:51because GPU with like (indistinct).
43:54Very high, right. <v ->Yeah, yeah.</v>
43:57Let me show you another figure,
44:00and then we can return to that.
44:03I think it's a good segue into the next section.
44:07So, let me answer that in a couple slides.
44:10<v Student>Okay, sounds good.</v>
44:12<v ->Okay.</v>
44:13So, questions about.
44:15I've gotten some good questions about how do we interpret
44:18the length scales and then this makes me think about,
44:23well, if all that we're doing is interpreting
44:26the length scales, how much is that telling us about
44:29the phenomenon that we're interested in?
44:32And can we actually craft more complex hierarchical models
44:37so that we can actually learn something perhaps
44:41even biologically interpretable?
44:43So here, I'm looking at 2014, 2016
44:47Ebola virus outbreak data.
44:50This is over almost 22,000 cases.
44:54And of these cases, we have about 1600
45:00that are providing us genome data.
45:08And then of those 1600, we have a smaller subset
45:12that provide us genome data, as well as spatiotemporal data.
45:20So often people use genome data, say RNA sequences in order
45:27to try to infer the way that different viral cases
45:29are related to each other.
45:31And the question is, can we pull together sequenced
45:34and unsequenced data at the same time?
45:39So what I'm doing here is, again,
45:42I'm not inventing this.
45:44This is something that already exists.
45:47So all that I'm doing is modifying my triggering function G,
45:52and giving it this little N,
45:54this little subscript right there,
45:57which is denoting the fact that I'm allowing different viral
46:01observations to contribute to the rate function
46:05in different manners.
46:07And the exact form that that's gonna take on
46:09for my specific simple model that I'm using,
46:12is I'm going to give this this data N.
46:17And I'm gonna include this data N parameter
46:20in my self excitatory component.
46:22And this data N is restricted to be greater than zero.
46:28So if it is greater than one,
46:30I'm gonna assume that actually, this self excite,
46:34excuse me, that this particular observation,
46:37little N is somehow more contagious.
46:41And if data is less than one,
46:43then I'm going to assume that it's less contagious.
46:48And this is an entirely unsatisfactory part of my talk,
46:52where I'm gonna gloss over a massive part of my model.
46:58And all that I'm gonna say is that
47:02this Phylogenetic Hawkes process, which I'm gonna be telling
47:05you about in the context of big modeling,
47:09and that challenge is that we start
47:13with the phylogenetic tree, which is simply the family tree
47:16that is uniting my 1600 sequenced cases.
47:22And then based on that, actually conditioned on that tree,
47:25we're gonna allow that tree to inform the larger
47:28co-variants of my model parameters, which are then going to
47:33contribute to the overall Hawkes rate function
47:37in a differential manner, although it's still additive.
47:45Now, let's see.
47:49Do I get to go till 10 or 9:50?
47:57<v Man>So you can go till 10.</v>
47:59<v ->Okay, great.</v>
48:00So then, I'll quickly say that if I'm inferring
48:06all of these rates, then I'm inferring over 1300 rates.
48:13So that is actually the dimensionality
48:15of my posterior distribution.
48:21So a tool that I can use,
48:23a classic tool over 50 years old at this point,
48:26that I can use, is I can use the random walk metropolis
48:29algorithm, which is actually going to sample
48:32from the posterior distribution of these rates.
48:36And it's gonna do so in a manner that is effective
48:40in low dimensions, but not effective in high dimensions.
48:46And the way that it works is say,
48:47we start at negative three, negative three.
48:49What we want to do is we want to explore this high density
48:52region of this bi-variate Gaussian,
48:55and we slowly amble forward, and eventually we get there.
49:03But this algorithm breaks down in moderate dimensions.
49:07So.
49:11An algorithm that I think many of us are aware of
49:14at this point, that is kind of a workhorse
49:16in high dimensional Bayesian inference
49:18is Hamiltonian Monte Carlo.
49:20And this works by using actual gradient information about
49:24our log posterior in order to intelligently guide
49:28the MCMC proposals that we're making.
49:32So, again, let's just pretend that we start
49:34at negative three, negative three,
49:36but within a small number of steps,
49:38we're actually effectively exploring
49:40that high density region, and we're doing so
49:45because we're using that gradient information
49:47of the log posterior.
49:51I'm not going to go too deep right now into the formulation
49:56of Hamiltonian Monte Carlo, for the sake of time.
50:00But what I would like to point out,
50:04is that after constructing this kind of physical system
50:13that is based on our target distribution
50:20on the posterior distribution, in some manner,
50:24we actually obtain our proposals within the MCMC.
50:30We obtain the proposals by simulating, by forward simulating
50:35the physical system, according to Hamilton's equations.
50:40Now,
50:43what this simulation involves is a massive number
50:48of repeated gradient evaluations.
50:53Moreover, if the posterior distribution is an ugly one,
51:00that is if it is still conditioned, which we interpret as,
51:06the log posterior Hessian has eigenvalues
51:09that are all over the place.
51:12Then we can also use a mass matrix, M, which is gonna allow
51:17us to condition our dynamics, and make sure that we are
51:24exploring all the dimensions of our model in an even manner.
51:29So the benefit of Hamiltonian Monte-Carlo is that it scales
51:32to tens of thousands of parameters.
51:34But the challenge is that that HMC necessitates repeated
51:38computation at the log likelihood,
51:42it's gradient and then preconditioning.
51:46And the best way that I know to precondition actually
51:49involves evaluating the log likelihood Hessian as well.
51:55And I told you that the challenges that I'm talking about
51:57today are intertwined.
51:58So what does this look like in a big data setting?
52:02Well, we've already managed to speed up the log likelihood
52:06computations that are quadratic in computational complexity.
52:11Well, it turns out that the log likelihood gradient
52:14and the log likelihood Hessian
52:17are all quadratic and computational complexity.
52:21So this means that as the size of our data set grows,
52:24we're going to...
52:27HMC, which is good at scaling to high dimensional models
52:31is going to break down because it's just gonna take too long
52:35to evaluate the quantities that we need to evaluate.
52:43To show you exactly how these parallel
52:45gradient calculations can work.
52:51So, what am I gonna do?
52:53I'm gonna parallelize again on a GPU
52:55or a multi-core CPU implementation,
53:00and I'm interested in evaluating or obtaining
53:04the quantities in the red box.
53:06These are simply the gradient of the log likelihood
53:09with respect to the individual rate parameters.
53:13And because of the summation that it involves,
53:17we actually obtain in the left, top left,
53:21we have the contribution of the first observation
53:25to that gradient term.
53:28Then we have the contribution of the second observation
53:31all the way up to the big int observation,
53:35that contribution to the gradient term.
53:37And these all need to be evaluated and summed over.
53:41So what do we do?
53:42We just do a running total, very simple.
53:45We start by getting the first contribution.
53:49We keep that stored in place.
53:53We evaluate the second contribution,
53:56all at the same time in parallel,
53:57and we simply increment our total observat-
54:01excuse me, our total gradient by that value.
54:05Very simple.
54:06We do this again and again.
54:08Kind of complicated to program, to be honest.
54:11But it's simple.
54:16It's simple when you think about it from the high level.
54:19So I showed you this figure before.
54:21And well, a similar figure before,
54:24and the interpretations are the same,
54:26but here I'll just focus on the question that I received.
54:30In the top left, we have the gradient.
54:32In the bottom left, excuse me,
54:34top row, we have the gradient.
54:35Bottom row, we have the Hessian,
54:37and here I'm increasing to 104 cores.
54:42So this is not infinite cores, right.
54:46It's 104.
54:47But I do want you to see that there's diminishing returns.
54:54And to give a little bit more technical
54:57response to that question,
55:02the thing to bear in mind is that
55:04it's not just about the number of threads that we use.
55:08It's having a lot of RAM very close
55:12to where the computing is being done.
55:15And that is something that GPUs,
55:18modern gigantic GPS do very well.
55:26So why is it important to do all this parallelization?
55:28Well, this is really, I want to kind of communicate
55:32this fact because it is so important.
55:36This slide underlines almost the entire challenge
55:40of big modeling using the spatiotemporal Hawkes process.
55:44The computing to apply this model to the 20,000 plus
55:49data points took about a month
55:54using a very large Nvidia GV100 GPU.
56:00Why?
56:01Because we had to generate 100 million Markov chain states
56:04at a rate of roughly three and a half million each day.
56:11After 100 million Markov chain states,
56:15after generating 100 million Markov chain states,
56:20this is the empirical distribution on the left
56:23of the effective sample sizes across,
56:28across all of the individual rates that we're inferring,
56:31actually all the model parameters.
56:34The minimum is 222, and that's right above my typical
56:39threshold of 200, because in general, we want the effective
56:43sample size to be as large as possible.
56:48Well, why was it so difficult?
56:50Well, a lot of the posterior,
56:53a lot of the marginal posteriors
56:55for our different parameters were very complex.
57:01So for example, here, I just have one individual rate,
57:05and this is the posterior that we learned from it.
57:08It's bi-modal.
57:10And not only is it bi-modal,
57:11but the modes exist on very different scales.
57:16Well, why else is it a difficult posterior to sample from?
57:19Well, because actually, as you might imagine,
57:22these rates have a very complex correlation in structure.
57:28This is kind of repeating something that I said earlier
57:30when we were actually inferring locations,
57:33which is that what this amounts to is really simulating
57:36a very large n-body problem.
57:44But what's the upshot?
57:45Well, we can actually capture these individual rates,
57:51which could give us hints at where to look for certain
57:55mutations that are allowing, say in this example,
58:01the Ebola virus to spread more effectively.
58:05And here, red is generally the highest,
58:09whereas blue is the lowest.
58:13We can get credible intervals,
58:15which can give us another way of thinking about, you know,
58:18where should I be looking
58:22in this collection of viral samples, for the next big one?
58:29And then I can also ask, well, how do these rates actually
58:32distribute along the phylogenetic tree?
58:37So I can look for clades or groups of branches
58:41that are in general, more red in this case than others.
58:53So, something that I...
58:55Okay, so it's 10 o'clock, and I will finish in one slide.
59:03The challenges that I'm talking about today,
59:05they're complex and they're intertwined,
59:08but they're not the only challenges.
59:10There are many challenges in the application
59:14of spatiotemporal Hawkes models,
59:16and there's actually a very large literature.
59:21So some other challenges that we might consider,
59:25and that will also be extremely challenging to overcome
59:31in a big data setting.
59:32So, kind of the first challenge is flexible modeling.
59:38So here, we want to use as flexible
59:41of a Hawkes model as possible.
59:44And this challenge kind of encapsulates one of the great
59:49ironies of model-based nonparametrics, which is that,
59:55the precise time that we actually want to use
59:58a flexible model, is the big data setting.
01:00:03I mean, I don't know if you recall my earlier slide
01:00:07where I was showing the posterior distribution
01:00:10of some of the length scales associated with
01:00:13the Washington DC data, and they're extremely tight.
01:00:19But this is actually exactly where we'd want to be able
01:00:24to use a flexible model, because no matter what,
01:00:28if I apply my model to 85,000 data points,
01:00:32I'm going to be very certain in my conclusion,
01:00:36conditioned on the specific model that I'm using.
01:00:41There's also boundary issues, right.
01:00:43This is a huge, a huge thing.
01:00:45So for those of you that are aware
01:00:47of the survival literature, which I'm sure many of you are,
01:00:52you know, they're censoring.
01:00:54So what about gunshots that occurred right outside
01:00:57of the border of Washington DC, and it occurred as a result
01:01:01of gunshots that occurred within the border?
01:01:03And then we can flip that on its head.
01:01:05What about parent events outside of Washington DC
01:01:10that precipitated gun violence within Washington DC.
01:01:13And then finally, sticking with the same example,
01:01:16differential sampling.
01:01:20You can be certain that those acoustic gunshot locators,
01:01:27location system sensors are not planted
01:01:30all over Washington DC.
01:01:34And how does their distribution affect things?
01:01:41Okay.
01:01:42This is joint work with Mark Suchard at UCLA, also at UCLA.
01:01:45And then my very good friend,
01:01:47my very dear friend, Xiang Ji at Tulane.
01:01:50It's funded by the K-Award Big Data Predictive Phylogenetics
01:01:54with Bayesian learning, funded by the NIH.
01:01:58And that's it.
01:01:59Thank you.
01:02:06<v Man>All right.</v>
01:02:07Thank you so much, Professor Holbrook.
01:02:08Does anybody have any other questions?
01:02:11(people speaking indistinctly)
01:02:18Yeah.
01:02:21Any other questions from the room here, or from Zoom?
01:02:25(people speaking indistinctly)