YSPH Biostatistics Seminar: “A Model-free Variable Screening Method Based on Leverage Score”

September 30, 2022

Information

Speaker: Yiwen Liu, PhD, Assistant Professor, Department of Epidemiology and Biostatistics, University of Arizona

September 27, 2022

ID8123

To CiteDCA Citation Guide

00:01<v Robert>Hey everybody, I've got noon,</v>
00:05so let's get started.
00:06So today I'm pleased to introduce Professor Yiwen Liu.
00:11Professor Liu earned her BS and MS in Statistics
00:13from the Central University of Finance and Economics
00:16in China and her PhD in Statistics
00:19from the University of Georgia.
00:21Today, she's an Assistant Professor of Practice
00:23in the Department of Epidemiology and Biostatistics
00:26at the Mel and Enid Zuckerberg, Zuckerman, sorry,
00:29College of Public Health at the University of Arizona.
00:32Her research primarily focuses on developing
00:35statistical methods and theory
00:37to harness a variety of issues in analyzing
00:39the high dimensional data or the complex data set.
00:43More specifically, her research interests
00:46include developing model-free dimension reduction methods,
00:48which are high dimensional data regression
00:50and integration methods for multiple source data.
00:53Today, she's gonna talk to us
00:54about a model-free variable screening method
00:57based on leverage score.
00:59Let's welcome Professor Liu.
01:04<v Professor Liu>Thank you, Robert</v>
01:05for your nice introduction and it's my great honor
01:08to be invited and present my work here.
01:12So in today's talk, I will introduce a model-free
01:16variable screening method based on leverage score,
01:19and we named the method as the weighted leverage score.
01:22So as we know, this is a joint work
01:24with Dr. Wenxuan Zhong from the University of Georgia
01:27and Dr. Peng Zeng from Auburn University.
01:32So as we know, as we've heard there's big data error,
01:36there are numerous data produced almost in every field
01:40of science including biology.
01:42So we are facing data extremely with high dimensionality
01:48and also data with really complex structures.
01:56Thank you.
01:58And we are facing data of extremely high dimensionality
02:00and really complex structures
02:04and how do we effectively extract information
02:07from such large and complex data
02:10pose new statistical challenge.
02:12So to motivate my research, let us see an example first.
02:18So currently cancer has graduated
02:20from the primary cause of death across the world.
02:24Nowadays cancer is diagnosed by an expert
02:28who has to look at the tissue samples under the microscope.
02:31You can imagine that there are millions
02:33of new cancer cases each year
02:36and this often means that those doctors
02:39will find themselves looking at hundreds of images each day.
02:44And this is really tedious work.
02:48And because of, you may find that,
02:52because of the shortage of qualified doctors,
02:55there could be a huge lag time
02:57before those doctors can even figure out
02:59what is going on with the patient.
03:02So detect cancer using only manpower,
03:05looking at images is not enough.
03:08And we intend to build a statistical
03:12and mathematical model to identify, detect cancer
03:16in a more accurate, less expensive way.
03:22Okay, so the the second generation sequencing
03:25makes this becomes possible and promising.
03:29And so a typical research,
03:33critical inference is to find the markers
03:36that related to cancer.
03:38Right now there's new sequencing technology
03:39called spatial transcriptomics.
03:42You know that for bulk I sequencing data,
03:46it just sequence the whole tissue
03:47and it generate a average the gene expression data.
03:51But with this new technology called spatial transcriptomic,
03:55this kind of cancer tissue will be sliced
03:57into several thin sections.
04:01And within each section, the grid point in the section
04:06will be sequenced simultaneously.
04:08So you can see that here we have two areas
04:11of invasive cancers, okay,
04:14all the dot points within these two sections
04:17will be invasive cancer areas for invasive cancer patients.
04:21The other six areas, they are noninvasive cancer areas.
04:25The grid points in these locations
04:29will be noninvasive cancer areas,
04:32but for other parts, they're normal part, okay?
04:35And the data that we will have
04:37is because this new technology
04:40will sequence the whole tissue,
04:43all those grid points simultaneously.
04:45This data matrix that we will have,
04:47each row corresponds to a location
04:50within the section and the other columns,
04:53each column corresponding to the expressions
04:56for certain genes.
04:59And the data matrix like this,
05:01the Y's are labels for those patients,
05:05the normal noninvasive or invasive.
05:07And we will get a gene expressions for all those P genes
05:12for each location, okay?
05:15So this is the data that we have
05:17and our goal then comes to identify marker genes
05:23for those noninvasive and invasive cancer areas.
05:27As showed in this figure, this is the tissue sections.
05:33There are points, color dots here
05:37with the color of the dots showing the expression levels.
05:42Okay?
05:44So the the dots with a yellow color
05:47shows a higher expression.
05:50We intended to build the models
05:52to identify such genes.
05:55These genes are show remarkable
05:58differential express the levels across issue sections, okay?
06:03These two genes have higher expression
06:06in invasive cancer areas.
06:10Okay, we intended to build a status quo model
06:13to identify such genes
06:14but there exist several challenges here.
06:17Usually the data that we have,
06:20the samples or take the locations here
06:22is only our label the data is only around hundreds,
06:25but the number of genes could be tens of thousands.
06:30This is so-called a large piece modern problem.
06:35Usually for any traditional methods,
06:38there's no way to utilize those traditional methods
06:41to solve this problem.
06:44And the talk mentioned that there is a further layer
06:46of complication between the gene expression levels
06:49and the cancer or normal types, okay?
06:53Usually how the gene expression levels
06:56would influence, could affect different types of cancer,
07:00this mechanism is largely unknown
07:02and the association between them is beyond linear.
07:06So these are the two challenges.
07:09That means that we're going to,
07:12what we need is a statistical methods
07:15that can do variable screening
07:19in a more general model set up.
07:26So we, to achieve this goal,
07:29we choose to build our efforts
07:32under this so called general index model.
07:35In a general index model it describes a scenario that Y-i
07:40which is the response
07:41will have relation to pay linear combinations of X-i.
07:47So that's beta one transpose X-I
07:49to beta K transpose X-I
07:51through some anomaly function F.
07:55So this is the general index model
07:57and we know that here, X-i is a P directional vector
08:03and if K is a value that is much smaller than P,
08:09then we actually achieved the goal of vanishing reduction
08:13because the original P directional vector
08:16is projected onto a space of a pay dimensional,
08:22pay beta one X, beta one transpose X-i
08:24to beta eight transpose X-i.
08:27And we choose this general index model
08:30because it actually is a very general model framework.
08:34If we map this general index model to our problem here,
08:38the Y-i could be the label for location i.
08:43And, for example, the non-invasive location.
08:48And then X-i is a key dimensional vector
08:50and could be the gene expression levels
08:52for location i of those P genes
08:57and then beta one transposed X-i
08:58to beta k transpose X-i
09:01could be those K coregulated gene K groups
09:05of coregulated genes.
09:07And those K groups of coregulated genes
09:10will affect the response through some anomaly function F.
09:16Okay so this is our general model setup.
09:24We utilize the general index model
09:26because it's a general model framework
09:29that encompasses many different model types.
09:33There is three special cases,
09:35for example, the linear model is one special case.
09:40Here, KY-i, that is when K equals one
09:45and F anomaly function acquires an identity form.
09:50Okay so this is the linear model
09:52and the error term is additive.
09:55So linear model is one special case for it.
09:58The number per match model is another special case
10:01for the general index model.
10:03That is where K equal to P and beta one to beta P,
10:07it forms an identity matrix.
10:12Thank you.
10:14And then the third one,
10:15the single index model
10:16is another special case for the general index model
10:21that is when K equal to one
10:22and that error term is additive.
10:26So the reason that I show these three special cases,
10:28just to let everyone know that general index model
10:32is a very general model framework.
10:36In this case, using this model framework
10:39to do a variable screening or variable selection,
10:45we can say for those, this is determined by the,
10:50whether we should screen
10:51or whether we should remove certain variables
10:54is determined by the coefficients here.
10:57Say, for a specific variable, if the coefficient beta one,
11:02if it's coefficient across those K,
11:04that K different factors are all zero,
11:08then we say this value, this variable, is redundant.
11:13Okay so this is how we utilize the model
11:16in a estimated coefficient
11:18to do a variable screening or, say, variable selection.
11:27So the question becomes how can we estimate data
11:30under this model framework, right?
11:32Just like made estimating beta
11:34in a simple linear refreshing model.
11:37So let's see a simple case.
11:39That is when F function can is invertible.
11:44So that is to say,
11:45we have the model becomes F inverse of Y-i
11:49equal to beta transposed X-i plus epsilon-i.
11:53Okay and this is very similar, looks similar model, right?
11:57And if we want to estimate beta,
11:59we can just simply maximize the correlation
12:03between inverse of Y-i and the equal transpose X-I.
12:07Using this optimization problem we can recover beta, okay,
12:13given that F is invertible and F function is known.
12:19But we know in real case, F function is unknown
12:24and sometimes it is unconvertible.
12:26And then what can we do to estimate beta
12:31when F function is unknown?
12:34So when F function is unknown,
12:37we can consider all the transformations of Y-i
12:40and we can solve beta
12:41through the following optimization problem.
12:45We consider all transformations of Y-i
12:48we know as E of Y-i
12:51and we define our square of eta
12:54which is a function of eta as the maximized correlation
12:59between NA tran E of Y-i and eta transposed X-i.
13:05And this maximization is taken over
13:07or any transformations E, okay?
13:13So using this function, beta basically is the solution
13:17for this maximization problem
13:20and with certain conditions satisfied,
13:23we can simplify this objective function
13:27with respect to eta, we can say R transform this,
13:34R square of eta into a this really nice quadratic form.
13:38Okay, in the numerator, it's eta transposed times, okay,
13:43this conditional variance times eta
13:47and in the denominator, it's either transposed
13:50the variance of X-i, eta.
13:52This is a very nice projected form.
13:56Basically, the solution of this, the solution beta,
14:02is just taking factors, okay,
14:06corresponding to the pay largest taken values
14:09of this matrix in the middle.
14:13That's how we solve beta in this case.
14:16But as we know, as I mentioned,
14:18that there is a really like big challenges here.
14:21One is that we have really large P here
14:24in the sigma X is a P value matrix.
14:27Sigma X given Y is also P value matrix.
14:31And we know that we are dealing with a case
14:34of P is larger than N, in this scenario,
14:38it would be really difficult
14:40to generate a consistent estimate based on a scenario
14:45when P is larger than N.
14:48And we also have this inverse here for a very large matrix,
14:53it would be really time consuming
14:55to produce an inverse of the matrix.
14:58That alone, this matrix is not a consistent estimate, okay?
15:02So this matrix in the middle,
15:04if we want to estimate that
15:06in the P brought up in this scenario,
15:08it would be really problematic, right?
15:13And in the following, I will show how we gonna use
15:18the weighted leverage score, the method that we proposed
15:21to bypass the estimation of these two matrix
15:25and then perform the variable selection
15:28once again if the reduction under the general index model.
15:35So we call our method the weighted leverage score.
15:39Let us first take a look at what is leverage score
15:43and what is weighted leverage score, okay?
15:46So let's consider a simple case
15:49that is the linear regression model
15:51and then the rhet D single value competition of X
15:55as X equal to U log to T-transpose.
15:59This is the singular value competition X.
16:03And then, so in statistics,
16:05the leverage score basically is defined as
16:08the diagonal element of the hat matrix.
16:14And then the hat matrix,
16:15we use the further least the singular value competition.
16:18It can be simplified to UU transpose.
16:22And then which means that the diagonal element
16:25or the hat matrix basically is the real norm
16:29of the U matrix, okay?
16:34And then actually this leverage score
16:37has a very good interpretation.
16:40It is the partial directive of Y-i hat with respect to Y-i.
16:45Okay, which means that if the leverage score
16:48is larger and closer to one,
16:51it would be more influential in predicting Y-i hat.
16:57So there is a recent work of Dr. Pima
17:01who's using this leverage score to do a sub-sampling
17:05in big data.
17:07As you can see that again,
17:09that message here is that if the U-i norm is larger,
17:13if the leverage score is larger
17:14than we say this point is more influential.
17:19Think the motivating example is like this,
17:21in the first figure, this black dots,
17:25they are original data
17:29and the solid black is the actual model.
17:35And if we want to do a linear regression,
17:37usually sometimes if the really big data,
17:39the data is really large,
17:41so it is hardly possible to utilize all the data points
17:44to generate the line here.
17:49So a typical strategy is just to do sub-sampling
17:54from the such big data
17:56and then performing a linear regression model.
17:59Right now, you can see that the regression line
18:02produced by a random sub sample from the population,
18:08those data is represented by the screen crosses.
18:12So we'll generate a linear regression line
18:17that largely deviates from the true model.
18:21So that is when the random sampling does not work
18:27in this case.
18:28However, if we do a sub-sampling
18:33according to its leverage score, okay,
18:36you will see that, in the second graph,
18:38these red crosses on the data sub sample we're using
18:44utilize the so-called leverage score
18:48and the red dashed line is the model
18:52attempt value in those sub samples, okay?
18:55So we can see that using the leverage score
18:59to the sub sample can help us to generate a line
19:03that is very good, can approximate the true model.
19:09So what I want to say using these graph
19:10is that the leverage score, the UI norm,
19:14can be used, say, as an indicator
19:19of how you fully ensure the data point is to the prediction.
19:25Okay so UI is the role norm of the left single matrix
19:32but we are talking about variable selection.
19:34So UI norm can be used to select the roles.
19:38Intuitively, to select the columns of X,
19:43we can just do a transpose of X.
19:45So X transpose equal to the VAU transpose.
19:49To select the columns of X
19:52basically is to select the roles of X transpose.
19:55Intuitively we can just use the rule map of V matrix,
19:59which is the right single matrix to do a selection,
20:03to select the influential columns effects, okay, right?
20:08And then, so we call the rho nu of U,
20:13we call the U as the left singular matrix
20:16V as the right singular matrix.
20:18And we call the rho nu of U as the left leverage score,
20:24rho of B as the right leverage score.
20:30So I want to say, use the previous two slides
20:34is that basically, the raw information,
20:36intuitively, the raw information is contained in the U,
20:40the column information of X
20:41is contained in the V matrix
20:44and we know that there is a fertile complication
20:47between X and Y, which is unknown link function F.
20:50So how do we utilize the information from the column
20:54from the rho and also the anomaly function
20:57to generate a same method
21:01that can help us to the variable selection
21:07that is to select influential columns for X.
21:13Okay so, let us get back to the matrix
21:16we derived in the previous slides.
21:18We have the conditional variance in the denominator
21:22and we have variance X in the numerator
21:25and the variance of X in the denominator.
21:27And with simple statistics,
21:29this can be simplified to variants of expectation
21:33of Z given Y where Z is a standardized X.
21:38Okay, further we used a singular variety competition,
21:41Z can be simplified to UV transpose
21:46and then it's IJ element basically
21:48is in the product of Ui and Vj,
21:52so Vi basically contains both raw information
21:56and column information.
21:58And then we proposed the weighted leverage score,
22:01which is defined in this equation.
22:06And the interpretation of the Wj,
22:09which is the weight leverage score for J's predictor
22:12is threefold.
22:14So first of all you can see it contains
22:16both the column information and the raw information of X.
22:21And we know and thus, in the second fold,
22:27you can see in the middle, basically contains
22:30the information from the unknown function F
22:33because we have the conditional expectation here,
22:36expectation of Ui given Y,
22:38basically it's a kind of a reflection
22:41of the anomaly function F.
22:46And third, this method is viewed
22:49under the general index model
22:51and it is model three,
22:53in the case that the general index model
22:55encompasses many different model types.
22:59Okay so this is kind of a population version
23:04of the weighted leverage score.
23:06In terms of estimation,
23:07you will see we only need to estimate
23:09in the matrix in the middle,
23:11which is the variance of the expectation of Ui given Y.
23:17To estimate this matrix,
23:20there's a, we can see that Ui is actually three dimensional
23:24because this is a directly single value composition.
23:27Ui is a three dimensional vector.
23:29Y is only one dimensional.
23:31This is a function of one dimensional variable.
23:33So it can be easily approximated by dividing, okay,
23:38the range of Y into h slices as much as h
23:43and within each slice, okay?
23:46Within each slice we calculate the slice mean
23:49for all roles of U.
23:54Lastly, illustrated inn this graph,
23:55we can first the slice into Y into edge slices
23:59and then within each slice, if those Yi
24:02fall into the same slice,
24:03we find out their corresponding use
24:06and then do, calculate its mean for each U.
24:14And in that way, we can simplify,
24:16we can simply estimate expectation of Ui given Y.
24:21And then further we can estimate the variance
24:23of those averages.
24:27Basically just taking that the variance of U one bar
24:30to U edge bar.
24:37So this is the way how we estimate the variance
24:40of the expectation of Ui given Y.
24:45Okay so in that way we actually generate our estimate
24:48weighted leverage score
24:50we define as the right leverage score
24:53weighted by the matrix in the middle.
24:58Okay so first of all, this weighted leverage score
25:02is built, say, upon the general index model,
25:08it is considered as model free
25:10because Yi, the response is connected with
25:14hitting combination of X through anomaly function F.
25:19And this model is general to encompass
25:21many different model types.
25:23So we can consider it as model free.
25:26And second, this to generate this weighted leverage score
25:31where there is no need to estimate the covariance matrix
25:36and there is no need to estimate anomaly function F.
25:40So we can bypass all those procedures
25:44to calculate this weighted leverage score.
25:52So this weighted leverage score
25:53actually encompass a very good feature
25:56that is it is an indicator
25:57of how influential of the columns are
26:01and we can basically run our predictors
26:04according to the weighted leverage score.
26:06The higher the score is,
26:07the more influential the predictor will be.
26:10And later I will show why this ranking properties
26:14would help or even with the leverage score.
26:17So this is a basic procedures
26:20for giving weighted leverage score
26:22to a variable selection or variable screening.
26:25So given that we have this matrix,
26:28which is an impart matrix and we have the responses,
26:31the labels, for each of the location,
26:34we only need one time singular value composition, okay?
26:39This is a rank D singular value composition of X.
26:43And then we can just calculate the weighted leverage score
26:45according to the equations,
26:48rank those weighted leverage score
26:50from the highest to the lowest.
26:52Select the predictor that we use,
26:54the highest weighted leverage scores.
26:57This is the basic screening procedure
26:59using the weighted average score
27:01and there is still implementation issue
27:03that we will later address.
27:05First one, how can we determine the number of D?
27:08So given the data which is an IP, how can we determine,
27:12say, how many, say, spiked or how many singular values
27:17to be included in the model?
27:20And then the second implementation issue
27:23is determine the number of variables
27:25to be selected in the model.
27:29So you can see that the weight leverage score procedure
27:32is screening procedure only include
27:34one type of singularity, competition is quite efficient.
27:41Okay so, in the next, let us using two slides
27:44to discuss a little but, basically just one slides
27:47to discuss the ranking properties
27:49of the weighted leverage score.
27:51So as I mentioned, the weighted leverage score
27:53has a very nice property.
27:55So it is guaranteed by the theorem here.
27:58We show that, given certain conditions are satisfied,
28:02we have the minimum value of the weighted leverage score
28:07from the true predictors
28:09will always rank higher than the maximum value,
28:12maximum weighted leverage score
28:14of those redundant predictors.
28:18And this holds for the population with leverage score.
28:24In terms of the estimation weighted leverage score,
28:27we utilize the two step procedure.
28:29So first of all,
28:30we show that the estimate weighted leverage score
28:34is very close to the population version
28:37of the weighted leverage.
28:39Okay and then the estimated weighted leverage score
28:44will also have the ranking property
28:48that is the estimated weighted leverage score
28:50of the active predictors or important predictors
28:54ranks higher than the estimated weighted leverage score
28:58of the down predictors with probability turning to one.
29:05Okay so this, the ranking properties
29:08of the weighted leverage score
29:09basically is guaranteed by these two properties.
29:16And then further, we also know
29:18that there are two implementation issue.
29:21The first one is determine
29:22the number of spiked singular values d.
29:25So how many single values we need to include in our model?
29:29This is a question and it's quite crucial
29:32because we need to know how many of the signals
29:37contain the data and we need to remove
29:40all those redundant or noise information.
29:43Okay so here I develop a criterion
29:46based on the properties of those aken values.
29:52DR is a function of, R is the number of values
29:56to be included in a model.
29:59DR is a function of R and the theta I hat
30:02is the ratio between the highest aken value
30:08and the largest aken, value, number one hat.
30:12And then, you can see that as we include more, say,
30:17single values in the model,
30:19this, the first term will decrease
30:22and then tier to some point.
30:24The first, the decreasing of the first term
30:28is smaller than the increasing of the second term.
30:30Then DR starts to increase.
30:34And then we can use the criterion to find D hat,
30:38we show that we had is very close to a true D.
30:41Okay, you meet this criterion,
30:43we can select the true number of signals in the model.
30:51And the second implementation issue
30:53is about how many predictors, how many true predictors
30:58we need to include in our model.
31:01Okay again, we're ranking our weighted leverage scores
31:05and here we utilize the criterion here
31:10based on the properties of the weighted leverage score.
31:12Okay, as we include more predictors into the active set,
31:17the first term will decrease, okay?
31:20The summation of the weighted leverage score will increase,
31:22but the increment will decrease
31:25and then the second term will increase, okay?
31:27So, as we include more predictors in the model,
31:33there's some changing point
31:35when the increment is smaller than the increment
31:43of the second penalty term.
31:46We show that, using this criteria,
31:48the set we selected using this criteria,
31:52which is A, will always we'll say can include
31:56all the true predictors with probability pending to one.
32:03Okay so that's how we using this criterion
32:05to determine the number of predictors
32:07to be selected in the model.
32:10Okay in the next step,
32:11let me show some empirical study results
32:14of using the weighted leverage score
32:17to do the variable selection in the model.
32:22In the example one, as I mentioned,
32:24we are utilizing, we are proposing our method
32:28under the general index model framework.
32:31So the first model is the general index model.
32:35Well why there's two directions?
32:37The first one is in the numerator,
32:40so this is so called a beta one transpose X.
32:43The second direction beta two transpose X
32:45is in the variable system,
32:49the term within the variant,
32:50this is beta two transpose X.
32:54Okay so this is called a general index model
32:56and we assume that X is generated from
33:03a very normal distribution
33:05and we submit our zero and covariant structure
33:09back in this way, okay?
33:11We let rho equal to 0.5
33:13which will generate a matrix with moderate correlations.
33:21So in this way, we generate both X and Y,
33:24let's see how the performance of variable selection
33:29give you the weighted leverage score.
33:31In our scenarios, we let N equal to 1000
33:34and the rho equal 0.5.
33:37In example one, there are four different scenarios.
33:42For scenario one, we let P as 200
33:46and then we increase P to 200, sorry 22,000 and 2,500.
33:54We also increase the variance of the error turn
33:58as 1.5 this now to 1.3.
34:02Okay, there are three criteria we used
34:04to evaluate the performance of the method.
34:08The first one is the false positive.
34:10So it means how many of the variables are falsely selected?
34:15False negative which shows how many variables
34:18falsely excluded, how many true predictors
34:21are falsely excluded.
34:24The last one, the last criterion
34:26is because, is basically is our model size
34:30because we have this ranking properties
34:33of all the methods here, okay?
34:36All those methods have ranking properties.
34:38So I want to know how many variables I need to include
34:43in the model so that all true predictors
34:45are included because we have this ranking property, okay?
34:50You will see that weighted leverage score
34:54basically have a better performance
34:56in terms of the false positive and the false negative
35:01and also the model size.
35:05I want to say say a bit more about model size.
35:09We can see that when N is 1000, P is 200,
35:13the minimum model size is 6.14, meaning that we, in total,
35:18we only have six true predictors, okay?
35:21We only need to include the six variables
35:23to 6.14 variables to encompass all true predictors.
35:29Basically all those six variables are rank higher
35:32than all our other novel variables, right?
35:35In a second model, when P increases to 2,200,
35:40we only need seven predictors
35:43in order to, on average, in order to include
35:45all the true predictors.
35:46Meaning that overall, all the true predictors
35:51ranks higher than those redundant predictors.
35:55And then when sigma increases and when P increases,
35:58we still need only the minimum number of variables
36:03just to include all true predictors.
36:07Okay so this is for the first example
36:09of the model index model.
36:12The second is more challenging,
36:14it's called a heteroscedastic model.
36:17You will find that, in the first model,
36:19the X, those active predictors only influence
36:24the main response, okay?
36:26Only influence Y in its means.
36:28But here, you can see four different types
36:31for those variables, the average of Y,
36:33the mean of Y is zero
36:34because error term is in the numerator
36:37and those X, those active predictors will influence Y
36:42in its variance.
36:45So it's a much challenging case.
36:49In this case, we also assume that X follows
36:52a very normal distribution,
36:54given mean is zero and a covariant structure like this.
37:00In our scenarios, we let N equal to 1,000.
37:03So let's see, in a heteroscedastic model,
37:06what are the behaviors of our methods?
37:11Okay, sorry, I forget to introduce the method
37:14that we are compare these with true independence ranking
37:18and screening and the distance correlation.
37:20So both of these methods can be utilized
37:23to measure, say, the association between the response
37:28and the predictors.
37:30And both of the methods have the ranking properties.
37:34So we can compare the minimum model size
37:36for all the methods.
37:39Regards the false positive of these two methods,
37:42there's no criteria proposed
37:43to select the number of predictors.
37:46And in all those methods,
37:48I just use a harder stretch holding
37:50to select the number of predictors
37:52in order to be included in the model.
37:55Okay let's see, in a heteroscedastic model,
37:58what are the behaviors of those three methods?
38:03Okay, when we have N greater the number of predictors,
38:08P here, you will find that there are slightly larger,
38:12with false negative for the weighted leverage score.
38:16This is because both the methods within the our threshold,
38:20they select around 140, more than 140,
38:25true predictors out of 200 from the model.
38:29So that's why they have very small false negative,
38:33but if you look at the minimum model size,
38:36you will find that our weighted leverage score
38:39still maintains a very good performance.
38:42Okay, it has a smaller value of the minimum model size.
38:46Okay in general, we only need 46 variables
38:49in order to include our two predictors in the model.
38:55And then as P diverges, as CSP increased to 2,500,
39:01basically the weighted leverage score
39:03will measure 1.3 variable true predictors from the model
39:07and every method have a really hard time
39:11to identify all the true predictors.
39:14They have really large minimum model size.
39:20So this is basic a performance of the using with
39:25the leverage score to perform a variable screening
39:30under general index model.
39:31So I only present two examples here
39:34for interest in odd scenarios.
39:36We can talk about that at the top.
39:42Okay so let's get back to our real data example.
39:47So in a motivating example, as I mentioned,
39:49we utilize this spatial transcriptomics data.
39:53We are sequencing the grid point within each section, okay?
39:57Basically these locations are invasive cancer areas,
40:02the other areas are the noninvasive cancer areas,
40:06and then these are the normal areas.
40:08So how to determine the invasive, non-invasive
40:11and the normal area?
40:13These are determined by qualified doctors
40:17and they're utilizing some logical information
40:20of these locations.
40:22Okay in general, for these two sections,
40:24we have identified 518 locations,
40:2864 invasive areas and 73 are noninvasive areas.
40:34And there are rest of the areas, 381, they are normal.
40:40And we have our gene, about 3,572 expressions,
40:46gene expressions across the section.
40:51Okay so in general, basically we have our data matrix
40:54it is about 518 times 3,572.
41:00So we trying to identify biomarkers, okay,
41:04within those three genes
41:06that can help us discriminate between invasive cancer,
41:10non-invasive cancer and normal areas.
41:14So we utilize the weighted leverage score,
41:18we apply the weight leverage score screening procedure
41:20for this data set.
41:22And we identified around 225 genes
41:27among all those P genes.
41:31In the plot, a heat map here show the results
41:34because just for the ease of presentation,
41:37I only printed around 20 genes here
41:40and with the top, say, weighted leverage scores,
41:45you can see that there are certain patterns here.
41:48This group of genes are more highly expressed
41:51for the non-invasive cancer area.
41:54There are certain group of genes right here.
41:57They are more highly expressed
41:59in the invasive cancer areas.
42:03Okay so this is the gene expression patterns
42:07of those top 20 genes.
42:13And then we also plot the expressions of those genes
42:20in these sections.
42:22Again, these are invasive areas,
42:25noninvasive and the normal areas.
42:27So we plot a group of genes,
42:30I can't remember exactly what our genes are,
42:33but these genes have, you can see,
42:37have a higher expression on those noninvasive cancer areas.
42:42And we plug another group of genes,
42:43basically are these three genes,
42:45in the section, the expression shows a higher,
42:48this means that these three genes have higher expression,
42:52okay, in the invasive cancer areas, okay?
42:57Basically this means that the genes that we selected
42:59show a remarkable spatially differential expressed patterns
43:06across the tissue sections.
43:11And later we do a, say, pathway analysis
43:19and see that there are 47 functional classes
43:24for those all those gene 225 gene that we have identified.
43:28And there are several cancer hallmarks, for example,
43:3238 of the genes that we identified
43:35enriched in the regulation of apoptotic process.
43:41This is a kind of cancer hallmark.
43:44And then another 41 gene that we have identified
43:47are involved in the regulation of cell death.
43:52More specifically, because we are really interested
43:54in the invasive cancer,
43:57so we identified these three,
43:59there are like three genes for example,
44:02in the regulation of brain process,
44:05they have many relations with the breast cancer, okay?
44:11And later we can investigate or, say,
44:17even adaptation of those, those genes
44:20that are enriched in the revelation of apoptotic process.
44:32So, in summary, that weighted leverage score
44:36that we have developed is a variable screening method
44:40and it is developed under the general index model.
44:45And this a very general model framework.
44:49It can be used the two address the curse of dimensionality
44:52in regression and also,
44:55because we utilize both the leverage score,
44:59the left leverage score and the right leverage score
45:01to evaluate a predictor's importance
45:04in the general index model,
45:07we provide a theoretical underpinning
45:10to that objectify that you need both the leverage scores
45:14of both the left and right leverage scores,
45:16we can evaluate the predicts importance.
45:19Okay so this is kind of a new framework
45:21for analyzing those numerical properties,
45:25especially for the single matrixes
45:29under the general index model.
45:32Okay so, this is basically a summary
45:34of the weighted leverage score
45:36and I wanna stop here and to see if anyone has any questions
45:42or comments about weighted leverage score.
45:56<v Robert>Questions?</v>
46:00Anybody on Zoom have questions?
46:04<v Student>Can I ask a quick question</v>
46:05regarding this weighted leverage score?
46:07So when we look at results,
46:09this weighted leverage score
46:10has much better performance
46:11with less inverse regression regional one, right?
46:14So I wonder, is this correct
46:17that the reason why improves so much
46:20is because it utilize the information on the line,
46:23maybe, total make sense in a lot of applications
46:26that those important features,
46:28they may be like more contributing
46:31to like also leading to like variation of other features
46:35and as a result, maybe could show up in the top
46:39as vectors in the design matrix.
46:44Is this correct?
46:46<v Professor Liu>Yeah, thank you very much</v>
46:47for your question, it's a very good question.
46:49So first of all, I want to clarify,
46:53maybe I'm not very clear about SIRS.
46:56Basically this is representing
46:58the true independence ranking and screening.
47:01So it's also a method
47:02that is based on the slicing versus regression.
47:06So yeah, and the other one that why with the leverage score
47:10has a much better performance
47:12comparing these two methods is basically
47:15because, one of the reason is because
47:19the true independent screening and the distance correlation,
47:23they all just utilize the partial and partial correlation.
47:27So between X and Y.
47:30Okay so it does not utilize any of the information within X.
47:34It's kind of a marginal correlation
47:38between each variable X and then one, okay?
47:41However, the weighted leverage score
47:43will utilize both the raw information
47:46and also the variance information,
47:49the correlation structure within the model,
47:54which is the V matrix, as I mentioned,,
47:56is derived from the covariance structure of X.
48:01The V matrix basically is the vector
48:03of the covariance structure, covariance of X.
48:06So it utilize the, say, kind of a correlation
48:11between all the X variables and a variable screening.
48:15So I'm not sure if this answers your question.
48:18<v Student>Yeah, thank you.</v>
48:19I think it is, you answered my question.
48:22Essentially, I'm thinking like if those important features,
48:25they are actually not the top contributors
48:28to the top other lectures,
48:30then we wouldn't expect the weighted leverage score
48:33to aim true way.
48:36<v Professor Liu>Thank you.</v>
48:42<v Robert>Any other questions</v>
48:43anyone wants to bring up right now?
48:51(students mumbling)
48:56<v Vince>Can I ask naive question?</v>
48:58<v Professor Liu>Yes, Vince.</v>
49:00<v Vince>So I'm wondering kind of, you know,</v>
49:04when I think about doing SVP on data,
49:06the first thing that I think of is easier
49:09and I keep coming back to that,
49:13and I can't tell if there's a relationship?
49:17<v Professor Liu>Yeah, basically, right,</v>
49:19we can generate U and V in many ways, right?
49:22We can using regression model,
49:24we can generate the U and V as well, right?
49:27We can generate, we can do a,
49:30so it's basically, I think a lot of inhibition
49:32is that the right score can generate the left and right
49:34singular vectors, so we can use many different ways
49:37to generate that.
49:39Yeah, it's not really.
49:40<v Robert>Thank you for that.</v>
49:41All right, well then, if there's nothing further,
49:43let's thank the teacher again.
49:45<v Professor Liu>Thank you everyone for having me on.</v>
49:51(students overlapping chatter)