Skip to Main Content

YSPH Biostatistics Seminar: “A Model-free Variable Screening Method Based on Leverage Score”

September 30, 2022
  • 00:01<v Robert>Hey everybody, I've got noon,</v>
  • 00:05so let's get started.
  • 00:06So today I'm pleased to introduce Professor Yiwen Liu.
  • 00:11Professor Liu earned her BS and MS in Statistics
  • 00:13from the Central University of Finance and Economics
  • 00:16in China and her PhD in Statistics
  • 00:19from the University of Georgia.
  • 00:21Today, she's an Assistant Professor of Practice
  • 00:23in the Department of Epidemiology and Biostatistics
  • 00:26at the Mel and Enid Zuckerberg, Zuckerman, sorry,
  • 00:29College of Public Health at the University of Arizona.
  • 00:32Her research primarily focuses on developing
  • 00:35statistical methods and theory
  • 00:37to harness a variety of issues in analyzing
  • 00:39the high dimensional data or the complex data set.
  • 00:43More specifically, her research interests
  • 00:46include developing model-free dimension reduction methods,
  • 00:48which are high dimensional data regression
  • 00:50and integration methods for multiple source data.
  • 00:53Today, she's gonna talk to us
  • 00:54about a model-free variable screening method
  • 00:57based on leverage score.
  • 00:59Let's welcome Professor Liu.
  • 01:04<v Professor Liu>Thank you, Robert</v>
  • 01:05for your nice introduction and it's my great honor
  • 01:08to be invited and present my work here.
  • 01:12So in today's talk, I will introduce a model-free
  • 01:16variable screening method based on leverage score,
  • 01:19and we named the method as the weighted leverage score.
  • 01:22So as we know, this is a joint work
  • 01:24with Dr. Wenxuan Zhong from the University of Georgia
  • 01:27and Dr. Peng Zeng from Auburn University.
  • 01:32So as we know, as we've heard there's big data error,
  • 01:36there are numerous data produced almost in every field
  • 01:40of science including biology.
  • 01:42So we are facing data extremely with high dimensionality
  • 01:48and also data with really complex structures.
  • 01:56Thank you.
  • 01:58And we are facing data of extremely high dimensionality
  • 02:00and really complex structures
  • 02:04and how do we effectively extract information
  • 02:07from such large and complex data
  • 02:10pose new statistical challenge.
  • 02:12So to motivate my research, let us see an example first.
  • 02:18So currently cancer has graduated
  • 02:20from the primary cause of death across the world.
  • 02:24Nowadays cancer is diagnosed by an expert
  • 02:28who has to look at the tissue samples under the microscope.
  • 02:31You can imagine that there are millions
  • 02:33of new cancer cases each year
  • 02:36and this often means that those doctors
  • 02:39will find themselves looking at hundreds of images each day.
  • 02:44And this is really tedious work.
  • 02:48And because of, you may find that,
  • 02:52because of the shortage of qualified doctors,
  • 02:55there could be a huge lag time
  • 02:57before those doctors can even figure out
  • 02:59what is going on with the patient.
  • 03:02So detect cancer using only manpower,
  • 03:05looking at images is not enough.
  • 03:08And we intend to build a statistical
  • 03:12and mathematical model to identify, detect cancer
  • 03:16in a more accurate, less expensive way.
  • 03:22Okay, so the the second generation sequencing
  • 03:25makes this becomes possible and promising.
  • 03:29And so a typical research,
  • 03:33critical inference is to find the markers
  • 03:36that related to cancer.
  • 03:38Right now there's new sequencing technology
  • 03:39called spatial transcriptomics.
  • 03:42You know that for bulk I sequencing data,
  • 03:46it just sequence the whole tissue
  • 03:47and it generate a average the gene expression data.
  • 03:51But with this new technology called spatial transcriptomic,
  • 03:55this kind of cancer tissue will be sliced
  • 03:57into several thin sections.
  • 04:01And within each section, the grid point in the section
  • 04:06will be sequenced simultaneously.
  • 04:08So you can see that here we have two areas
  • 04:11of invasive cancers, okay,
  • 04:14all the dot points within these two sections
  • 04:17will be invasive cancer areas for invasive cancer patients.
  • 04:21The other six areas, they are noninvasive cancer areas.
  • 04:25The grid points in these locations
  • 04:29will be noninvasive cancer areas,
  • 04:32but for other parts, they're normal part, okay?
  • 04:35And the data that we will have
  • 04:37is because this new technology
  • 04:40will sequence the whole tissue,
  • 04:43all those grid points simultaneously.
  • 04:45This data matrix that we will have,
  • 04:47each row corresponds to a location
  • 04:50within the section and the other columns,
  • 04:53each column corresponding to the expressions
  • 04:56for certain genes.
  • 04:59And the data matrix like this,
  • 05:01the Y's are labels for those patients,
  • 05:05the normal noninvasive or invasive.
  • 05:07And we will get a gene expressions for all those P genes
  • 05:12for each location, okay?
  • 05:15So this is the data that we have
  • 05:17and our goal then comes to identify marker genes
  • 05:23for those noninvasive and invasive cancer areas.
  • 05:27As showed in this figure, this is the tissue sections.
  • 05:33There are points, color dots here
  • 05:37with the color of the dots showing the expression levels.
  • 05:42Okay?
  • 05:44So the the dots with a yellow color
  • 05:47shows a higher expression.
  • 05:50We intended to build the models
  • 05:52to identify such genes.
  • 05:55These genes are show remarkable
  • 05:58differential express the levels across issue sections, okay?
  • 06:03These two genes have higher expression
  • 06:06in invasive cancer areas.
  • 06:10Okay, we intended to build a status quo model
  • 06:13to identify such genes
  • 06:14but there exist several challenges here.
  • 06:17Usually the data that we have,
  • 06:20the samples or take the locations here
  • 06:22is only our label the data is only around hundreds,
  • 06:25but the number of genes could be tens of thousands.
  • 06:30This is so-called a large piece modern problem.
  • 06:35Usually for any traditional methods,
  • 06:38there's no way to utilize those traditional methods
  • 06:41to solve this problem.
  • 06:44And the talk mentioned that there is a further layer
  • 06:46of complication between the gene expression levels
  • 06:49and the cancer or normal types, okay?
  • 06:53Usually how the gene expression levels
  • 06:56would influence, could affect different types of cancer,
  • 07:00this mechanism is largely unknown
  • 07:02and the association between them is beyond linear.
  • 07:06So these are the two challenges.
  • 07:09That means that we're going to,
  • 07:12what we need is a statistical methods
  • 07:15that can do variable screening
  • 07:19in a more general model set up.
  • 07:26So we, to achieve this goal,
  • 07:29we choose to build our efforts
  • 07:32under this so called general index model.
  • 07:35In a general index model it describes a scenario that Y-i
  • 07:40which is the response
  • 07:41will have relation to pay linear combinations of X-i.
  • 07:47So that's beta one transpose X-I
  • 07:49to beta K transpose X-I
  • 07:51through some anomaly function F.
  • 07:55So this is the general index model
  • 07:57and we know that here, X-i is a P directional vector
  • 08:03and if K is a value that is much smaller than P,
  • 08:09then we actually achieved the goal of vanishing reduction
  • 08:13because the original P directional vector
  • 08:16is projected onto a space of a pay dimensional,
  • 08:22pay beta one X, beta one transpose X-i
  • 08:24to beta eight transpose X-i.
  • 08:27And we choose this general index model
  • 08:30because it actually is a very general model framework.
  • 08:34If we map this general index model to our problem here,
  • 08:38the Y-i could be the label for location i.
  • 08:43And, for example, the non-invasive location.
  • 08:48And then X-i is a key dimensional vector
  • 08:50and could be the gene expression levels
  • 08:52for location i of those P genes
  • 08:57and then beta one transposed X-i
  • 08:58to beta k transpose X-i
  • 09:01could be those K coregulated gene K groups
  • 09:05of coregulated genes.
  • 09:07And those K groups of coregulated genes
  • 09:10will affect the response through some anomaly function F.
  • 09:16Okay so this is our general model setup.
  • 09:24We utilize the general index model
  • 09:26because it's a general model framework
  • 09:29that encompasses many different model types.
  • 09:33There is three special cases,
  • 09:35for example, the linear model is one special case.
  • 09:40Here, KY-i, that is when K equals one
  • 09:45and F anomaly function acquires an identity form.
  • 09:50Okay so this is the linear model
  • 09:52and the error term is additive.
  • 09:55So linear model is one special case for it.
  • 09:58The number per match model is another special case
  • 10:01for the general index model.
  • 10:03That is where K equal to P and beta one to beta P,
  • 10:07it forms an identity matrix.
  • 10:12Thank you.
  • 10:14And then the third one,
  • 10:15the single index model
  • 10:16is another special case for the general index model
  • 10:21that is when K equal to one
  • 10:22and that error term is additive.
  • 10:26So the reason that I show these three special cases,
  • 10:28just to let everyone know that general index model
  • 10:32is a very general model framework.
  • 10:36In this case, using this model framework
  • 10:39to do a variable screening or variable selection,
  • 10:45we can say for those, this is determined by the,
  • 10:50whether we should screen
  • 10:51or whether we should remove certain variables
  • 10:54is determined by the coefficients here.
  • 10:57Say, for a specific variable, if the coefficient beta one,
  • 11:02if it's coefficient across those K,
  • 11:04that K different factors are all zero,
  • 11:08then we say this value, this variable, is redundant.
  • 11:13Okay so this is how we utilize the model
  • 11:16in a estimated coefficient
  • 11:18to do a variable screening or, say, variable selection.
  • 11:27So the question becomes how can we estimate data
  • 11:30under this model framework, right?
  • 11:32Just like made estimating beta
  • 11:34in a simple linear refreshing model.
  • 11:37So let's see a simple case.
  • 11:39That is when F function can is invertible.
  • 11:44So that is to say,
  • 11:45we have the model becomes F inverse of Y-i
  • 11:49equal to beta transposed X-i plus epsilon-i.
  • 11:53Okay and this is very similar, looks similar model, right?
  • 11:57And if we want to estimate beta,
  • 11:59we can just simply maximize the correlation
  • 12:03between inverse of Y-i and the equal transpose X-I.
  • 12:07Using this optimization problem we can recover beta, okay,
  • 12:13given that F is invertible and F function is known.
  • 12:19But we know in real case, F function is unknown
  • 12:24and sometimes it is unconvertible.
  • 12:26And then what can we do to estimate beta
  • 12:31when F function is unknown?
  • 12:34So when F function is unknown,
  • 12:37we can consider all the transformations of Y-i
  • 12:40and we can solve beta
  • 12:41through the following optimization problem.
  • 12:45We consider all transformations of Y-i
  • 12:48we know as E of Y-i
  • 12:51and we define our square of eta
  • 12:54which is a function of eta as the maximized correlation
  • 12:59between NA tran E of Y-i and eta transposed X-i.
  • 13:05And this maximization is taken over
  • 13:07or any transformations E, okay?
  • 13:13So using this function, beta basically is the solution
  • 13:17for this maximization problem
  • 13:20and with certain conditions satisfied,
  • 13:23we can simplify this objective function
  • 13:27with respect to eta, we can say R transform this,
  • 13:34R square of eta into a this really nice quadratic form.
  • 13:38Okay, in the numerator, it's eta transposed times, okay,
  • 13:43this conditional variance times eta
  • 13:47and in the denominator, it's either transposed
  • 13:50the variance of X-i, eta.
  • 13:52This is a very nice projected form.
  • 13:56Basically, the solution of this, the solution beta,
  • 14:02is just taking factors, okay,
  • 14:06corresponding to the pay largest taken values
  • 14:09of this matrix in the middle.
  • 14:13That's how we solve beta in this case.
  • 14:16But as we know, as I mentioned,
  • 14:18that there is a really like big challenges here.
  • 14:21One is that we have really large P here
  • 14:24in the sigma X is a P value matrix.
  • 14:27Sigma X given Y is also P value matrix.
  • 14:31And we know that we are dealing with a case
  • 14:34of P is larger than N, in this scenario,
  • 14:38it would be really difficult
  • 14:40to generate a consistent estimate based on a scenario
  • 14:45when P is larger than N.
  • 14:48And we also have this inverse here for a very large matrix,
  • 14:53it would be really time consuming
  • 14:55to produce an inverse of the matrix.
  • 14:58That alone, this matrix is not a consistent estimate, okay?
  • 15:02So this matrix in the middle,
  • 15:04if we want to estimate that
  • 15:06in the P brought up in this scenario,
  • 15:08it would be really problematic, right?
  • 15:13And in the following, I will show how we gonna use
  • 15:18the weighted leverage score, the method that we proposed
  • 15:21to bypass the estimation of these two matrix
  • 15:25and then perform the variable selection
  • 15:28once again if the reduction under the general index model.
  • 15:35So we call our method the weighted leverage score.
  • 15:39Let us first take a look at what is leverage score
  • 15:43and what is weighted leverage score, okay?
  • 15:46So let's consider a simple case
  • 15:49that is the linear regression model
  • 15:51and then the rhet D single value competition of X
  • 15:55as X equal to U log to T-transpose.
  • 15:59This is the singular value competition X.
  • 16:03And then, so in statistics,
  • 16:05the leverage score basically is defined as
  • 16:08the diagonal element of the hat matrix.
  • 16:14And then the hat matrix,
  • 16:15we use the further least the singular value competition.
  • 16:18It can be simplified to UU transpose.
  • 16:22And then which means that the diagonal element
  • 16:25or the hat matrix basically is the real norm
  • 16:29of the U matrix, okay?
  • 16:34And then actually this leverage score
  • 16:37has a very good interpretation.
  • 16:40It is the partial directive of Y-i hat with respect to Y-i.
  • 16:45Okay, which means that if the leverage score
  • 16:48is larger and closer to one,
  • 16:51it would be more influential in predicting Y-i hat.
  • 16:57So there is a recent work of Dr. Pima
  • 17:01who's using this leverage score to do a sub-sampling
  • 17:05in big data.
  • 17:07As you can see that again,
  • 17:09that message here is that if the U-i norm is larger,
  • 17:13if the leverage score is larger
  • 17:14than we say this point is more influential.
  • 17:19Think the motivating example is like this,
  • 17:21in the first figure, this black dots,
  • 17:25they are original data
  • 17:29and the solid black is the actual model.
  • 17:35And if we want to do a linear regression,
  • 17:37usually sometimes if the really big data,
  • 17:39the data is really large,
  • 17:41so it is hardly possible to utilize all the data points
  • 17:44to generate the line here.
  • 17:49So a typical strategy is just to do sub-sampling
  • 17:54from the such big data
  • 17:56and then performing a linear regression model.
  • 17:59Right now, you can see that the regression line
  • 18:02produced by a random sub sample from the population,
  • 18:08those data is represented by the screen crosses.
  • 18:12So we'll generate a linear regression line
  • 18:17that largely deviates from the true model.
  • 18:21So that is when the random sampling does not work
  • 18:27in this case.
  • 18:28However, if we do a sub-sampling
  • 18:33according to its leverage score, okay,
  • 18:36you will see that, in the second graph,
  • 18:38these red crosses on the data sub sample we're using
  • 18:44utilize the so-called leverage score
  • 18:48and the red dashed line is the model
  • 18:52attempt value in those sub samples, okay?
  • 18:55So we can see that using the leverage score
  • 18:59to the sub sample can help us to generate a line
  • 19:03that is very good, can approximate the true model.
  • 19:09So what I want to say using these graph
  • 19:10is that the leverage score, the UI norm,
  • 19:14can be used, say, as an indicator
  • 19:19of how you fully ensure the data point is to the prediction.
  • 19:25Okay so UI is the role norm of the left single matrix
  • 19:32but we are talking about variable selection.
  • 19:34So UI norm can be used to select the roles.
  • 19:38Intuitively, to select the columns of X,
  • 19:43we can just do a transpose of X.
  • 19:45So X transpose equal to the VAU transpose.
  • 19:49To select the columns of X
  • 19:52basically is to select the roles of X transpose.
  • 19:55Intuitively we can just use the rule map of V matrix,
  • 19:59which is the right single matrix to do a selection,
  • 20:03to select the influential columns effects, okay, right?
  • 20:08And then, so we call the rho nu of U,
  • 20:13we call the U as the left singular matrix
  • 20:16V as the right singular matrix.
  • 20:18And we call the rho nu of U as the left leverage score,
  • 20:24rho of B as the right leverage score.
  • 20:30So I want to say, use the previous two slides
  • 20:34is that basically, the raw information,
  • 20:36intuitively, the raw information is contained in the U,
  • 20:40the column information of X
  • 20:41is contained in the V matrix
  • 20:44and we know that there is a fertile complication
  • 20:47between X and Y, which is unknown link function F.
  • 20:50So how do we utilize the information from the column
  • 20:54from the rho and also the anomaly function
  • 20:57to generate a same method
  • 21:01that can help us to the variable selection
  • 21:07that is to select influential columns for X.
  • 21:13Okay so, let us get back to the matrix
  • 21:16we derived in the previous slides.
  • 21:18We have the conditional variance in the denominator
  • 21:22and we have variance X in the numerator
  • 21:25and the variance of X in the denominator.
  • 21:27And with simple statistics,
  • 21:29this can be simplified to variants of expectation
  • 21:33of Z given Y where Z is a standardized X.
  • 21:38Okay, further we used a singular variety competition,
  • 21:41Z can be simplified to UV transpose
  • 21:46and then it's IJ element basically
  • 21:48is in the product of Ui and Vj,
  • 21:52so Vi basically contains both raw information
  • 21:56and column information.
  • 21:58And then we proposed the weighted leverage score,
  • 22:01which is defined in this equation.
  • 22:06And the interpretation of the Wj,
  • 22:09which is the weight leverage score for J's predictor
  • 22:12is threefold.
  • 22:14So first of all you can see it contains
  • 22:16both the column information and the raw information of X.
  • 22:21And we know and thus, in the second fold,
  • 22:27you can see in the middle, basically contains
  • 22:30the information from the unknown function F
  • 22:33because we have the conditional expectation here,
  • 22:36expectation of Ui given Y,
  • 22:38basically it's a kind of a reflection
  • 22:41of the anomaly function F.
  • 22:46And third, this method is viewed
  • 22:49under the general index model
  • 22:51and it is model three,
  • 22:53in the case that the general index model
  • 22:55encompasses many different model types.
  • 22:59Okay so this is kind of a population version
  • 23:04of the weighted leverage score.
  • 23:06In terms of estimation,
  • 23:07you will see we only need to estimate
  • 23:09in the matrix in the middle,
  • 23:11which is the variance of the expectation of Ui given Y.
  • 23:17To estimate this matrix,
  • 23:20there's a, we can see that Ui is actually three dimensional
  • 23:24because this is a directly single value composition.
  • 23:27Ui is a three dimensional vector.
  • 23:29Y is only one dimensional.
  • 23:31This is a function of one dimensional variable.
  • 23:33So it can be easily approximated by dividing, okay,
  • 23:38the range of Y into h slices as much as h
  • 23:43and within each slice, okay?
  • 23:46Within each slice we calculate the slice mean
  • 23:49for all roles of U.
  • 23:54Lastly, illustrated inn this graph,
  • 23:55we can first the slice into Y into edge slices
  • 23:59and then within each slice, if those Yi
  • 24:02fall into the same slice,
  • 24:03we find out their corresponding use
  • 24:06and then do, calculate its mean for each U.
  • 24:14And in that way, we can simplify,
  • 24:16we can simply estimate expectation of Ui given Y.
  • 24:21And then further we can estimate the variance
  • 24:23of those averages.
  • 24:27Basically just taking that the variance of U one bar
  • 24:30to U edge bar.
  • 24:37So this is the way how we estimate the variance
  • 24:40of the expectation of Ui given Y.
  • 24:45Okay so in that way we actually generate our estimate
  • 24:48weighted leverage score
  • 24:50we define as the right leverage score
  • 24:53weighted by the matrix in the middle.
  • 24:58Okay so first of all, this weighted leverage score
  • 25:02is built, say, upon the general index model,
  • 25:08it is considered as model free
  • 25:10because Yi, the response is connected with
  • 25:14hitting combination of X through anomaly function F.
  • 25:19And this model is general to encompass
  • 25:21many different model types.
  • 25:23So we can consider it as model free.
  • 25:26And second, this to generate this weighted leverage score
  • 25:31where there is no need to estimate the covariance matrix
  • 25:36and there is no need to estimate anomaly function F.
  • 25:40So we can bypass all those procedures
  • 25:44to calculate this weighted leverage score.
  • 25:52So this weighted leverage score
  • 25:53actually encompass a very good feature
  • 25:56that is it is an indicator
  • 25:57of how influential of the columns are
  • 26:01and we can basically run our predictors
  • 26:04according to the weighted leverage score.
  • 26:06The higher the score is,
  • 26:07the more influential the predictor will be.
  • 26:10And later I will show why this ranking properties
  • 26:14would help or even with the leverage score.
  • 26:17So this is a basic procedures
  • 26:20for giving weighted leverage score
  • 26:22to a variable selection or variable screening.
  • 26:25So given that we have this matrix,
  • 26:28which is an impart matrix and we have the responses,
  • 26:31the labels, for each of the location,
  • 26:34we only need one time singular value composition, okay?
  • 26:39This is a rank D singular value composition of X.
  • 26:43And then we can just calculate the weighted leverage score
  • 26:45according to the equations,
  • 26:48rank those weighted leverage score
  • 26:50from the highest to the lowest.
  • 26:52Select the predictor that we use,
  • 26:54the highest weighted leverage scores.
  • 26:57This is the basic screening procedure
  • 26:59using the weighted average score
  • 27:01and there is still implementation issue
  • 27:03that we will later address.
  • 27:05First one, how can we determine the number of D?
  • 27:08So given the data which is an IP, how can we determine,
  • 27:12say, how many, say, spiked or how many singular values
  • 27:17to be included in the model?
  • 27:20And then the second implementation issue
  • 27:23is determine the number of variables
  • 27:25to be selected in the model.
  • 27:29So you can see that the weight leverage score procedure
  • 27:32is screening procedure only include
  • 27:34one type of singularity, competition is quite efficient.
  • 27:41Okay so, in the next, let us using two slides
  • 27:44to discuss a little but, basically just one slides
  • 27:47to discuss the ranking properties
  • 27:49of the weighted leverage score.
  • 27:51So as I mentioned, the weighted leverage score
  • 27:53has a very nice property.
  • 27:55So it is guaranteed by the theorem here.
  • 27:58We show that, given certain conditions are satisfied,
  • 28:02we have the minimum value of the weighted leverage score
  • 28:07from the true predictors
  • 28:09will always rank higher than the maximum value,
  • 28:12maximum weighted leverage score
  • 28:14of those redundant predictors.
  • 28:18And this holds for the population with leverage score.
  • 28:24In terms of the estimation weighted leverage score,
  • 28:27we utilize the two step procedure.
  • 28:29So first of all,
  • 28:30we show that the estimate weighted leverage score
  • 28:34is very close to the population version
  • 28:37of the weighted leverage.
  • 28:39Okay and then the estimated weighted leverage score
  • 28:44will also have the ranking property
  • 28:48that is the estimated weighted leverage score
  • 28:50of the active predictors or important predictors
  • 28:54ranks higher than the estimated weighted leverage score
  • 28:58of the down predictors with probability turning to one.
  • 29:05Okay so this, the ranking properties
  • 29:08of the weighted leverage score
  • 29:09basically is guaranteed by these two properties.
  • 29:16And then further, we also know
  • 29:18that there are two implementation issue.
  • 29:21The first one is determine
  • 29:22the number of spiked singular values d.
  • 29:25So how many single values we need to include in our model?
  • 29:29This is a question and it's quite crucial
  • 29:32because we need to know how many of the signals
  • 29:37contain the data and we need to remove
  • 29:40all those redundant or noise information.
  • 29:43Okay so here I develop a criterion
  • 29:46based on the properties of those aken values.
  • 29:52DR is a function of, R is the number of values
  • 29:56to be included in a model.
  • 29:59DR is a function of R and the theta I hat
  • 30:02is the ratio between the highest aken value
  • 30:08and the largest aken, value, number one hat.
  • 30:12And then, you can see that as we include more, say,
  • 30:17single values in the model,
  • 30:19this, the first term will decrease
  • 30:22and then tier to some point.
  • 30:24The first, the decreasing of the first term
  • 30:28is smaller than the increasing of the second term.
  • 30:30Then DR starts to increase.
  • 30:34And then we can use the criterion to find D hat,
  • 30:38we show that we had is very close to a true D.
  • 30:41Okay, you meet this criterion,
  • 30:43we can select the true number of signals in the model.
  • 30:51And the second implementation issue
  • 30:53is about how many predictors, how many true predictors
  • 30:58we need to include in our model.
  • 31:01Okay again, we're ranking our weighted leverage scores
  • 31:05and here we utilize the criterion here
  • 31:10based on the properties of the weighted leverage score.
  • 31:12Okay, as we include more predictors into the active set,
  • 31:17the first term will decrease, okay?
  • 31:20The summation of the weighted leverage score will increase,
  • 31:22but the increment will decrease
  • 31:25and then the second term will increase, okay?
  • 31:27So, as we include more predictors in the model,
  • 31:33there's some changing point
  • 31:35when the increment is smaller than the increment
  • 31:43of the second penalty term.
  • 31:46We show that, using this criteria,
  • 31:48the set we selected using this criteria,
  • 31:52which is A, will always we'll say can include
  • 31:56all the true predictors with probability pending to one.
  • 32:03Okay so that's how we using this criterion
  • 32:05to determine the number of predictors
  • 32:07to be selected in the model.
  • 32:10Okay in the next step,
  • 32:11let me show some empirical study results
  • 32:14of using the weighted leverage score
  • 32:17to do the variable selection in the model.
  • 32:22In the example one, as I mentioned,
  • 32:24we are utilizing, we are proposing our method
  • 32:28under the general index model framework.
  • 32:31So the first model is the general index model.
  • 32:35Well why there's two directions?
  • 32:37The first one is in the numerator,
  • 32:40so this is so called a beta one transpose X.
  • 32:43The second direction beta two transpose X
  • 32:45is in the variable system,
  • 32:49the term within the variant,
  • 32:50this is beta two transpose X.
  • 32:54Okay so this is called a general index model
  • 32:56and we assume that X is generated from
  • 33:03a very normal distribution
  • 33:05and we submit our zero and covariant structure
  • 33:09back in this way, okay?
  • 33:11We let rho equal to 0.5
  • 33:13which will generate a matrix with moderate correlations.
  • 33:21So in this way, we generate both X and Y,
  • 33:24let's see how the performance of variable selection
  • 33:29give you the weighted leverage score.
  • 33:31In our scenarios, we let N equal to 1000
  • 33:34and the rho equal 0.5.
  • 33:37In example one, there are four different scenarios.
  • 33:42For scenario one, we let P as 200
  • 33:46and then we increase P to 200, sorry 22,000 and 2,500.
  • 33:54We also increase the variance of the error turn
  • 33:58as 1.5 this now to 1.3.
  • 34:02Okay, there are three criteria we used
  • 34:04to evaluate the performance of the method.
  • 34:08The first one is the false positive.
  • 34:10So it means how many of the variables are falsely selected?
  • 34:15False negative which shows how many variables
  • 34:18falsely excluded, how many true predictors
  • 34:21are falsely excluded.
  • 34:24The last one, the last criterion
  • 34:26is because, is basically is our model size
  • 34:30because we have this ranking properties
  • 34:33of all the methods here, okay?
  • 34:36All those methods have ranking properties.
  • 34:38So I want to know how many variables I need to include
  • 34:43in the model so that all true predictors
  • 34:45are included because we have this ranking property, okay?
  • 34:50You will see that weighted leverage score
  • 34:54basically have a better performance
  • 34:56in terms of the false positive and the false negative
  • 35:01and also the model size.
  • 35:05I want to say say a bit more about model size.
  • 35:09We can see that when N is 1000, P is 200,
  • 35:13the minimum model size is 6.14, meaning that we, in total,
  • 35:18we only have six true predictors, okay?
  • 35:21We only need to include the six variables
  • 35:23to 6.14 variables to encompass all true predictors.
  • 35:29Basically all those six variables are rank higher
  • 35:32than all our other novel variables, right?
  • 35:35In a second model, when P increases to 2,200,
  • 35:40we only need seven predictors
  • 35:43in order to, on average, in order to include
  • 35:45all the true predictors.
  • 35:46Meaning that overall, all the true predictors
  • 35:51ranks higher than those redundant predictors.
  • 35:55And then when sigma increases and when P increases,
  • 35:58we still need only the minimum number of variables
  • 36:03just to include all true predictors.
  • 36:07Okay so this is for the first example
  • 36:09of the model index model.
  • 36:12The second is more challenging,
  • 36:14it's called a heteroscedastic model.
  • 36:17You will find that, in the first model,
  • 36:19the X, those active predictors only influence
  • 36:24the main response, okay?
  • 36:26Only influence Y in its means.
  • 36:28But here, you can see four different types
  • 36:31for those variables, the average of Y,
  • 36:33the mean of Y is zero
  • 36:34because error term is in the numerator
  • 36:37and those X, those active predictors will influence Y
  • 36:42in its variance.
  • 36:45So it's a much challenging case.
  • 36:49In this case, we also assume that X follows
  • 36:52a very normal distribution,
  • 36:54given mean is zero and a covariant structure like this.
  • 37:00In our scenarios, we let N equal to 1,000.
  • 37:03So let's see, in a heteroscedastic model,
  • 37:06what are the behaviors of our methods?
  • 37:11Okay, sorry, I forget to introduce the method
  • 37:14that we are compare these with true independence ranking
  • 37:18and screening and the distance correlation.
  • 37:20So both of these methods can be utilized
  • 37:23to measure, say, the association between the response
  • 37:28and the predictors.
  • 37:30And both of the methods have the ranking properties.
  • 37:34So we can compare the minimum model size
  • 37:36for all the methods.
  • 37:39Regards the false positive of these two methods,
  • 37:42there's no criteria proposed
  • 37:43to select the number of predictors.
  • 37:46And in all those methods,
  • 37:48I just use a harder stretch holding
  • 37:50to select the number of predictors
  • 37:52in order to be included in the model.
  • 37:55Okay let's see, in a heteroscedastic model,
  • 37:58what are the behaviors of those three methods?
  • 38:03Okay, when we have N greater the number of predictors,
  • 38:08P here, you will find that there are slightly larger,
  • 38:12with false negative for the weighted leverage score.
  • 38:16This is because both the methods within the our threshold,
  • 38:20they select around 140, more than 140,
  • 38:25true predictors out of 200 from the model.
  • 38:29So that's why they have very small false negative,
  • 38:33but if you look at the minimum model size,
  • 38:36you will find that our weighted leverage score
  • 38:39still maintains a very good performance.
  • 38:42Okay, it has a smaller value of the minimum model size.
  • 38:46Okay in general, we only need 46 variables
  • 38:49in order to include our two predictors in the model.
  • 38:55And then as P diverges, as CSP increased to 2,500,
  • 39:01basically the weighted leverage score
  • 39:03will measure 1.3 variable true predictors from the model
  • 39:07and every method have a really hard time
  • 39:11to identify all the true predictors.
  • 39:14They have really large minimum model size.
  • 39:20So this is basic a performance of the using with
  • 39:25the leverage score to perform a variable screening
  • 39:30under general index model.
  • 39:31So I only present two examples here
  • 39:34for interest in odd scenarios.
  • 39:36We can talk about that at the top.
  • 39:42Okay so let's get back to our real data example.
  • 39:47So in a motivating example, as I mentioned,
  • 39:49we utilize this spatial transcriptomics data.
  • 39:53We are sequencing the grid point within each section, okay?
  • 39:57Basically these locations are invasive cancer areas,
  • 40:02the other areas are the noninvasive cancer areas,
  • 40:06and then these are the normal areas.
  • 40:08So how to determine the invasive, non-invasive
  • 40:11and the normal area?
  • 40:13These are determined by qualified doctors
  • 40:17and they're utilizing some logical information
  • 40:20of these locations.
  • 40:22Okay in general, for these two sections,
  • 40:24we have identified 518 locations,
  • 40:2864 invasive areas and 73 are noninvasive areas.
  • 40:34And there are rest of the areas, 381, they are normal.
  • 40:40And we have our gene, about 3,572 expressions,
  • 40:46gene expressions across the section.
  • 40:51Okay so in general, basically we have our data matrix
  • 40:54it is about 518 times 3,572.
  • 41:00So we trying to identify biomarkers, okay,
  • 41:04within those three genes
  • 41:06that can help us discriminate between invasive cancer,
  • 41:10non-invasive cancer and normal areas.
  • 41:14So we utilize the weighted leverage score,
  • 41:18we apply the weight leverage score screening procedure
  • 41:20for this data set.
  • 41:22And we identified around 225 genes
  • 41:27among all those P genes.
  • 41:31In the plot, a heat map here show the results
  • 41:34because just for the ease of presentation,
  • 41:37I only printed around 20 genes here
  • 41:40and with the top, say, weighted leverage scores,
  • 41:45you can see that there are certain patterns here.
  • 41:48This group of genes are more highly expressed
  • 41:51for the non-invasive cancer area.
  • 41:54There are certain group of genes right here.
  • 41:57They are more highly expressed
  • 41:59in the invasive cancer areas.
  • 42:03Okay so this is the gene expression patterns
  • 42:07of those top 20 genes.
  • 42:13And then we also plot the expressions of those genes
  • 42:20in these sections.
  • 42:22Again, these are invasive areas,
  • 42:25noninvasive and the normal areas.
  • 42:27So we plot a group of genes,
  • 42:30I can't remember exactly what our genes are,
  • 42:33but these genes have, you can see,
  • 42:37have a higher expression on those noninvasive cancer areas.
  • 42:42And we plug another group of genes,
  • 42:43basically are these three genes,
  • 42:45in the section, the expression shows a higher,
  • 42:48this means that these three genes have higher expression,
  • 42:52okay, in the invasive cancer areas, okay?
  • 42:57Basically this means that the genes that we selected
  • 42:59show a remarkable spatially differential expressed patterns
  • 43:06across the tissue sections.
  • 43:11And later we do a, say, pathway analysis
  • 43:19and see that there are 47 functional classes
  • 43:24for those all those gene 225 gene that we have identified.
  • 43:28And there are several cancer hallmarks, for example,
  • 43:3238 of the genes that we identified
  • 43:35enriched in the regulation of apoptotic process.
  • 43:41This is a kind of cancer hallmark.
  • 43:44And then another 41 gene that we have identified
  • 43:47are involved in the regulation of cell death.
  • 43:52More specifically, because we are really interested
  • 43:54in the invasive cancer,
  • 43:57so we identified these three,
  • 43:59there are like three genes for example,
  • 44:02in the regulation of brain process,
  • 44:05they have many relations with the breast cancer, okay?
  • 44:11And later we can investigate or, say,
  • 44:17even adaptation of those, those genes
  • 44:20that are enriched in the revelation of apoptotic process.
  • 44:32So, in summary, that weighted leverage score
  • 44:36that we have developed is a variable screening method
  • 44:40and it is developed under the general index model.
  • 44:45And this a very general model framework.
  • 44:49It can be used the two address the curse of dimensionality
  • 44:52in regression and also,
  • 44:55because we utilize both the leverage score,
  • 44:59the left leverage score and the right leverage score
  • 45:01to evaluate a predictor's importance
  • 45:04in the general index model,
  • 45:07we provide a theoretical underpinning
  • 45:10to that objectify that you need both the leverage scores
  • 45:14of both the left and right leverage scores,
  • 45:16we can evaluate the predicts importance.
  • 45:19Okay so this is kind of a new framework
  • 45:21for analyzing those numerical properties,
  • 45:25especially for the single matrixes
  • 45:29under the general index model.
  • 45:32Okay so, this is basically a summary
  • 45:34of the weighted leverage score
  • 45:36and I wanna stop here and to see if anyone has any questions
  • 45:42or comments about weighted leverage score.
  • 45:56<v Robert>Questions?</v>
  • 46:00Anybody on Zoom have questions?
  • 46:04<v Student>Can I ask a quick question</v>
  • 46:05regarding this weighted leverage score?
  • 46:07So when we look at results,
  • 46:09this weighted leverage score
  • 46:10has much better performance
  • 46:11with less inverse regression regional one, right?
  • 46:14So I wonder, is this correct
  • 46:17that the reason why improves so much
  • 46:20is because it utilize the information on the line,
  • 46:23maybe, total make sense in a lot of applications
  • 46:26that those important features,
  • 46:28they may be like more contributing
  • 46:31to like also leading to like variation of other features
  • 46:35and as a result, maybe could show up in the top
  • 46:39as vectors in the design matrix.
  • 46:44Is this correct?
  • 46:46<v Professor Liu>Yeah, thank you very much</v>
  • 46:47for your question, it's a very good question.
  • 46:49So first of all, I want to clarify,
  • 46:53maybe I'm not very clear about SIRS.
  • 46:56Basically this is representing
  • 46:58the true independence ranking and screening.
  • 47:01So it's also a method
  • 47:02that is based on the slicing versus regression.
  • 47:06So yeah, and the other one that why with the leverage score
  • 47:10has a much better performance
  • 47:12comparing these two methods is basically
  • 47:15because, one of the reason is because
  • 47:19the true independent screening and the distance correlation,
  • 47:23they all just utilize the partial and partial correlation.
  • 47:27So between X and Y.
  • 47:30Okay so it does not utilize any of the information within X.
  • 47:34It's kind of a marginal correlation
  • 47:38between each variable X and then one, okay?
  • 47:41However, the weighted leverage score
  • 47:43will utilize both the raw information
  • 47:46and also the variance information,
  • 47:49the correlation structure within the model,
  • 47:54which is the V matrix, as I mentioned,,
  • 47:56is derived from the covariance structure of X.
  • 48:01The V matrix basically is the vector
  • 48:03of the covariance structure, covariance of X.
  • 48:06So it utilize the, say, kind of a correlation
  • 48:11between all the X variables and a variable screening.
  • 48:15So I'm not sure if this answers your question.
  • 48:18<v Student>Yeah, thank you.</v>
  • 48:19I think it is, you answered my question.
  • 48:22Essentially, I'm thinking like if those important features,
  • 48:25they are actually not the top contributors
  • 48:28to the top other lectures,
  • 48:30then we wouldn't expect the weighted leverage score
  • 48:33to aim true way.
  • 48:36<v Professor Liu>Thank you.</v>
  • 48:42<v Robert>Any other questions</v>
  • 48:43anyone wants to bring up right now?
  • 48:51(students mumbling)
  • 48:56<v Vince>Can I ask naive question?</v>
  • 48:58<v Professor Liu>Yes, Vince.</v>
  • 49:00<v Vince>So I'm wondering kind of, you know,</v>
  • 49:04when I think about doing SVP on data,
  • 49:06the first thing that I think of is easier
  • 49:09and I keep coming back to that,
  • 49:13and I can't tell if there's a relationship?
  • 49:17<v Professor Liu>Yeah, basically, right,</v>
  • 49:19we can generate U and V in many ways, right?
  • 49:22We can using regression model,
  • 49:24we can generate the U and V as well, right?
  • 49:27We can generate, we can do a,
  • 49:30so it's basically, I think a lot of inhibition
  • 49:32is that the right score can generate the left and right
  • 49:34singular vectors, so we can use many different ways
  • 49:37to generate that.
  • 49:39Yeah, it's not really.
  • 49:40<v Robert>Thank you for that.</v>
  • 49:41All right, well then, if there's nothing further,
  • 49:43let's thank the teacher again.
  • 49:45<v Professor Liu>Thank you everyone for having me on.</v>
  • 49:51(students overlapping chatter)