YSPH Biostatistics Seminar: “A Model-free Variable Screening Method Based on Leverage Score”
September 30, 2022Information
Speaker: Yiwen Liu, PhD, Assistant Professor, Department of Epidemiology and Biostatistics, University of Arizona
September 27, 2022
ID8123
To CiteDCA Citation Guide
- 00:01<v Robert>Hey everybody, I've got noon,</v>
- 00:05so let's get started.
- 00:06So today I'm pleased to introduce Professor Yiwen Liu.
- 00:11Professor Liu earned her BS and MS in Statistics
- 00:13from the Central University of Finance and Economics
- 00:16in China and her PhD in Statistics
- 00:19from the University of Georgia.
- 00:21Today, she's an Assistant Professor of Practice
- 00:23in the Department of Epidemiology and Biostatistics
- 00:26at the Mel and Enid Zuckerberg, Zuckerman, sorry,
- 00:29College of Public Health at the University of Arizona.
- 00:32Her research primarily focuses on developing
- 00:35statistical methods and theory
- 00:37to harness a variety of issues in analyzing
- 00:39the high dimensional data or the complex data set.
- 00:43More specifically, her research interests
- 00:46include developing model-free dimension reduction methods,
- 00:48which are high dimensional data regression
- 00:50and integration methods for multiple source data.
- 00:53Today, she's gonna talk to us
- 00:54about a model-free variable screening method
- 00:57based on leverage score.
- 00:59Let's welcome Professor Liu.
- 01:04<v Professor Liu>Thank you, Robert</v>
- 01:05for your nice introduction and it's my great honor
- 01:08to be invited and present my work here.
- 01:12So in today's talk, I will introduce a model-free
- 01:16variable screening method based on leverage score,
- 01:19and we named the method as the weighted leverage score.
- 01:22So as we know, this is a joint work
- 01:24with Dr. Wenxuan Zhong from the University of Georgia
- 01:27and Dr. Peng Zeng from Auburn University.
- 01:32So as we know, as we've heard there's big data error,
- 01:36there are numerous data produced almost in every field
- 01:40of science including biology.
- 01:42So we are facing data extremely with high dimensionality
- 01:48and also data with really complex structures.
- 01:56Thank you.
- 01:58And we are facing data of extremely high dimensionality
- 02:00and really complex structures
- 02:04and how do we effectively extract information
- 02:07from such large and complex data
- 02:10pose new statistical challenge.
- 02:12So to motivate my research, let us see an example first.
- 02:18So currently cancer has graduated
- 02:20from the primary cause of death across the world.
- 02:24Nowadays cancer is diagnosed by an expert
- 02:28who has to look at the tissue samples under the microscope.
- 02:31You can imagine that there are millions
- 02:33of new cancer cases each year
- 02:36and this often means that those doctors
- 02:39will find themselves looking at hundreds of images each day.
- 02:44And this is really tedious work.
- 02:48And because of, you may find that,
- 02:52because of the shortage of qualified doctors,
- 02:55there could be a huge lag time
- 02:57before those doctors can even figure out
- 02:59what is going on with the patient.
- 03:02So detect cancer using only manpower,
- 03:05looking at images is not enough.
- 03:08And we intend to build a statistical
- 03:12and mathematical model to identify, detect cancer
- 03:16in a more accurate, less expensive way.
- 03:22Okay, so the the second generation sequencing
- 03:25makes this becomes possible and promising.
- 03:29And so a typical research,
- 03:33critical inference is to find the markers
- 03:36that related to cancer.
- 03:38Right now there's new sequencing technology
- 03:39called spatial transcriptomics.
- 03:42You know that for bulk I sequencing data,
- 03:46it just sequence the whole tissue
- 03:47and it generate a average the gene expression data.
- 03:51But with this new technology called spatial transcriptomic,
- 03:55this kind of cancer tissue will be sliced
- 03:57into several thin sections.
- 04:01And within each section, the grid point in the section
- 04:06will be sequenced simultaneously.
- 04:08So you can see that here we have two areas
- 04:11of invasive cancers, okay,
- 04:14all the dot points within these two sections
- 04:17will be invasive cancer areas for invasive cancer patients.
- 04:21The other six areas, they are noninvasive cancer areas.
- 04:25The grid points in these locations
- 04:29will be noninvasive cancer areas,
- 04:32but for other parts, they're normal part, okay?
- 04:35And the data that we will have
- 04:37is because this new technology
- 04:40will sequence the whole tissue,
- 04:43all those grid points simultaneously.
- 04:45This data matrix that we will have,
- 04:47each row corresponds to a location
- 04:50within the section and the other columns,
- 04:53each column corresponding to the expressions
- 04:56for certain genes.
- 04:59And the data matrix like this,
- 05:01the Y's are labels for those patients,
- 05:05the normal noninvasive or invasive.
- 05:07And we will get a gene expressions for all those P genes
- 05:12for each location, okay?
- 05:15So this is the data that we have
- 05:17and our goal then comes to identify marker genes
- 05:23for those noninvasive and invasive cancer areas.
- 05:27As showed in this figure, this is the tissue sections.
- 05:33There are points, color dots here
- 05:37with the color of the dots showing the expression levels.
- 05:42Okay?
- 05:44So the the dots with a yellow color
- 05:47shows a higher expression.
- 05:50We intended to build the models
- 05:52to identify such genes.
- 05:55These genes are show remarkable
- 05:58differential express the levels across issue sections, okay?
- 06:03These two genes have higher expression
- 06:06in invasive cancer areas.
- 06:10Okay, we intended to build a status quo model
- 06:13to identify such genes
- 06:14but there exist several challenges here.
- 06:17Usually the data that we have,
- 06:20the samples or take the locations here
- 06:22is only our label the data is only around hundreds,
- 06:25but the number of genes could be tens of thousands.
- 06:30This is so-called a large piece modern problem.
- 06:35Usually for any traditional methods,
- 06:38there's no way to utilize those traditional methods
- 06:41to solve this problem.
- 06:44And the talk mentioned that there is a further layer
- 06:46of complication between the gene expression levels
- 06:49and the cancer or normal types, okay?
- 06:53Usually how the gene expression levels
- 06:56would influence, could affect different types of cancer,
- 07:00this mechanism is largely unknown
- 07:02and the association between them is beyond linear.
- 07:06So these are the two challenges.
- 07:09That means that we're going to,
- 07:12what we need is a statistical methods
- 07:15that can do variable screening
- 07:19in a more general model set up.
- 07:26So we, to achieve this goal,
- 07:29we choose to build our efforts
- 07:32under this so called general index model.
- 07:35In a general index model it describes a scenario that Y-i
- 07:40which is the response
- 07:41will have relation to pay linear combinations of X-i.
- 07:47So that's beta one transpose X-I
- 07:49to beta K transpose X-I
- 07:51through some anomaly function F.
- 07:55So this is the general index model
- 07:57and we know that here, X-i is a P directional vector
- 08:03and if K is a value that is much smaller than P,
- 08:09then we actually achieved the goal of vanishing reduction
- 08:13because the original P directional vector
- 08:16is projected onto a space of a pay dimensional,
- 08:22pay beta one X, beta one transpose X-i
- 08:24to beta eight transpose X-i.
- 08:27And we choose this general index model
- 08:30because it actually is a very general model framework.
- 08:34If we map this general index model to our problem here,
- 08:38the Y-i could be the label for location i.
- 08:43And, for example, the non-invasive location.
- 08:48And then X-i is a key dimensional vector
- 08:50and could be the gene expression levels
- 08:52for location i of those P genes
- 08:57and then beta one transposed X-i
- 08:58to beta k transpose X-i
- 09:01could be those K coregulated gene K groups
- 09:05of coregulated genes.
- 09:07And those K groups of coregulated genes
- 09:10will affect the response through some anomaly function F.
- 09:16Okay so this is our general model setup.
- 09:24We utilize the general index model
- 09:26because it's a general model framework
- 09:29that encompasses many different model types.
- 09:33There is three special cases,
- 09:35for example, the linear model is one special case.
- 09:40Here, KY-i, that is when K equals one
- 09:45and F anomaly function acquires an identity form.
- 09:50Okay so this is the linear model
- 09:52and the error term is additive.
- 09:55So linear model is one special case for it.
- 09:58The number per match model is another special case
- 10:01for the general index model.
- 10:03That is where K equal to P and beta one to beta P,
- 10:07it forms an identity matrix.
- 10:12Thank you.
- 10:14And then the third one,
- 10:15the single index model
- 10:16is another special case for the general index model
- 10:21that is when K equal to one
- 10:22and that error term is additive.
- 10:26So the reason that I show these three special cases,
- 10:28just to let everyone know that general index model
- 10:32is a very general model framework.
- 10:36In this case, using this model framework
- 10:39to do a variable screening or variable selection,
- 10:45we can say for those, this is determined by the,
- 10:50whether we should screen
- 10:51or whether we should remove certain variables
- 10:54is determined by the coefficients here.
- 10:57Say, for a specific variable, if the coefficient beta one,
- 11:02if it's coefficient across those K,
- 11:04that K different factors are all zero,
- 11:08then we say this value, this variable, is redundant.
- 11:13Okay so this is how we utilize the model
- 11:16in a estimated coefficient
- 11:18to do a variable screening or, say, variable selection.
- 11:27So the question becomes how can we estimate data
- 11:30under this model framework, right?
- 11:32Just like made estimating beta
- 11:34in a simple linear refreshing model.
- 11:37So let's see a simple case.
- 11:39That is when F function can is invertible.
- 11:44So that is to say,
- 11:45we have the model becomes F inverse of Y-i
- 11:49equal to beta transposed X-i plus epsilon-i.
- 11:53Okay and this is very similar, looks similar model, right?
- 11:57And if we want to estimate beta,
- 11:59we can just simply maximize the correlation
- 12:03between inverse of Y-i and the equal transpose X-I.
- 12:07Using this optimization problem we can recover beta, okay,
- 12:13given that F is invertible and F function is known.
- 12:19But we know in real case, F function is unknown
- 12:24and sometimes it is unconvertible.
- 12:26And then what can we do to estimate beta
- 12:31when F function is unknown?
- 12:34So when F function is unknown,
- 12:37we can consider all the transformations of Y-i
- 12:40and we can solve beta
- 12:41through the following optimization problem.
- 12:45We consider all transformations of Y-i
- 12:48we know as E of Y-i
- 12:51and we define our square of eta
- 12:54which is a function of eta as the maximized correlation
- 12:59between NA tran E of Y-i and eta transposed X-i.
- 13:05And this maximization is taken over
- 13:07or any transformations E, okay?
- 13:13So using this function, beta basically is the solution
- 13:17for this maximization problem
- 13:20and with certain conditions satisfied,
- 13:23we can simplify this objective function
- 13:27with respect to eta, we can say R transform this,
- 13:34R square of eta into a this really nice quadratic form.
- 13:38Okay, in the numerator, it's eta transposed times, okay,
- 13:43this conditional variance times eta
- 13:47and in the denominator, it's either transposed
- 13:50the variance of X-i, eta.
- 13:52This is a very nice projected form.
- 13:56Basically, the solution of this, the solution beta,
- 14:02is just taking factors, okay,
- 14:06corresponding to the pay largest taken values
- 14:09of this matrix in the middle.
- 14:13That's how we solve beta in this case.
- 14:16But as we know, as I mentioned,
- 14:18that there is a really like big challenges here.
- 14:21One is that we have really large P here
- 14:24in the sigma X is a P value matrix.
- 14:27Sigma X given Y is also P value matrix.
- 14:31And we know that we are dealing with a case
- 14:34of P is larger than N, in this scenario,
- 14:38it would be really difficult
- 14:40to generate a consistent estimate based on a scenario
- 14:45when P is larger than N.
- 14:48And we also have this inverse here for a very large matrix,
- 14:53it would be really time consuming
- 14:55to produce an inverse of the matrix.
- 14:58That alone, this matrix is not a consistent estimate, okay?
- 15:02So this matrix in the middle,
- 15:04if we want to estimate that
- 15:06in the P brought up in this scenario,
- 15:08it would be really problematic, right?
- 15:13And in the following, I will show how we gonna use
- 15:18the weighted leverage score, the method that we proposed
- 15:21to bypass the estimation of these two matrix
- 15:25and then perform the variable selection
- 15:28once again if the reduction under the general index model.
- 15:35So we call our method the weighted leverage score.
- 15:39Let us first take a look at what is leverage score
- 15:43and what is weighted leverage score, okay?
- 15:46So let's consider a simple case
- 15:49that is the linear regression model
- 15:51and then the rhet D single value competition of X
- 15:55as X equal to U log to T-transpose.
- 15:59This is the singular value competition X.
- 16:03And then, so in statistics,
- 16:05the leverage score basically is defined as
- 16:08the diagonal element of the hat matrix.
- 16:14And then the hat matrix,
- 16:15we use the further least the singular value competition.
- 16:18It can be simplified to UU transpose.
- 16:22And then which means that the diagonal element
- 16:25or the hat matrix basically is the real norm
- 16:29of the U matrix, okay?
- 16:34And then actually this leverage score
- 16:37has a very good interpretation.
- 16:40It is the partial directive of Y-i hat with respect to Y-i.
- 16:45Okay, which means that if the leverage score
- 16:48is larger and closer to one,
- 16:51it would be more influential in predicting Y-i hat.
- 16:57So there is a recent work of Dr. Pima
- 17:01who's using this leverage score to do a sub-sampling
- 17:05in big data.
- 17:07As you can see that again,
- 17:09that message here is that if the U-i norm is larger,
- 17:13if the leverage score is larger
- 17:14than we say this point is more influential.
- 17:19Think the motivating example is like this,
- 17:21in the first figure, this black dots,
- 17:25they are original data
- 17:29and the solid black is the actual model.
- 17:35And if we want to do a linear regression,
- 17:37usually sometimes if the really big data,
- 17:39the data is really large,
- 17:41so it is hardly possible to utilize all the data points
- 17:44to generate the line here.
- 17:49So a typical strategy is just to do sub-sampling
- 17:54from the such big data
- 17:56and then performing a linear regression model.
- 17:59Right now, you can see that the regression line
- 18:02produced by a random sub sample from the population,
- 18:08those data is represented by the screen crosses.
- 18:12So we'll generate a linear regression line
- 18:17that largely deviates from the true model.
- 18:21So that is when the random sampling does not work
- 18:27in this case.
- 18:28However, if we do a sub-sampling
- 18:33according to its leverage score, okay,
- 18:36you will see that, in the second graph,
- 18:38these red crosses on the data sub sample we're using
- 18:44utilize the so-called leverage score
- 18:48and the red dashed line is the model
- 18:52attempt value in those sub samples, okay?
- 18:55So we can see that using the leverage score
- 18:59to the sub sample can help us to generate a line
- 19:03that is very good, can approximate the true model.
- 19:09So what I want to say using these graph
- 19:10is that the leverage score, the UI norm,
- 19:14can be used, say, as an indicator
- 19:19of how you fully ensure the data point is to the prediction.
- 19:25Okay so UI is the role norm of the left single matrix
- 19:32but we are talking about variable selection.
- 19:34So UI norm can be used to select the roles.
- 19:38Intuitively, to select the columns of X,
- 19:43we can just do a transpose of X.
- 19:45So X transpose equal to the VAU transpose.
- 19:49To select the columns of X
- 19:52basically is to select the roles of X transpose.
- 19:55Intuitively we can just use the rule map of V matrix,
- 19:59which is the right single matrix to do a selection,
- 20:03to select the influential columns effects, okay, right?
- 20:08And then, so we call the rho nu of U,
- 20:13we call the U as the left singular matrix
- 20:16V as the right singular matrix.
- 20:18And we call the rho nu of U as the left leverage score,
- 20:24rho of B as the right leverage score.
- 20:30So I want to say, use the previous two slides
- 20:34is that basically, the raw information,
- 20:36intuitively, the raw information is contained in the U,
- 20:40the column information of X
- 20:41is contained in the V matrix
- 20:44and we know that there is a fertile complication
- 20:47between X and Y, which is unknown link function F.
- 20:50So how do we utilize the information from the column
- 20:54from the rho and also the anomaly function
- 20:57to generate a same method
- 21:01that can help us to the variable selection
- 21:07that is to select influential columns for X.
- 21:13Okay so, let us get back to the matrix
- 21:16we derived in the previous slides.
- 21:18We have the conditional variance in the denominator
- 21:22and we have variance X in the numerator
- 21:25and the variance of X in the denominator.
- 21:27And with simple statistics,
- 21:29this can be simplified to variants of expectation
- 21:33of Z given Y where Z is a standardized X.
- 21:38Okay, further we used a singular variety competition,
- 21:41Z can be simplified to UV transpose
- 21:46and then it's IJ element basically
- 21:48is in the product of Ui and Vj,
- 21:52so Vi basically contains both raw information
- 21:56and column information.
- 21:58And then we proposed the weighted leverage score,
- 22:01which is defined in this equation.
- 22:06And the interpretation of the Wj,
- 22:09which is the weight leverage score for J's predictor
- 22:12is threefold.
- 22:14So first of all you can see it contains
- 22:16both the column information and the raw information of X.
- 22:21And we know and thus, in the second fold,
- 22:27you can see in the middle, basically contains
- 22:30the information from the unknown function F
- 22:33because we have the conditional expectation here,
- 22:36expectation of Ui given Y,
- 22:38basically it's a kind of a reflection
- 22:41of the anomaly function F.
- 22:46And third, this method is viewed
- 22:49under the general index model
- 22:51and it is model three,
- 22:53in the case that the general index model
- 22:55encompasses many different model types.
- 22:59Okay so this is kind of a population version
- 23:04of the weighted leverage score.
- 23:06In terms of estimation,
- 23:07you will see we only need to estimate
- 23:09in the matrix in the middle,
- 23:11which is the variance of the expectation of Ui given Y.
- 23:17To estimate this matrix,
- 23:20there's a, we can see that Ui is actually three dimensional
- 23:24because this is a directly single value composition.
- 23:27Ui is a three dimensional vector.
- 23:29Y is only one dimensional.
- 23:31This is a function of one dimensional variable.
- 23:33So it can be easily approximated by dividing, okay,
- 23:38the range of Y into h slices as much as h
- 23:43and within each slice, okay?
- 23:46Within each slice we calculate the slice mean
- 23:49for all roles of U.
- 23:54Lastly, illustrated inn this graph,
- 23:55we can first the slice into Y into edge slices
- 23:59and then within each slice, if those Yi
- 24:02fall into the same slice,
- 24:03we find out their corresponding use
- 24:06and then do, calculate its mean for each U.
- 24:14And in that way, we can simplify,
- 24:16we can simply estimate expectation of Ui given Y.
- 24:21And then further we can estimate the variance
- 24:23of those averages.
- 24:27Basically just taking that the variance of U one bar
- 24:30to U edge bar.
- 24:37So this is the way how we estimate the variance
- 24:40of the expectation of Ui given Y.
- 24:45Okay so in that way we actually generate our estimate
- 24:48weighted leverage score
- 24:50we define as the right leverage score
- 24:53weighted by the matrix in the middle.
- 24:58Okay so first of all, this weighted leverage score
- 25:02is built, say, upon the general index model,
- 25:08it is considered as model free
- 25:10because Yi, the response is connected with
- 25:14hitting combination of X through anomaly function F.
- 25:19And this model is general to encompass
- 25:21many different model types.
- 25:23So we can consider it as model free.
- 25:26And second, this to generate this weighted leverage score
- 25:31where there is no need to estimate the covariance matrix
- 25:36and there is no need to estimate anomaly function F.
- 25:40So we can bypass all those procedures
- 25:44to calculate this weighted leverage score.
- 25:52So this weighted leverage score
- 25:53actually encompass a very good feature
- 25:56that is it is an indicator
- 25:57of how influential of the columns are
- 26:01and we can basically run our predictors
- 26:04according to the weighted leverage score.
- 26:06The higher the score is,
- 26:07the more influential the predictor will be.
- 26:10And later I will show why this ranking properties
- 26:14would help or even with the leverage score.
- 26:17So this is a basic procedures
- 26:20for giving weighted leverage score
- 26:22to a variable selection or variable screening.
- 26:25So given that we have this matrix,
- 26:28which is an impart matrix and we have the responses,
- 26:31the labels, for each of the location,
- 26:34we only need one time singular value composition, okay?
- 26:39This is a rank D singular value composition of X.
- 26:43And then we can just calculate the weighted leverage score
- 26:45according to the equations,
- 26:48rank those weighted leverage score
- 26:50from the highest to the lowest.
- 26:52Select the predictor that we use,
- 26:54the highest weighted leverage scores.
- 26:57This is the basic screening procedure
- 26:59using the weighted average score
- 27:01and there is still implementation issue
- 27:03that we will later address.
- 27:05First one, how can we determine the number of D?
- 27:08So given the data which is an IP, how can we determine,
- 27:12say, how many, say, spiked or how many singular values
- 27:17to be included in the model?
- 27:20And then the second implementation issue
- 27:23is determine the number of variables
- 27:25to be selected in the model.
- 27:29So you can see that the weight leverage score procedure
- 27:32is screening procedure only include
- 27:34one type of singularity, competition is quite efficient.
- 27:41Okay so, in the next, let us using two slides
- 27:44to discuss a little but, basically just one slides
- 27:47to discuss the ranking properties
- 27:49of the weighted leverage score.
- 27:51So as I mentioned, the weighted leverage score
- 27:53has a very nice property.
- 27:55So it is guaranteed by the theorem here.
- 27:58We show that, given certain conditions are satisfied,
- 28:02we have the minimum value of the weighted leverage score
- 28:07from the true predictors
- 28:09will always rank higher than the maximum value,
- 28:12maximum weighted leverage score
- 28:14of those redundant predictors.
- 28:18And this holds for the population with leverage score.
- 28:24In terms of the estimation weighted leverage score,
- 28:27we utilize the two step procedure.
- 28:29So first of all,
- 28:30we show that the estimate weighted leverage score
- 28:34is very close to the population version
- 28:37of the weighted leverage.
- 28:39Okay and then the estimated weighted leverage score
- 28:44will also have the ranking property
- 28:48that is the estimated weighted leverage score
- 28:50of the active predictors or important predictors
- 28:54ranks higher than the estimated weighted leverage score
- 28:58of the down predictors with probability turning to one.
- 29:05Okay so this, the ranking properties
- 29:08of the weighted leverage score
- 29:09basically is guaranteed by these two properties.
- 29:16And then further, we also know
- 29:18that there are two implementation issue.
- 29:21The first one is determine
- 29:22the number of spiked singular values d.
- 29:25So how many single values we need to include in our model?
- 29:29This is a question and it's quite crucial
- 29:32because we need to know how many of the signals
- 29:37contain the data and we need to remove
- 29:40all those redundant or noise information.
- 29:43Okay so here I develop a criterion
- 29:46based on the properties of those aken values.
- 29:52DR is a function of, R is the number of values
- 29:56to be included in a model.
- 29:59DR is a function of R and the theta I hat
- 30:02is the ratio between the highest aken value
- 30:08and the largest aken, value, number one hat.
- 30:12And then, you can see that as we include more, say,
- 30:17single values in the model,
- 30:19this, the first term will decrease
- 30:22and then tier to some point.
- 30:24The first, the decreasing of the first term
- 30:28is smaller than the increasing of the second term.
- 30:30Then DR starts to increase.
- 30:34And then we can use the criterion to find D hat,
- 30:38we show that we had is very close to a true D.
- 30:41Okay, you meet this criterion,
- 30:43we can select the true number of signals in the model.
- 30:51And the second implementation issue
- 30:53is about how many predictors, how many true predictors
- 30:58we need to include in our model.
- 31:01Okay again, we're ranking our weighted leverage scores
- 31:05and here we utilize the criterion here
- 31:10based on the properties of the weighted leverage score.
- 31:12Okay, as we include more predictors into the active set,
- 31:17the first term will decrease, okay?
- 31:20The summation of the weighted leverage score will increase,
- 31:22but the increment will decrease
- 31:25and then the second term will increase, okay?
- 31:27So, as we include more predictors in the model,
- 31:33there's some changing point
- 31:35when the increment is smaller than the increment
- 31:43of the second penalty term.
- 31:46We show that, using this criteria,
- 31:48the set we selected using this criteria,
- 31:52which is A, will always we'll say can include
- 31:56all the true predictors with probability pending to one.
- 32:03Okay so that's how we using this criterion
- 32:05to determine the number of predictors
- 32:07to be selected in the model.
- 32:10Okay in the next step,
- 32:11let me show some empirical study results
- 32:14of using the weighted leverage score
- 32:17to do the variable selection in the model.
- 32:22In the example one, as I mentioned,
- 32:24we are utilizing, we are proposing our method
- 32:28under the general index model framework.
- 32:31So the first model is the general index model.
- 32:35Well why there's two directions?
- 32:37The first one is in the numerator,
- 32:40so this is so called a beta one transpose X.
- 32:43The second direction beta two transpose X
- 32:45is in the variable system,
- 32:49the term within the variant,
- 32:50this is beta two transpose X.
- 32:54Okay so this is called a general index model
- 32:56and we assume that X is generated from
- 33:03a very normal distribution
- 33:05and we submit our zero and covariant structure
- 33:09back in this way, okay?
- 33:11We let rho equal to 0.5
- 33:13which will generate a matrix with moderate correlations.
- 33:21So in this way, we generate both X and Y,
- 33:24let's see how the performance of variable selection
- 33:29give you the weighted leverage score.
- 33:31In our scenarios, we let N equal to 1000
- 33:34and the rho equal 0.5.
- 33:37In example one, there are four different scenarios.
- 33:42For scenario one, we let P as 200
- 33:46and then we increase P to 200, sorry 22,000 and 2,500.
- 33:54We also increase the variance of the error turn
- 33:58as 1.5 this now to 1.3.
- 34:02Okay, there are three criteria we used
- 34:04to evaluate the performance of the method.
- 34:08The first one is the false positive.
- 34:10So it means how many of the variables are falsely selected?
- 34:15False negative which shows how many variables
- 34:18falsely excluded, how many true predictors
- 34:21are falsely excluded.
- 34:24The last one, the last criterion
- 34:26is because, is basically is our model size
- 34:30because we have this ranking properties
- 34:33of all the methods here, okay?
- 34:36All those methods have ranking properties.
- 34:38So I want to know how many variables I need to include
- 34:43in the model so that all true predictors
- 34:45are included because we have this ranking property, okay?
- 34:50You will see that weighted leverage score
- 34:54basically have a better performance
- 34:56in terms of the false positive and the false negative
- 35:01and also the model size.
- 35:05I want to say say a bit more about model size.
- 35:09We can see that when N is 1000, P is 200,
- 35:13the minimum model size is 6.14, meaning that we, in total,
- 35:18we only have six true predictors, okay?
- 35:21We only need to include the six variables
- 35:23to 6.14 variables to encompass all true predictors.
- 35:29Basically all those six variables are rank higher
- 35:32than all our other novel variables, right?
- 35:35In a second model, when P increases to 2,200,
- 35:40we only need seven predictors
- 35:43in order to, on average, in order to include
- 35:45all the true predictors.
- 35:46Meaning that overall, all the true predictors
- 35:51ranks higher than those redundant predictors.
- 35:55And then when sigma increases and when P increases,
- 35:58we still need only the minimum number of variables
- 36:03just to include all true predictors.
- 36:07Okay so this is for the first example
- 36:09of the model index model.
- 36:12The second is more challenging,
- 36:14it's called a heteroscedastic model.
- 36:17You will find that, in the first model,
- 36:19the X, those active predictors only influence
- 36:24the main response, okay?
- 36:26Only influence Y in its means.
- 36:28But here, you can see four different types
- 36:31for those variables, the average of Y,
- 36:33the mean of Y is zero
- 36:34because error term is in the numerator
- 36:37and those X, those active predictors will influence Y
- 36:42in its variance.
- 36:45So it's a much challenging case.
- 36:49In this case, we also assume that X follows
- 36:52a very normal distribution,
- 36:54given mean is zero and a covariant structure like this.
- 37:00In our scenarios, we let N equal to 1,000.
- 37:03So let's see, in a heteroscedastic model,
- 37:06what are the behaviors of our methods?
- 37:11Okay, sorry, I forget to introduce the method
- 37:14that we are compare these with true independence ranking
- 37:18and screening and the distance correlation.
- 37:20So both of these methods can be utilized
- 37:23to measure, say, the association between the response
- 37:28and the predictors.
- 37:30And both of the methods have the ranking properties.
- 37:34So we can compare the minimum model size
- 37:36for all the methods.
- 37:39Regards the false positive of these two methods,
- 37:42there's no criteria proposed
- 37:43to select the number of predictors.
- 37:46And in all those methods,
- 37:48I just use a harder stretch holding
- 37:50to select the number of predictors
- 37:52in order to be included in the model.
- 37:55Okay let's see, in a heteroscedastic model,
- 37:58what are the behaviors of those three methods?
- 38:03Okay, when we have N greater the number of predictors,
- 38:08P here, you will find that there are slightly larger,
- 38:12with false negative for the weighted leverage score.
- 38:16This is because both the methods within the our threshold,
- 38:20they select around 140, more than 140,
- 38:25true predictors out of 200 from the model.
- 38:29So that's why they have very small false negative,
- 38:33but if you look at the minimum model size,
- 38:36you will find that our weighted leverage score
- 38:39still maintains a very good performance.
- 38:42Okay, it has a smaller value of the minimum model size.
- 38:46Okay in general, we only need 46 variables
- 38:49in order to include our two predictors in the model.
- 38:55And then as P diverges, as CSP increased to 2,500,
- 39:01basically the weighted leverage score
- 39:03will measure 1.3 variable true predictors from the model
- 39:07and every method have a really hard time
- 39:11to identify all the true predictors.
- 39:14They have really large minimum model size.
- 39:20So this is basic a performance of the using with
- 39:25the leverage score to perform a variable screening
- 39:30under general index model.
- 39:31So I only present two examples here
- 39:34for interest in odd scenarios.
- 39:36We can talk about that at the top.
- 39:42Okay so let's get back to our real data example.
- 39:47So in a motivating example, as I mentioned,
- 39:49we utilize this spatial transcriptomics data.
- 39:53We are sequencing the grid point within each section, okay?
- 39:57Basically these locations are invasive cancer areas,
- 40:02the other areas are the noninvasive cancer areas,
- 40:06and then these are the normal areas.
- 40:08So how to determine the invasive, non-invasive
- 40:11and the normal area?
- 40:13These are determined by qualified doctors
- 40:17and they're utilizing some logical information
- 40:20of these locations.
- 40:22Okay in general, for these two sections,
- 40:24we have identified 518 locations,
- 40:2864 invasive areas and 73 are noninvasive areas.
- 40:34And there are rest of the areas, 381, they are normal.
- 40:40And we have our gene, about 3,572 expressions,
- 40:46gene expressions across the section.
- 40:51Okay so in general, basically we have our data matrix
- 40:54it is about 518 times 3,572.
- 41:00So we trying to identify biomarkers, okay,
- 41:04within those three genes
- 41:06that can help us discriminate between invasive cancer,
- 41:10non-invasive cancer and normal areas.
- 41:14So we utilize the weighted leverage score,
- 41:18we apply the weight leverage score screening procedure
- 41:20for this data set.
- 41:22And we identified around 225 genes
- 41:27among all those P genes.
- 41:31In the plot, a heat map here show the results
- 41:34because just for the ease of presentation,
- 41:37I only printed around 20 genes here
- 41:40and with the top, say, weighted leverage scores,
- 41:45you can see that there are certain patterns here.
- 41:48This group of genes are more highly expressed
- 41:51for the non-invasive cancer area.
- 41:54There are certain group of genes right here.
- 41:57They are more highly expressed
- 41:59in the invasive cancer areas.
- 42:03Okay so this is the gene expression patterns
- 42:07of those top 20 genes.
- 42:13And then we also plot the expressions of those genes
- 42:20in these sections.
- 42:22Again, these are invasive areas,
- 42:25noninvasive and the normal areas.
- 42:27So we plot a group of genes,
- 42:30I can't remember exactly what our genes are,
- 42:33but these genes have, you can see,
- 42:37have a higher expression on those noninvasive cancer areas.
- 42:42And we plug another group of genes,
- 42:43basically are these three genes,
- 42:45in the section, the expression shows a higher,
- 42:48this means that these three genes have higher expression,
- 42:52okay, in the invasive cancer areas, okay?
- 42:57Basically this means that the genes that we selected
- 42:59show a remarkable spatially differential expressed patterns
- 43:06across the tissue sections.
- 43:11And later we do a, say, pathway analysis
- 43:19and see that there are 47 functional classes
- 43:24for those all those gene 225 gene that we have identified.
- 43:28And there are several cancer hallmarks, for example,
- 43:3238 of the genes that we identified
- 43:35enriched in the regulation of apoptotic process.
- 43:41This is a kind of cancer hallmark.
- 43:44And then another 41 gene that we have identified
- 43:47are involved in the regulation of cell death.
- 43:52More specifically, because we are really interested
- 43:54in the invasive cancer,
- 43:57so we identified these three,
- 43:59there are like three genes for example,
- 44:02in the regulation of brain process,
- 44:05they have many relations with the breast cancer, okay?
- 44:11And later we can investigate or, say,
- 44:17even adaptation of those, those genes
- 44:20that are enriched in the revelation of apoptotic process.
- 44:32So, in summary, that weighted leverage score
- 44:36that we have developed is a variable screening method
- 44:40and it is developed under the general index model.
- 44:45And this a very general model framework.
- 44:49It can be used the two address the curse of dimensionality
- 44:52in regression and also,
- 44:55because we utilize both the leverage score,
- 44:59the left leverage score and the right leverage score
- 45:01to evaluate a predictor's importance
- 45:04in the general index model,
- 45:07we provide a theoretical underpinning
- 45:10to that objectify that you need both the leverage scores
- 45:14of both the left and right leverage scores,
- 45:16we can evaluate the predicts importance.
- 45:19Okay so this is kind of a new framework
- 45:21for analyzing those numerical properties,
- 45:25especially for the single matrixes
- 45:29under the general index model.
- 45:32Okay so, this is basically a summary
- 45:34of the weighted leverage score
- 45:36and I wanna stop here and to see if anyone has any questions
- 45:42or comments about weighted leverage score.
- 45:56<v Robert>Questions?</v>
- 46:00Anybody on Zoom have questions?
- 46:04<v Student>Can I ask a quick question</v>
- 46:05regarding this weighted leverage score?
- 46:07So when we look at results,
- 46:09this weighted leverage score
- 46:10has much better performance
- 46:11with less inverse regression regional one, right?
- 46:14So I wonder, is this correct
- 46:17that the reason why improves so much
- 46:20is because it utilize the information on the line,
- 46:23maybe, total make sense in a lot of applications
- 46:26that those important features,
- 46:28they may be like more contributing
- 46:31to like also leading to like variation of other features
- 46:35and as a result, maybe could show up in the top
- 46:39as vectors in the design matrix.
- 46:44Is this correct?
- 46:46<v Professor Liu>Yeah, thank you very much</v>
- 46:47for your question, it's a very good question.
- 46:49So first of all, I want to clarify,
- 46:53maybe I'm not very clear about SIRS.
- 46:56Basically this is representing
- 46:58the true independence ranking and screening.
- 47:01So it's also a method
- 47:02that is based on the slicing versus regression.
- 47:06So yeah, and the other one that why with the leverage score
- 47:10has a much better performance
- 47:12comparing these two methods is basically
- 47:15because, one of the reason is because
- 47:19the true independent screening and the distance correlation,
- 47:23they all just utilize the partial and partial correlation.
- 47:27So between X and Y.
- 47:30Okay so it does not utilize any of the information within X.
- 47:34It's kind of a marginal correlation
- 47:38between each variable X and then one, okay?
- 47:41However, the weighted leverage score
- 47:43will utilize both the raw information
- 47:46and also the variance information,
- 47:49the correlation structure within the model,
- 47:54which is the V matrix, as I mentioned,,
- 47:56is derived from the covariance structure of X.
- 48:01The V matrix basically is the vector
- 48:03of the covariance structure, covariance of X.
- 48:06So it utilize the, say, kind of a correlation
- 48:11between all the X variables and a variable screening.
- 48:15So I'm not sure if this answers your question.
- 48:18<v Student>Yeah, thank you.</v>
- 48:19I think it is, you answered my question.
- 48:22Essentially, I'm thinking like if those important features,
- 48:25they are actually not the top contributors
- 48:28to the top other lectures,
- 48:30then we wouldn't expect the weighted leverage score
- 48:33to aim true way.
- 48:36<v Professor Liu>Thank you.</v>
- 48:42<v Robert>Any other questions</v>
- 48:43anyone wants to bring up right now?
- 48:51(students mumbling)
- 48:56<v Vince>Can I ask naive question?</v>
- 48:58<v Professor Liu>Yes, Vince.</v>
- 49:00<v Vince>So I'm wondering kind of, you know,</v>
- 49:04when I think about doing SVP on data,
- 49:06the first thing that I think of is easier
- 49:09and I keep coming back to that,
- 49:13and I can't tell if there's a relationship?
- 49:17<v Professor Liu>Yeah, basically, right,</v>
- 49:19we can generate U and V in many ways, right?
- 49:22We can using regression model,
- 49:24we can generate the U and V as well, right?
- 49:27We can generate, we can do a,
- 49:30so it's basically, I think a lot of inhibition
- 49:32is that the right score can generate the left and right
- 49:34singular vectors, so we can use many different ways
- 49:37to generate that.
- 49:39Yeah, it's not really.
- 49:40<v Robert>Thank you for that.</v>
- 49:41All right, well then, if there's nothing further,
- 49:43let's thank the teacher again.
- 49:45<v Professor Liu>Thank you everyone for having me on.</v>
- 49:51(students overlapping chatter)