Program
Tsinghua University Pao-Lu Hsu Distinguished Lecture
Student No.:50
Time:Tue 16:30-17:30, 2016-7-5 / Thu 10:30-11:30, 2016-7-7
Instructor:Lawrence D. Brown  [University of Pennsylvania]
Place:Lecture Hall, Floor 3, Jin Chun Yuan West Building
Starting Date:2016-7-5
Ending Date:2016-7-7

 

 

Lecture 1: Regression Analysis in an Assumption-lean Framework

 

 

Time: Tue 16:30-17:30, 2016-7-5

 

 

Abstract:

 

 

Statistical analysis via linear models and generalized linear models involves covariates. In conventional notation these are the X-values, and the observations are the Y-values. In many conventional discussions of these models and their applications the X-values are treated as fixed constants even though the data itself is more realistically modeled as coming from a population yielding random covariates.

 

 

Is it really OK to condition on the observed values of the randomly distributed covariates and treat them as if they were fixed? The short answer is that it is sometimes OK and it is sometimes not OK. A key part of the answer depends on whether the analytical linear model or GLM is an accurate representation of the stochastic nature of the data or is “mis-specified”. This talk will characterize answers to this basic question and describe valid alternative forms and targets of inference for situations involving random covariates. A key feature of the development is imposition of very minimal assumptions on the true distributions in the stochastic model for the data. In that sense the framework is “assumption-lean”.

 

 

Consequences related to this issue in linear models lead to alternative inference for the Average Treatment Effect in randomized clinical trials, to alternative forms of the popular C_p criterion for model selection, and to improved estimates and predictions in semi-supervised learning. Most of the current talk will be devoted to an exposition of the main issue – the role of random covariates in standard methodology. Some of the consequences for specific modes of application will be discussed as time permits. Details related to C_p will be discussed in the next lecture.
This is joint research of the Wharton Linear Models Research Group whose members include Buja, A.; Berk, R. A.; Brown, L. D.; George, E.; Pitkin, E.; Traskin, M.; Zhang, K.; and Zhao, L.

 

 

 

Lecture 2: Mallows Cp for Realistic Out-of-sample Prediction

 

 

Time: Thu 10:30-11:30, 2016-7-7

 

 

Abstract:

 

 

Mallows’ Cp is a frequently used tool for variable selection in linear models. (For the original discussion see Mallows (1973), building on Mallows (1964, 1966).) In practice it may be used in conjunction with forward stepwise selection or all-subsets selection, or some other selection scheme. It can be derived and interpreted as an estimate of (normalized) predictive squared error in a very special situation. Two key features of that situation are: 1) The observed covariate variables and the covariates for the predictive population are, “not to be regarded as being sampled randomly from some population, but rather are taken as fixed design variables”. (Mallows (1973).); and 2) The observations in the sample and in the predictive universe follow a homoscedastic linear model. Assumption 1) does not accord with most of the common statistical settings in which Cp is employed, and assumption 2) is often undesirably optimistic in practical settings.

 

 

We derive an easily computed variant of Mallows expression that does not rely on either of these assumptions. The new variant, denoted as  , estimates the predictive squared error for future observations drawn from the same population as that which provided the observed statistical sample. The candidate estimators are linear estimators based on selected variables. But there are virtually no assumptions on the true sampling distribution.
Use of this variant will be demonstrated via simulations in a simple regression setting that enables easy visualization and also exact computation of some relevant quantities. For a more practical demonstration we also apply the methodology to variable selection in a data set involving criminal sentencing.