This image may not relate to this project at all. Source:http://www.koreatimesus.com. All images, data and R Script can be found here
This is a homework assignment in DSO_530 Applied Modern Statistical Learning Methods class by professor Robertas Gabrys, USC. I completed this project with two classmates He Liu and Kurshal Bhatia. In this assignment, we practice Linear Model Selection and Regularization. We got to know Subset Selection, Shrinkage Methods and Dimension Reduction Regression
Prompt
The data set was taken from the StatLib data repository which is administered and maintained at Carnegie Mellon University. The data set was used in the ASA Statistical Graphics Section's Data Analysis Exposition. This data set consists of 777 observations and 19 variables:
College = Name of a university
Private = A factor with levels No and Yes indicating private or public university
Apps = Number of applications received
Accept = Number of applications accepted
Enroll = Number of new students enrolled
Top10perc = Pct. new students from top 10% of H.S. class
Top25perc = Pct. new students from top 25% of H.S. class
F.Undergrad = Number of fulltime undergraduates
P.Undergrad = Number of parttime undergraduates
Outstate = Out-of-state tuition
Room.Board = Room and board costs
Books = Estimated book costs
Personal = Estimated personal spending
PhD = Pct. of faculty with Ph.D.'s
Terminal = Pct. of faculty with terminal degree
S.F.Ratio = Student/faculty ratio
perc.alumni = Pct. alumni who donate
Expend = Instructional expenditure per student
Grad.Rate = Graduation rate
Objective of the analysis: build a predictive model to predict the number of applications received using given features in the College data set.
We will use several models and compare their predictive accuracy.
Prepare Data
college <-read.csv("college.csv")head(college)
The first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later, so we use this data as names for rows.
We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50 %. The decision is recorded in a dummy variable called Elite.
college$Elite=rep("No",nrow(college))college$Elite[college$Top10perc >50]=" Yes"college$Elite=as.factor(college$Elite)# Change the Private and Elite into dummy variablescollege$Elite =as.numeric(as.numeric(college$Elite)==1)college$Private=as.numeric(as.numeric(college$Private)==2)
To evaluate models predictive accuracy, we create functions for RMSE and MAPE for convenient calls later. We can also inclue other measures of error like CV.
We will now use the package leaps to evaluate all the best-subset models. We try 3 subset selection approaches "exhaustive", "backward",and "forward".
Exhaustive approach
We will wrap our code in a pair of time record to calculate the time consumption.
library(leaps)starting.time=Sys.time()M.ex=regsubsets(Apps~.,data=train,nvmax=18,really.big=T)finishing.time=Sys.time()time_consumed[1,1]=round(finishing.time-starting.time,4)which.max(summary(M.ex)$adjr2)plot(summary(M.ex)$adjr2, xlab ="Number of Variables", ylab ="Adjusted R Square")points(14, summary(M.ex)$adjr2[14], pch =20, col ="red")
R-squared Adjusted is smallest in case of 14 variable model when we train the model. Besides R-squared Adjusted, we can use AIC or BIC as the parameters to choose models. We now use the models generated from the training set to make prediction on testing set, and record the erros for each model. There're 18 models.
R-squared Adjusted is smallest in case of 14 variable model when we train model. However, if we select model using validation set, RMSE for testing set is min in 12 variable model.
R-squared Adjusted is smallest in case of 14 variable model when we train model. However, if we select model using validation set, RMSE for testing set is min in 15 variable model.
We here can compare the behavior of RMSE of testing set with that of training set (the left chart) and zoom in RMSE of testing set (the right chart) when we increase the number of explanatory variables.
Exhaustive approach
Forward approach
Exhaustive approach
As we expect, the training error goes down monotonically as we add more explanatory variables, but not so for the validation error. RMSE for testing set fluctuates dynamically. Increasing the size of the model does not guarantee higher accuracy.
OLS Regression Model - Evaluate the accuracy of the model using Cross Validation
we need function cv.glm() from boot package to calculate K-fold cross-validation prediction error.
There are two potential problems with this OLS regression when we include all explanatory variables: 1. Prediction Accuracy: When n (number of observations) ~ p (number of predictors), the least square fit can have high variance and may result in over fitting and poor estimates on unseen observations. And, when n < p, then the variability of the least squares fit increases dramatically, and the variance of these estimates is infinite. 2. Model Interpretability: When we have a large number of variables X in the model, there will generally be many that have little or no effect on Y, which makes it harder to see the big picture or recognize the effect of "important variables"
** A solution for this is Subset Selection, which is done at the begining of this post. We will include some more solutions below."
We can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. Shrinking the coefficient estimates can significantly reduce their variance.The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso.
Ridge Regression
We need to pick the tuning parameter lambda, and we do this by cross-validation.
Because we let the function automatically pick the range of lambda, this plot is not so useful. The variance of MSE seems to increase as lambda increases. This is unfortunately not true. If we do it again, we better set a wider range of lambda by set a variable lambdarange <- (exp(1))^seq(10, -2, length = 100), and use this in the function. If we did that, the range of log(lambda) would be from -2 to 10, instead of 6 to 14 like in the graph. We will demonstrate this in the Lasso model below.
The LASSO works in a similar way to Ridge Regression, except it uses a different penalty term. Using this penalty, it could be proven mathematically that some coefficients end up being set to exactly zero.
With LASSO, we can produce a model that has high predictive power and it is simple to interpret
The methods that we have discussed so far in this post have involved fitting linear regression models, via least squares or a shrunken approach, using the original predictors, X1, X2, … , Xp.
We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables. We will refer to these techniques as dimension reduction methods.
Based on the graph, we may pick 5 PCs to build our models because the degree of MSEs reduction is not subtantial when we increases the number of PCs from 5 to 18.
Based on the graph, we may pick 5 PCs to build our models because the degree of MSEs reduction is not subtantial when we increases the number of PCs from 5 to 18.