Model Selection in R

This image may not relate to this project at all. Source: http://www.koreatimesus.com. All images, data and R Script can be found here
This is a homework assignment in DSO_530 Applied Modern Statistical Learning Methods class by professor Robertas Gabrys, USC. I completed this project with two classmates He Liu and Kurshal Bhatia. In this assignment, we practice Linear Model Selection and Regularization. We got to know Subset Selection, Shrinkage Methods and Dimension Reduction Regression
Prompt
The data set was taken from the StatLib data repository which is administered and maintained at Carnegie Mellon University. The data set was used in the ASA Statistical Graphics Section's Data Analysis Exposition. This data set consists of 777 observations and 19 variables:
College = Name of a university
Private = A factor with levels No and Yes indicating private or public university
Apps = Number of applications received
Accept = Number of applications accepted
Enroll = Number of new students enrolled
Top10perc = Pct. new students from top 10% of H.S. class
Top25perc = Pct. new students from top 25% of H.S. class
F.Undergrad = Number of fulltime undergraduates
P.Undergrad = Number of parttime undergraduates
Outstate = Out-of-state tuition
Room.Board = Room and board costs
Books = Estimated book costs
Personal = Estimated personal spending
PhD = Pct. of faculty with Ph.D.'s
Terminal = Pct. of faculty with terminal degree
S.F.Ratio = Student/faculty ratio
perc.alumni = Pct. alumni who donate
Expend = Instructional expenditure per student
Grad.Rate = Graduation rate
Objective of the analysis: build a predictive model to predict the number of applications received using given features in the College data set.
We will use several models and compare their predictive accuracy.
Prepare Data
college <-read.csv("college.csv")
head(college)
The first column is just the name of each university. We don’t really want R to treat this as data. However, it may be handy to have these names for later, so we use this data as names for rows.
rownames(college) <- college[,1]
college <- college[,-1]
str(college)
summary(college)
We are going to divide universities into two groups based on whether or not the proportion of students coming from the top 10 % of their high school classes exceeds 50 %. The decision is recorded in a dummy variable called Elite.
college$Elite=rep("No",nrow(college))
college$Elite[college$Top10perc >50]=" Yes"
college$Elite=as.factor(college$Elite)
# Change the Private and Elite into dummy variables
college$Elite = as.numeric(as.numeric(college$Elite)==1)
college$Private= as.numeric(as.numeric(college$Private)==2)
Let's create a training set and a testing set
set.seed(1)
test_id<-sample(dim(college)[1],200)
y.train=college[-(test_id),2]
x.train=college[-(test_id),-2]
y.test=college[(test_id),2]
x.test=college[(test_id),-2]
test=college[test_id,]
train=college[-(test_id),]
To evaluate models predictive accuracy, we create functions for RMSE and MAPE for convenient calls later. We can also inclue other measures of error like CV.
RMSE=function(y,y.hat) sqrt(mean((y-y.hat)^2))
MAPE=function(y,y.hat) mean(abs((y-y.hat)/y))
We also make an RMSE report table and a time consumtion report table for easy view and comparison.
time_consumed <-data.frame(NA,NA,NA,NA,NA,NA,NA,NA,NA)
colnames(time_consumed) <- c("Sub_Ex", "Sub_For", "Sub_Back","Linear Reg","Ridge","Lasso","PCR","PLS","Elastic Net")
rmse_comparison <-data.frame(NA,NA,NA,NA,NA,NA,NA,NA,NA)
colnames(rmse_comparison) <- c("Sub_Ex", "Sub_For", "Sub_Back","Linear Reg","Ridge","Lasso","PCR","PLS","Elastic Net")
Best Subset regression
We will now use the package leaps to evaluate all the best-subset models. We try 3 subset selection approaches "exhaustive", "backward",and "forward".
Exhaustive approach
We will wrap our code in a pair of time record to calculate the time consumption.
library(leaps)
starting.time=Sys.time()
M.ex=regsubsets(Apps~.,data=train,nvmax=18,really.big=T)
finishing.time=Sys.time()
time_consumed[1,1]=round(finishing.time-starting.time,4)
which.max(summary(M.ex)$adjr2)
plot(summary(M.ex)$adjr2, xlab = "Number of Variables", ylab = "Adjusted R Square")
points(14, summary(M.ex)$adjr2[14], pch = 20, col = "red")
R-squared Adjusted is smallest in case of 14 variable model when we train the model. Besides R-squared Adjusted, we can use AIC or BIC as the parameters to choose models. We now use the models generated from the training set to make prediction on testing set, and record the erros for each model. There're 18 models.
M.ex.accuracy <- data.frame(rep(NA,18),rep("",18),rep(NA,18),rep("",18),rep(NA,18),rep("",18),rep(NA,18),rep("",18),rep(NA,18),rep("",18),rep(NA,18),rep("",18), stringsAsFactors = FALSE)
colnames(M.ex.accuracy)=c("Train.RMSE","","Test.RMSE","","Train.MAPE","","Test.MAPE","","Train.Corr","","Test.Corr","")
for (i in 1:18) {
M.ex.y.hat<-t(coef(M.ex,id=i)) %*% t(cbind(1,x.test[,names(coef(M.ex,id=i))[-1]]));
M.ex.fitted<-t(coef(M.ex,id=i)) %*% t(cbind(1,x.train[,names(coef(M.ex,id=i))[-1]]));
M.ex.accuracy$Train.RMSE[i] = RMSE(y=y.train,y.hat=M.ex.fitted);
M.ex.accuracy$Test.RMSE[i] = RMSE(y=y.test,y.hat=M.ex.y.hat);
M.ex.accuracy$Train.MAPE[i] = MAPE(y=y.train,y.hat=M.ex.fitted);
M.ex.accuracy$Test.MAPE[i] = MAPE(y=y.test,y.hat=M.ex.y.hat);
M.ex.accuracy$Train.Corr[i] = cor(as.vector(M.ex.fitted),y.train);
M.ex.accuracy$Test.Corr[i] = cor(as.vector(M.ex.y.hat),y.test)
}
for (i in 1:3){
M.ex.accuracy[match(sort(M.ex.accuracy$Train.RMSE)[i],M.ex.accuracy$Train.RMSE),2]=paste(replicate(4-i, '*'), collapse = "");
M.ex.accuracy[match(sort(M.ex.accuracy$Test.RMSE)[i],M.ex.accuracy$Test.RMSE),4]=paste(replicate(4-i, '*'), collapse = "");
M.ex.accuracy[match(sort(M.ex.accuracy$Train.MAPE)[i],M.ex.accuracy$Train.MAPE),6]=paste(replicate(4-i, '*'), collapse = "");
M.ex.accuracy[match(sort(M.ex.accuracy$Test.MAPE)[i],M.ex.accuracy$Test.MAPE),8]=paste(replicate(4-i, '*'), collapse = "");
M.ex.accuracy[match(sort(M.ex.accuracy$Train.Corr)[19-i],M.ex.accuracy$Train.Corr),10]=paste(replicate(4-i, '*'), collapse = "");
M.ex.accuracy[match(sort(M.ex.accuracy$Test.Corr)[19-i],M.ex.accuracy$Test.Corr),12]=paste(replicate(4-i, '*'), collapse = "");
}
M.ex.accuracy
The summary table is now like this:
> M.ex.accuracy
Train.RMSE Test.RMSE Train.MAPE Test.MAPE Train.Corr Test.Corr
1 1215.3447 1483.424 0.2642744 *** 0.2725806 0.9350495 0.9576987
2 1058.5431 1317.831 0.3685517 0.3191913 0.9511341 0.9676984
3 1021.3586 1353.455 0.2657766 ** 0.2359127 *** 0.9545873 0.9647422
4 1003.9375 1360.159 0.3691235 0.3429179 0.9561585 0.9662017
5 995.3905 1336.878 0.3656009 0.3226051 0.9569186 0.9670741
6 989.8745 1331.956 0.3353175 0.2854917 0.9574053 0.9672007
7 985.2096 1353.994 0.3190749 * 0.2886800 0.9578147 0.9668531
8 978.2238 1327.493 0.3266173 0.2771585 0.9584237 0.9678347
9 972.3983 1344.641 0.3454567 0.2642254 0.9589280 0.9673834
10 969.0885 1346.711 0.3416160 0.2776933 0.9592131 0.9671804
11 965.3341 1313.723 0.3472304 0.2778655 0.9595351 0.9691684
12 962.6936 1316.779 0.3443276 0.2567660 ** 0.9597609 0.9689705
13 961.1050 1312.251 0.3471107 0.2602167 0.9598963 0.9691584
14 960.1505 1302.293 ** 0.3472487 0.2583202 * 0.9599776 0.9697018 ***
15 959.4776 1302.273 *** 0.3470724 0.2644870 0.9600348 0.9696671 **
16 959.2712 * 1303.679 * 0.3480086 0.2657612 0.9600524 * 0.9696123 *
17 959.2130 ** 1304.623 0.3475276 0.2676477 0.9600573 ** 0.9695919
18 959.2129 *** 1304.605 0.3475254 0.2676874 0.9600573 *** 0.9695929
RMSE for testing set is min in 15 variable model.
which.min(M.ex.accuracy$Test.RMSE)
We conduct the same procedure for other two approaches.
Forward approach
> M.for.accuracy
Train.RMSE Test.RMSE Train.MAPE Test.MAPE Train.Corr Test.Corr
1 1215.3447 1483.424 0.2642744 *** 0.2725806 0.9350495 0.9576987
2 1058.5431 1317.831 0.3685517 0.3191913 0.9511341 0.9676984
3 1025.6655 1340.797 0.3393837 * 0.3152921 0.9541943 0.9657877
4 1003.9375 1360.159 0.3691235 0.3429179 0.9561585 0.9662017
5 995.3905 1336.878 0.3656009 0.3226051 0.9569186 0.9670741
6 989.8745 1331.956 0.3353175 ** 0.2854917 0.9574053 0.9672007
7 985.9823 1333.205 0.3487048 0.2736176 0.9577470 0.9672856
8 981.9279 1341.610 0.3504910 0.2723669 0.9581014 0.9672037
9 978.1941 1347.694 0.3479662 0.2851037 0.9584263 0.9668737
10 975.8946 1342.996 0.3587785 0.2903442 0.9586258 0.9673159
11 970.8823 1320.889 0.3545895 0.2645969 0.9590587 0.9686026
12 968.8510 1283.129 *** 0.3553727 0.2677575 0.9592335 0.9698657 ***
13 961.1050 1312.251 0.3471107 0.2602167 ** 0.9598963 0.9691584
14 960.1505 1302.293 * 0.3472487 0.2583202 *** 0.9599776 0.9697018 **
15 959.4776 1302.273 ** 0.3470724 0.2644870 * 0.9600348 0.9696671 *
16 959.2712 * 1303.679 0.3480086 0.2657612 0.9600524 * 0.9696123
17 959.2130 ** 1304.623 0.3475276 0.2676477 0.9600573 ** 0.9695919
18 959.2129 *** 1304.605 0.3475254 0.2676874 0.9600573 *** 0.9695929
R-squared Adjusted is smallest in case of 14 variable model when we train model. However, if we select model using validation set, RMSE for testing set is min in 12 variable model.
Exhaustive approach
> M.back.accuracy
Train.RMSE Test.RMSE Train.MAPE Test.MAPE Train.Corr Test.Corr
1 1215.3447 1483.424 0.2642744 *** 0.2725806 0.9350495 0.9576987
2 1058.5431 1317.831 0.3685517 0.3191913 0.9511341 0.9676984
3 1025.6655 1340.797 0.3393837 0.3152921 0.9541943 0.9657877
4 1003.9375 1360.159 0.3691235 0.3429179 0.9561585 0.9662017
5 996.4307 1350.950 0.3335815 0.2990086 0.9568265 0.9664878
6 994.3908 1384.976 0.3215884 * 0.2898493 0.9570070 0.9653612
7 985.2096 1353.994 0.3190749 ** 0.2886800 0.9578147 0.9668531
8 978.2238 1327.493 0.3266173 0.2771585 0.9584237 0.9678347
9 972.3983 1344.641 0.3454567 0.2642254 0.9589280 0.9673834
10 969.6505 1318.688 0.3373336 0.2705813 0.9591647 0.9685162
11 965.4591 1312.280 0.3470725 0.2586248 * 0.9595244 0.9692669
12 962.6936 1316.779 0.3443276 0.2567660 *** 0.9597609 0.9689705
13 961.1050 1312.251 0.3471107 0.2602167 0.9598963 0.9691584
14 960.1505 1302.293 ** 0.3472487 0.2583202 ** 0.9599776 0.9697018 ***
15 959.4776 1302.273 *** 0.3470724 0.2644870 0.9600348 0.9696671 **
16 959.2712 * 1303.679 * 0.3480086 0.2657612 0.9600524 * 0.9696123 *
17 959.2130 ** 1304.623 0.3475276 0.2676477 0.9600573 ** 0.9695919
18 959.2129 *** 1304.605 0.3475254 0.2676874 0.9600573 *** 0.9695929
R-squared Adjusted is smallest in case of 14 variable model when we train model. However, if we select model using validation set, RMSE for testing set is min in 15 variable model.
We here can compare the behavior of RMSE of testing set with that of training set (the left chart) and zoom in RMSE of testing set (the right chart) when we increase the number of explanatory variables.
Exhaustive approach

Forward approach

Exhaustive approach

As we expect, the training error goes down monotonically as we add more explanatory variables, but not so for the validation error. RMSE for testing set fluctuates dynamically. Increasing the size of the model does not guarantee higher accuracy.
OLS Regression Model - Evaluate the accuracy of the model using Cross Validation
we need function cv.glm() from boot package to calculate K-fold cross-validation prediction error.
library(boot)
starting.time=Sys.time()
M.linear=glm(Apps~.,data=college)
finishing.time=Sys.time()
time_consumed[1,4]=round(finishing.time-starting.time,4)
cv.error=cv.glm(college,M.linear,K=10)$delta[1]
summary(M.linear)
cv.error # = 1262112
cv.error^.5 # = 1123.438
M.linear.y.hat=predict(M.linear,x.test)
rmse_comparison[4]<-RMSE(y=y.test,y.hat=M.linear.y.hat)
There are two potential problems with this OLS regression when we include all explanatory variables: 1. Prediction Accuracy: When n (number of observations) ~ p (number of predictors), the least square fit can have high variance and may result in over fitting and poor estimates on unseen observations. And, when n < p, then the variability of the least squares fit increases dramatically, and the variance of these estimates is infinite. 2. Model Interpretability: When we have a large number of variables X in the model, there will generally be many that have little or no effect on Y, which makes it harder to see the big picture or recognize the effect of "important variables"
** A solution for this is Subset Selection, which is done at the begining of this post. We will include some more solutions below."
We can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. Shrinking the coefficient estimates can significantly reduce their variance.The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso.
Ridge Regression
We need to pick the tuning parameter lambda, and we do this by cross-validation.
library(glmnet)
set.seed(1)
cv.out=cv.glmnet(as.matrix(x.train),y.train,alpha=0)
cv.out$cvsd
min(cv.out$cvsd) # = 230181.
min(cv.out$cvsd)^.5 # = 479.7724
par(mfrow=c(1,1))
plot(cv.out)
best.lambda=cv.out$lambda.min
best.lambda # = 351.804
Because we let the function automatically pick the range of lambda, this plot is not so useful. The variance of MSE seems to increase as lambda increases. This is unfortunately not true. If we do it again, we better set a wider range of lambda by set a variable lambdarange <- (exp(1))^seq(10, -2, length = 100), and use this in the function. If we did that, the range of log(lambda) would be from -2 to 10, instead of 6 to 14 like in the graph. We will demonstrate this in the Lasso model below.

starting.time=Sys.time()
M.ridge=glmnet(as.matrix(x.train),y.train,alpha=0,lambda=best.lambda)
finishing.time=Sys.time()
time_consumed[1,5]=round(finishing.time-starting.time,4)
By checking the coefficients of the model, we can see some coefficients are shrinked to nearly 0.
> coef(M.ridge)
19 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) -1.669029e+03
Private -6.119370e+02
Accept 7.975664e-01
Enroll 5.218250e-01
Top10perc 1.056295e+01
Top25perc 2.583510e+00
F.Undergrad 1.116647e-01
P.Undergrad 2.887601e-02
Outstate 4.787920e-03
Room.Board 1.663263e-01
Books 1.959874e-01
Personal -3.918161e-03
PhD -1.728731e+00
Terminal -2.953220e+00
S.F.Ratio 1.747877e+01
perc.alumni -9.499477e+00
Expend 7.908724e-02
Grad.Rate 1.240990e+01
Elite 6.587377e+02
M.ridge.y.hat=predict(M.ridge,s=best.lambda,newx = mat.x[test_id,])
rmse_comparison[5]<-RMSE(y=y.test,y.hat=M.ridge.y.hat)
Lasso Regression
The LASSO works in a similar way to Ridge Regression, except it uses a different penalty term. Using this penalty, it could be proven mathematically that some coefficients end up being set to exactly zero.
With LASSO, we can produce a model that has high predictive power and it is simple to interpret
set.seed(1)
lambdarange <- (exp(1))^seq(6, -2, length = 100)
cv.out=cv.glmnet(as.matrix(x.train), y.train,alpha=1, lambda = lambdarange)
cv.out$cvsd
min(cv.out$cvsd) # = 229225.6
min(cv.out$cvsd)^.5 # = 478.7751
par(mfrow=c(1,1))
plot(cv.out)
best.lambda=cv.out$lambda.min
best.lambda # = 9.043264

starting.time=Sys.time()
M.lasso=glmnet(as.matrix(x.train),y.train,alpha=1,lambda=best.lambda)
finishing.time=Sys.time()
time_consumed[1,6]=round(finishing.time-starting.time,4)
Looking at the coefficients of the model, we can see that some coefficients are shrinked to exactly zero.
> coef(M.lasso)
19 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) -1.004706e+03
Private -4.813969e+02
Accept 1.292032e+00
Enroll -2.237505e-01
Top10perc 1.775103e+01
Top25perc .
F.Undergrad 5.653075e-02
P.Undergrad 3.715501e-02
Outstate -3.094770e-02
Room.Board 9.904828e-02
Books 8.435703e-02
Personal .
PhD -4.495032e+00
Terminal -4.348351e+00
S.F.Ratio 2.030395e+01
perc.alumni -2.016228e+00
Expend 8.920336e-02
Grad.Rate 8.123156e+00
Elite 5.553816e+02
Adding the RMSE to the report table
M.lasso.y.hat=predict(M.lasso,s=best.lambda,newx = mat.x[test_id,])
rmse_comparison[6]<-RMSE(y=y.test,y.hat=M.lasso.y.hat)
Elastic Net
This is a dynamic blending of lasso and ridge regression.
set.seed(1)
lambdarange <- (exp(1))^seq(6, -2, length = 100)
cv.out= cv.glmnet(as.matrix(x.train),y.train,alpha=.5, lambda = lambdarange)
cv.out$cvsd
min(cv.out$cvsd) #229001.9
min(cv.out$cvsd)^.5 #478.5414
plot(cv.out)
best.lambda=cv.out$lambda.min
best.lambda # = 4.030697

starting.time=Sys.time()
M.elastic=glmnet(mat.x[-test_id,],y.train,alpha = .5,lambda=best.lambda)
finishing.time=Sys.time()
time_consumed[1,9]=round(finishing.time-starting.time,4)
coef(M.elastic)
M.elastic.y.hat=predict(M.elastic,s=best.lambda,newx=mat.x[test_id,])
rmse_comparison[9]<-RMSE(y=y.test,y.hat=M.elastic.y.hat)
The methods that we have discussed so far in this post have involved fitting linear regression models, via least squares or a shrunken approach, using the original predictors, X1, X2, … , Xp.
We now explore a class of approaches that transform the predictors and then fit a least squares model using the transformed variables. We will refer to these techniques as dimension reduction methods.
PCR Model
library (pls)
set.seed(1)
starting.time=Sys.time()
M.pcr=pcr(Apps~.,data=train, scale=TRUE,validation="CV")
finishing.time=Sys.time()
time_consumed[1,7]=round(finishing.time-starting.time,4)
summary(M.pcr)
validationplot(M.pcr,val.type=c("MSEP"))

Based on the graph, we may pick 5 PCs to build our models because the degree of MSEs reduction is not subtantial when we increases the number of PCs from 5 to 18.
M.pcr.y.hat = predict(M.pcr,x.test,ncomp = 5)
rmse_comparison[7]<-RMSE(y=y.test,y.hat=M.pcr.y.hat)
PLS Model
starting.time=Sys.time()
M.pls=plsr(Apps~.,data=train, scale=TRUE,validation="CV")
finishing.time=Sys.time()
time_consumed[1,8]=round(finishing.time-starting.time,4)
summary (M.pls)
validationplot(M.pls,val.type=c("MSEP"))

Based on the graph, we may pick 5 PCs to build our models because the degree of MSEs reduction is not subtantial when we increases the number of PCs from 5 to 18.
M.pls.y.hat= predict(M.pls,x.test,ncomp = 5)
rmse_comparison[8]<-RMSE(y=y.test,y.hat=M.pls.y.hat)
To compare the performance of all models on test set
RMSE
OLS regression shows the best performance, but Subset Selection with forward approach also shows impressive accuracy.
> rmse_comparison
Sub_Ex Sub_For Sub_Back Linear Reg Ridge Lasso PCR PLS Elastic Net
1 1302.273 1283.129 1302.273 1150.511 1801.388 1345.767 2271.287 1515.119 1337.154
Time Consumed
Ridge and Lasso are most efficient.
> time_consumed
Sub_Ex Sub_For Sub_Back Linear Reg Ridge Lasso PCR PLS Elastic Net
1 0.0299 0.0519 0.1396 0.0249 0.017 0.0179 0.0908 0.1047 0.0349
Last updated