Multiple Linear Regression with R
Multiple Linear Regression with R

This image may not relate to this project at all. Source: endpoints.elysiumhealth.com. All images, data and R Script can be found here
This is a project I did in DSO_530 Applied Modern Statistical Learning Methods class by professor Robertas Gabrys, USC.
Prompt
Dr. Sam Parameter, author of the textbook Statistics for Poets, has been contemplating starting a new magazine, Popular Statistics, and needs to know if it will be profitable enough. You have augmented the original dataset of 50 magazines by measuring more characteristics of the magazines and their predictors that may be useful in understanding the one-page advertisement costs better. The variables are the following :
Your goal is to analyze the data with R using Multiple Linear Regression methods and choose the best model to explain the differences in advertising costs between the different titles and then to predict what Sam should be able to charge for the advertisements in the new magazine.
Steps
Data Preparation
We need to transform market variable into a dummy variable
Data Exploration
Histogram
The skewness is noticeable in three variables Median Income, One Ad Page Cost, and Circulation. They’re skewed right, with long tail and the peak of the histogram veers to the left. There seems to be more magazines with one ad page cost lower than the average of one ad page cost for all magazines. Same seems to be true for median income and circulation. There are more magazines with lower circulations and lower median income of reader.
Some insights from discriptive statistics for varibles
Cost for one ad page varies mostly from $50,000 to $100,000
Projected circulation varies mostly from 3500 to 15000, but over 50% of the observations have a circulation of fewer than 6000
Over 50% of magazines in the sample targets readers with median income from $40,000 to $60,000
There are a significant number of magazines that target solely one reader gender(25% of the observations have a percentage of male reader of less than 15%; 25% of the observations have a percentage of male reader of more than 65%)
70% of the observations are domestic magazine
Correlation Matrix
The relationship between pagecost and circ is strong and looks linear. This strong relationship makes an economic sense since the higher the circulation, the stronger the impact of the ad campaign, the higher the ad cost.
The relationship between percmale and medianincome is also noticeable. However, it does not look linear visually. Different from the magazine that targets solely female readers, the magazine that targets solely male readers also target male readers with higher income, probably higher social class.
There is also a slight relationship between medianincome and pagecost, as well as medianincome and circ. The magazines that have higher ad page cost or higher circulation tend to target readers with lower median income.
From the correlation table, we confirm very similar insights. Cost per one ad page is highly correlated with circulation number. Median Income is also strongly correlated with percentage of male readers. Median Income also has slight negative correlation with both ad page cost and circulation.
Model 1 - Multiple Linear Regression analysis using all the predictor variables
This regression model is useful because p-value is 3.436e-13. R-squared of 80.13% is decent, but there’s room for further improvement on the model. Also based on a 5% significance level, three out of the four explanatory variables are insignificant. p-value for Circulation is small, so we can say that Circulation is contributing significantly to the cost. I would recommend keeping circulation variable. All other variables'p-value are greater than .05, hence we can say they are not significant based on our confidence interval selection and should be removed. The dummy variable domestic should be kept.
The VIF for each explanatory variable is low (<5), so there does not appear to be _**_multicollinearity.
Predict the pagecost for one observation
Predict what Sam should be able to charge for the advertisements in the new magazine, if his magazine has a projected predictor of 1,800,000 readers, 60 percent of which are male, with a median income of $80,000 and it will be sold internationally.
We are 95% confident that the cost is from $41,879.83 to $94,865.15
Residual Analysis
Residual plot shows homoscedasticity. The variance of the residuals seems constant. QQ plot shows normality of the data. Shapiro test shows that there is no evidence to reject the hypothesis that data is not normally distributed.
Durbin Watson test shows a p-value of 0.7356. So, we can not reject the null hypothesis of zero autocorrelation. However, ACF plot does not show any autocorrelation. When we plot residuals against predicted value, we recognize non-random pattern, and probably outliers. This violates linearity assumption. The non-random pattern in the residuals indicates that the deterministic portion (predictor variables) of the model is not capturing some explanatory information that is “leaking” into the residuals. There are some possible solutions for this problem:
Adding another explanatory variable that helps explain the pattern in the residual
Variable in the model needs to be transformed to explain the curvature
Adding interactions between explanatory variables
Often, when dealing with dependent variables that represent financial data (income, price, etc.), using the natural log of the dependent variable will help to alleviate problems. Re-run the Multiple Regression analysis using the natural log of the page cost variable.
Model 2 - Transforming response variable
With p-value of 1.573e-08, this model is useful. However, the model does not seem better than the previous one, because:
The r-square and adjusted r-square are worse with only 0.6521 and 0.6164
The model has log transformation, which causes difficulty in interpretation
p-value is larger
However, the medianincome variable is relatively useful with p-valule of 0.08. I recommend keeping three variable circ, meidanincome, and domestic, which is a dummy variable.
Residual Analysis
Residuals plot does not show heteroskedasticity. The assumption of equal error variances is reasonable. QQ plot shows normality of the data. Durbin Watson test shows a p-value of 0.7915. So, we can not reject the null hypothesis of 0 autocorrelation. However, ACF plot does not show autocorrelation. The residuals plot still shows non-random pattern, and probably outliers. This violates linearity assumption.
The scatterplots again suggest that there is a high correlation between lncost and circulation. The higher the circulation, the higher the ln.cost. It's also noticeable that the relationship between these two are not linear, which suggests further transformation. All other plots do not show any patterns and are random in nature.
Model 3 - Transforming response variables and explanatory variables
In this model we take the natural log of circ variable.
This model has a greater r-square value than all our previous models. The p-value is smaller, and the model is useful. Besides lncirc, medianincome variable, which was not that significant in our previous models, is behaving as a statistically significant variable in this model. We recommend keeping three variable circ, meidanincome, and domestic.
Residual Analysis
Outliers
I create a function to automatically remove outliers based on Cook's Distance. This function first run linear model with the response variable being the 1st column and the explanatory variables being the remaining columns. After removing outliers, the function runs linear model again and extract the summary. In practice, we need to assess the model again every time we remove one outliers. However, for the sake of time, and partly my laziness, I here remove them all at once. This function helps us assess new model quickly without having to go through each code line again and again. Let's try it with Model 3
After removing outliers, r-square improves up to 0.845
Further Transformation
In this steps we have to try different transformations and assess the R-square. Below is the major steps that lead to the model with the highest r-square.
Model 4 - Adding interactions
Model 5 - Squaring permale
After adding some interaction variables and transforming permale variable, we come up with this new model with an improved r-square and adjusted r-square.
Summary
The last model best fits the data since it explains over 90% the Ad cost for 1 page.
Model 3 is also a good candidate since the r-square is over 85%
Model 1 is also a decent model with an r-square of over 80%. We should also mention the advantage of model one with its simple form. The regression model of pagecost = 7179.4408 + 3.9633_Circ - 32.4809_percmale + 0.7001_medianincome - 640.5486_domestic is very simple and easy to explain. For example, by increasing the circulation by 1000, Dr. Sam can boost the ad cost up by almost $4000.
Model 3 and model 6 are relatively complicated model with complicated relationship between each variable. With these models, we can’t tell easily what would happen with ad cost if we change predictor variables. Besides, model 6 predicts a wider interval compared to Model 3. This should also be considered.
Last updated