This image may not relate to this project at all. Source: www.childcarseats.com.au. All images, data and R Script can be found here
This is a short homework assignment in DSO_530 Applied Modern Statistical Learning Methods class by professor Robertas Gabrys, USC. I completed this project with two classmates He Liu and Kurshal Bhatia. In this assignment, we compare the predictive power of KNN and Logistic Regression.
Prompt
A child car seat company is interested in understanding what factors contribute to sales for one of its products. They have sales data on a particular model of child car seats at different stores inside and outside the United States.
To simplify the analysis, the company considers sales at a store to be “Satisfactory” if they are able to cover 115% of their costs at that location (i.e., roughly 15% profit) and “Unsatisfactory” if sales cover less than 115% of costs at that location (i.e., less than 15% profit).
The data set consists of 11 variables and 400 observations. Each observation corresponds to one of the stores.
Variables
Description
Sales
Sales at each store (Satisfactory = 1 or Unsatisfactory = 0)
CompPrice
Price charged by competitor’s equivalent product at each store
Income
Local community income level (in thousands of dollars)
Advertising
Local advertising budget for company at each store (in thousands of dollars)
Population
Population size of local community (in thousands)
Price
Price company charges for its own product at the store
ShelveLoc
A factor with levels (Good=1 and Bad=0) indicating the quality of the shelving location for the car seats at each store
Age
Average age of the local community
Education
Average Education level in the local community
Urban
A factor with levels (Yes=1 and No=0) to indicate whether the store is in an urban or rural location
US
A factor with levels (Yes=1 and No=0) to indicate whether the store is in the US or not
Load data
> carseat.data=read.csv("carseat.txt")>head(carseat.data) Sales CompPrice Income Advertising Population Price ShelveLoc Age Education Urban US11138731127612004217112111148162608306510113111335102698015912114011710044669715514115014164334012803813106112411313501720781601
> logistic_model=glm(Sales~.,data=training_data, family="binomial")>summary(logistic_model)Call:glm(formula = Sales ~ ., family ="binomial", data = training_data)Deviance Residuals: Min 1Q Median 3Q Max -2.1090-0.6056-0.18990.42252.6784Coefficients: Estimate Std. Error z value Pr(>|z|)(Intercept) -1.39131192.0272717-0.6860.492525CompPrice 0.11334530.01837596.1686.91e-10***Income 0.01536230.00619702.4790.013175*Advertising 0.14817290.03918253.7820.000156***Population -0.00029050.0012774-0.2270.820126Price -0.11757690.0150529-7.8115.68e-15***ShelveLoc 2.61331490.49555695.2731.34e-07***Age -0.05682640.0116597-4.8741.09e-06***Education -0.04518320.0641827-0.7040.481447Urban -0.51725750.3888039-1.3300.183393US 0.30590330.50938540.6010.548150---Signif. codes:0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1(Dispersion parameter for binomial family taken to be 1) Null deviance:402.98 on 299 degrees of freedomResidual deviance:222.16 on 289 degrees of freedomAIC:244.16Number of Fisher Scoring iterations:6