KNN vs Logistic Regression in R

This image may not relate to this project at all. Source: www.childcarseats.com.au. All images, data and R Script can be found here
This is a short homework assignment in DSO_530 Applied Modern Statistical Learning Methods class by professor Robertas Gabrys, USC. I completed this project with two classmates He Liu and Kurshal Bhatia. In this assignment, we compare the predictive power of KNN and Logistic Regression.
Prompt
A child car seat company is interested in understanding what factors contribute to sales for one of its products. They have sales data on a particular model of child car seats at different stores inside and outside the United States.
To simplify the analysis, the company considers sales at a store to be “Satisfactory” if they are able to cover 115% of their costs at that location (i.e., roughly 15% profit) and “Unsatisfactory” if sales cover less than 115% of costs at that location (i.e., less than 15% profit).
The data set consists of 11 variables and 400 observations. Each observation corresponds to one of the stores.
Variables
Description
Sales
Sales at each store (Satisfactory = 1 or Unsatisfactory = 0)
CompPrice
Price charged by competitor’s equivalent product at each store
Income
Local community income level (in thousands of dollars)
Advertising
Local advertising budget for company at each store (in thousands of dollars)
Population
Population size of local community (in thousands)
Price
Price company charges for its own product at the store
ShelveLoc
A factor with levels (Good=1 and Bad=0) indicating the quality of the shelving location for the car seats at each store
Age
Average age of the local community
Education
Average Education level in the local community
Urban
A factor with levels (Yes=1 and No=0) to indicate whether the store is in an urban or rural location
US
A factor with levels (Yes=1 and No=0) to indicate whether the store is in the US or not
Load data
Create the validation set and training set
Train the logistic regression model
Find the cutoff value
The misclassification rate is lowest at 14.3% whenthe cutoff value is 0.55. We will use this value to predict on the testing set.
Create confusion matrix on testing set
The misclassification error for the testing set is 20%, smaller than that of the training set. This is actually a very impressive result.
False positive and false negative rates
KNN Model
First we need package "class" to run k-nearest neighbour classification. It requires the response variable to be factor.
Find k to minimize misclassification rate
Misclassification rate and confusion matrix
False positive and false negative rates
Compared with Logistic regression, KNN has higher misclassification rate. Especially, the false negative rate is substantially high at 62.2%
Last updated