KNN vs Logistic Regression in R

Cover

This image may not relate to this project at all. Source: www.childcarseats.com.au. All images, data and R Script can be found here

This is a short homework assignment in DSO_530 Applied Modern Statistical Learning Methods class by professor Robertas Gabrys, USC. I completed this project with two classmates He Liu and Kurshal Bhatia. In this assignment, we compare the predictive power of KNN and Logistic Regression.

Prompt

A child car seat company is interested in understanding what factors contribute to sales for one of its products. They have sales data on a particular model of child car seats at different stores inside and outside the United States.

To simplify the analysis, the company considers sales at a store to be “Satisfactory” if they are able to cover 115% of their costs at that location (i.e., roughly 15% profit) and “Unsatisfactory” if sales cover less than 115% of costs at that location (i.e., less than 15% profit).

The data set consists of 11 variables and 400 observations. Each observation corresponds to one of the stores.

Variables

Description

Sales

Sales at each store (Satisfactory = 1 or Unsatisfactory = 0)

CompPrice

Price charged by competitor’s equivalent product at each store

Income

Local community income level (in thousands of dollars)

Advertising

Local advertising budget for company at each store (in thousands of dollars)

Population

Population size of local community (in thousands)

Price

Price company charges for its own product at the store

ShelveLoc

A factor with levels (Good=1 and Bad=0) indicating the quality of the shelving location for the car seats at each store

Age

Average age of the local community

Education

Average Education level in the local community

Urban

A factor with levels (Yes=1 and No=0) to indicate whether the store is in an urban or rural location

US

A factor with levels (Yes=1 and No=0) to indicate whether the store is in the US or not

Load data

Create the validation set and training set

Train the logistic regression model

Find the cutoff value

The misclassification rate is lowest at 14.3% whenthe cutoff value is 0.55. We will use this value to predict on the testing set.

Create confusion matrix on testing set

The misclassification error for the testing set is 20%, smaller than that of the training set. This is actually a very impressive result.

False positive and false negative rates

KNN Model

First we need package "class" to run k-nearest neighbour classification. It requires the response variable to be factor.

Find k to minimize misclassification rate

Misclassification rate and confusion matrix

False positive and false negative rates

Compared with Logistic regression, KNN has higher misclassification rate. Especially, the false negative rate is substantially high at 62.2%

Last updated