Cleaning Data in R, Average daily attendance (ADA) USA

Data screenshot

All images, data and R Script can be found here

This is a case in Importing & Cleaning Data in R course of DataCamp. The course has its own solution, but I made some adjustments to make the solution scalable.

A lot of times, we have to import data from Excel and make it ready for analysis. Very often, the cleaning step takes a lot more time than the analysis itself. The data structure in Excel with high level of complexity is very helpful for reading the spreadsheet, but it makes life harder when it comes to cleaning and organizing. This short case is a good practice to get myself familiar to this kind of problem.

Import Data

When you're importing a messy spreadsheet into R, it's good practice to compare the original spreadsheet with what you've imported. By comparing with the screenshot, I have these observations:

  • The data is surely messy

  • read_xls() function actually imported the first row of the original data frame as the variable name, and that's not what you want in this case

  • The names of variables have to be added manually

  • Many rows and columns contain useless or missing data

Cleaning

I first removed rows with NA values, and columns with useless data. I also assigned names to varibles.

I also noticed that state names are all stored as the same number of characters, with periods padding the ends of the shorter states. This technique is helpful in many cases, but here, we need to get rid of it by removing all dot (.) in the state names.

Make sure you apply names before you run str_replace_all function, because if you call str_replace_all on att2[1], instead of att2$state, things will be messy. att2[1] is a dataframe, while att2$state is a vector.

In many cases, a single data frame stores multiple "tables" of information. In this data frame, columns 1, 6, and 7 represent attendance data for US elementary schools, columns 1, 8, and 9 represent data for secondary schools, and columns 1 through 5 represent data for all schools in the US.

Each of these should be stored as its own separate data frame, so I split them up here.

It's also noticeable that numerical data has come in as character strings. I can think of two ways to coerce strings into numbers.

And now the data is good for analysis.

Last updated