This is a case in Importing & Cleaning Data in R course of DataCamp. The course has its own solution, but I made some adjustments to make the solution scalable.
A lot of times, we have to import data from Excel and make it ready for analysis. Very often, the cleaning step takes a lot more time than the analysis itself. The data structure in Excel with high level of complexity is very helpful for reading the spreadsheet, but it makes life harder when it comes to cleaning and organizing. This short case is a good practice to get myself familiar to this kind of problem.
Import Data
> att <-read_xls("attendance.xls")>head(att,n=15)# A tibble: 15 x 17 `Table 43. Average d~ X__1 X__2 X__3 X__4 X__5 X__6 X__7 X__8 X__9 X__10 X__11 X__12 X__13 X__14 X__15 X__16<chr><chr><dbl><chr><dbl><chr><dbl><chr><dbl><chr><dbl><chr><dbl><chr><chr><chr><chr>1NA Total e~NANANANANANANA Eleme~NANANA Secon~NANANA2NA ADA as ~NA Avera~NA Avera~NA Avera~NA ADA a~NA Avera~NA ADA a~NA Aver~NA312NA3NA4NA5NA6NA7NA8NA9NA4 United States ......~93.0789~0.2196.644~0.01761800.1431192.~3.0994.00~0.2696.656~0.016091.11~0.43~6.59~0.04~5 Alabama ............~93.8123~1.247.028~0.06561800.7551266.~12.393.77~1.847.038~0.075994.56~0.37~7.13~0.17~6 Alaska .............~89.9175~1.226.476~0.04991803.431162.~22.991.28~1.566.486~0.053193.24~1.57~6.24~0.14~7 Arizona ............~89.0369~2.956.433~0.09191811.681159.~14.488.90~3.916.443~0.10288.97~3.22~6.36~0.24~8 Arkansas ...........~91.8271~1.356.885~0.06151790.2021228.~10.792.09~2.096.932~0.076590.81~2.23~6.76~0.09~9 California .........~93.2410~0.7126.240~0.06641810.4391128.~12.594.93~0.7546.289~0.051589.36~1.44~6.14~0.20~10NANANANANANANANANANANANANANANANANA11 Colorado ...........~93.8871~0.4407.007~0.05201711.051199.~9.9294.53~0.4476.960~0.066491.24~1.28~7.00~0.10~12 Connecticut ........~87.9258~2.986.466~0.08781810.1381173.~15.987.37~3.986.469~0.11293.74~0.68~6.46~0.09~13 Delaware ...........~89.7705~1.756.678~0.08961810.8321208.~18.789.36~2.506.766~0.0606 ‡ (†) 6.54~0.22~14 District of Columbia~91.2208~1.276.913~0.2141810.4381255.~42.393.93~0.3796.916~0.0969 ‡ (†) ‡ (†) 15 Florida ............~92.6781~0.7416.427~0.08411841.201183.~18.893.96~0.9366.463~0.055089.90~1.42~6.31~0.37~>tail(att,n=10)# A tibble: 10 x 17 `Table 43. Average dai~ X__1 X__2 X__3 X__4 X__5 X__6 X__7 X__8 X__9 X__10 X__11 X__12 X__13 X__14 X__15 X__16<chr><chr><dbl><chr><dbl><chr><dbl><chr><dbl><chr><dbl><chr><dbl><chr><chr><chr><chr>1 Vermont ..............~92.72~3.396.656~0.07401770.4571179.~13.693.08~4.506.716~0.090490.77~4.080~6.40~0.17~2 Virginia .............~94.67~0.4606.617~0.05021810.1591197.~9.1095.59~0.3506.611~0.032793.94~0.584~6.67~0.18~3 Washington ...........~82.85~3.066.222~0.08041790.1741116.~14.581.04~4.426.378~0.045285.76~2.753~5.89~0.26~4 West Virginia ........~94.00~0.9886.873~0.07011820.4981251.~13.794.37~1.326.879~0.093292.84~0.408~6.82~0.13~5 Wisconsin ............~94.95~0.5666.911~0.04271800.7361246.~8.6395.37~0.4156.886~0.048593.01~1.907~7.01~0.14~6 Wyoming ..............~92.35~1.156.858~0.04581751.281200.~8.3392.19~1.656.850~0.054792.38~0.749~6.97~0.06~7 †Not applicable. NANANANANANANANANANANANANANANANA8 ‡Reporting standards n~NANANANANANANANANANANANANANANANA9 NOTE: Averages reflect~NANANANANANANANANANANANANANANANA10"SOURCE: U.S. Departme~ NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
When you're importing a messy spreadsheet into R, it's good practice to compare the original spreadsheet with what you've imported. By comparing with the screenshot, I have these observations:
The data is surely messy
read_xls() function actually imported the first row of the original data frame as the variable name, and that's not what you want in this case
The names of variables have to be added manually
Many rows and columns contain useless or missing data
Cleaning
I first removed rows with NA values, and columns with useless data. I also assigned names to varibles.
>#Remove rows with more than 5 NA values>#Remove columns 3,5,7,9,11,13,15,17. They contains useless data> nacount <-is.na(att)> removerow<-rowSums(nacount)>5> removecolumn <-seq(3,17,2)> att2 <- att[!removerow,-removecolumn]># Define cnames vector> cnames <-c("state", "avg_attend_pct", "avg_hr_per_day", "avg_day_per_yr", "avg_hr_per_yr","avg_attend_pct", "avg_hr_per_day", "avg_attend_pct", "avg_hr_per_day")># Assign column names of att2>colnames(att2) <- cnames>head(att2)# A tibble: 6 x 9 state avg_attend_pct avg_hr_per_day avg_day_per_yr avg_hr_per_yr avg_attend_pct avg_hr_per_day avg_attend_pct avg_hr_per_day<chr><chr><chr><chr><chr><chr><chr><chr><chr>1 United St~93.07896200000~6.644700000000~1801192.647200000~94.00498299999~6.65605600000~91.1180810000~6.59439400000~2 Alabama .~93.81237099999~7.028520000000~1801266.620593.77637500000~7.03847299999~94.5610010000~7.13723900000~3 Alaska ..~89.91759700000~6.476880000000~1801162.908491.2815376.48678500000~93.2408766.24086500000~4 Arizona .~89.03696100000~6.433690000000~1811159.114399999~88.90439299999~6.44315588.9797159999~6.36909200000~5 Arkansas ~91.82711100000~6.885419999999~1791228.888099999~92.09320099999~6.93215999999~90.8119520000~6.76750800000~6 Californi~93.24101699999~6.240641811128.769399999~94.9314216.28974499999~89.3608569999~6.14261500000~
I also noticed that state names are all stored as the same number of characters, with periods padding the ends of the shorter states. This technique is helpful in many cases, but here, we need to get rid of it by removing all dot (.) in the state names.
Make sure you apply names before you run str_replace_all function, because if you call str_replace_all on att2[1], instead of att2$state, things will be messy. att2[1] is a dataframe, while att2$state is a vector.
In many cases, a single data frame stores multiple "tables" of information. In this data frame, columns 1, 6, and 7 represent attendance data for US elementary schools, columns 1, 8, and 9 represent data for secondary schools, and columns 1 through 5 represent data for all schools in the US.
Each of these should be stored as its own separate data frame, so I split them up here.
# Subset just elementary schools: att_elematt_elem <- att2[,c(1,6,7)]# Subset just secondary schools: att_secatt_sec <- att2[,c(1,8,9)]# Subset all schools: att4att4 <- att2[,1:5]
It's also noticeable that numerical data has come in as character strings. I can think of two ways to coerce strings into numbers.
> att4 <-mutate_at(att4, vars(-state), funs(as.numeric))>str(att4)Classes ‘tbl_df’, ‘tbl’ and 'data.frame':52 obs. of 5 variables:$ state : chr "United States""Alabama""Alaska""Arizona"...$ avg_attend_pct: num 93.193.889.98991.8...$ avg_hr_per_day: num 6.647.036.486.436.89...$ avg_day_per_yr: num 180180180181179181171181181181...$ avg_hr_per_yr : num 11931267116311591229...># You can actually use sapply to do this># cols<-c(2:ncol(att4))># att5[, cols] <- sapply(att4[,cols],as.numeric)