Cleaning Data with stringr

08 Oct 2017

Introduction

Occassionally, when we are using a set of data, there may be mistakes. Before we can really make any decisions or visualizations using that data, we must ensure it is as accurate as possible. In this post, we will look at using the R library stringr to filter out unwanted data.

Basic Example

For this example, we will be using a data set from (Kaggle.com)[https://www.kaggle.com/orangutan/exploratory-analysis-of-the-korean-war] that contains all the US deaths from the Korean War. First, we import our libraries and look at the first few rows of data.

library(dplyr)
library(stringr)

#read in the data
deaths<-read.csv("KoreanConflict.csv", header=TRUE, stringsAsFactors=FALSE)

head(deaths)

##   SERVICE_TYPE SERVICE_CODE             ENROLLMENT    BRANCH RANK
## 1            V            L ACTIVE - GUARD/RESERVE AIR FORCE CAPT
## 2            R            K       ACTIVE - REGULAR      ARMY  PVT
## 3            R            K       ACTIVE - REGULAR      ARMY  PFC
## 4            V            L ACTIVE - GUARD/RESERVE      ARMY  2LT
## 5            R            K       ACTIVE - REGULAR      ARMY  CPL
## 6            R            K       ACTIVE - REGULAR      ARMY  PFC
##   PAY_GRADE                      POSITION BIRTH_YEAR SEX HOME_CITY
## 1       O03                                     1917   M  NEW YORK
## 2       E02       FOOD SERVICE APPRENTICE       1927   M   UNKNOWN
## 3       E03     HEAVY WEAPONS INFANTRYMAN       1932   M   UNKNOWN
## 4       O01       INFANTRY UNIT COMMANDER       1929   M   UNKNOWN
## 5       E04 LIGHT WEAPONS INFANTRY LEADER       1932   M   UNKNOWN
## 6       E03     LIGHT WEAPONS INFANTRYMAN       1929   M   UNKNOWN
##   HOME_COUNTY NATIONALITY STATE_CODE   HOME_STATE MARITAL_STATUS ETHNICITY
## 1    NEW YORK          US         NY     NEW YORK        MARRIED     WHITE
## 2      OCONEE          US         GA      GEORGIA        UNKNOWN     WHITE
## 3        BIBB          US         GA      GEORGIA        UNKNOWN     WHITE
## 4     COAHOMA          US         MS  MISSISSIPPI        UNKNOWN     WHITE
## 5      DICKEY          US         ND NORTH DAKOTA        UNKNOWN     WHITE
## 6      NELSON          US         ND NORTH DAKOTA        UNKNOWN     WHITE
##     ETHNICITY_1 ETHNICITY_2              DIVISION INCIDENT_DATE
## 1 NOT SPECIFIED       WHITE 93 BOMB SQ 19 BOMB GP      19510412
## 2 NOT SPECIFIED       WHITE      29 RGT CMBT TEAM      19500727
## 3 NOT SPECIFIED       WHITE       5 RGT 1 CAV DIV      19510316
## 4 NOT SPECIFIED       WHITE          32 INF 7 DIV      19530122
## 5 NOT SPECIFIED       WHITE         14 INF 25 DIV      19530529
## 6 NOT SPECIFIED       WHITE       5 RGT 1 CAV DIV      19510606
##   FATALITY_YEAR FATALITY_DATE HOSTILITY_CONDITIONS         FATALITY
## 1          1951      20010402                    H    DECLARED DEAD
## 2          1950      19500727                    H KILLED IN ACTION
## 3          1951      19510316                    H KILLED IN ACTION
## 4          1953      19530122                    H KILLED IN ACTION
## 5          1953      19530529                    H KILLED IN ACTION
## 6          1951      19510606                    H KILLED IN ACTION
##   BURIAL_STATUS
## 1             Y
## 2             Y
## 3             Y
## 4             Y
## 5             Y
## 6             Y

It appears that the INCIDENT_DATE field is formatted like so: YYYYMMDD. To check this, let’s compare the total number of rows in the database with how many rows fit that format. To keep things simple, we will use a regular expression to check that the data contains only 8 numbers.

#Get the total number of records int the data
count(deaths) #36,574

## # A tibble: 1 x 1
##       n
##   <int>
## 1 36574

#compare INCIDENT_DATE against the regular expression and get the count of records
fullDateRegEx = "^\\d{8}$"

df<-deaths%>%
  filter(str_detect(deaths$INCIDENT_DATE, fullDateRegEx)==TRUE)
count(df) #33,370

## # A tibble: 1 x 1
##       n
##   <int>
## 1 33370

Looking further into the data, we can see that the FATALITY column also appears to hold a date in the format we want, and only shows that when INCIDENT_DATE is invalid. Using a simple for loop, we can check to see if FATALITY is valid when INCIDENT_DATE is not, and store the date in INCIDENT_DATE

for (i in 1:dim(deaths)) {
  #if INCIDENT_DATE is bad, but FATALITY is good, replace it
  if (!str_detect(deaths$INCIDENT_DATE[i], fullDateRegEx) & str_detect(deaths$FATALITY[i], fullDateRegEx)) {
    #print(paste(i, "There is a mistake here"))
    deaths$INCIDENT_DATE[i]<-deaths$FATALITY[i]
  }
}

## Warning in 1:dim(deaths): numerical expression has 2 elements: only the
## first used

#Check out counts again
df<-deaths%>%
  filter(str_detect(deaths$INCIDENT_DATE, fullDateRegEx)==TRUE)
count(df) #36,511

## # A tibble: 1 x 1
##       n
##   <int>
## 1 36511

This is an acceptable number of valid records (36,511 out of 36,574), so we can just filter out the remaining bad data and begin to use our data for visualizations and decision making.

deaths<-deaths%>%
  filter(str_detect(deaths$INCIDENT_DATE, fullDateRegEx)==TRUE)

Visualization

Now we can plot out a simple line chart of deaths by date using this newly cleaned data set.

library(ggplot2)
library(lubridate)  #for parsing and dealing with dates

deathsDf<-deaths%>%
  group_by(INCIDENT_DATE)%>%
  summarize(num_deaths=n())%>%
  mutate(date=ymd(INCIDENT_DATE))%>%
  filter(date<='1953-07-27')%>%
  select(INCIDENT_DATE, date, num_deaths)

#look at number of deaths by date
ggplot()+
  geom_line(data=deathsDf, aes(x=date,y=num_deaths))+
  ggtitle("Korean War Deaths by Date")+
  xlab("Date")+
  ylab("Number of Deaths")+
  scale_x_date(date_breaks="3 months", date_labels="%b %y")