Sunday, September 15, 2013

Cleaning Address Fields with R String Functions

I have been surprised by the lack of solutions for cleaning addresses in the open source world. So, I decided to look into R Statistical Program.  Whether in R or other statistical programs, both have string/character functions that allow for splitting fields.  The stringr R package is also very helpful.

Dirty address fields can be a symptom of problems with data collection (lack of defined fields, standardization, minor errors) or something simple--like  typographical errors -- which can be compounded over time. 

These mistakes can affect matching of addresses to reference datasets and ultimately any analysis that is performed. If addresses are so poorly collected, no analysis may actually be able to be done or simply have results that are two unreliable to interpret.

Before geocoding addresses, it is best to get the data as "clean" as possible.  If you have a database setup properly with data being entered by automation or by hand, validation rules, or warning messages about potential conflicts, then you should be in relatively good shape. 

Hopefully, in the coming weeks, I will have some sample data and R code posted illustrating common problems and solutions.