Monday, January 19, 2015

Using R to Prepare a Case File for SatScan

SaTScan requires several different types of files for analysis: 1)  A case file with a column for the geographic unit. day, month or year (see documentation), and number of cases.  You can aggregate the data into any geographic unit--large or small. 2) A geographic coordinate file (cartesian or lat/long) with the name of the unit (i.e. census tract), x and y for centroids of the geographic units, and 3) population file with the estimated population over the time period-- by year.

In this post, I will describe creating a case file using code in R.  The goal is to create a sum of homicides by month, year (just 2013 for this example), and police beat/post.  We won't worry about any other specifics (i.e. degree) or related types of crimes, i.e. shootings.

To ready yourself for data preparation, read Richard Block's tutorial or the more extensive SatScan manual.

I use crime data from Chicago's Open Data Portal.  The same code can be applied to other types of data, health data, etc.  A few key points: 1) the data contains victim-based data--which we want to convert into incidents. 2) not every post has a homicide, and 3) the reference post list contains 275 post.  So, we will end up with a data set with 3300 rows (275 x 12 months) or simply a row for each post-month.

If you want to skip ahead and just look at the code, go to: http://goo.gl/pmOi1u.


At the top: What you start with.  Bottom: After processing in R

Overview of Steps: See the code for further details

Step #1: Two files are imported: 1) a victim-based file of all crimes, which is narrowed down to just homicides (you could also add in shootings) and 2) a 'reference' file or simply a list of the police beats/posts in Chicago.

Step #2:  The data are summed up so that each row contains the total number of victims, then grouped again into incidents by using two different count variables.

Step #3: The list of police beats get column variables for each month in the year and expanded by reshaping data from wide to long.  This serves as a 'reference list' for matching purposes.

Step #4:  The two data sets are matched the 'unmatched' records are also kept.  These are post-months that don't have a homicide, so each count value is replaced with a zero.

Step #5: To ensure the code has worked, I check the total number of rows (3300) and spot check various posts to make sure the data has been grouped in to incidents and posts correctly.  

Whether in R or using for-fee software (i.e. SAS, STATA), preparing data for SaTScan is relatively straightforward but there are a number of steps.

Update #1 (2/18/15)
Scan statistics can also be implemented in R's Spatial Epi Package and rsatscan .

2 comments:

  1. You will be pleased to know that you can now run SaTScan directly in R:

    http://cran.r-project.org/web/packages/rsatscan/vignettes/rsatscan.html

    ReplyDelete
  2. Thanks for that catch! I updated the post with a link to the Spatial Epi package in R.
    -Jon

    ReplyDelete