The City and County of San Francisco had launched an official open data portal called SF OpenData in 2009 as a product of its official open data program, DataSF. The portal contains hundreds of city datasets for use by developers, analysts, residents and more. Under the category of Public Safety, the portal contains the list of SFPD Incidents since Jan 1, 2003.

In this post I have done an exploratory time-series analysis on the crime incidents dataset to see if there are any patterns.

The data for this analysis was downloaded from the publicly available dataset from the City and County of San Francisco’s OpenData website SF OpenData. The crime incidents dataset has data recorded from the year 2003 till date. I downloaded the full dataset and performed my analysis for the time period from 2003 to 2015, filtering out the data from the year 2016. There are nearly 1.9 million crime incidents in this dataset.

I have performed minimal data processing on the downloaded raw data to facilitate my analysis.

- Data source location: https://data.sfgov.org/data
- Data source: https://data.sfgov.org/Public-Safety/SFPD-Incidents-from-1-January-...
- Number of SFPD crime incidents processed: 1,859,850

The following plot depicts the crimes recorded from the year 2003 till the end of the year 2015.

The horizontal line represents the average number of crimes during those years, which is just below 150,000 crimes per year. As you can observe from the year 2003 till 2007 the number of crime incidents decreased steadily. But in the year 2008 and 2009 there was a slight increase in the number of crime incidents. These two years is when the United States went through the financial and subprime mortgage crisis resulting in what is called as the Great Recession. According to the US National Bureau of Economic Research the recession began around January 2008 and ended around June 2009. As most statisticians say, “Correlation does not imply causation”, I too want to emphasize that without additional data and insights from its related analysis it may be not possible to relate these two events, but nevertheless it is an interesting observation. Following that period, there was a slight decrease in the crime incidents during the next two years but it has increased since 2012 ending up above average from year 2013 to year 2015.

The following plot depicts the mean crimes for each month from January till December. You can observe that the mean crime for each month is more or less around the monthly average which is just below 12,000 (horizontal line). One interesting observation is that the mean crime is significantly below the monthly average for the months of February, November and December. The possible reasons could be that the month of February has less number of days compared to the other months and the festive and holiday season during the months of November and December.

The following plot depicts the mean crimes for the different days of the month.You can observe that the mean crime for each day of the month is pretty much around the daily average which is just below 400 (horizontal line) for the days from the 2nd of the month till the 28th. The mean crime during the first day of the month is significantly above average. One possible reason could be that the first day of the month is usually the pay day. Again, a correlation does not imply causation. Without additional related data and insights derived from the analysis of that data we cannot be sure. The 29th and 30th are also below average and the reason could be that the month of February does not have those days. The mean crime for the 31st of the month is around half of the daily average and that might be due to the reason that only half of the months in a year has the 31st day.

The following plot depicts the mean crimes by the hour of the day.You can observe that this plot is very different from the other plots in the sense that the crime incidents are far from the hourly average which is around 16 (horizontal line). But within this plot you can observe some interesting pattern like the fact that crime incidents are well above average around midnight and decline steadily and significantly below the hourly average till early morning around 5 AM. From the early morning hours starting at 6 AM you can observe that the crime incidents steadily increase and spikes around noon. From noon, it is well above average peaking around 6 PM in the evening and then declining after 6 PM.

The following plot depicts the mean crimes by the day of the week.As you can observe, Sunday has the least number of crime incidents, well below the daily average which is just below 400 (vertical line) and Friday has the most number of crime incidents well above the daily average.

The following plot depicts the mean crimes during few key days like holidays in the United States.You can observe here that the number of crime incidents is significantly high during the New Year, well above the daily average which is just below 400 (horizontal line). During the other holidays the number of crime incidents is more or less same as the daily average but during the Christmas Eve and the day of Christmas the number of crime incidents is significantly lower than the daily average. Since Thanksgiving Day falls in different dates each year, as an approximation I chose the date of November 24 here. I was expecting to see significantly lower crime incidents during this time period, but it does not seem to be the case.

In conclusion, based on the above observation, we can see some patterns in the crime incidents and arrive at the following conclusions:

- The average number of crime incidents happening daily in the City and County of San Francisco is around 400.
- The number of crime incidents is highest around midnight and lowest at the early morning hours.
- The number of crime incidents is usually lower during Christmas.
- The number of crime incidents has been slowly increasing in the recent years.
- The number of crime incidents is high during New Year day and at the beginning of every month.

The above is just a high-level exploratory time-series analysis. With further in-depth analysis it is possible to arrive at more insights. In my future posts I will try to perform those analyses.

This analysis was performed entirely using RStudio version 0.99 and R Version 3.2.0.

The data processing and plots were done using the R libraries ggplot2 and dplyr.

© 2019 Data Science Central ® Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

## You need to be a member of Data Science Central to add comments!

Join Data Science Central