Crime is public record. But it has taken our team over two years to acquire this data, convert it to useful and universal formats, identify clerical errors and duplicates, and classify the offenses into a broad and specific categories. We are now opening this up for everyone to have the opportunity to detect and understand the patterns of crime.
Our long term goal is to steer social policy in an evidence-based manner. Legal policy is often driven by intuition and politics more than by data analysis. Large-scale data analysis has the potential to reveal patterns that will assess the efficacy of legislation. Using millions of criminal records from multiple states, we mine patterns of crime and recidivism to help navigate a more effective criminal justice policy. Which policies over the past few decades have effectively reduced crime? Which types of crime respond to which types of policies? Are there “gateway crimes” that lead offenders to commit other crimes in the future? What patterns correlate with re-offense? Which crime types cluster, and which are rarely performed by the same individual? When does sentencing effectively prevent offenders from reoffending?
Funding for this tool was provided by the National Science Foundation SBE Office of Multidisciplinary Activities under Grant No. 1439453.
New York City, NY, is the most populous city in the United States. The data set consists of 9.8 million records spanning from 1977 to 2013. The data contains 19 variables and was obtained from New York State Division of Criminal Justice Services in 2013. It currently only contains the most serious charge in a given arrest and does not yet contain identifiers.
Miami-Dade County, FL, is the 7th most populous county in the United States and is the county seat of Miami, FL. The data set consists of 5.7 million records spanning from 1971 to 2012. The data contains 21 variables and was obtained from Miami-Dade County Clerks Criminal Justice Information System on December 3, 2013.
The state of New Mexico is the 5th largest state in the United States. The data set consists of 3.8 million records, spanning from 1979 to April, 2014. The data contains 23 variables and was obtained from the New Mexico Courts - The Judicial Branch of New Mexico in April, 2014. The current version of data set is preliminary, as such it may contain duplicate records, and offenders in the data set may have multiple Defendant System IDs. This version of data set is for review purpose only and will be superseded when a new version with data up to 2018 is obtained from the New Mexico Courts - Judicial Branch.
Harris County, TX, is the 3rd most populous county in the United States and is the county seat of Houston, TX. The data set consists of 3.1 million records, spanning from 1977 to April, 2012. The data contains 39 variables and was obtained from the Harris County District Clerk's Office in September, 2013.