Logo

| PORTFOLIO


SAN FRANCISCO CRIME RATE

CRIME ANALYSIS & PREDICTION

Exploratory Data Analysis | Data Cleaning | Data Visualization | Classification | Feature Engineering | Python

SFPD Incident Analysis with PySpark

Introduction

In recent years, the rise in urban crime has become a critical concern for both residents and policymakers. San Francisco, a major metropolitan hub known for its cultural diversity and technological innovation, has experienced fluctuating crime patterns that pose challenges to public safety and urban planning. This project leverages data analytics and machine learning techniques to analyze historical crime data from San Francisco with the goal of predicting future crime trends. By uncovering hidden patterns and forecasting crime rates, this analysis aims to support data-driven decision-making for law enforcement, city officials, and community stakeholders, ultimately contributing to safer and smarter city environments.

Problem Statement

Urban crime prediction presents significant challenges due to the complexity of human behavior, environmental variables, and temporal factors. In San Francisco, understanding when and where specific crimes are likely to occur can greatly improve resource allocation and public safety initiatives. The key problem this project addresses is how to use historical crime data to develop an accurate and interpretable predictive model that informs real-time decision-making and proactive intervention strategies.

Process to Develop the Project

To address this problem, I followed a complete Data Science lifecycle consisting of the following key steps:

Tools Used

PySpark
Python
SQL
Scikit-learn
Matplotlib
Geospatial Analysis
Jupyter Notebooks
GitHub
SFPD Crime Rate

Live Data Integration:

This project utilizes live crime incident data sourced from the San Francisco Police Department's Open Data Portal, which is updated daily by 10:00 AM PT. Leveraging PySpark, I efficiently processed and analyzed these large-scale datasets to uncover real-time patterns, emerging crime trends, and geographical hotspots across the city. By working with continuously refreshed data, the project demonstrates not only the ability to handle dynamic and high-volume data streams but also showcases the practical value of near real-time analytics in supporting public safety and resource allocation. This integration of live data adds a layer of relevance and timeliness to the analysis, simulating the demands of real-world data environments.

Key Objectives

Data & Insights

I ingested over 1 million rows of daily crime data using PySpark’s DataFrame API, performing groupBy and window functions for temporal analysis. The final output included visual heatmaps, bar graphs by district, and day-of-week breakdowns.

Skills Used

PySpark
Big Data Processing
Data Cleaning
Crime Analytics
API Data Access
Exploratory Visualization
Model Evaluation
Classification Models
Real-Time Data Handling

Google Colab & Documentation

Google Colab Notebook
Decision Tree with Ada Boost Classifierk
Download Project Report (PDF)