Logo

| PORTFOLIO

Data Analysis · Machine Learning

San Francisco
Crime Analysis
& Prediction

Using PySpark, Python, and SQL to process over 1 million daily crime records — identifying patterns, hotspots, and building a predictive classification model.

1M+
Records processed
150k+
Incidents analyzed
6
Pipeline stages
Live
Daily data feed
Back to Projects
Overview

The Problem

Context

San Francisco, a major metropolitan hub known for its cultural diversity and technological innovation, has experienced fluctuating crime patterns that pose challenges to public safety and urban planning. This project uses data analytics and ML to analyze historical crime data and predict future trends.

Problem Statement

Understanding when and where specific crimes are likely to occur can greatly improve resource allocation and public safety initiatives. The key question: how can historical data power an accurate, interpretable model that informs real-time decision-making and proactive intervention?

Methodology

Data Science Pipeline

01

Data Wrangling

Assessed dataset quality and applied cleaning processes to ensure data integrity.

02

EDA

Analyzed and visualized variables to uncover initial patterns and trends.

03

Feature Engineering

Generated new meaningful features from existing data to enhance model performance.

04

Normalization

Scaled and transformed variables to prepare data for ML algorithms.

05

Train / Test Split

Divided data to evaluate model performance and optimize hyperparameters.

06

Model Evaluation

Identified and assessed the best-performing classifier for crime type prediction.

Goals

Key Objectives

Live Data Feed

Ingested daily-updated data from the SFPD Open Data Portal — refreshed every day by 10:00 AM PT.

Temporal Patterns

Tracked changes in crime volume over time, identifying day-of-week and seasonal trends.

Geospatial Hotspots

Identified high-crime neighborhoods and time periods using geospatial heatmaps and district breakdowns.

Crime Categories

Grouped incidents by type — assault, theft, vandalism — to enable category-level analysis.

Predictive Modeling

Built classification models to predict the likelihood of specific crime types based on location and time.

Real-Time Processing

Leveraged PySpark's DataFrame API with groupBy and window functions for scalable temporal analysis.

Stack

Tools & Technologies

PySpark Python SQL Scikit-learn Matplotlib Geospatial Analysis Jupyter Notebooks GitHub Decision Tree AdaBoost SFPD Open Data API
Data & Insights

Crime Map — San Francisco

Interactive map showing incident density across SF districts. The final output included visual heatmaps, bar graphs by district, and day-of-week breakdowns.

Resources

Notebooks & Documentation