The Problem
Context
San Francisco, a major metropolitan hub known for its cultural diversity and technological innovation, has experienced fluctuating crime patterns that pose challenges to public safety and urban planning. This project uses data analytics and ML to analyze historical crime data and predict future trends.
Problem Statement
Understanding when and where specific crimes are likely to occur can greatly improve resource allocation and public safety initiatives. The key question: how can historical data power an accurate, interpretable model that informs real-time decision-making and proactive intervention?
Data Science Pipeline
Data Wrangling
Assessed dataset quality and applied cleaning processes to ensure data integrity.
EDA
Analyzed and visualized variables to uncover initial patterns and trends.
Feature Engineering
Generated new meaningful features from existing data to enhance model performance.
Normalization
Scaled and transformed variables to prepare data for ML algorithms.
Train / Test Split
Divided data to evaluate model performance and optimize hyperparameters.
Model Evaluation
Identified and assessed the best-performing classifier for crime type prediction.
Key Objectives
Live Data Feed
Ingested daily-updated data from the SFPD Open Data Portal — refreshed every day by 10:00 AM PT.
Temporal Patterns
Tracked changes in crime volume over time, identifying day-of-week and seasonal trends.
Geospatial Hotspots
Identified high-crime neighborhoods and time periods using geospatial heatmaps and district breakdowns.
Crime Categories
Grouped incidents by type — assault, theft, vandalism — to enable category-level analysis.
Predictive Modeling
Built classification models to predict the likelihood of specific crime types based on location and time.
Real-Time Processing
Leveraged PySpark's DataFrame API with groupBy and window functions for scalable temporal analysis.
Tools & Technologies
Crime Map — San Francisco
Interactive map showing incident density across SF districts. The final output included visual heatmaps, bar graphs by district, and day-of-week breakdowns.
