Logo

| PORTFOLIO


SAN FRANCISCO CRIME RATE

CRIME ANALYSIS & PREDICTION

Exploratory Data Analysis | Data Cleaning | Data Visualization | Classification | Feature Engineering | Python

๐Ÿš“ SFPD Incident Analysis with PySpark

Introduction

In recent years, the rise in urban crime has become a critical concern for both residents and policymakers. San Francisco, a major metropolitan hub known for its cultural diversity and technological innovation, has experienced fluctuating crime patterns that pose challenges to public safety and urban planning. This project leverages data analytics and machine learning techniques to analyze historical crime data from San Francisco with the goal of predicting future crime trends. By uncovering hidden patterns and forecasting crime rates, this analysis aims to support data-driven decision-making for law enforcement, city officials, and community stakeholders, ultimately contributing to safer and smarter city environments.

Problem Statement

To address this specific problem, I followed a complete Data Science lifecycle consisting of the following key steps:

Tools Used:

PySpark
Python
SQL
Scikit-learn
Matplotlib
Geospatial Analysis
Jupyter Notebooks
GitHub
SFPD Crime Rate

Live Data Integration:

This project utilizes live crime incident data sourced from the San Francisco Police Department's Open Data Portal, which is updated daily by 10:00 AM PT. Leveraging PySpark, I efficiently processed and analyzed these large-scale datasets to uncover real-time patterns, emerging crime trends, and geographical hotspots across the city. By working with continuously refreshed data, the project demonstrates not only the ability to handle dynamic and high-volume data streams but also showcases the practical value of near real-time analytics in supporting public safety and resource allocation. This integration of live data adds a layer of relevance and timeliness to the analysis, simulating the demands of real-world data environments.

๐Ÿ” Key Objectives

๐Ÿ“ˆ Data & Insights

I ingested over 1 million rows of daily crime data using PySparkโ€™s DataFrame API, performing groupBy and window functions for temporal analysis. The final output included visual heatmaps, bar graphs by district, and day-of-week breakdowns.

๐Ÿง  Skills Demonstrated

PySpark
Big Data Processing
Data Cleaning
Crime Analytics
API Data Access
Exploratory Visualization
Model Evaluation
Classification Models
Real-Time Data Handling

๐Ÿ“Ž Google Colab & Documentation

Google Colab Notebook
Decision Tree with Ada Boost Classifierk
Download Project Report (PDF)