Exploratory Data Analysis | Data Cleaning | Data Visualization | Classification | Feature Engineering | Python
In recent years, the rise in urban crime has become a critical concern for both residents and policymakers. San Francisco, a major metropolitan hub known for its cultural diversity and technological innovation, has experienced fluctuating crime patterns that pose challenges to public safety and urban planning. This project leverages data analytics and machine learning techniques to analyze historical crime data from San Francisco with the goal of predicting future crime trends. By uncovering hidden patterns and forecasting crime rates, this analysis aims to support data-driven decision-making for law enforcement, city officials, and community stakeholders, ultimately contributing to safer and smarter city environments.
To address this specific problem, I followed a complete Data Science lifecycle consisting of the following key steps:
This project utilizes live crime incident data sourced from the San Francisco Police Department's Open Data Portal, which is updated daily by 10:00 AM PT. Leveraging PySpark, I efficiently processed and analyzed these large-scale datasets to uncover real-time patterns, emerging crime trends, and geographical hotspots across the city. By working with continuously refreshed data, the project demonstrates not only the ability to handle dynamic and high-volume data streams but also showcases the practical value of near real-time analytics in supporting public safety and resource allocation. This integration of live data adds a layer of relevance and timeliness to the analysis, simulating the demands of real-world data environments.
I ingested over 1 million rows of daily crime data using PySparkโs DataFrame API, performing groupBy and window functions for temporal analysis. The final output included visual heatmaps, bar graphs by district, and day-of-week breakdowns.
Google Colab Notebook
Decision Tree with Ada Boost Classifierk
Download Project Report (PDF)