An intelligent taxi-dispatch system using Data Mining techniques
Data: Taxi trip records of yellow taxicabs in New York City in 2015. The data contains around 3 million records and 19 features.
Additional data: Plane arrival schedules in nearby airports, weather information, and holiday information (used to support decision support framework).
Tools: MATLAB
Objectives:
The main objective of this project is to create an ML model that predicts taxi demand around Manhattan and the airports region in New York City.
The demand is divided into three categories: High, Medium, and Low. The model will assign each region to one of the three categories on an hourly basis.
One of the main focuses of the model is to accurately distinguish the high and medium-demand regions from the low-demand regions which would enable the taxicabs to focus on high-demand regions and avoid low-demand ones, thus improving efficiency and increasing profits.
To explore the revenue (Fare) and cost (Duration and Distance) of each region to see if certain regions of the city are more profitable than others.
To analyze the model performance to see how much loss was generated from the model's erroneous prediction.
Target: New York City is divided into 6 regions for this project. The model will predict the level of demand in each region and periods of 1 hour throughout the whole year.
Weather Event Patterns in the USA: A Data-Driven Study
Topic: A data science project on analyzing the occurrence of different weather events in the USA from 2011-2018.
Data: The dataset contains detailed records of weather events, including event type, date, location, fatalities, injuries, property damage, etc.
Each year data contains over 60,000 records. 8 years' data- in total, around half a million records.
Tools: MATLAB
Objectives:
This project aims to perform an in-depth analysis of various weather events that occurred across the United States between 2011 and 2018.
The project will focus on identifying patterns, trends, and the impact of weather events such as floods, hurricanes, wildfires, and tornadoes.
By leveraging data science techniques, this analysis will provide insights into the frequency, distribution, and severity of these events over the given period.
Tasks:
Visualize the frequency and distribution of different types of weather events across regions and years.
Analyze the seasonal patterns of specific weather events.
Study changes in weather patterns over time to identify any shifts in the frequency or intensity of extreme weather events.
Identify weather-prone areas and their vulnerability to specific types of weather events (e.g., tornado, hurricane zones).
Use machine learning models to predict the likelihood of certain weather events based on historical data and regional factors.
Assess the economic and social impact of weather events by analyzing trends in property damage, fatalities, and injuries.
Applications:
Risk assessment for insurance companies
Government policy recommendations.
Insights into Fatal Police Shootings in the USA
Synopsis: The project seeks to explore and analyze fatal shootings in the USA. Using a combination of data science techniques and statistical analysis, the project aims to identify key demographic, socioeconomic, and geographic variables that may correlate with these incidents.
Data: The data has been collected from the Washington Post. It holds over 10,000 records from 2015 - present.
Link to the dataset - Fatal Force Dataset
Tools: MATLAB
Objectives:
The analysis will focus on examining racial and ethnic disparities, regional variations, and situational factors. The study aims to uncover any potential biases and systemic issues in the application of lethal force by police. It also looks into extreme cases such as underage/unarmed shootings that did not pose any threat.
The insights derived from this analysis could inform future policy changes, training programs, and law enforcement strategies to minimize the occurrence of fatal encounters and promote more equitable policing practices.
Note: A similar analysis of school shootings in the USA is currently being conducted. The dataset is also available in the GitHub repository of the Washington Post.
Exploratory Data Analysis (EDA)
Synopsis: This project is focused on performing EDA to understand patterns, spot anomalies, test hypotheses, and check assumptions, all before applying more complex modeling or hypothesis testing.
Data:
The Titanic dataset, one of the most well-known datasets in DS, is used for this analysis. Dataset link: https://www.kaggle.com/c/titanic.
KDDCup-99 dataset, a benchmark dataset used for evaluating ML models in network intrusion detection.
Dataset link: https://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Dataset Information:
Titanic dataset: The training and testing set contain 891 and 418 passengers records, respectively. 12 features.
KDDCup-99 dataset: The original dataset holds approximately 4.9 million connection records. A compressed version (10%) is used for analysis. 41 features.
Tools: Python (Pandas, Seaborn, etc.)
Workflow:
Analyze data using Pandas
Identify issues in the dataset (e.g, missing value, outliers, unrelated features)
Data visualization for analysis
Analysis using Pivot table
Removing irrelevant features
Imputation - missing data handling
Feature engineering
Correlation analysis
Hypothesis testing