
Top 10 Datasets to Elevate Your Game in Data Analysis
Looking to sharpen your data analysis skills? Working with diverse, real-world datasets is the key to building analytical skills and becoming a better data analyst. Here are ten diverse datasets that will challenge you and help you develop expertise across multiple domains.
1. Marketing Channel Analytics
This dataset tracks conversions and clicks across multiple marketing channels, allowing you to determine optimal budget allocation and timing for campaigns. Perfect for ROI analysis, attribution modeling, and marketing mix optimization. I have done a tutorial on this data in a previous blog.
Marketing Dataset Schema
date : STRING # Date of marketing activity channel : STRING # Marketing channel used campaign : STRING # Specific campaign name client : STRING # Client identifier spend : FLOAT # Amount spent on campaign impressions : INTEGER # Number of impressions generated clicks : INTEGER # Number of clicks received conversions : INTEGER # Number of conversions achieved revenue : FLOAT # Revenue generated from campaign
2. NBA Player Performance Metrics for 2017-20 seasons
Real performance data spanning 3 professional basketball seasons. Ideal for sports analytics, player evaluation, team composition analysis, and predicting performance trends based on historical statistics. I have done a tutorial on this data in a previous blog.
NBA Performance Schema, first 14 columns
player_name : STRING # Name of the player team : STRING # Team name games_played : INTEGER # Number of games played minutes : FLOAT # Minutes played points : FLOAT # Points scored per game rebounds : FLOAT # Rebounds per game assists : FLOAT # Assists per game steals : FLOAT # Steals per game blocks : FLOAT # Blocks per game turnovers : FLOAT # Turnovers per game field_goal_pct: FLOAT # Field goal percentage three_pt_pct : FLOAT # Three-point percentage free_throw_pct: FLOAT # Free throw percentage plus_minus : FLOAT # Plus/minus rating
3. Manufacturing Incident Investigation
The GlobalShop dataset contains product manufacturing data with return information, allowing analysts to investigate patterns in product failures. Perfect for root cause analysis, quality control improvements, and predictive maintenance modeling. I have done a tutorial on this data in a previous blog.
Manufacturing Dataset Schema
sale_date : STRING # Date of sale region : STRING # Sales region product_category : STRING # Product category manufacturing_batch: STRING # Batch identifier price : FLOAT # Product price is_returned : BOOLEAN # Whether product was returned return_date : STRING # Date of return (if applicable) return_reason : STRING # Reason for return humidity_at_return: FLOAT # Humidity level when returned humidity_tolerance: STRING # Product's humidity tolerance rating sensor_supplier : STRING # Supplier of humidity sensor
4. Bike Rental Business
This is a famous open-source dataset that tracks daily bike rentals along with weather conditions, holidays, and seasonal information. Excellent for feature importance analysis, demand forecasting, and developing actionable business recommendations.
Bike Rental Schema
obs : INTEGER # Observation ID date : STRING # Date of observation year : INTEGER # Year month : INTEGER # Month season : INTEGER # Season (1:winter, 2:spring, 3:summer, 4:fall) holiday : INTEGER # Whether day is holiday (0, 1) working_day : INTEGER # Whether day is working day (0, 1) weather_condition: INTEGER # Weather condition code temp : FLOAT # Temperature in Celsius feel_temp : FLOAT # "Feels like" temperature humidity : FLOAT # Humidity percentage wind_speed : FLOAT # Wind speed occasional : INTEGER # Count of occasional/casual users members : INTEGER # Count of registered members rental : INTEGER # Total rentals (target variable)
5. Healthcare Patient Outcomes
This dataset contains patient treatment data across multiple hospitals with demographics and outcomes. Develop skills in cohort analysis, treatment efficacy evaluation, and identifying factors that influence recovery rates and complications.
Healthcare Dataset Schema
patient_id : STRING # Anonymized patient identifier hospital_id : INTEGER # Hospital identifier admission_date: DATE # Date of admission discharge_date: DATE # Date of discharge length_of_stay: INTEGER # Length of stay in days age : INTEGER # Patient age gender : STRING # Patient gender diagnosis_code: STRING # Primary diagnosis code (ICD-10) treatment_code: STRING # Treatment procedure code comorbidities : STRING # Existing conditions (comma separated) readmission : BOOLEAN # Whether readmitted within 30 days mortality : BOOLEAN # Whether patient deceased during treatment complications : STRING # Complications during treatment insurance_type: STRING # Type of insurance
6. E-commerce Customer Journey
This dataset tracks detailed customer interactions from browsing to purchase and returns. Perfect for building customer segmentation models, analyzing conversion funnels, developing recommendation algorithms, and calculating lifetime value metrics.
E-commerce Dataset Schema
event_time : TIMESTAMP # Time when event happened event_type : STRING # Event type (view, cart, purchase, return) product_id : STRING # Product ID category_id : STRING # Product category ID category_name : STRING # Product category name brand : STRING # Brand name price : FLOAT # Product price user_id : STRING # User ID user_session : STRING # User session ID device : STRING # User device (mobile, desktop, tablet) referrer : STRING # Referring site or source location : STRING # User location is_returning : BOOLEAN # Whether user is returning customer days_since_first: INTEGER # Days since first visit days_since_last: INTEGER # Days since last visit
7. Financial Market Time Series
Historical market data with various economic indicators. Ideal for time series analysis, volatility modeling, trend detection, and developing trading strategies based on historical patterns.
Financial Market Schema
date : DATE # Trading date ticker : STRING # Asset ticker symbol open : FLOAT # Opening price high : FLOAT # Highest price during day low : FLOAT # Lowest price during day close : FLOAT # Closing price adj_close : FLOAT # Adjusted closing price volume : INTEGER # Trading volume dividend : FLOAT # Dividend payment if any split : FLOAT # Stock split ratio if any interest_rate : FLOAT # Benchmark interest rate inflation_rate: FLOAT # Monthly inflation rate unemployment : FLOAT # Monthly unemployment rate gdp_growth : FLOAT # Quarterly GDP growth rate sector : STRING # Market sector
8. Climate and Weather Patterns
Long-term climate records containing temperature, precipitation, and extreme weather events. Excellent for time series decomposition, anomaly detection, seasonal pattern identification, and predictive modeling.
Climate Data Schema
date : DATE # Date of observation station_id : STRING # Weather station identifier latitude : FLOAT # Station latitude longitude : FLOAT # Station longitude max_temp : FLOAT # Maximum temperature (Celsius) min_temp : FLOAT # Minimum temperature (Celsius) avg_temp : FLOAT # Average temperature (Celsius) precipitation : FLOAT # Precipitation amount (mm) snowfall : FLOAT # Snowfall amount (mm) humidity : FLOAT # Average relative humidity (%) wind_speed : FLOAT # Average wind speed (km/h) wind_direction: INTEGER # Wind direction (degrees) pressure : FLOAT # Atmospheric pressure (hPa) sunshine_hours: FLOAT # Hours of sunshine cloud_cover : FLOAT # Cloud cover percentage event_type : STRING # Weather event type (if any)
9. Urban Transportation and Traffic Flow
City mobility data showing traffic patterns and public transportation usage. Practice spatial data analysis, optimization problems, and infrastructure planning through real-world urban movement patterns.
Transportation Dataset Schema
timestamp : TIMESTAMP # Time of measurement segment_id : STRING # Road segment identifier start_lat : FLOAT # Starting latitude start_lon : FLOAT # Starting longitude end_lat : FLOAT # Ending latitude end_lon : FLOAT # Ending longitude length_km : FLOAT # Segment length in kilometers vehicle_count : INTEGER # Number of vehicles avg_speed : FLOAT # Average speed (km/h) congestion_level: INTEGER # Congestion level (0-4) travel_time : FLOAT # Average travel time (minutes) day_type : STRING # Type of day (weekday, weekend, holiday) weather_condition: STRING # Weather condition accident_nearby: BOOLEAN # Whether accident reported nearby public_transport_routes: INTEGER # Number of public transit routes public_transport_freq: FLOAT # Public transit frequency (vehicles/hour)
10. Social Media Engagement
Anonymized social media interaction data across content types and audience segments. Develop skills in sentiment analysis, engagement prediction, content optimization, and audience targeting strategies.
Social Media Dataset Schema
post_id : STRING # Unique post identifier timestamp : TIMESTAMP # Post creation time platform : STRING # Social media platform content_type : STRING # Type of content (text, image, video, link) post_category : STRING # Category of post content hashtags : STRING # Hashtags used (comma separated) user_id : STRING # Anonymized creator ID follower_count: INTEGER # Creator's follower count likes : INTEGER # Number of likes/reactions shares : INTEGER # Number of shares/retweets comments : INTEGER # Number of comments impressions : INTEGER # Total impressions click_count : INTEGER # Number of clicks (if link) video_views : INTEGER # Number of video views (if video) audience_age_bracket: STRING # Predominant age bracket of engagers audience_gender: STRING # Predominant gender of engagers sentiment_score: FLOAT # Sentiment analysis score of comments
Getting Started
Each dataset presents unique analytical challenges that mirror real-world business problems. Start by defining clear questions, then upload your data to PlotsALot and try answering:
- What patterns exist in the data?
- Which variables have the strongest relationships?
- How can the insights drive actionable business decisions?
Remember: the most valuable analysis isn't just technically sound—it delivers insights that enable better decision-making. These ten datasets provide the perfect training ground to develop both technical skills and business acumen essential for data analysis.
Your AI for analysing data