Back to all blogs
Top 10 Datasets to Elevate Your Game in Data Analysis
General

Top 10 Datasets to Elevate Your Game in Data Analysis

April 7, 2025
8 min read

Looking to sharpen your data analysis skills? Working with diverse, real-world datasets is the key to building analytical skills and becoming a better data analyst. Here are ten diverse datasets that will challenge you and help you develop expertise across multiple domains.

1. Marketing Channel Analytics

This dataset tracks conversions and clicks across multiple marketing channels, allowing you to determine optimal budget allocation and timing for campaigns. Perfect for ROI analysis, attribution modeling, and marketing mix optimization. I have done a tutorial on this data in a previous blog.

Marketing Dataset Schema

date          : STRING      # Date of marketing activity
channel       : STRING      # Marketing channel used
campaign      : STRING      # Specific campaign name
client        : STRING      # Client identifier
spend         : FLOAT       # Amount spent on campaign
impressions   : INTEGER     # Number of impressions generated
clicks        : INTEGER     # Number of clicks received
conversions   : INTEGER     # Number of conversions achieved
revenue       : FLOAT       # Revenue generated from campaign

2. NBA Player Performance Metrics for 2017-20 seasons

Real performance data spanning 3 professional basketball seasons. Ideal for sports analytics, player evaluation, team composition analysis, and predicting performance trends based on historical statistics. I have done a tutorial on this data in a previous blog.

NBA Performance Schema, first 14 columns

player_name   : STRING      # Name of the player
team          : STRING      # Team name
games_played  : INTEGER     # Number of games played
minutes       : FLOAT       # Minutes played
points        : FLOAT       # Points scored per game
rebounds      : FLOAT       # Rebounds per game
assists       : FLOAT       # Assists per game
steals        : FLOAT       # Steals per game
blocks        : FLOAT       # Blocks per game
turnovers     : FLOAT       # Turnovers per game
field_goal_pct: FLOAT       # Field goal percentage
three_pt_pct  : FLOAT       # Three-point percentage
free_throw_pct: FLOAT       # Free throw percentage
plus_minus    : FLOAT       # Plus/minus rating

3. Manufacturing Incident Investigation

The GlobalShop dataset contains product manufacturing data with return information, allowing analysts to investigate patterns in product failures. Perfect for root cause analysis, quality control improvements, and predictive maintenance modeling. I have done a tutorial on this data in a previous blog.

Manufacturing Dataset Schema

sale_date         : STRING      # Date of sale
region            : STRING      # Sales region
product_category  : STRING      # Product category
manufacturing_batch: STRING     # Batch identifier
price             : FLOAT       # Product price
is_returned       : BOOLEAN     # Whether product was returned
return_date       : STRING      # Date of return (if applicable)
return_reason     : STRING      # Reason for return
humidity_at_return: FLOAT       # Humidity level when returned
humidity_tolerance: STRING      # Product's humidity tolerance rating
sensor_supplier   : STRING      # Supplier of humidity sensor

4. Bike Rental Business

This is a famous open-source dataset that tracks daily bike rentals along with weather conditions, holidays, and seasonal information. Excellent for feature importance analysis, demand forecasting, and developing actionable business recommendations.

Bike Rental Schema

obs              : INTEGER     # Observation ID
date             : STRING      # Date of observation
year             : INTEGER     # Year
month            : INTEGER     # Month
season           : INTEGER     # Season (1:winter, 2:spring, 3:summer, 4:fall)
holiday          : INTEGER     # Whether day is holiday (0, 1)
working_day      : INTEGER     # Whether day is working day (0, 1)
weather_condition: INTEGER     # Weather condition code
temp             : FLOAT       # Temperature in Celsius
feel_temp        : FLOAT       # "Feels like" temperature
humidity         : FLOAT       # Humidity percentage
wind_speed       : FLOAT       # Wind speed
occasional       : INTEGER     # Count of occasional/casual users
members          : INTEGER     # Count of registered members
rental           : INTEGER     # Total rentals (target variable)

5. Healthcare Patient Outcomes

This dataset contains patient treatment data across multiple hospitals with demographics and outcomes. Develop skills in cohort analysis, treatment efficacy evaluation, and identifying factors that influence recovery rates and complications.

Healthcare Dataset Schema

patient_id    : STRING      # Anonymized patient identifier
hospital_id   : INTEGER     # Hospital identifier
admission_date: DATE        # Date of admission
discharge_date: DATE        # Date of discharge
length_of_stay: INTEGER     # Length of stay in days
age           : INTEGER     # Patient age
gender        : STRING      # Patient gender
diagnosis_code: STRING      # Primary diagnosis code (ICD-10)
treatment_code: STRING      # Treatment procedure code
comorbidities : STRING      # Existing conditions (comma separated)
readmission   : BOOLEAN     # Whether readmitted within 30 days
mortality     : BOOLEAN     # Whether patient deceased during treatment
complications : STRING      # Complications during treatment
insurance_type: STRING      # Type of insurance

6. E-commerce Customer Journey

This dataset tracks detailed customer interactions from browsing to purchase and returns. Perfect for building customer segmentation models, analyzing conversion funnels, developing recommendation algorithms, and calculating lifetime value metrics.

E-commerce Dataset Schema

event_time    : TIMESTAMP   # Time when event happened
event_type    : STRING      # Event type (view, cart, purchase, return)
product_id    : STRING      # Product ID
category_id   : STRING      # Product category ID
category_name : STRING      # Product category name
brand         : STRING      # Brand name
price         : FLOAT       # Product price
user_id       : STRING      # User ID
user_session  : STRING      # User session ID
device        : STRING      # User device (mobile, desktop, tablet)
referrer      : STRING      # Referring site or source
location      : STRING      # User location
is_returning  : BOOLEAN     # Whether user is returning customer
days_since_first: INTEGER   # Days since first visit
days_since_last: INTEGER    # Days since last visit

7. Financial Market Time Series

Historical market data with various economic indicators. Ideal for time series analysis, volatility modeling, trend detection, and developing trading strategies based on historical patterns.

Financial Market Schema

date          : DATE        # Trading date
ticker        : STRING      # Asset ticker symbol
open          : FLOAT       # Opening price
high          : FLOAT       # Highest price during day
low           : FLOAT       # Lowest price during day
close         : FLOAT       # Closing price
adj_close     : FLOAT       # Adjusted closing price
volume        : INTEGER     # Trading volume
dividend      : FLOAT       # Dividend payment if any
split         : FLOAT       # Stock split ratio if any
interest_rate : FLOAT       # Benchmark interest rate
inflation_rate: FLOAT       # Monthly inflation rate
unemployment  : FLOAT       # Monthly unemployment rate
gdp_growth    : FLOAT       # Quarterly GDP growth rate
sector        : STRING      # Market sector

8. Climate and Weather Patterns

Long-term climate records containing temperature, precipitation, and extreme weather events. Excellent for time series decomposition, anomaly detection, seasonal pattern identification, and predictive modeling.

Climate Data Schema

date          : DATE        # Date of observation
station_id    : STRING      # Weather station identifier
latitude      : FLOAT       # Station latitude
longitude     : FLOAT       # Station longitude
max_temp      : FLOAT       # Maximum temperature (Celsius)
min_temp      : FLOAT       # Minimum temperature (Celsius)
avg_temp      : FLOAT       # Average temperature (Celsius)
precipitation : FLOAT       # Precipitation amount (mm)
snowfall      : FLOAT       # Snowfall amount (mm)
humidity      : FLOAT       # Average relative humidity (%)
wind_speed    : FLOAT       # Average wind speed (km/h)
wind_direction: INTEGER     # Wind direction (degrees)
pressure      : FLOAT       # Atmospheric pressure (hPa)
sunshine_hours: FLOAT       # Hours of sunshine
cloud_cover   : FLOAT       # Cloud cover percentage
event_type    : STRING      # Weather event type (if any)

9. Urban Transportation and Traffic Flow

City mobility data showing traffic patterns and public transportation usage. Practice spatial data analysis, optimization problems, and infrastructure planning through real-world urban movement patterns.

Transportation Dataset Schema

timestamp     : TIMESTAMP   # Time of measurement
segment_id    : STRING      # Road segment identifier
start_lat     : FLOAT       # Starting latitude
start_lon     : FLOAT       # Starting longitude
end_lat       : FLOAT       # Ending latitude
end_lon       : FLOAT       # Ending longitude
length_km     : FLOAT       # Segment length in kilometers
vehicle_count : INTEGER     # Number of vehicles
avg_speed     : FLOAT       # Average speed (km/h)
congestion_level: INTEGER   # Congestion level (0-4)
travel_time   : FLOAT       # Average travel time (minutes)
day_type      : STRING      # Type of day (weekday, weekend, holiday)
weather_condition: STRING   # Weather condition
accident_nearby: BOOLEAN    # Whether accident reported nearby
public_transport_routes: INTEGER # Number of public transit routes
public_transport_freq: FLOAT # Public transit frequency (vehicles/hour)

10. Social Media Engagement

Anonymized social media interaction data across content types and audience segments. Develop skills in sentiment analysis, engagement prediction, content optimization, and audience targeting strategies.

Social Media Dataset Schema

post_id       : STRING      # Unique post identifier
timestamp     : TIMESTAMP   # Post creation time
platform      : STRING      # Social media platform
content_type  : STRING      # Type of content (text, image, video, link)
post_category : STRING      # Category of post content
hashtags      : STRING      # Hashtags used (comma separated)
user_id       : STRING      # Anonymized creator ID
follower_count: INTEGER     # Creator's follower count
likes         : INTEGER     # Number of likes/reactions
shares        : INTEGER     # Number of shares/retweets
comments      : INTEGER     # Number of comments
impressions   : INTEGER     # Total impressions
click_count   : INTEGER     # Number of clicks (if link)
video_views   : INTEGER     # Number of video views (if video)
audience_age_bracket: STRING # Predominant age bracket of engagers
audience_gender: STRING     # Predominant gender of engagers
sentiment_score: FLOAT      # Sentiment analysis score of comments

Getting Started

Each dataset presents unique analytical challenges that mirror real-world business problems. Start by defining clear questions, then upload your data to PlotsALot and try answering:

  • What patterns exist in the data?
  • Which variables have the strongest relationships?
  • How can the insights drive actionable business decisions?

Remember: the most valuable analysis isn't just technically sound—it delivers insights that enable better decision-making. These ten datasets provide the perfect training ground to develop both technical skills and business acumen essential for data analysis.

Your AI for analysing data

AI specifically trained for data analysis and visualization

Chat with Plotsalot