Public Data Sources: Complete Guide for Data Science

Data is the lifeblood of data science and machine learning. Accessing high-quality, public datasets is crucial for practice, research, and building models. Below is a curated list of reliable online data sources, categorized for ease of use.

Why Public Data Matters

Public datasets enable:

Learning: Practice data analysis and visualization skills
Research: Reproduce and build upon existing studies
Machine Learning: Train and validate models
Portfolio Projects: Build impressive data science portfolios
Competition: Participate in data science competitions

Global Development and Statistics

Gapminder

Gapminder Data Browser
- Gapminder collects global development statistics
- Covers human development, health, education, economy
- Perfect for visualizing trends over time
- Hans Rosling’s famous visualizations use this data

World Bank

World Bank Open Data
- Global economic, social, and environmental data
- 200+ countries, 50+ years of data
- GDP, population, trade, education indicators
- Free API access available

United Nations

UN Data
- UN statistics on various topics
- Population, trade, environment
- Multiple file formats available

Government Data

United States

Data.gov
- U.S. government open data portal
- Thousands of datasets across categories
- Agriculture, climate, energy, health
U.S. Census Bureau
- Demographic data
- Economic surveys
- Geographic data

United Kingdom

UK Data Service
- Social and economic data
- Large-scale surveys

European Union

Eurostat
- EU statistics
- Economy, population, trade

Machine Learning Datasets

UCI Machine Learning Repository

UCI ML Repository
- Classic datasets for machine learning research
- Includes: Iris, Wine, Adult, MNIST variants
- Well-documented, clean datasets
- Perfect for beginners

Kaggle

Kaggle Datasets
- Vast collection of user-uploaded datasets
- Competition datasets
- Community ratings and documentation
- Covers every domain imaginable

Google

Google Dataset Search
- Search engine for datasets across the web
- Indexes academic and government sources
- Finds datasets from various publishers

Hugging Face Datasets Hub

The Hugging Face Datasets library provides programmatic access to thousands of datasets with a unified API:

from datasets import load_dataset

# Load a dataset directly in Python
dataset = load_dataset("squad", split="train")
print(f"SQuAD: {len(dataset)} examples")
print(dataset[0])
# {'id': '5733be284776f41900661182', 'title': '...', 'context': '...', 'question': '...', 'answers': {...}}

# Stream large datasets without downloading fully
streamed = load_dataset("c4", "en", split="train", streaming=True)
for i, example in enumerate(streamed):
    if i >= 5:
        break
    print(f"Example {i}: {example['text'][:100]}...")

# Filter and transform with built-in methods
filtered = dataset.filter(lambda x: len(x["context"]) > 500)
print(f"Long context examples: {len(filtered)}")

# Map transformations
def add_length(example):
    example["question_length"] = len(example["question"])
    return example

dataset = dataset.map(add_length)

Dataset	Domain	Size	Use Case
SQuAD 2.0	Reading comprehension	150K Q&A	Question answering models
C4	Web text	750GB	LLM pretraining
GLUE/SuperGLUE	NLU	Various	Benchmark evaluation
Common Crawl	Web pages	Billions of pages	Large-scale pretraining
ImageNet	Computer vision	14M images	Image classification
COCO	Vision + captions	330K images	Object detection, captioning
LibriSpeech	Audio	1000 hours	Speech recognition
Wikipedia	Text	6M articles	Knowledge base, embeddings

Curated Top ML Datasets for 2026

Dataset	Type	Size	Best For	Source
FineWeb	Text	15T tokens	LLM training	Hugging Face
DCLM-Baseline	Text	4T tokens	LLM benchmarking	Hugging Face
OpenAssistant 2	Conversations	1M+ dialogues	Instruction tuning	LAION
MathQA	Mathematics	37K problems	Math reasoning	Google
GSM8K	Math word problems	8.5K problems	Arithmetic reasoning	OpenAI
HumanEval	Code	164 problems	Code generation	OpenAI
SWE-bench	Software engineering	2,294 tasks	Agentic coding	Princeton
Dolma	Text	3T tokens	Language modeling	AI2
RedPajama-V2	Text	30T tokens	LLM research	Together
Anthropic HH-RLHF	Preferences	170K comparisons	RLHF training	Anthropic

Specialized Data Sources

Finance and Economics

OpenSecrets
- Campaign contributions
- Lobbying data
- Political spending
FRED - Federal Reserve Economic Data
- U.S. economic data
- Interest rates, employment, GDP
Yahoo Finance
- Stock market data
- Historical prices
- Company fundamentals

Entertainment

Box Office Mojo
- Movie revenue data
- Budgets and performance metrics
- Film industry trends
IMDb
- Movie and TV database
- Ratings, cast, crew

Sports

FiveThirtyEight
- Sports data
- Election predictions
- Methodology available

Health

WHO Global Health Observatory
- Global health statistics
- Disease prevalence
- Health systems
CDC Data
- U.S. health data
- Disease tracking
- Public health statistics

Climate

NASA Climate Data
- Climate change data
- Temperature records
- CO2 levels
NOAA Climate Data
- Weather data
- Oceanographic data
- Climate records

Twitter API
- Tweet data
- Sentiment analysis
- Trend tracking
Reddit Data
- User-submitted datasets
- API access

Academic

arXiv
- Preprints in physics, math, CS
- Full text available
Semantic Scholar
- Academic paper data
- Citations
- Research trends

Data Portals by Country

China

National Bureau of Statistics
- Official Chinese statistics
- English version available

Japan

Statistics Bureau
- Japanese government statistics

India

Open Government Data
- Indian government datasets

Tips for Using Public Data

Data Quality

Check the source: Prefer official and well-documented sources
Verify licensing: Some datasets have restrictions
Check for updates: Data may be outdated

Data Processing

Clean data: Handle missing values
Validate: Cross-check with other sources
Document: Track data transformations

Ethical Considerations

Privacy: Don’t use personally identifiable information inappropriately
Bias: Be aware of dataset biases
Citation: Always credit data sources

For Machine Learning

Start simple: Begin with UCI datasets
Understand data: Explore before modeling
Split properly: Train/validation/test splits
Reproduce: Document preprocessing steps

Programmatic Data Access

Modern data science requires automated data ingestion. Here are patterns for accessing the major sources programmatically:

World Bank API

import requests

def get_world_bank_indicator(indicator: str, country: str = "all") -> list[dict]:
    """Fetch data from World Bank API."""
    url = f"https://api.worldbank.org/v2/country/{country}/indicator/{indicator}"
    params = {"format": "json", "per_page": 100, "date": "2000:2025"}
    response = requests.get(url, params=params)
    return response.json()[1]  # First element is metadata

# Example: GDP per capita for all countries
gdp_data = get_world_bank_indicator("NY.GDP.PCAP.PP.KD")
for entry in gdp_data[:5]:
    print(f"{entry['country']['value']}: {entry['value']}")

FRED Economic Data

import pandas as pd
from fredapi import Fred

fred = Fred(api_key="your_fred_api_key")

# Fetch unemployment rate
unemployment = fred.get_series("UNRATE", observation_start="2020-01-01")
print(unemployment.head())

# Search for series
search = fred.search("GDP", limit=5)
print(search[["title", "id", "frequency"]])

Hugging Face Dataset Search and Loading

from datasets import get_dataset_config_names, load_dataset
import pandas as pd

# Search for available datasets
configs = get_dataset_config_names("squad")
print(f"SQuAD configs: {configs}")

# Load specific configuration
dataset = load_dataset("squad_v2", split="train")
df = pd.DataFrame(dataset)
print(f"Loaded {len(df)} examples")

# Filter for questions containing "why"
why_questions = df[df['question'].str.contains('why', case=False)]
print(f"Why questions: {len(why_questions)}")

Kaggle API

# Install: pip install kagglehub
import kagglehub

# Download latest version of a dataset
path = kagglehub.dataset_download("datasnaek/youtube-new")
print(f"Dataset downloaded to: {path}")

# List files
import os
for file in os.listdir(path):
    print(file)

Data Versioning with DVC

DVC (Data Version Control) brings Git-like versioning to datasets, essential for reproducible ML:

# Initialize DVC
pip install dvc
dvc init

# Track a dataset
dvc add data/raw/dataset.csv
git add data/raw/dataset.csv.dvc .gitignore
git commit -m "Add raw dataset"

# Switch between versions
git checkout <commit-hash>
dvc checkout

# Verify data version integrity
import hashlib

def verify_dataset_integrity(filepath: str, expected_hash: str) -> bool:
    """Verify a dataset file matches its DVC-tracked hash."""
    sha256 = hashlib.sha256()
    with open(filepath, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            sha256.update(chunk)
    actual_hash = sha256.hexdigest()
    return actual_hash == expected_hash

print(f"Dataset integrity: {verify_dataset_integrity('data/dataset.csv', 'abc123...')}")

DVC supports remote storage backends (S3, GCS, SSH) for sharing datasets across teams, with pull/push semantics identical to Git.

Data Quality Validation with Great Expectations

Automated data quality checks should run before any ML pipeline:

import great_expectations as ge

# Load a dataset as a Great Expectations DataFrame
df = ge.read_csv("data/dataset.csv")

# Define expectations
expectations = [
    df.expect_column_values_to_not_be_null("id"),
    df.expect_column_values_to_be_between("age", 0, 120),
    df.expect_column_values_to_be_in_set("category", ["A", "B", "C"]),
    df.expect_column_median_to_be_between("salary", 30000, 200000),
    df.expect_column_values_to_be_unique("email"),
]

# Run validation
results = [exp for exp in expectations]
passed = sum(1 for r in results if r["success"])
print(f"Passed: {passed}/{len(results)} expectations")

# Generate HTML report
# df.save_expectation_suite("my_suite.json")
# ge.render.render_to_html(results, "data_quality_report.html")

Data Quality Checklist

Check	Tool/Method	Frequency
Missing values	Pandas `.isnull().sum()`	Every pipeline run
Type consistency	Pandas `.dtypes`	On schema change
Range validation	Custom bounds check	Per dataset version
Uniqueness constraints	Pandas `.duplicated()`	On data load
Distribution drift	Kolmogorov-Smirnov test	Weekly
Schema validation	Great Expectations	Every pipeline run
Freshness check	Compare max date to current	Daily

Synthetic Data Generation

When public data is insufficient or privacy-restricted, synthetic data fills the gap:

from faker import Faker
import pandas as pd
import random

fake = Faker()
Faker.seed(42)

def generate_customer_data(n: int = 1000) -> pd.DataFrame:
    """Generate synthetic customer records."""
    data = []
    for _ in range(n):
        data.append({
            "customer_id": fake.uuid4(),
            "name": fake.name(),
            "email": fake.email(),
            "age": random.randint(18, 80),
            "income": round(random.gammavariate(alpha=5, beta=10000), 2),
            "signup_date": fake.date_between(start_date="-3y", end_date="today"),
            "country": fake.country(),
            "is_active": random.random() > 0.2,
        })
    return pd.DataFrame(data)

df = generate_customer_data(5000)
print(f"Generated {len(df)} synthetic records")
print(df.describe())

Advanced Synthetic Data with SDV

# pip install sdv
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.datasets.demo import download_demo

# Load real data and train synthesizer
real_data, metadata = download_demo(
    modality="single_table",
    dataset_name="fake_hotel_booking"
)

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)

# Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=500)

# Evaluate quality
from sdv.evaluation.single_table import evaluate_quality
quality_report = evaluate_quality(
    real_data, synthetic_data, metadata
)
print(f"Quality score: {quality_report.get_score():.2f}")

Use Cases for Synthetic Data

Use Case	Approach	Tool
Privacy-preserving ML	Differential privacy + generation	SDV, Gretel.ai
Class imbalance	SMOTE, GAN-based oversampling	SMOTE, CTGAN
Data augmentation	Rule-based transformation	Albumentations, nlpaug
Testing/CI pipelines	Deterministic generation	Faker, factory_boy
Stress testing	Edge case generation	Hypothesis, custom rules

Modern Data Collection Tools (2026)

Tool	Type	Best For	Pricing
Apify	Web scraping	Large-scale structured data extraction	Pay-per-use
Bright Data	Proxy + scraping	Enterprise web data collection	Subscription
Common Crawl	Web crawl archive	Historical web data for NLP	Free
Zyte	Scraping API	Reliable data extraction	Per-record
Diffbot	Knowledge graph	Structured entity extraction	API credits
Octoparse	Visual scraping	Non-technical users	Freemium

Web Scraping with Python

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def scrape_table(url: str, table_class: str) -> pd.DataFrame:
    """Scrape an HTML table into a DataFrame."""
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(response.content, "html.parser")
    table = soup.find("table", class_=table_class)

    rows = []
    for tr in table.find_all("tr"):
        cells = tr.find_all(["td", "th"])
        rows.append([cell.get_text(strip=True) for cell in cells])

    return pd.DataFrame(rows[1:], columns=rows[0]) if len(rows) > 1 else pd.DataFrame()

# Example: scrape a Wikipedia table
# df = scrape_table("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)", "wikitable")
# print(df.head())

Conclusion

These sources provide a great starting point for data analysis projects. Whether you’re learning data science, building machine learning models, or conducting research, quality data is essential.

Explore these resources to fuel your learning and innovation! Start with well-known sources like Kaggle and UCI, then branch into specialized domains as needed.

Why Public Data Matters

Global Development and Statistics

Gapminder

World Bank

United Nations

Government Data

United States

United Kingdom

European Union

Machine Learning Datasets

UCI Machine Learning Repository

Kaggle

Google

Hugging Face Datasets Hub

Curated Top ML Datasets for 2026

Specialized Data Sources

Finance and Economics

Entertainment

Sports

Health

Climate

Social Data

Social Media

Academic

Data Portals by Country

China

Japan

India

Tips for Using Public Data

Data Quality

Data Processing

Ethical Considerations

For Machine Learning

Programmatic Data Access

World Bank API

FRED Economic Data

Hugging Face Dataset Search and Loading

Kaggle API

Data Versioning with DVC

Data Quality Validation with Great Expectations

Data Quality Checklist

Synthetic Data Generation

Advanced Synthetic Data with SDV

Use Cases for Synthetic Data

Modern Data Collection Tools (2026)

Web Scraping with Python

Conclusion

Related Topics

Resources

Comments

Share this article

👍 Was this article helpful?