Skip to main content

Public Data Sources: Complete Guide for Data Science

Internet Public Data Sources for Data Analysis

Published: January 3, 2019 Updated: May 24, 2026 Larry Qu 8 min read

Data is the lifeblood of data science and machine learning. Accessing high-quality, public datasets is crucial for practice, research, and building models. Below is a curated list of reliable online data sources, categorized for ease of use.

Why Public Data Matters

Public datasets enable:

  • Learning: Practice data analysis and visualization skills
  • Research: Reproduce and build upon existing studies
  • Machine Learning: Train and validate models
  • Portfolio Projects: Build impressive data science portfolios
  • Competition: Participate in data science competitions

Global Development and Statistics

Gapminder

  • Gapminder Data Browser
    • Gapminder collects global development statistics
    • Covers human development, health, education, economy
    • Perfect for visualizing trends over time
    • Hans Rosling’s famous visualizations use this data

World Bank

  • World Bank Open Data
    • Global economic, social, and environmental data
    • 200+ countries, 50+ years of data
    • GDP, population, trade, education indicators
    • Free API access available

United Nations

  • UN Data
    • UN statistics on various topics
    • Population, trade, environment
    • Multiple file formats available

Government Data

United States

  • Data.gov

    • U.S. government open data portal
    • Thousands of datasets across categories
    • Agriculture, climate, energy, health
  • U.S. Census Bureau

    • Demographic data
    • Economic surveys
    • Geographic data

United Kingdom

European Union

  • Eurostat
    • EU statistics
    • Economy, population, trade

Machine Learning Datasets

UCI Machine Learning Repository

  • UCI ML Repository
    • Classic datasets for machine learning research
    • Includes: Iris, Wine, Adult, MNIST variants
    • Well-documented, clean datasets
    • Perfect for beginners

Kaggle

  • Kaggle Datasets
    • Vast collection of user-uploaded datasets
    • Competition datasets
    • Community ratings and documentation
    • Covers every domain imaginable

Google

  • Google Dataset Search
    • Search engine for datasets across the web
    • Indexes academic and government sources
    • Finds datasets from various publishers

Hugging Face Datasets Hub

The Hugging Face Datasets library provides programmatic access to thousands of datasets with a unified API:

from datasets import load_dataset

# Load a dataset directly in Python
dataset = load_dataset("squad", split="train")
print(f"SQuAD: {len(dataset)} examples")
print(dataset[0])
# {'id': '5733be284776f41900661182', 'title': '...', 'context': '...', 'question': '...', 'answers': {...}}

# Stream large datasets without downloading fully
streamed = load_dataset("c4", "en", split="train", streaming=True)
for i, example in enumerate(streamed):
    if i >= 5:
        break
    print(f"Example {i}: {example['text'][:100]}...")

# Filter and transform with built-in methods
filtered = dataset.filter(lambda x: len(x["context"]) > 500)
print(f"Long context examples: {len(filtered)}")

# Map transformations
def add_length(example):
    example["question_length"] = len(example["question"])
    return example

dataset = dataset.map(add_length)
Dataset Domain Size Use Case
SQuAD 2.0 Reading comprehension 150K Q&A Question answering models
C4 Web text 750GB LLM pretraining
GLUE/SuperGLUE NLU Various Benchmark evaluation
Common Crawl Web pages Billions of pages Large-scale pretraining
ImageNet Computer vision 14M images Image classification
COCO Vision + captions 330K images Object detection, captioning
LibriSpeech Audio 1000 hours Speech recognition
Wikipedia Text 6M articles Knowledge base, embeddings

Curated Top ML Datasets for 2026

Dataset Type Size Best For Source
FineWeb Text 15T tokens LLM training Hugging Face
DCLM-Baseline Text 4T tokens LLM benchmarking Hugging Face
OpenAssistant 2 Conversations 1M+ dialogues Instruction tuning LAION
MathQA Mathematics 37K problems Math reasoning Google
GSM8K Math word problems 8.5K problems Arithmetic reasoning OpenAI
HumanEval Code 164 problems Code generation OpenAI
SWE-bench Software engineering 2,294 tasks Agentic coding Princeton
Dolma Text 3T tokens Language modeling AI2
RedPajama-V2 Text 30T tokens LLM research Together
Anthropic HH-RLHF Preferences 170K comparisons RLHF training Anthropic

Specialized Data Sources

Finance and Economics

Entertainment

  • Box Office Mojo

    • Movie revenue data
    • Budgets and performance metrics
    • Film industry trends
  • IMDb

    • Movie and TV database
    • Ratings, cast, crew

Sports

Health

Climate


Social Data

Social Media

Academic

  • arXiv

    • Preprints in physics, math, CS
    • Full text available
  • Semantic Scholar

    • Academic paper data
    • Citations
    • Research trends

Data Portals by Country

China

Japan

India


Tips for Using Public Data

Data Quality

  1. Check the source: Prefer official and well-documented sources
  2. Verify licensing: Some datasets have restrictions
  3. Check for updates: Data may be outdated

Data Processing

  1. Clean data: Handle missing values
  2. Validate: Cross-check with other sources
  3. Document: Track data transformations

Ethical Considerations

  1. Privacy: Don’t use personally identifiable information inappropriately
  2. Bias: Be aware of dataset biases
  3. Citation: Always credit data sources

For Machine Learning

  1. Start simple: Begin with UCI datasets
  2. Understand data: Explore before modeling
  3. Split properly: Train/validation/test splits
  4. Reproduce: Document preprocessing steps

Programmatic Data Access

Modern data science requires automated data ingestion. Here are patterns for accessing the major sources programmatically:

World Bank API

import requests

def get_world_bank_indicator(indicator: str, country: str = "all") -> list[dict]:
    """Fetch data from World Bank API."""
    url = f"https://api.worldbank.org/v2/country/{country}/indicator/{indicator}"
    params = {"format": "json", "per_page": 100, "date": "2000:2025"}
    response = requests.get(url, params=params)
    return response.json()[1]  # First element is metadata

# Example: GDP per capita for all countries
gdp_data = get_world_bank_indicator("NY.GDP.PCAP.PP.KD")
for entry in gdp_data[:5]:
    print(f"{entry['country']['value']}: {entry['value']}")

FRED Economic Data

import pandas as pd
from fredapi import Fred

fred = Fred(api_key="your_fred_api_key")

# Fetch unemployment rate
unemployment = fred.get_series("UNRATE", observation_start="2020-01-01")
print(unemployment.head())

# Search for series
search = fred.search("GDP", limit=5)
print(search[["title", "id", "frequency"]])

Hugging Face Dataset Search and Loading

from datasets import get_dataset_config_names, load_dataset
import pandas as pd

# Search for available datasets
configs = get_dataset_config_names("squad")
print(f"SQuAD configs: {configs}")

# Load specific configuration
dataset = load_dataset("squad_v2", split="train")
df = pd.DataFrame(dataset)
print(f"Loaded {len(df)} examples")

# Filter for questions containing "why"
why_questions = df[df['question'].str.contains('why', case=False)]
print(f"Why questions: {len(why_questions)}")

Kaggle API

# Install: pip install kagglehub
import kagglehub

# Download latest version of a dataset
path = kagglehub.dataset_download("datasnaek/youtube-new")
print(f"Dataset downloaded to: {path}")

# List files
import os
for file in os.listdir(path):
    print(file)

Data Versioning with DVC

DVC (Data Version Control) brings Git-like versioning to datasets, essential for reproducible ML:

# Initialize DVC
pip install dvc
dvc init

# Track a dataset
dvc add data/raw/dataset.csv
git add data/raw/dataset.csv.dvc .gitignore
git commit -m "Add raw dataset"

# Switch between versions
git checkout <commit-hash>
dvc checkout
# Verify data version integrity
import hashlib

def verify_dataset_integrity(filepath: str, expected_hash: str) -> bool:
    """Verify a dataset file matches its DVC-tracked hash."""
    sha256 = hashlib.sha256()
    with open(filepath, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            sha256.update(chunk)
    actual_hash = sha256.hexdigest()
    return actual_hash == expected_hash

print(f"Dataset integrity: {verify_dataset_integrity('data/dataset.csv', 'abc123...')}")

DVC supports remote storage backends (S3, GCS, SSH) for sharing datasets across teams, with pull/push semantics identical to Git.


Data Quality Validation with Great Expectations

Automated data quality checks should run before any ML pipeline:

import great_expectations as ge

# Load a dataset as a Great Expectations DataFrame
df = ge.read_csv("data/dataset.csv")

# Define expectations
expectations = [
    df.expect_column_values_to_not_be_null("id"),
    df.expect_column_values_to_be_between("age", 0, 120),
    df.expect_column_values_to_be_in_set("category", ["A", "B", "C"]),
    df.expect_column_median_to_be_between("salary", 30000, 200000),
    df.expect_column_values_to_be_unique("email"),
]

# Run validation
results = [exp for exp in expectations]
passed = sum(1 for r in results if r["success"])
print(f"Passed: {passed}/{len(results)} expectations")

# Generate HTML report
# df.save_expectation_suite("my_suite.json")
# ge.render.render_to_html(results, "data_quality_report.html")

Data Quality Checklist

Check Tool/Method Frequency
Missing values Pandas .isnull().sum() Every pipeline run
Type consistency Pandas .dtypes On schema change
Range validation Custom bounds check Per dataset version
Uniqueness constraints Pandas .duplicated() On data load
Distribution drift Kolmogorov-Smirnov test Weekly
Schema validation Great Expectations Every pipeline run
Freshness check Compare max date to current Daily

Synthetic Data Generation

When public data is insufficient or privacy-restricted, synthetic data fills the gap:

from faker import Faker
import pandas as pd
import random

fake = Faker()
Faker.seed(42)

def generate_customer_data(n: int = 1000) -> pd.DataFrame:
    """Generate synthetic customer records."""
    data = []
    for _ in range(n):
        data.append({
            "customer_id": fake.uuid4(),
            "name": fake.name(),
            "email": fake.email(),
            "age": random.randint(18, 80),
            "income": round(random.gammavariate(alpha=5, beta=10000), 2),
            "signup_date": fake.date_between(start_date="-3y", end_date="today"),
            "country": fake.country(),
            "is_active": random.random() > 0.2,
        })
    return pd.DataFrame(data)

df = generate_customer_data(5000)
print(f"Generated {len(df)} synthetic records")
print(df.describe())

Advanced Synthetic Data with SDV

# pip install sdv
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.datasets.demo import download_demo

# Load real data and train synthesizer
real_data, metadata = download_demo(
    modality="single_table",
    dataset_name="fake_hotel_booking"
)

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)

# Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=500)

# Evaluate quality
from sdv.evaluation.single_table import evaluate_quality
quality_report = evaluate_quality(
    real_data, synthetic_data, metadata
)
print(f"Quality score: {quality_report.get_score():.2f}")

Use Cases for Synthetic Data

Use Case Approach Tool
Privacy-preserving ML Differential privacy + generation SDV, Gretel.ai
Class imbalance SMOTE, GAN-based oversampling SMOTE, CTGAN
Data augmentation Rule-based transformation Albumentations, nlpaug
Testing/CI pipelines Deterministic generation Faker, factory_boy
Stress testing Edge case generation Hypothesis, custom rules

Modern Data Collection Tools (2026)

Tool Type Best For Pricing
Apify Web scraping Large-scale structured data extraction Pay-per-use
Bright Data Proxy + scraping Enterprise web data collection Subscription
Common Crawl Web crawl archive Historical web data for NLP Free
Zyte Scraping API Reliable data extraction Per-record
Diffbot Knowledge graph Structured entity extraction API credits
Octoparse Visual scraping Non-technical users Freemium

Web Scraping with Python

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def scrape_table(url: str, table_class: str) -> pd.DataFrame:
    """Scrape an HTML table into a DataFrame."""
    response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
    soup = BeautifulSoup(response.content, "html.parser")
    table = soup.find("table", class_=table_class)

    rows = []
    for tr in table.find_all("tr"):
        cells = tr.find_all(["td", "th"])
        rows.append([cell.get_text(strip=True) for cell in cells])

    return pd.DataFrame(rows[1:], columns=rows[0]) if len(rows) > 1 else pd.DataFrame()

# Example: scrape a Wikipedia table
# df = scrape_table("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)", "wikitable")
# print(df.head())

Conclusion

These sources provide a great starting point for data analysis projects. Whether you’re learning data science, building machine learning models, or conducting research, quality data is essential.

Explore these resources to fuel your learning and innovation! Start with well-known sources like Kaggle and UCI, then branch into specialized domains as needed.

Resources

Comments

👍 Was this article helpful?