Data is the lifeblood of data science and machine learning. Accessing high-quality, public datasets is crucial for practice, research, and building models. Below is a curated list of reliable online data sources, categorized for ease of use.
Why Public Data Matters
Public datasets enable:
- Learning: Practice data analysis and visualization skills
- Research: Reproduce and build upon existing studies
- Machine Learning: Train and validate models
- Portfolio Projects: Build impressive data science portfolios
- Competition: Participate in data science competitions
Global Development and Statistics
Gapminder
- Gapminder Data Browser
- Gapminder collects global development statistics
- Covers human development, health, education, economy
- Perfect for visualizing trends over time
- Hans Rosling’s famous visualizations use this data
World Bank
- World Bank Open Data
- Global economic, social, and environmental data
- 200+ countries, 50+ years of data
- GDP, population, trade, education indicators
- Free API access available
United Nations
- UN Data
- UN statistics on various topics
- Population, trade, environment
- Multiple file formats available
Government Data
United States
-
- U.S. government open data portal
- Thousands of datasets across categories
- Agriculture, climate, energy, health
-
- Demographic data
- Economic surveys
- Geographic data
United Kingdom
- UK Data Service
- Social and economic data
- Large-scale surveys
European Union
- Eurostat
- EU statistics
- Economy, population, trade
Machine Learning Datasets
UCI Machine Learning Repository
- UCI ML Repository
- Classic datasets for machine learning research
- Includes: Iris, Wine, Adult, MNIST variants
- Well-documented, clean datasets
- Perfect for beginners
Kaggle
- Kaggle Datasets
- Vast collection of user-uploaded datasets
- Competition datasets
- Community ratings and documentation
- Covers every domain imaginable
- Google Dataset Search
- Search engine for datasets across the web
- Indexes academic and government sources
- Finds datasets from various publishers
Hugging Face Datasets Hub
The Hugging Face Datasets library provides programmatic access to thousands of datasets with a unified API:
from datasets import load_dataset
# Load a dataset directly in Python
dataset = load_dataset("squad", split="train")
print(f"SQuAD: {len(dataset)} examples")
print(dataset[0])
# {'id': '5733be284776f41900661182', 'title': '...', 'context': '...', 'question': '...', 'answers': {...}}
# Stream large datasets without downloading fully
streamed = load_dataset("c4", "en", split="train", streaming=True)
for i, example in enumerate(streamed):
if i >= 5:
break
print(f"Example {i}: {example['text'][:100]}...")
# Filter and transform with built-in methods
filtered = dataset.filter(lambda x: len(x["context"]) > 500)
print(f"Long context examples: {len(filtered)}")
# Map transformations
def add_length(example):
example["question_length"] = len(example["question"])
return example
dataset = dataset.map(add_length)
| Dataset | Domain | Size | Use Case |
|---|---|---|---|
| SQuAD 2.0 | Reading comprehension | 150K Q&A | Question answering models |
| C4 | Web text | 750GB | LLM pretraining |
| GLUE/SuperGLUE | NLU | Various | Benchmark evaluation |
| Common Crawl | Web pages | Billions of pages | Large-scale pretraining |
| ImageNet | Computer vision | 14M images | Image classification |
| COCO | Vision + captions | 330K images | Object detection, captioning |
| LibriSpeech | Audio | 1000 hours | Speech recognition |
| Wikipedia | Text | 6M articles | Knowledge base, embeddings |
Curated Top ML Datasets for 2026
| Dataset | Type | Size | Best For | Source |
|---|---|---|---|---|
| FineWeb | Text | 15T tokens | LLM training | Hugging Face |
| DCLM-Baseline | Text | 4T tokens | LLM benchmarking | Hugging Face |
| OpenAssistant 2 | Conversations | 1M+ dialogues | Instruction tuning | LAION |
| MathQA | Mathematics | 37K problems | Math reasoning | |
| GSM8K | Math word problems | 8.5K problems | Arithmetic reasoning | OpenAI |
| HumanEval | Code | 164 problems | Code generation | OpenAI |
| SWE-bench | Software engineering | 2,294 tasks | Agentic coding | Princeton |
| Dolma | Text | 3T tokens | Language modeling | AI2 |
| RedPajama-V2 | Text | 30T tokens | LLM research | Together |
| Anthropic HH-RLHF | Preferences | 170K comparisons | RLHF training | Anthropic |
Specialized Data Sources
Finance and Economics
-
- Campaign contributions
- Lobbying data
- Political spending
-
FRED - Federal Reserve Economic Data
- U.S. economic data
- Interest rates, employment, GDP
-
- Stock market data
- Historical prices
- Company fundamentals
Entertainment
-
- Movie revenue data
- Budgets and performance metrics
- Film industry trends
-
- Movie and TV database
- Ratings, cast, crew
Sports
- FiveThirtyEight
- Sports data
- Election predictions
- Methodology available
Health
-
- Global health statistics
- Disease prevalence
- Health systems
-
- U.S. health data
- Disease tracking
- Public health statistics
Climate
-
- Climate change data
- Temperature records
- CO2 levels
-
- Weather data
- Oceanographic data
- Climate records
Social Data
Social Media
-
- Tweet data
- Sentiment analysis
- Trend tracking
-
- User-submitted datasets
- API access
Academic
-
- Preprints in physics, math, CS
- Full text available
-
- Academic paper data
- Citations
- Research trends
Data Portals by Country
China
- National Bureau of Statistics
- Official Chinese statistics
- English version available
Japan
- Statistics Bureau
- Japanese government statistics
India
- Open Government Data
- Indian government datasets
Tips for Using Public Data
Data Quality
- Check the source: Prefer official and well-documented sources
- Verify licensing: Some datasets have restrictions
- Check for updates: Data may be outdated
Data Processing
- Clean data: Handle missing values
- Validate: Cross-check with other sources
- Document: Track data transformations
Ethical Considerations
- Privacy: Don’t use personally identifiable information inappropriately
- Bias: Be aware of dataset biases
- Citation: Always credit data sources
For Machine Learning
- Start simple: Begin with UCI datasets
- Understand data: Explore before modeling
- Split properly: Train/validation/test splits
- Reproduce: Document preprocessing steps
Programmatic Data Access
Modern data science requires automated data ingestion. Here are patterns for accessing the major sources programmatically:
World Bank API
import requests
def get_world_bank_indicator(indicator: str, country: str = "all") -> list[dict]:
"""Fetch data from World Bank API."""
url = f"https://api.worldbank.org/v2/country/{country}/indicator/{indicator}"
params = {"format": "json", "per_page": 100, "date": "2000:2025"}
response = requests.get(url, params=params)
return response.json()[1] # First element is metadata
# Example: GDP per capita for all countries
gdp_data = get_world_bank_indicator("NY.GDP.PCAP.PP.KD")
for entry in gdp_data[:5]:
print(f"{entry['country']['value']}: {entry['value']}")
FRED Economic Data
import pandas as pd
from fredapi import Fred
fred = Fred(api_key="your_fred_api_key")
# Fetch unemployment rate
unemployment = fred.get_series("UNRATE", observation_start="2020-01-01")
print(unemployment.head())
# Search for series
search = fred.search("GDP", limit=5)
print(search[["title", "id", "frequency"]])
Hugging Face Dataset Search and Loading
from datasets import get_dataset_config_names, load_dataset
import pandas as pd
# Search for available datasets
configs = get_dataset_config_names("squad")
print(f"SQuAD configs: {configs}")
# Load specific configuration
dataset = load_dataset("squad_v2", split="train")
df = pd.DataFrame(dataset)
print(f"Loaded {len(df)} examples")
# Filter for questions containing "why"
why_questions = df[df['question'].str.contains('why', case=False)]
print(f"Why questions: {len(why_questions)}")
Kaggle API
# Install: pip install kagglehub
import kagglehub
# Download latest version of a dataset
path = kagglehub.dataset_download("datasnaek/youtube-new")
print(f"Dataset downloaded to: {path}")
# List files
import os
for file in os.listdir(path):
print(file)
Data Versioning with DVC
DVC (Data Version Control) brings Git-like versioning to datasets, essential for reproducible ML:
# Initialize DVC
pip install dvc
dvc init
# Track a dataset
dvc add data/raw/dataset.csv
git add data/raw/dataset.csv.dvc .gitignore
git commit -m "Add raw dataset"
# Switch between versions
git checkout <commit-hash>
dvc checkout
# Verify data version integrity
import hashlib
def verify_dataset_integrity(filepath: str, expected_hash: str) -> bool:
"""Verify a dataset file matches its DVC-tracked hash."""
sha256 = hashlib.sha256()
with open(filepath, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
sha256.update(chunk)
actual_hash = sha256.hexdigest()
return actual_hash == expected_hash
print(f"Dataset integrity: {verify_dataset_integrity('data/dataset.csv', 'abc123...')}")
DVC supports remote storage backends (S3, GCS, SSH) for sharing datasets across teams, with pull/push semantics identical to Git.
Data Quality Validation with Great Expectations
Automated data quality checks should run before any ML pipeline:
import great_expectations as ge
# Load a dataset as a Great Expectations DataFrame
df = ge.read_csv("data/dataset.csv")
# Define expectations
expectations = [
df.expect_column_values_to_not_be_null("id"),
df.expect_column_values_to_be_between("age", 0, 120),
df.expect_column_values_to_be_in_set("category", ["A", "B", "C"]),
df.expect_column_median_to_be_between("salary", 30000, 200000),
df.expect_column_values_to_be_unique("email"),
]
# Run validation
results = [exp for exp in expectations]
passed = sum(1 for r in results if r["success"])
print(f"Passed: {passed}/{len(results)} expectations")
# Generate HTML report
# df.save_expectation_suite("my_suite.json")
# ge.render.render_to_html(results, "data_quality_report.html")
Data Quality Checklist
| Check | Tool/Method | Frequency |
|---|---|---|
| Missing values | Pandas .isnull().sum() |
Every pipeline run |
| Type consistency | Pandas .dtypes |
On schema change |
| Range validation | Custom bounds check | Per dataset version |
| Uniqueness constraints | Pandas .duplicated() |
On data load |
| Distribution drift | Kolmogorov-Smirnov test | Weekly |
| Schema validation | Great Expectations | Every pipeline run |
| Freshness check | Compare max date to current | Daily |
Synthetic Data Generation
When public data is insufficient or privacy-restricted, synthetic data fills the gap:
from faker import Faker
import pandas as pd
import random
fake = Faker()
Faker.seed(42)
def generate_customer_data(n: int = 1000) -> pd.DataFrame:
"""Generate synthetic customer records."""
data = []
for _ in range(n):
data.append({
"customer_id": fake.uuid4(),
"name": fake.name(),
"email": fake.email(),
"age": random.randint(18, 80),
"income": round(random.gammavariate(alpha=5, beta=10000), 2),
"signup_date": fake.date_between(start_date="-3y", end_date="today"),
"country": fake.country(),
"is_active": random.random() > 0.2,
})
return pd.DataFrame(data)
df = generate_customer_data(5000)
print(f"Generated {len(df)} synthetic records")
print(df.describe())
Advanced Synthetic Data with SDV
# pip install sdv
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.datasets.demo import download_demo
# Load real data and train synthesizer
real_data, metadata = download_demo(
modality="single_table",
dataset_name="fake_hotel_booking"
)
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(real_data)
# Generate synthetic data
synthetic_data = synthesizer.sample(num_rows=500)
# Evaluate quality
from sdv.evaluation.single_table import evaluate_quality
quality_report = evaluate_quality(
real_data, synthetic_data, metadata
)
print(f"Quality score: {quality_report.get_score():.2f}")
Use Cases for Synthetic Data
| Use Case | Approach | Tool |
|---|---|---|
| Privacy-preserving ML | Differential privacy + generation | SDV, Gretel.ai |
| Class imbalance | SMOTE, GAN-based oversampling | SMOTE, CTGAN |
| Data augmentation | Rule-based transformation | Albumentations, nlpaug |
| Testing/CI pipelines | Deterministic generation | Faker, factory_boy |
| Stress testing | Edge case generation | Hypothesis, custom rules |
Modern Data Collection Tools (2026)
| Tool | Type | Best For | Pricing |
|---|---|---|---|
| Apify | Web scraping | Large-scale structured data extraction | Pay-per-use |
| Bright Data | Proxy + scraping | Enterprise web data collection | Subscription |
| Common Crawl | Web crawl archive | Historical web data for NLP | Free |
| Zyte | Scraping API | Reliable data extraction | Per-record |
| Diffbot | Knowledge graph | Structured entity extraction | API credits |
| Octoparse | Visual scraping | Non-technical users | Freemium |
Web Scraping with Python
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
def scrape_table(url: str, table_class: str) -> pd.DataFrame:
"""Scrape an HTML table into a DataFrame."""
response = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
soup = BeautifulSoup(response.content, "html.parser")
table = soup.find("table", class_=table_class)
rows = []
for tr in table.find_all("tr"):
cells = tr.find_all(["td", "th"])
rows.append([cell.get_text(strip=True) for cell in cells])
return pd.DataFrame(rows[1:], columns=rows[0]) if len(rows) > 1 else pd.DataFrame()
# Example: scrape a Wikipedia table
# df = scrape_table("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)", "wikitable")
# print(df.head())
Conclusion
These sources provide a great starting point for data analysis projects. Whether you’re learning data science, building machine learning models, or conducting research, quality data is essential.
Explore these resources to fuel your learning and innovation! Start with well-known sources like Kaggle and UCI, then branch into specialized domains as needed.
Comments