Web scraping—programmatically extracting data from websites—is invaluable for data collection, price monitoring, research, and competitive analysis. Rust’s performance, memory safety, and excellent HTTP libraries make it ideal for building scraping tools. This article covers techniques from basic HTML parsing to handling JavaScript-rendered content.
Why Rust for Web Scraping?
Compared to Python or JavaScript:
- Speed: 100-1000x faster (Rust compiled vs interpreted)
- Memory efficiency: Rust’s ownership model prevents memory leaks and bloat
- Concurrency: Handle thousands of concurrent requests safely
- Type safety: Catch parsing errors at compile time
- Single binary: Distribute without runtime dependencies
Setup and Dependencies
[dependencies]
reqwest = { version = "0.11", features = ["json"] }
tokio = { version = "1", features = ["full"] }
scraper = "0.17"
serde = { version = "1.0", features = ["derive"] }
serde_json = "1.0"
regex = "1"
url = "2"
chrono = "0.4"
Basic HTML Parsing
Simple GET Request
// filepath: src/basic_scraping.rs
use reqwest::Client;
use scraper::Html;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::new();
// Fetch a webpage
let response = client
.get("https://example.com")
.header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
.send()
.await?;
let body = response.text().await?;
// Parse HTML
let document = Html::parse_document(&body);
// Select elements
let selector = scraper::Selector::parse("h1").unwrap();
for element in document.select(&selector) {
println!("Title: {}", element.inner_html());
}
Ok(())
}
CSS Selector Extraction
use scraper::{Html, Selector};
use serde::Deserialize;
#[derive(Debug, Deserialize)]
pub struct Article {
pub title: String,
pub url: String,
pub date: String,
pub author: String,
}
pub fn scrape_articles(html: &str) -> Result<Vec<Article>, Box<dyn std::error::Error>> {
let document = Html::parse_document(html);
// More complex selectors
let article_selector = Selector::parse("article.post").unwrap();
let title_selector = Selector::parse("h2.title").unwrap();
let url_selector = Selector::parse("a.permalink").unwrap();
let date_selector = Selector::parse("time").unwrap();
let author_selector = Selector::parse(".author").unwrap();
let mut articles = Vec::new();
for article_elem in document.select(&article_selector) {
let title = article_elem
.select(&title_selector)
.next()
.and_then(|elem| elem.select(&title_selector).next())
.map(|elem| elem.inner_html())
.unwrap_or_default();
let url = article_elem
.select(&url_selector)
.next()
.and_then(|elem| elem.value().attr("href"))
.unwrap_or_default()
.to_string();
let date = article_elem
.select(&date_selector)
.next()
.and_then(|elem| elem.value().attr("datetime"))
.unwrap_or_default()
.to_string();
let author = article_elem
.select(&author_selector)
.next()
.map(|elem| elem.inner_html())
.unwrap_or_default();
articles.push(Article {
title,
url,
date,
author,
});
}
Ok(articles)
}
Concurrent Scraping
Parallel Requests
// filepath: src/concurrent_scraping.rs
use reqwest::Client;
use tokio::task;
use futures::stream::{self, StreamExt};
pub async fn scrape_multiple_urls(
urls: Vec<&str>,
concurrency: usize,
) -> Result<Vec<String>, Box<dyn std::error::Error>> {
let client = Client::new();
let results = stream::iter(urls)
.map(|url| {
let client = client.clone();
async move {
client
.get(url)
.send()
.await
.ok()?
.text()
.await
.ok()
}
})
.buffered(concurrency)
.collect::<Vec<_>>()
.await;
Ok(results.into_iter().flatten().collect())
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let urls = vec![
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
];
// Limit to 5 concurrent requests
let results = scrape_multiple_urls(urls, 5).await?;
println!("Scraped {} pages", results.len());
Ok(())
}
Rate Limiting
use std::time::Duration;
use tokio::time::sleep;
use async_rate_limit::AsyncRateLimit;
pub async fn scrape_with_rate_limit(
urls: Vec<&str>,
requests_per_second: u32,
) -> Result<Vec<String>, Box<dyn std::error::Error>> {
let client = reqwest::Client::new();
// Create rate limiter: N requests per second
let limiter = AsyncRateLimit::new(requests_per_second as usize,
Duration::from_secs(1));
let mut results = Vec::new();
for url in urls {
// Wait for rate limit
limiter.acquire_one().await;
let response = client.get(url).send().await?;
results.push(response.text().await?);
println!("Fetched: {}", url);
}
Ok(results)
}
// Alternative: Manual rate limiting
pub async fn scrape_with_delay(
urls: Vec<&str>,
delay_ms: u64,
) -> Result<Vec<String>, Box<dyn std::error::Error>> {
let client = reqwest::Client::new();
let mut results = Vec::new();
for url in urls {
let response = client.get(url).send().await?;
results.push(response.text().await?);
// Wait between requests
sleep(Duration::from_millis(delay_ms)).await;
}
Ok(results)
}
Handling JavaScript-Rendered Content
Using Headless Browser (Chromium)
For JavaScript-heavy sites, you need a browser engine:
// filepath: src/js_rendering.rs
use headless_chrome::Browser;
use headless_chrome::protocol::cdp::Page;
pub fn scrape_javascript_content(url: &str) -> Result<String, Box<dyn std::error::Error>> {
// Launch headless Chrome
let browser = Browser::default()?;
let tab = browser.wait_for_initial_tab()?;
// Navigate to URL
tab.navigate_to(url)?;
// Wait for element to appear
tab.wait_for_element("body", std::time::Duration::from_secs(5))?;
// Get rendered HTML
let html = tab.get_content()?;
Ok(html)
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let html = scrape_javascript_content("https://example.com")?;
println!("Rendered HTML:\n{}", html);
Ok(())
}
Add to Cargo.toml:
headless_chrome = "1.0"
Puppeteer-style API with Chromium
use std::time::Duration;
pub async fn scrape_with_actions(url: &str) -> Result<String, Box<dyn std::error::Error>> {
// More advanced: interact with page
let browser = headless_chrome::Browser::default()?;
let tab = browser.wait_for_initial_tab()?;
tab.navigate_to(url)?;
// Wait for dynamic content
tab.wait_for_element(".dynamic-content", Duration::from_secs(10))?;
// Optionally: click elements, fill forms, etc.
// tab.click_element("button.load-more")?;
// tab.wait_for_element(".new-content", Duration::from_secs(5))?;
let html = tab.get_content()?;
Ok(html)
}
Data Extraction Patterns
Structured Data Extraction
// filepath: src/data_extraction.rs
use scraper::{Html, Selector};
use serde::{Deserialize, Serialize};
use regex::Regex;
#[derive(Debug, Serialize, Deserialize)]
pub struct Product {
pub id: String,
pub name: String,
pub price: f64,
pub rating: f64,
pub in_stock: bool,
}
pub fn extract_products(html: &str) -> Result<Vec<Product>, Box<dyn std::error::Error>> {
let document = Html::parse_document(html);
let product_selector = Selector::parse("div.product-item")?;
let name_selector = Selector::parse("h3.product-name")?;
let price_selector = Selector::parse("span.price")?;
let rating_selector = Selector::parse("div.rating")?;
let stock_selector = Selector::parse("span.stock-status")?;
let id_regex = Regex::new(r"product-(\d+)")?;
let price_regex = Regex::new(r"\$([0-9.]+)")?;
let rating_regex = Regex::new(r"(\d+\.?\d*)/5")?;
let mut products = Vec::new();
for product_elem in document.select(&product_selector) {
// Extract ID from class or attribute
let id = product_elem
.value()
.attr("id")
.and_then(|id_str| {
id_regex.captures(id_str)
.and_then(|caps| caps.get(1))
.map(|m| m.as_str())
})
.unwrap_or("unknown")
.to_string();
// Extract name
let name = product_elem
.select(&name_selector)
.next()
.map(|elem| elem.inner_html())
.unwrap_or_default();
// Extract and parse price
let price = product_elem
.select(&price_selector)
.next()
.and_then(|elem| {
let text = elem.inner_html();
price_regex
.captures(&text)
.and_then(|caps| caps.get(1))
.and_then(|m| m.as_str().parse::<f64>().ok())
})
.unwrap_or(0.0);
// Extract rating
let rating = product_elem
.select(&rating_selector)
.next()
.and_then(|elem| {
let text = elem.inner_html();
rating_regex
.captures(&text)
.and_then(|caps| caps.get(1))
.and_then(|m| m.as_str().parse::<f64>().ok())
})
.unwrap_or(0.0);
// Extract stock status
let in_stock = product_elem
.select(&stock_selector)
.next()
.map(|elem| {
let text = elem.inner_html().to_lowercase();
text.contains("in stock") || text.contains("available")
})
.unwrap_or(false);
products.push(Product {
id,
name,
price,
rating,
in_stock,
});
}
Ok(products)
}
Advanced Techniques
Pagination Handling
// filepath: src/pagination.rs
use reqwest::Client;
use scraper::Html;
pub async fn scrape_paginated(
base_url: &str,
total_pages: usize,
) -> Result<Vec<String>, Box<dyn std::error::Error>> {
let client = Client::new();
let mut all_data = Vec::new();
for page in 1..=total_pages {
// Construct page URL
let url = if base_url.contains("?") {
format!("{}&page={}", base_url, page)
} else {
format!("{}?page={}", base_url, page)
};
println!("Scraping page {} of {}", page, total_pages);
let response = client.get(&url).send().await?;
let html = response.text().await?;
// Extract data
let document = Html::parse_document(&html);
// Store extracted data
// all_data.push(extract_page_data(&document)?);
// Polite delay
tokio::time::sleep(tokio::time::Duration::from_millis(500)).await;
}
Ok(all_data)
}
// Alternative: Auto-detect next page
pub async fn scrape_paginated_auto(
start_url: &str,
) -> Result<Vec<String>, Box<dyn std::error::Error>> {
let client = Client::new();
let mut all_data = Vec::new();
let mut current_url = start_url.to_string();
loop {
println!("Scraping: {}", current_url);
let response = client.get(¤t_url).send().await?;
let html = response.text().await?;
let document = Html::parse_document(&html);
// Extract data from current page
// all_data.push(extract_page_data(&document)?);
// Find next page link
let next_selector = scraper::Selector::parse("a.next-page").unwrap();
match document
.select(&next_selector)
.next()
.and_then(|elem| elem.value().attr("href"))
{
Some(next_link) => {
current_url = if next_link.starts_with("http") {
next_link.to_string()
} else {
format!("{}{}", start_url.split('/').take(3).collect::<Vec<_>>().join("/"), next_link)
};
}
None => break, // No more pages
}
tokio::time::sleep(tokio::time::Duration::from_millis(500)).await;
}
Ok(all_data)
}
Session and Cookie Handling
// filepath: src/sessions.rs
use reqwest::{Client, cookie::Jar};
use std::sync::Arc;
use url::Url;
pub async fn scrape_with_login(
login_url: &str,
username: &str,
password: &str,
target_url: &str,
) -> Result<String, Box<dyn std::error::Error>> {
// Create client with cookie jar
let jar = Arc::new(Jar::default());
let client = Client::builder()
.cookie_provider(jar.clone())
.build()?;
// Login first
let login_page = client.get(login_url).send().await?;
println!("Login page status: {}", login_page.status());
// Submit login form
let params = [("username", username), ("password", password)];
let response = client
.post(login_url)
.form(¶ms)
.send()
.await?;
println!("Login response status: {}", response.status());
// Now access protected content
let target_response = client.get(target_url).send().await?;
let html = target_response.text().await?;
Ok(html)
}
Proxy and User-Agent Rotation
// filepath: src/proxies.rs
use reqwest::Client;
use rand::Rng;
const USER_AGENTS: &[&str] = &[
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0",
];
pub fn get_random_user_agent() -> &'static str {
let mut rng = rand::thread_rng();
let idx = rng.gen_range(0..USER_AGENTS.len());
USER_AGENTS[idx]
}
pub async fn scrape_with_rotation(
urls: Vec<&str>,
proxies: Vec<&str>,
) -> Result<Vec<String>, Box<dyn std::error::Error>> {
let mut rng = rand::thread_rng();
let mut results = Vec::new();
for url in urls {
// Rotate user agent
let user_agent = get_random_user_agent();
// Rotate proxy
let proxy = proxies[rng.gen_range(0..proxies.len())];
let client = Client::builder()
.user_agent(user_agent)
.proxy(reqwest::Proxy::https(proxy)?)
.build()?;
let response = client.get(url).send().await?;
results.push(response.text().await?);
println!("Fetched {} with UA: {}", url, user_agent);
}
Ok(results)
}
Complete Scraping Project
// filepath: src/main.rs
use reqwest::Client;
use scraper::{Html, Selector};
use serde::{Deserialize, Serialize};
use std::time::Duration;
use tokio::time::sleep;
#[derive(Debug, Serialize, Deserialize)]
struct NewsItem {
title: String,
link: String,
date: String,
}
pub struct Scraper {
client: Client,
base_url: String,
}
impl Scraper {
pub fn new(base_url: &str) -> Self {
let client = Client::builder()
.user_agent("Mozilla/5.0 (Rust Web Scraper)")
.timeout(Duration::from_secs(10))
.build()
.expect("Failed to create client");
Scraper {
client,
base_url: base_url.to_string(),
}
}
pub async fn scrape_news(&self) -> Result<Vec<NewsItem>, Box<dyn std::error::Error>> {
let response = self.client.get(&self.base_url).send().await?;
let html = response.text().await?;
let document = Html::parse_document(&html);
let item_selector = Selector::parse("article.news-item").unwrap();
let title_selector = Selector::parse("h2.title").unwrap();
let link_selector = Selector::parse("a.link").unwrap();
let date_selector = Selector::parse("time").unwrap();
let mut items = Vec::new();
for item in document.select(&item_selector) {
let title = item
.select(&title_selector)
.next()
.map(|e| e.inner_html())
.unwrap_or_default();
let link = item
.select(&link_selector)
.next()
.and_then(|e| e.value().attr("href"))
.unwrap_or_default()
.to_string();
let date = item
.select(&date_selector)
.next()
.and_then(|e| e.value().attr("datetime"))
.unwrap_or_default()
.to_string();
items.push(NewsItem { title, link, date });
}
Ok(items)
}
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let scraper = Scraper::new("https://example.com/news");
let items = scraper.scrape_news().await?;
for item in items {
println!("{} - {} ({})", item.title, item.link, item.date);
}
Ok(())
}
Ethical Scraping Guidelines
Respecting Robots.txt
pub async fn check_robots_txt(domain: &str) -> Result<String, Box<dyn std::error::Error>> {
let client = reqwest::Client::new();
let response = client
.get(&format!("https://{}/robots.txt", domain))
.send()
.await?;
let content = response.text().await?;
println!("robots.txt content:\n{}", content);
Ok(content)
}
Best Practices
- Check robots.txt before scraping
- Respect User-Agent rules in robots.txt
- Use appropriate delays between requests (500ms-1s)
- Identify yourself with a descriptive User-Agent
- Don’t overload servers - use connection pooling
- Cache responses to avoid repeated requests
- Handle 429 (Too Many Requests) gracefully
- Check Terms of Service before scraping
// Respectful scraper
pub async fn respectful_scraper(url: &str) -> Result<String, Box<dyn std::error::Error>> {
let client = reqwest::Client::builder()
.user_agent("MyBot/1.0 (+https://example.com/bot)")
.timeout(Duration::from_secs(10))
.build()?;
let response = client.get(url).send().await?;
// Check for rate limiting
if response.status() == 429 {
eprintln!("Rate limited! Waiting before retry...");
sleep(Duration::from_secs(60)).await;
}
Ok(response.text().await?)
}
Performance Optimization
Caching Responses
use std::collections::HashMap;
use std::time::{Duration, Instant};
pub struct Cache {
data: HashMap<String, (String, Instant)>,
ttl: Duration,
}
impl Cache {
pub fn new(ttl_secs: u64) -> Self {
Cache {
data: HashMap::new(),
ttl: Duration::from_secs(ttl_secs),
}
}
pub fn get(&self, key: &str) -> Option<String> {
self.data.get(key).and_then(|(value, time)| {
if time.elapsed() < self.ttl {
Some(value.clone())
} else {
None
}
})
}
pub fn insert(&mut self, key: String, value: String) {
self.data.insert(key, (value, Instant::now()));
}
}
Error Handling
use thiserror::Error;
#[derive(Error, Debug)]
pub enum ScraperError {
#[error("Network error: {0}")]
NetworkError(#[from] reqwest::Error),
#[error("Parse error: {0}")]
ParseError(String),
#[error("Not found: {0}")]
NotFound(String),
#[error("Rate limited, retry after {0} seconds")]
RateLimited(u64),
}
pub async fn robust_fetch(url: &str) -> Result<String, ScraperError> {
let client = reqwest::Client::new();
let response = client.get(url)
.send()
.await
.map_err(ScraperError::NetworkError)?;
match response.status() {
reqwest::StatusCode::OK => {
response.text().await.map_err(ScraperError::NetworkError)
}
reqwest::StatusCode::TOO_MANY_REQUESTS => {
Err(ScraperError::RateLimited(60))
}
reqwest::StatusCode::NOT_FOUND => {
Err(ScraperError::NotFound(url.to_string()))
}
status => {
Err(ScraperError::ParseError(format!("HTTP {}", status)))
}
}
}
Further Resources
Libraries
- reqwest: HTTP client - https://crates.io/crates/reqwest
- scraper: CSS selector parsing - https://crates.io/crates/scraper
- headless_chrome: Browser control - https://crates.io/crates/headless_chrome
- regex: Pattern matching - https://crates.io/crates/regex
- serde: Serialization - https://crates.io/crates/serde
Tools
- curl: Test URLs
- Browser DevTools: Inspect HTML structure
- Postman: Test API endpoints
- Charles Proxy: Monitor network traffic
Reading
- Web Scraping Best Practices: https://www.imperva.com/learn/application-security/web-scraping-techniques/
- Robots.txt Specification: https://www.robotstxt.org/
- RESTful API Design: https://restfulapi.net/
Scraping Checklist
- Check robots.txt and Terms of Service
- Use appropriate User-Agent headers
- Implement rate limiting/delays
- Handle errors gracefully
- Respect server load
- Cache responses when possible
- Handle JavaScript rendering if needed
- Test with real data before full run
- Monitor rate limiting (429 responses)
- Log progress and errors
- Document data sources
Conclusion
Rust is excellent for web scraping because of its:
- Performance - Handle large datasets efficiently
- Safety - Memory-safe concurrent scraping
- Control - Low-level HTTP and parsing control
- Concurrency - Scale to thousands of parallel requests
- Reliability - Robust error handling
Always scrape responsibly and ethically!
- HTTP fundamentals
- Async/Await in Rust - Concurrent request handling
- Error Handling in Rust - Robust error management
- Working with JSON - Parse API responses
Scrape responsibly! 🕷️
Comments