Introduction
Web scraping is easy to start and hard to scale. A one-file script with requests may work for one page, but real scraping workloads need retries, deduplication, rate limiting, anti-blocking strategy, data quality checks, and maintainable pipelines. See Python Guide for more context. See Python Guide for more context. See Python Guide for more context.
This guide covers a production-minded scraping stack in Python using Scrapy and Splash.
Choose the Right Tool by Complexity
Small static pages
Use requests + BeautifulSoup.
Medium crawl with pagination and structure
Use Scrapy.
JavaScript-heavy rendering required
Use Scrapy + Splash or browser automation fallback.
The key is to pick minimal complexity for your workload.
Scrapy Architecture Refresher
Core components:
- Spider: defines where to crawl and how to parse.
- Scheduler: queues requests.
- Downloader middleware: modifies requests and responses.
- Item pipeline: validates, cleans, deduplicates, stores.
- Engine: coordinates everything.
When scraping grows, design around these boundaries early.
Basic Scrapy Project Skeleton
scrapy startproject news_spider
cd news_spider
scrapy genspider articles example.com
Typical structure:
news_spider/
spiders/
items.py
pipelines.py
middlewares.py
settings.py
Practical Spider Example
import scrapy
class ArticleSpider(scrapy.Spider):
name = "articles"
allowed_domains = ["example.com"]
start_urls = ["https://example.com/blog"]
def parse(self, response):
for card in response.css("article.card"):
yield {
"title": card.css("h2::text").get(default="").strip(),
"url": response.urljoin(card.css("a::attr(href)").get()),
}
next_page = response.css("a.next::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
XPath vs CSS Selectors
Use whichever is clearer for target DOM.
XPath strengths
- Complex tree navigation.
- Conditional filtering.
- Parent/ancestor traversal.
CSS strengths
- Readability for common selectors.
- Short syntax.
- Easier onboarding for frontend-familiar developers.
XPath example excluding code blocks
content = response.xpath(
"//div[@class='article-content']/*[not(self::div[@class='highlight'])]//text()"
).getall()
Handling JavaScript-Rendered Pages
Some pages return skeleton HTML and load data via JS.
Options:
- Call the underlying API directly if discoverable.
- Use Splash for lightweight JS rendering.
- Use Playwright/Selenium only when needed.
Splash workflow:
- Run Splash service.
- Add Splash middleware.
- Use SplashRequest in spider.
Anti-Blocking Basics
Ethical scraping is not stealth abuse. Still, responsible crawlers need resilience.
Recommended controls:
- Respect robots.txt where applicable.
- Set conservative concurrency.
- Add download delays.
- Rotate user-agent thoughtfully.
- Retry transient network failures.
Example settings:
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 8
DOWNLOAD_DELAY = 0.5
RETRY_TIMES = 3
AUTOTHROTTLE_ENABLED = True
Data Cleaning and Validation Pipeline
Scraping quality is mostly pipeline quality.
Typical pipeline tasks:
- Normalize whitespace and encoding.
- Parse and standardize dates.
- Drop duplicates by URL/hash.
- Validate required fields.
- Route invalid items to quarantine.
class ValidatePipeline:
def process_item(self, item, spider):
if not item.get("title") or not item.get("url"):
raise scrapy.exceptions.DropItem("missing required fields")
item["title"] = item["title"].strip()
return item
Storage Strategy
Start simple, evolve as needed.
- JSON/CSV for prototypes.
- PostgreSQL for structured analytics.
- Object storage for raw snapshots.
- Search index for retrieval workloads.
Avoid coupling spider logic directly to database writes in parse methods.
Observability for Crawlers
Track crawler health like any production service.
Useful metrics:
- Requests per minute.
- Success/error ratio.
- Parse failure count.
- Duplicate ratio.
- Freshness lag.
Without metrics, you discover breakage too late.
Legal and Ethical Boundaries
Always evaluate:
- Terms of service.
- robots directives.
- Jurisdictional data regulations.
- Copyright and redistribution limits.
- PII handling constraints.
Scraping is an engineering activity and a compliance activity.
Common Failure Modes
- Brittle selectors tied to CSS class names.
- No retry/backoff strategy.
- No deduplication keys.
- Unbounded concurrency causing blocks.
- No schema validation in pipeline.
Practical Project Checklist
- Define schema first.
- Build resilient selectors.
- Add retries and throttling.
- Add validation pipeline.
- Add monitoring and alerts.
- Document legal assumptions.
Conclusion
Scraping success is less about extracting one page and more about maintaining a reliable data collection system over time. Scrapy gives the right architecture for that journey, and Splash helps when JavaScript rendering is required.
Build for maintainability from day one: clean selectors, controlled crawl behavior, validated pipelines, and clear compliance boundaries.
Comments