Skip to main content

Web Scraping with Python: Scrapy, Splash, and Production Workflows

Created: April 24, 2026 CalmOps 3 min read

Introduction

Web scraping is easy to start and hard to scale. A one-file script with requests may work for one page, but real scraping workloads need retries, deduplication, rate limiting, anti-blocking strategy, data quality checks, and maintainable pipelines. See Python Guide for more context. See Python Guide for more context. See Python Guide for more context.

This guide covers a production-minded scraping stack in Python using Scrapy and Splash.

Choose the Right Tool by Complexity

Small static pages

Use requests + BeautifulSoup.

Medium crawl with pagination and structure

Use Scrapy.

JavaScript-heavy rendering required

Use Scrapy + Splash or browser automation fallback.

The key is to pick minimal complexity for your workload.

Scrapy Architecture Refresher

Core components:

  1. Spider: defines where to crawl and how to parse.
  2. Scheduler: queues requests.
  3. Downloader middleware: modifies requests and responses.
  4. Item pipeline: validates, cleans, deduplicates, stores.
  5. Engine: coordinates everything.

When scraping grows, design around these boundaries early.

Basic Scrapy Project Skeleton

scrapy startproject news_spider
cd news_spider
scrapy genspider articles example.com

Typical structure:

news_spider/
  spiders/
  items.py
  pipelines.py
  middlewares.py
  settings.py

Practical Spider Example

import scrapy


class ArticleSpider(scrapy.Spider):
    name = "articles"
    allowed_domains = ["example.com"]
    start_urls = ["https://example.com/blog"]

    def parse(self, response):
        for card in response.css("article.card"):
            yield {
                "title": card.css("h2::text").get(default="").strip(),
                "url": response.urljoin(card.css("a::attr(href)").get()),
            }

        next_page = response.css("a.next::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

XPath vs CSS Selectors

Use whichever is clearer for target DOM.

XPath strengths

  1. Complex tree navigation.
  2. Conditional filtering.
  3. Parent/ancestor traversal.

CSS strengths

  1. Readability for common selectors.
  2. Short syntax.
  3. Easier onboarding for frontend-familiar developers.

XPath example excluding code blocks

content = response.xpath(
    "//div[@class='article-content']/*[not(self::div[@class='highlight'])]//text()"
).getall()

Handling JavaScript-Rendered Pages

Some pages return skeleton HTML and load data via JS.

Options:

  1. Call the underlying API directly if discoverable.
  2. Use Splash for lightweight JS rendering.
  3. Use Playwright/Selenium only when needed.

Splash workflow:

  1. Run Splash service.
  2. Add Splash middleware.
  3. Use SplashRequest in spider.

Anti-Blocking Basics

Ethical scraping is not stealth abuse. Still, responsible crawlers need resilience.

Recommended controls:

  1. Respect robots.txt where applicable.
  2. Set conservative concurrency.
  3. Add download delays.
  4. Rotate user-agent thoughtfully.
  5. Retry transient network failures.

Example settings:

ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 8
DOWNLOAD_DELAY = 0.5
RETRY_TIMES = 3
AUTOTHROTTLE_ENABLED = True

Data Cleaning and Validation Pipeline

Scraping quality is mostly pipeline quality.

Typical pipeline tasks:

  1. Normalize whitespace and encoding.
  2. Parse and standardize dates.
  3. Drop duplicates by URL/hash.
  4. Validate required fields.
  5. Route invalid items to quarantine.
class ValidatePipeline:
    def process_item(self, item, spider):
        if not item.get("title") or not item.get("url"):
            raise scrapy.exceptions.DropItem("missing required fields")
        item["title"] = item["title"].strip()
        return item

Storage Strategy

Start simple, evolve as needed.

  1. JSON/CSV for prototypes.
  2. PostgreSQL for structured analytics.
  3. Object storage for raw snapshots.
  4. Search index for retrieval workloads.

Avoid coupling spider logic directly to database writes in parse methods.

Observability for Crawlers

Track crawler health like any production service.

Useful metrics:

  1. Requests per minute.
  2. Success/error ratio.
  3. Parse failure count.
  4. Duplicate ratio.
  5. Freshness lag.

Without metrics, you discover breakage too late.

Always evaluate:

  1. Terms of service.
  2. robots directives.
  3. Jurisdictional data regulations.
  4. Copyright and redistribution limits.
  5. PII handling constraints.

Scraping is an engineering activity and a compliance activity.

Common Failure Modes

  1. Brittle selectors tied to CSS class names.
  2. No retry/backoff strategy.
  3. No deduplication keys.
  4. Unbounded concurrency causing blocks.
  5. No schema validation in pipeline.

Practical Project Checklist

  1. Define schema first.
  2. Build resilient selectors.
  3. Add retries and throttling.
  4. Add validation pipeline.
  5. Add monitoring and alerts.
  6. Document legal assumptions.

Conclusion

Scraping success is less about extracting one page and more about maintaining a reliable data collection system over time. Scrapy gives the right architecture for that journey, and Splash helps when JavaScript rendering is required.

Build for maintainability from day one: clean selectors, controlled crawl behavior, validated pipelines, and clear compliance boundaries.

References

Resources

Comments

Share this article

Scan to read on mobile