Web Scraping with Python using Scrapy and Splash

1. Introduction

What can you scrape?

Text
Images
Videos
Emails

For simple projects, requests and BeautifulSoup are sufficient, but for complex or large-scale scraping, use Scrapy.

Scrapy Components

Spiders: Define how to crawl and parse pages.
Pipelines: Clean, deduplicate, and store scraped data.
Middlewares: Modify requests/responses, inject custom headers, handle proxies, etc.
Engine: Coordinates all components.
Scheduler: Manages the order of requests.

Types of Spiders

scrapy.Spider
CrawlSpider
XMLFeedSpider
CSVFeedSpider
SitemapSpider

Pipelines

Data cleaning
Removing duplicates
Storing data (e.g., to database, file)

Middlewares

Modify requests and responses
Inject custom headers
Use proxies

Engine

Coordinates spiders, pipelines, middlewares, and scheduler.

Scheduler

Manages the queue of requests.
Respects robots.txt (User-Agent, Allow, Disallow rules).

2. XPath Selectors

XPath is a query language for selecting nodes from XML or HTML documents.

XML Path Language
Uses path-like syntax
Operates on the tree structure of XML/HTML

Example: Extract the title and content of a blog page, excluding code blocks.

# Example URL: https://yongqiang.live/databases/m201-mongodb-performance-ch3-index-operations/

content = response.xpath(
    "//div[@class='article-content']/*[not(self::div[@class='highlight'])]//text()"
).getall()

Reference:
Exclude certain elements from selection in XPath (Stack Overflow)

Additional Tips

Use Splash with Scrapy for JavaScript-rendered pages.
Always respect website terms of service and robots.txt.
Use Scrapy’s logging and debugging tools for troubleshooting.
Store scraped data in formats like JSON, CSV, or databases using Scrapy pipelines.