Web Scraping with Python using Scrapy and Splash

Python Web Spider

1. Introduction

What can you scrape?

  • Text
  • Images
  • Videos
  • Emails

For simple projects, requests and BeautifulSoup are sufficient, but for complex or large-scale scraping, use Scrapy.

Scrapy Components

  • Spiders: Define how to crawl and parse pages.
  • Pipelines: Clean, deduplicate, and store scraped data.
  • Middlewares: Modify requests/responses, inject custom headers, handle proxies, etc.
  • Engine: Coordinates all components.
  • Scheduler: Manages the order of requests.

Types of Spiders

  • scrapy.Spider
  • CrawlSpider
  • XMLFeedSpider
  • CSVFeedSpider
  • SitemapSpider

Pipelines

  • Data cleaning
  • Removing duplicates
  • Storing data (e.g., to database, file)

Middlewares

  • Modify requests and responses
  • Inject custom headers
  • Use proxies

Engine

  • Coordinates spiders, pipelines, middlewares, and scheduler.

Scheduler

  • Manages the queue of requests.
  • Respects robots.txt (User-Agent, Allow, Disallow rules).

2. XPath Selectors

XPath is a query language for selecting nodes from XML or HTML documents.

  • XML Path Language
  • Uses path-like syntax
  • Operates on the tree structure of XML/HTML

Example: Extract the title and content of a blog page, excluding code blocks.

# Example URL: https://yongqiang.live/databases/m201-mongodb-performance-ch3-index-operations/

content = response.xpath(
    "//div[@class='article-content']/*[not(self::div[@class='highlight'])]//text()"
).getall()

Reference:
Exclude certain elements from selection in XPath (Stack Overflow)

Additional Tips

  • Use Splash with Scrapy for JavaScript-rendered pages.
  • Always respect website terms of service and robots.txt.
  • Use Scrapy’s logging and debugging tools for troubleshooting.
  • Store scraped data in formats like JSON, CSV, or databases using Scrapy pipelines.

References