1. Introduction
What can you scrape?
- Text
- Images
- Videos
- Emails
For simple projects, requests and BeautifulSoup are sufficient, but for complex or large-scale scraping, use Scrapy.
Scrapy Components
- Spiders: Define how to crawl and parse pages.
- Pipelines: Clean, deduplicate, and store scraped data.
- Middlewares: Modify requests/responses, inject custom headers, handle proxies, etc.
- Engine: Coordinates all components.
- Scheduler: Manages the order of requests.
Types of Spiders
scrapy.SpiderCrawlSpiderXMLFeedSpiderCSVFeedSpiderSitemapSpider
Pipelines
- Data cleaning
- Removing duplicates
- Storing data (e.g., to database, file)
Middlewares
- Modify requests and responses
- Inject custom headers
- Use proxies
Engine
- Coordinates spiders, pipelines, middlewares, and scheduler.
Scheduler
- Manages the queue of requests.
- Respects
robots.txt(User-Agent, Allow, Disallow rules).
2. XPath Selectors
XPath is a query language for selecting nodes from XML or HTML documents.
- XML Path Language
- Uses path-like syntax
- Operates on the tree structure of XML/HTML
Example: Extract the title and content of a blog page, excluding code blocks.
# Example URL: https://yongqiang.live/databases/m201-mongodb-performance-ch3-index-operations/
content = response.xpath(
"//div[@class='article-content']/*[not(self::div[@class='highlight'])]//text()"
).getall()
Reference:
Exclude certain elements from selection in XPath (Stack Overflow)
Additional Tips
- Use Splash with Scrapy for JavaScript-rendered pages.
- Always respect website terms of service and
robots.txt. - Use Scrapy’s logging and debugging tools for troubleshooting.
- Store scraped data in formats like JSON, CSV, or databases using Scrapy pipelines.