1. Introduction
What to scrape?
- Text
- Images
- Videos
- Emails
Requests and BeautifulSoup are not suitable for complex projects.
- Scrapy
- Spiders
- Pipelines
- Middlewares
- Engine
- Scheduler
Spiders
- scrapy.Spider
- CrawlSpider
- XMLFeedSpider
- CSVFeedSpider
- SitemapSpider
Pipelines
- Cleaning the data
- Remove duplication
- Storing data
Middlewares
- Request/Response
- Injecting custome headers
- Proxying
Engine
Engine is responsible for coordinating between all the other components.
Scheduler
Robots.txt
- User-Agent
- Allow
- Disallow
2. XPath Selectors
- A query language used to select node from XML or HTML document.
- XML Path Language
- Uses path like syntax
- Based on the tree representation of the XML/HTML document
Exclude certain elements from selection in XPath
For example, extract title and content of a blog page
# test url: https://yongqiang.live/databases/m201-mongodb-performance-ch3-index-operations/
content = response.xpath(
"//div[@class='article-content']/*[not(self::div[@class='highlight'])]//text()").getall()