Web Scraping with Python using Scrapy and Splash

1. Introduction

What to scrape?

Text
Images
Videos
Emails

Requests and BeautifulSoup are not suitable for complex projects.

Scrapy
Spiders
Pipelines
Middlewares
Engine
Scheduler

Spiders

scrapy.Spider
CrawlSpider
XMLFeedSpider
CSVFeedSpider
SitemapSpider

Pipelines

Cleaning the data
Remove duplication
Storing data

Middlewares

Request/Response
Injecting custome headers
Proxying

Engine

Engine is responsible for coordinating between all the other components.

Scheduler

Robots.txt

User-Agent
Allow
Disallow

2. XPath Selectors

A query language used to select node from XML or HTML document.
XML Path Language
Uses path like syntax
Based on the tree representation of the XML/HTML document

Exclude certain elements from selection in XPath

For example, extract title and content of a blog page

# test url: https://yongqiang.live/databases/m201-mongodb-performance-ch3-index-operations/

content = response.xpath(
	"//div[@class='article-content']/*[not(self::div[@class='highlight'])]//text()").getall()