Web Scraping with Python using Scrapy and Splash

Python Web Spider

1. Introduction

What to scrape?

  • Text
  • Images
  • Videos
  • Emails

Requests and BeautifulSoup are not suitable for complex projects.

  • Scrapy
  • Spiders
  • Pipelines
  • Middlewares
  • Engine
  • Scheduler

Spiders

  • scrapy.Spider
  • CrawlSpider
  • XMLFeedSpider
  • CSVFeedSpider
  • SitemapSpider

Pipelines

  • Cleaning the data
  • Remove duplication
  • Storing data

Middlewares

  • Request/Response
  • Injecting custome headers
  • Proxying

Engine

Engine is responsible for coordinating between all the other components.

Scheduler

Robots.txt

  • User-Agent
  • Allow
  • Disallow

2. XPath Selectors

  • A query language used to select node from XML or HTML document.
  • XML Path Language
  • Uses path like syntax
  • Based on the tree representation of the XML/HTML document

Exclude certain elements from selection in XPath

For example, extract title and content of a blog page

# test url: https://yongqiang.live/databases/m201-mongodb-performance-ch3-index-operations/

content = response.xpath(
	"//div[@class='article-content']/*[not(self::div[@class='highlight'])]//text()").getall()