Scrapy + Splash 实现动态网页爬取

需求

​ 这是一个撞库事件的后续, 通过之前编写的脚本Suricata - login_audit脚本成功审计到了所有登录网站的账号。这里需要对经过分析后存在可疑行为的账号进行反向查询, 主要判断该账号是否已被标记为泄露账号。

坑点

​ 由于Scrapy没有JS Eengine只能爬取静态页面的, 对于JS生成的动态页面是不支持的。但是可以借助Scrapy-Splash来实现动态页面的爬取。

部署方法

1. Scrapy-Splash

1
$ pip install scrapy-splash --user

2. Splash Instance

由于Scrapy-Splash使用的是Splash HTTP API, 所以需要一个**Splash Instance,一般采用Docker运行Splash**。

1
2
3
4
5
6
7
8
9
10
11
12
13
$ more docker-compose.yml
version: "2.0"

services:
splash:
restart: always
image: scrapinghub/splash
tty: true
ports:
- "8050:8050"
network_mode: "bridge"
container_name: "Splash"
hostname: "Splash"

3. 配置Splash服务(以下操作全部在settings.py

3.1 添加Splash服务器地址

1
SPLASH_URL = 'http://localhost:8050'

3.2 启用Splash middleware

1
2
3
4
5
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

Order 723 is just before HttpProxyMiddleware (750) in default scrapy settings.

3.3 启用SplashDeduplicateArgsMiddleware

1
2
3
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

3.4 自定义 DUPEFILTER_CLASS

1
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

3.5 使用Scrapy HTTP缓存

1
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

4. 代码

注: 当使用Scrapy-Splash之后, 将无法直接使用crawlera middleware。需要手动引用外部lua脚本。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# -*- coding: utf-8 -*-
import scrapy
from haveibeenpwned.items import feed

import re
import json
import pandas as pd
from scrapy_splash import SplashRequest


"""
from redis crawl haveibeenpwned
"""

LUA_SOURCE = """
function main(splash)
local host = "proxy.crawlera.com"
local port = 8010
local user = "api_key"
local password = ""
local session_header = "X-Crawlera-Session"
local session_id = "create"

splash:on_request(function (request)
request:set_header("X-Crawlera-UA", "desktop")
request:set_header(session_header, session_id)
request:set_proxy{host, port, username=user, password=password}
end)

splash:on_response_headers(function (response)
if response.headers[session_header] ~= nil then
session_id = response.headers[session_header]
end
end)

splash:go(splash.args.url)
return splash:html()
end
"""

class CheckSpider(scrapy.Spider):
name = 'scrapy_demo'
start_urls = 'https://httpbin.org/get'

def start_requests(self):
yield SplashRequest(self.start_urls, self.parse, endpoint='execute', args={'wait': 3, 'lua_source': LUA_SOURCE})

def parse(self, response):
print(response.text)

参考:

参考: