Scrapy + Splash 实现动态网页爬取

发表于 2019-10-08 更新于 2019-10-16 分类于 Spider

本文字数： 2.8k 阅读时长 ≈ 3 分钟

需求

这是一个撞库事件的后续, 通过之前编写的脚本Suricata - login_audit脚本成功审计到了所有登录网站的账号。这里需要对经过分析后存在可疑行为的账号进行反向查询, 主要判断该账号是否已被标记为泄露账号。

坑点

由于Scrapy没有JS Eengine只能爬取静态页面的, 对于JS生成的动态页面是不支持的。但是可以借助Scrapy-Splash来实现动态页面的爬取。

部署方法

1. Scrapy-Splash

1	$ pip install scrapy-splash --user

2. Splash Instance

由于Scrapy-Splash使用的是Splash HTTP API，所以需要一个**Splash Instance，一般采用Docker运行Splash**。

$ more docker-compose.yml
version: "2.0"

services:
  splash:
    restart: always
    image: scrapinghub/splash
    tty: true
    ports:
      - "8050:8050"
    network_mode: "bridge"
    container_name: "Splash"
    hostname: "Splash"

3. 配置Splash服务（以下操作全部在settings.py）

3.1 添加Splash服务器地址

1	SPLASH_URL = 'http://localhost:8050'

3.2 启用Splash middleware

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

Order 723 is just before HttpProxyMiddleware (750) in default scrapy settings.

3.3 启用SplashDeduplicateArgsMiddleware

1
2
3

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

3.4 自定义 DUPEFILTER_CLASS

1	DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

3.5 使用Scrapy HTTP缓存

1	HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

4. 代码

注: 当使用Scrapy-Splash之后, 将无法直接使用crawlera middleware。需要手动引用外部lua脚本。

# -*- coding: utf-8 -*-
import scrapy
from haveibeenpwned.items import feed

import re
import json
import pandas as pd
from scrapy_splash import SplashRequest


"""
from redis crawl haveibeenpwned
"""

LUA_SOURCE = """
    function main(splash)
        local host = "proxy.crawlera.com"
        local port = 8010
        local user = "api_key"
        local password = ""
        local session_header = "X-Crawlera-Session"
        local session_id = "create"

        splash:on_request(function (request)
            request:set_header("X-Crawlera-UA", "desktop")
            request:set_header(session_header, session_id)
            request:set_proxy{host, port, username=user, password=password}
        end)

        splash:on_response_headers(function (response)
            if response.headers[session_header] ~= nil then
                session_id = response.headers[session_header]
            end
        end)

        splash:go(splash.args.url)
        return splash:html()
    end
"""

class CheckSpider(scrapy.Spider):
    name = 'scrapy_demo'
    start_urls = 'https://httpbin.org/get'
 
    def start_requests(self):
        yield SplashRequest(self.start_urls, self.parse, endpoint='execute',  args={'wait': 3, 'lua_source': LUA_SOURCE})
 
    def parse(self, response):
        print(response.text)

参考:

https://github.com/scrapy-plugins/scrapy-splash/issues/117

参考: