首发于数据烹饪
Selenium Chrome Driver之反爬监测

Selenium Chrome Driver之反爬监测

@Date    : 2018-09-03
@Author  : lmingzhi (lmingzhi@gmail.com)

TOC



0.前言

爬虫和反爬向来都是一对“冤家”,最近在一次数据采集中遇到了一个问题,如何防止Selenium控制下的Chrome操作不被反爬监测到?

久闻selenium控制下的浏览器会被监测到,原来没有意识到这个问题,但是听别人提及过,然后最近,终于有了实践的机会了——刚好遇到了A网站在登录时进行了反爬监测。

于是就有了以下的内容。

本教程的目标:
通过该测试网站https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html的测试,使表格的所有项目全绿。

主要步骤包括:

  1. 安装mitmproxy
  2. 设置mitmproxy证书
  3. 编写mitmproxy脚本,注入js代码
  4. 安装Google Chrome
  5. 下载chromedriver,并修改源文件
  6. 目标网站测试

1.CentOS安装mitmproxy

1.1.使用linux已编译好的二进制包

step0. 参考资料

崔庆才的 Python3网络爬虫开发实战 1.7.2 mitmproxy的安装



step1. 下载链接



step2. 具体实现

我使用的是mitmproxy 2.0.2

# 下载已编译好的二进制包
$ wget https://path/to/mitmproxy-2.0.2-linux.tar.gz
$ tar -zxvf mitmproxy-2.0.2-linux.tar.gz
$ sudo mv mitmproxy mitmdump mitmweb /usr/bin

1.2.conda安装mitmproxy >>>> 另一种选择

参考链接:centos 安装mitmproxy

系统信息:

  1. CentOS Linux 7
  2. Miniconda
# 安装gcc和c++
$ yum install gcc
$ yum install gcc-c++

# 查看已经有的conda虚拟环境
$ conda env list

    # conda environments:
    #
    py35                     /root/miniconda3/envs/py35
    root                  *  /root/miniconda3


# 安装conda虚拟环境,名称为py36
$ conda create --name py36 

    Fetching package metadata .............
    Solving package specifications: 
    Package plan for installation in environment /root/miniconda3/envs/py36:

    Proceed ([y]/n)? y

    #
    # To activate this environment, use:
    # > source activate py36
    #
    # To deactivate an active environment, use:
    # > source deactivate
    #


# 进入虚拟环境py36并安装mitmproxy
$ source activate py36
(py36) $ pip install mitmproxy  

# 最后安装的提示
  Found existing installation: pyasn1 0.2.3
Cannot uninstall 'pyasn1'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.
You are using pip version 10.0.1, however version 18.0 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.


# 上面没有安装成功,需要对pip进行降级处理
(py36) $ pip install --upgrade --force-reinstall pip==9.0.3



# 安装mitmproxy
(py36) $ pip install mitmproxy --disable-pip-version-check


# 尝试是否可以启动成功
(py36) $ mitmproxy 
# `Ctrl+C` 退出

1.3.CentOS Linux 7 证书配置

step0. mitmproxy证书配置

引自 >>>> Python3网络爬虫开发实战-崔庆才 <<<<

参考链接 :MitmProxy的安装


step1. mitmproxy官方文档

参考链接:Installing the mitmproxy CA certificate manually


step2. stackoverflow pem证书转换为crt

参考链接:Installing a root/CA Certificate


step3. centOS系统的证书安装

参考链接:Adding trusted root certificates to the server


step4. 具体实现代码

系统:CentOS Linux 7.4.1708

# a.查看mitmproxy版本和系统信息
$ mitmproxy --version

Mitmproxy version: 2.0.2 (release version) Precompiled Binary
Python version: 3.5.2
Platform: Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core
SSL version: OpenSSL 1.1.0e  16 Feb 2017
Linux distro: CentOS Linux 7.4.1708 Core

# b.安装 ca-certificates
$ yum install ca-certificates


# c. 转换证书,pem → crt
$ cd ~/.mitmproxy/
$ openssl x509 -in mitmproxy-ca-cert.pem -inform PEM -out mitmproxy-ca-cert.crt

$ ll
total 28
-rw-r--r-- 1 root root 1318 Aug 31 16:19 mitmproxy-ca-cert.cer
-rw-r--r-- 1 root root 1318 Aug 31 18:52 mitmproxy-ca-cert.crt
-rw-r--r-- 1 root root 1140 Aug 31 16:19 mitmproxy-ca-cert.p12
-rw-r--r-- 1 root root 1318 Aug 31 16:19 mitmproxy-ca-cert.pem
-rw-r--r-- 1 root root 2529 Aug 31 16:19 mitmproxy-ca.p12
-rw-r--r-- 1 root root 3022 Aug 31 16:19 mitmproxy-ca.pem
-rw-r--r-- 1 root root  770 Aug 31 16:19 mitmproxy-dhparam.pem

# 将证书导入系统
$ update-ca-trust force-enable
$ cp mitmproxy-ca-cert.crt /etc/pki/ca-trust/source/anchors/
$ update-ca-trust extract

# 外部证书存放目录
$ ll /etc/pki/ca-trust/source/anchors/
total 4
-rw-r--r-- 1 root root 1318 Aug 31 19:41 mitmproxy-ca-cert.crt

2. mitmproxy注入js脚本

2.1.参考资料

参考文献:

  1. JAVASCRIPT INJECTION WITH SELENIUM, PUPPETEER, AND MARIONETTE IN CHROME AND FIREFOX
  2. IT IS NOT POSSIBLE TO DETECT AND BLOCK CHROME
  3. Selenium webdriver: firefox headless inject javascript to modify browser property

注意事项:

  • 以下JavaScript代码是为了防止selenium控制chrome浏览器访问网站时被反爬监测到,而插入的。
  • mitmproxy默认代理端口为8080

2.2.具体实现

# 编辑注入脚本
(py36) $ vim indject_js_proxy.py


indject_js_proxy.py
from mitmproxy import ctx
injected_javascript = '''
// overwrite the `languages` property to use a custom getter
Object.defineProperty(navigator, "languages", {
  get: function() {
    return ["zh-CN","zh","zh-TW","en-US","en"];
  }
});

// Overwrite the `plugins` property to use a custom getter.
Object.defineProperty(navigator, 'plugins', {
  get: () => [1, 2, 3, 4, 5],
});

// Pass the Webdriver test
Object.defineProperty(navigator, 'webdriver', {
  get: () => false,
});


// Pass the Chrome Test.
// We can mock this in as much depth as we need for the test.
window.navigator.chrome = {
  runtime: {},
  // etc.
};

// Pass the Permissions Test.
const originalQuery = window.navigator.permissions.query;
window.navigator.permissions.query = (parameters) => (
  parameters.name === 'notifications' ?
    Promise.resolve({ state: Notification.permission }) :
    originalQuery(parameters)
);
'''

def response(flow):
    # Only process 200 responses of HTML content.
    if not flow.response.status_code == 200:
        return

    # Inject a script tag containing the JavaScript.
    html = flow.response.text
    html = html.replace('<head>', '<head><script>%s</script>' % injected_javascript)
    flow.response.text = str(html)
    ctx.log.info('>>>> js代码插入成功 <<<<')

    # 只要url链接以target开头,则将网页内容替换为目前网址
    # target = 'https://target-url.com'
    # if flow.url.startswith(target):
    #     flow.response.text = flow.url

启动mitmprox, 以及js注入脚本

# 启动脚本
(py36) $ mitmdump -s indject_js_proxy.py   

Loading script indject_js_proxy.py
Proxy server listening at http://*:8080

3.安装google-chrome

3.1.安装命令

安装Google Chrome仅需要2行代码:

wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
sudo yum install google-chrome-stable_current_x86_64.rpm

3.2.安装流水

# a.下载
$ wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm

--2018-08-31 16:37:05--  https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm
Resolving dl.google.com (dl.google.com)... 203.208.40.68, 203.208.40.64, 203.208.40.67, ...
Connecting to dl.google.com (dl.google.com)|203.208.40.68|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 54333686 (52M) [application/x-rpm]
Saving to: ‘google-chrome-stable_current_x86_64.rpm’

100%[ =======================================================================================================>] 54,333,686  6.63MB/s   in 6.9s   

2018-08-31 16:37:12 (7.56 MB/s) - ‘google-chrome-stable_current_x86_64.rpm’ saved [54333686/54333686]



# b.安装浏览器和依赖包
$ sudo yum install google-chrome-stable_current_x86_64.rpm

4.安装chromedriver

4.1.安装一些必要的工具包

# 安装zip,unzip
$ yum install zip, unzip

# 安装十六进制编辑器 hexedit
$ yum install hexedit

4.2 下载chromedriver

# a.下载
$ wget https://chromedriver.storage.googleapis.com/2.40/chromedriver_linux64.zip

# b.解压压缩包
$ unzip chromedriver_linux64_2.40.zip 

    Archive:  chromedriver_linux64_2.40.zip
      inflating: chromedriver

4.3 修改chromedrive源代码

参考链接:Can a website detect when you are using selenium with chromedriver?


# 修改chromedriver
$ hexedit chromedriver 

        # 操作
        1. tab 跳转到string栏
        2. ctrl+S 查找 var key = '$cdc_asdjflasutopfhvcZLmcfl_'(对于2.40版本)
        3. 替换'$cdc_asdjflasutopfhvcZLmcfl_'为任意值
        4. ctrl+X 保存

# 移动chromedriver 到 /usr/bin
$ mv chromedriver /usr/bin

5.反爬监测测试

5.0. 本地接入服务器的jupyternobook

因为代码准备布置在远程服务器上,所以需要直接连上服务器做测试,由于更喜欢用jupyter notebook,所以顺便说一下如何连上远程的jupyter notebook.

5.0.1.参考链接


5.0.2.具体实现

5.0.2.1 远程服务器

由于远程仅安装了miniconda3, 还需要按安装jupyter.

$ conda install jupyter


# 启动jupyter notebook
$ jupyter notebook                                                                    
[I 11:39:00.645 NotebookApp] Writing notebook server cookie secret to /run/user/0/jupyter/notebook_cookie_secret
[C 11:39:00.665 NotebookApp] Running as root is not recommended. Use --allow-root to bypass.


# 提示说需要加上--allow-root,因为账号是用管理员账号登录的
$ jupyter notebook --allow-root

[I 11:39:18.954 NotebookApp] Serving notebooks from local directory: /data/scrapy_lmz/tg_scrapy_eleme_mt
[I 11:39:18.954 NotebookApp] 0 active kernels 
[I 11:39:18.954 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/?token=1478c6364d77b3a74edd529250a1d01c36d79eb079f93e5e
[I 11:39:18.954 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[W 11:39:18.955 NotebookApp] No web browser found: could not locate runnable browser.
[C 11:39:18.955 NotebookApp] 

    Copy/paste this URL into your browser when you connect for the first time,
    to login with a token:
        http://localhost:8888/?token=1478c6364d77b3a74edd529250a1d01c36d79eb079f93e5e

启动juypter notebook也可以使用以下命令,使jupyter notebook在后台运行,日志记录在 /tmp/ipynb.logs 中。

# 远程服务器后台运行jupyter notebook
$ nohup jupyter notebook --allow-root > /tmp/ipynb.logs &

# 查看日志
$ tail -f /tmp/ipynb.logs

# 查看jupyter notebook进程
$ ps aux | grep jupyter  

root     29106  0.0  0.1 475080 61068 pts/1    Sl   14:05   0:04 /root/miniconda3/bin/python /root/miniconda3/bin/jupyter-notebook --allow-root                                                                                             
root     29151  0.0  0.1 771448 55556 ?        Ssl  14:08   0:02 /root/miniconda3/bin/python -m ipykernel_launcher -f /run/user/0/jupyter/kernel-afbd8de5-efed-4b18-94ac-d27866219a2c.json                                                  
root     30714  0.0  0.1 744156 44668 ?        Ssl  17:24   0:00 /root/miniconda3/bin/python -m ipykernel_launcher -f /run/user/0/jupyter/kernel-b6c35685-bd4e-4f18-ba3c-b7680706ad1b.json                                                  
root     31028  0.0  0.0 112708   976 pts/1    S+   18:18   0:00 grep --color=auto jupyter 


# 关闭jupyter进程
$ kill 29106

# 或批量删除jupyter进程
$ ps aux|grep jupyter|awk '{print $2}'|xargs kill -9

5.0.2.2 本地服务器设置

远程jupyter notebook端口为8888, 含token地址为: localhost:8888/?

# 端口映射
$ ssh -L8008:localhost:8888 root@192.168.100.251
# 输入远程服务器的密码

# 在本地浏览器输入以下地址连接远程jupyter notebook,注意本地端口为8008
http://localhost:8008/?token=1478c6364d77b3a74edd529250a1d01c36d79eb079f93e5e>

5.1.测试网站

测试链接:chrome-headless-test

正常浏览结果:


5.2.测试代码

5.2.0.导入模块

# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time, sys
import json, io, re, os, random
from datetime import datetime, timedelta

# 导入logging模块
import logging
# create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)
# create console handler and set level to debug
ch = logging.StreamHandler()
ch.setLevel(logging.INFO)
# create formatter
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
# add formatter to ch
ch.setFormatter(formatter)
# add ch to logger
logger.addHandler(ch)

# fh = logging.FileHandler(r'/tmp/crawl_logs.log')
# fh.setFormatter(formatter)
# logger.addHandler(fh)

5.2.1 test1_不注入js

即不接入mitmproxy代理
# mitmproxy端口8080
proxy_host = 'localhost'
proxy_port = 8080
options = webdriver.ChromeOptions()
options.add_argument('lang=zh-CN,zh,zh-TW,en-US,en')
options.add_argument('--headless')
options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36')
options.add_argument("--no-sandbox")
options.add_argument('--disable-gpu')

# test1 不注入js代码 -- 不接入mitmproxy代理
# Specify the proxy.
# options.add_argument('--proxy-server=%s:%s' % (proxy_host, proxy_port))

# 启动浏览器
driver = webdriver.Chrome(chrome_options=options)
url = 'https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html'
# 访问测试网站
driver.get(url)
# 截屏
driver.save_screenshot('test1.png')

截屏test1.png


5.2.2 重置浏览器

driver.close()

# 查看后台进程
!ps aux|grep chrome

root     30727  0.0  0.0 206520  5400 ?        Sl   17:28   0:00 chromedriver --port=28011
root     30844  0.0  0.0 113172  1208 pts/3    Ss+  17:32   0:00 /usr/bin/sh -c ps aux|grep chrome
root     30846  0.0  0.0 112708   944 pts/3    S+   17:32   0:00 grep chrome

###################################################################
# 关闭chromedriver --port进程
import subprocess
cmd = "ps aux|grep chrome|awk '{print $2}'|xargs kill -9"
p = subprocess.Popen(args=cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, close_fds=True)
z = p.wait()
logging.warning('kill chrome process %s' % str(z))
###################################################################

# 再次查看后台进程
!ps aux|grep chrome

root     30727  0.0  0.0      0     0 ?        Z    17:28   0:00 [chromedriver] <defunct>
root     30853  0.0  0.0 113172  1208 pts/3    Ss+  17:33   0:00 /usr/bin/sh -c ps aux|grep chrome
root     30855  0.0  0.0 112708   944 pts/3    S+   17:33   0:00 grep chrome

5.2.3 注入js

重新启动,接入mitmproxy,注入js代码
# mitmproxy端口8080
proxy_host = 'localhost'
proxy_port = 8080
options = webdriver.ChromeOptions()
options.add_argument('lang=zh-CN,zh,zh-TW,en-US,en')
options.add_argument('--headless')
options.add_argument('user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36')
options.add_argument("--no-sandbox")
options.add_argument('--disable-gpu')

# test1 不注入js代码 -- 不接入mitmproxy代理
# Specify the proxy.
options.add_argument('--proxy-server=%s:%s' % (proxy_host, proxy_port))

# 启动浏览器
driver = webdriver.Chrome(chrome_options=options)
url = 'https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html'
driver.get(url)
# 截屏
driver.save_screenshot('test2.png')

test2.png


chrome这一项好像没有通过,不过我的目标网址已经可以通过反爬监测了,这一项还需要再看看。

编辑于 2022-01-13 10:14