BeautifulSoupとSeleniumのあわせ技でスクレイピングする

Pythonでスクレイピングを行う際は、BeautifulSoupを利用するのが一般的かと思われます。
使い方もDOMをセレクタで取得して解析するような感じですので、難易度もそこまで高くありません。

Seleniumはブラウザのレンダリングまでを考慮しているので、JavaScriptのによる遅延読み込み後のDOMをダウンロードすることができます。

スクレイピングをするのであれば、BeautifulSoupとSeleniumを組み合わせて実施すると幸せになれる、というお話です。

やろうとしたこと
BeautifulSoupで取得してみる
Seleniumも使ってみる
まとめ

やろうとしたこと

ことの発端は、「WordPressの公式リポジトリに登録されているプラグインのダウンロード数を日時で集計したい」という要望で、とりあえず該当ページを開いて値を取ってくれば良いという単純なものでした。

Advanced View の DOWNLOADS HISTORY というテーブルから拾ってくれば良さそうですね。

HTMLは以下のような感じです。

<table id="plugin-download-history-stats" class="download-history-stats">
		<tbody>
            <tr>
                <th scope="row">Today</th>
                <td>100</td>
            </tr>
            <tr>
                <th scope="row">Yesterday</th>
                <td>200</td>
            </tr>
            <tr>
                <th scope="row">Last 7 Days</th>
                <td>300</td>
            </tr>
            <tr>
                <th scope="row">All Time</th>
                <td>1000</td>  <!-- 主にこの数値がほしい -->
            </tr>
        </tbody>
</table>

BeautifulSoupで取得してみる

とりあえず、BeautifulSoupだけで取得してみます。

まずは必要なライブラリのインストール。

$ pip install requests
$ pip install beautifulsoup4

書いたコードは↓こんな感じ。

import requests
from bs4 import BeautifulSoup

load_url = 'https://wordpress.org/plugins/{プラグインのslug}/advanced/'
html = requests.get(load_url)
soup = BeautifulSoup(html.content, 'html.parser')
print(soup.select('table#plugin-download-history-stats'))

実行した結果はというと。。

$ python plugin-download-rate-bs.py
[<table class="download-history-stats" id="plugin-download-history-stats">
<tbody></tbody>
</table>]

該当のテーブルが取得できていませんね。
どうやら該当のテーブルのtbodyは遅延読み込みとなっているようで、requestsでは取得できないようでした。

Seleniumも使ってみる

最初期はテスティングフレームワークとして開発されたSeleniumですが、現在ではブラウザ操作を自動化するツールとして、Webサイトのクローリングなどに利用されています。

今回はSeleniumを使ってスクレイピングを行っていきます。

まずはSeleniumとWebDriver (今回はChrome) をインストールします。

$ pip install selenium
$ pip install chromedriver-binary={Chromeのバージョン}

※WebDriverはChromeのバージョンと合わせる必要があります。

書いたコードは↓。

import chromedriver_binary
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

opts = Options()
opts.headless = True
driver = webdriver.Chrome(options=opts)

load_url = 'https://wordpress.org/plugins/{プラグインのslug}/advanced/'
driver.get(load_url)

html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'html.parser')
print(soup.select('table#plugin-download-history-stats'))

driver.quit()

requestsの代わりにwebdriverで対象のURLにアクセスしています。
headlessモードはFalseでもOKなはず。

実行結果は。。

$ python3 plugin-download-rate-bs.py
[<table class="download-history-stats" id="plugin-download-history-stats">
<tbody><tr><th scope="row">Today</th><td>100</td></tr><tr><th scope="row">Yesterday</th><td>200</td></tr><tr><th scope="row">Last 7 Days</th><td>300</td></tr><tr><th scope="row">All Time</th><td>400</td></tr></tbody>
</table>]

狙ったDOMが取得できていますね。