Webスクレイピング見本シート

このチートシートは、Webスクレイピングの基本的な概念と、BeautifulSoup、Selenium、Scrapyといった主要なPythonライブラリの使用方法をまとめたものです。

HTMLの基礎

Webスクレイピングを行う前に、HTMLの基本構造を理解することが重要です。HTMLは「要素（element）」「属性（attribute）」「テキストノード（text）」などのノードで構成されています。

<article class="main-article">
  <h1> 地面師たち</h1>
  <p class="plot"> 1年後 </p>
  <div class="full-script"> 実在の事件に着想を得て、 </div>
</article>

<article>: 要素（タグ）
class="main-article": 属性と値
<h1> Titanic (1997) </h1>: h1要素とそのテキストノード

XPathの基本

XPathはXMLドキュメント（HTMLも含む）の要素や属性をナビゲートするための言語です。

//article[@class="main-article"]: class属性が"main-article"であるarticle要素をドキュメント内のどこからでも選択
//h1: ドキュメント内のすべてのh1要素
//div[@class="full-script"]: class属性が"full-script"であるdiv要素をドキュメント内のどこからでも選択

XPathの主な記号

/: 直下の子要素
//: ドキュメント内の任意の階層
.: 現在のノード
..: 親ノード
*: すべての要素
@: 属性指定
[n]: n番目の要素（1から始まる）
[@attribute="value"]: 特定の属性を持つ要素

Beautiful Soup

Beautiful Soupは、HTMLおよびXMLファイルからデータをプルするためのPythonライブラリです。

ワークフロー例

from bs4 import BeautifulSoup
import requests

# ウェブページを取得
result = requests.get("https://www.google.com")
content = result.text

# BeautifulSoupオブジェクトを作成（lxmlパーサーを使用）
soup = BeautifulSoup(content, "lxml")

# 整形されたHTMLを出力
print(soup.prettify())

要素の取得

目的	コード	説明
IDで要素を取得	`soup.find(id="specific_id")`	指定されたIDを持つ最初の要素を取得。
タグ名で全て取得	`soup.find_all("a")`	すべての`<a>`タグ要素を取得。
クラス名で取得	`soup.find_all("a", class_="my_class")`	`class`が`"my_class"`の`<a>`タグを全て取得。
テキストの取得	`sample = element.get_text()`	要素内のすべてのテキストを取得。
属性値の取得	`sample = element.get('href')`	要素の指定された属性の値を取得。

Scrapy

Scrapyは、Webサイトから構造化されたデータを抽出するための高速なWebクローリングフレームワークです。

プロジェクト作成

scrapy startproject my_first_spider
cd my_first_spider
scrapy genspider example example.com

スパイダーの例

my_first_spider/spiders/example.py

import scrapy

class ExampleSpider(scrapy.Spider):
    # スパイダーの名前
    name = 'example'
    # クロールを許可するドメイン
    allowed_domains = ['example.com']
    # クロールを開始するURL
    start_urls = ['http://example.com/']

    # レスポンスを処理するメソッド
    def parse(self, response):
        # XPathを使用してh1タグのテキストを取得
        title = response.xpath('//h1/text()').get()
        # 辞書形式でデータをyield（出力）
        yield {'titles': title}

実行とデータ出力

# スパイダーを実行
scrapy crawl example

# データをCSVファイルに出力
scrapy crawl example -o name_of_file.csv

# データをJSONファイルに出力
scrapy crawl example -o name_of_file.json

Selenium

Seleniumは、Webブラウザを自動化するためのツールです。JavaScriptで動的にコンテンツが生成されるサイトや、ユーザーインタラクションが必要な場合に特に有用です。

基本操作

from selenium import webdriver
from selenium.webdriver.chrome.service import Service

# ターゲットURL
web = "https://www.google.com"
# ChromeDriverのパス (ご自身の環境に合わせて変更してください)
path = 'chromedriverのパス' # 例: '/usr/local/bin/chromedriver'

# Serviceオブジェクトを作成
service = Service(executable_path=path)
# Chromeドライバを初期化
driver = webdriver.Chrome(service=service)
# ウェブページを開く
driver.get(web)

要素の取得・待機・終了

目的	コード	説明
単一要素の取得	`driver.find_element(by="id", value="...")`	指定された属性で最初の要素を取得。
複数要素の取得	`driver.find_elements(by="xpath", value="...")`	指定された属性で複数の要素を取得。
テキストの取得	`data = element.text`	要素の表示テキストを取得。
ブラウザの終了	`driver.quit()`	WebDriverセッションを終了し、ブラウザを閉じる。

待機処理

明示的な待機と暗黙的な待機があります。

暗黙的待機

特定の要素が見つかるまでWebDriverが待機する最大時間を設定します。

import time

# 指定された秒数だけ実行を一時停止
time.sleep(2)

明示的待機

特定の条件が満たされるまでWebDriverが待機します。

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# 最大5秒間、IDが 'id_name' の要素がクリック可能になるまで待機
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.ID, 'id_name')))

ヘッドレスモード

GUIなしでブラウザを実行し、サーバー環境などで利用する場合に便利です。

from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True # ヘッドレスモードを有効にする
options.add_argument('window-size=1920x1080') # ウィンドウサイズを設定（必須ではないが推奨）

# ヘッドレスモードでChromeドライバを初期化
driver = webdriver.Chrome(service=service, options=options)