HTMLパーサー入門 — BeautifulSoup の基本操作

find()とfind_all()の違いを毎回調べ直していたことに気づき、さすがに一度きちんと整理しようと思いました。BeautifulSoupの基本操作をまとめた記録です。

BeautifulSoup とは

図1: BeautifulSoupのパーサー選択とHTMLツリー解析フロー

BeautifulSoup は、HTML や XML の文字列を解析（パース）して Python オブジェクトに変換するライブラリです。変換後はタグ名・属性・テキストを Python らしい直感的な記法で自由に取り出せます。

Webページからデータを収集する「スクレイピング」の定番ライブラリであり、構造化されていない HTML を整形したり、ログやレポートファイルに埋め込まれた HTML テーブルをデータとして読み取る用途にも使われます。

BeautifulSoup4

本記事のメイン。直感的な API で HTML/XML を解析。パーサーは差し替え可能。

lxml

C 実装の高速パーサー。BeautifulSoup のバックエンドとして使用。XPath も使える。

html.parser

Python 標準ライブラリ。追加インストール不要だが速度・精度は lxml に劣る。

requests

Webページの HTML を取得するための HTTP クライアント。BeautifulSoup と組み合わせて使う。

インストール

仮想環境を有効化した状態で以下を実行します（仮想環境の作り方は PART 01 を参照）。

Shell — インストール

# BeautifulSoup4 本体
pip install beautifulsoup4

# 高速パーサー（推奨）
pip install lxml

# Webページ取得用（スクレイピングに使う場合）
pip install requests

# まとめてインストールする場合
pip install beautifulsoup4 lxml requests

# インストール確認
pip show beautifulsoup4

pip show beautifulsoup4 の出力例

Name: beautifulsoup4
Version: 4.12.3
Summary: Screen-scraping library
Requires: soupsieve
Location: /path/to/.venv/lib/python3.12/site-packages

⚠️ パッケージ名の注意

pip install beautifulsoup4（数字の4あり）が正しいコマンドです。pip install beautifulsoup（4なし）は古いバージョン（BS3）で別物です。import するときは from bs4 import BeautifulSoup と短縮名を使います。

パーサーの種類と選び方

BeautifulSoup は HTML の解析エンジン（パーサー）を差し替えられる設計になっています。

パーサー名	指定文字列	速度	特徴	追加インストール
lxml HTML	`"lxml"`	速い	壊れた HTML の修復が得意。本番推奨。	`pip install lxml`
html.parser	`"html.parser"`	普通	Python 標準。追加不要。簡単なスクレイピングに十分。	不要
lxml XML	`"xml"`	速い	XML 専用。大文字小文字を区別する。	`pip install lxml`
html5lib	`"html5lib"`	遅い	ブラウザと同一の解析。最も寛容だが低速。	`pip install html5lib`

✅ 迷ったら "lxml" を選ぶ

速度・精度・壊れた HTML への耐性のバランスが最も良いため、特別な理由がなければ lxml を使ってください。

HTML の読み込み方

BeautifulSoup オブジェクトの作り方は3通りあります。

① HTML 文字列から直接作成（テスト・学習用）

Python

from bs4 import BeautifulSoup

html = """
<html>
  <head><title>サンプルページ</title></head>
  <body>
    <h1 id="main-title">見出し</h1>
    <p class="intro">はじめての <strong>BeautifulSoup</strong></p>
    <a href="https://example.com">リンク</a>
  </body>
</html>
"""

soup = BeautifulSoup(html, "lxml")
print(soup.title.text)   # サンプルページ

② HTML ファイルから読み込む

Python

from bs4 import BeautifulSoup

with open("page.html", encoding="utf-8") as f:
    soup = BeautifulSoup(f, "lxml")

print(soup.title.text)

③ requests で Web ページから取得

Python

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url, timeout=10)
response.encoding = response.apparent_encoding   # 文字化け対策

soup = BeautifulSoup(response.text, "lxml")
print(soup.title.text)

💡 エンコーディングについて

日本語サイトでは response.encoding = response.apparent_encoding を忘れると文字化けすることがあります。apparent_encoding は chardet ライブラリを使って自動検出します（pip install chardet）。

find / find_all — タグを検索する

BeautifulSoup で最もよく使うメソッドです。

find() — 条件に一致する最初の1件を返す（見つからなければ None）
find_all() — 条件に一致するすべての要素をリストで返す

Python — 検索の基本

from bs4 import BeautifulSoup

html = """
<html><body>
  <h2 class="section">セクション A</h2>
  <p class="desc">Aの説明です</p>
  <h2 class="section">セクション B</h2>
  <p class="desc important">Bの説明です</p>
  <p id="note">補足事項</p>
</body></html>
"""
soup = BeautifulSoup(html, "lxml")

# --- タグ名で検索 ---
first_h2 = soup.find("h2")
print(first_h2.text)        # セクション A

all_h2 = soup.find_all("h2")
for h in all_h2:
    print(h.text)           # セクション A / セクション B

# --- class 属性で検索 ---
desc_tags = soup.find_all("p", class_="desc")
print(len(desc_tags))       # 2（important が付いていても class_="desc" にマッチする）

# --- id 属性で検索 ---
note = soup.find(id="note")
print(note.text)            # 補足事項

# --- 複数タグをまとめて検索 ---
results = soup.find_all(["h2", "p"])
print(len(results))         # 5

実行結果

セクション A
セクション A
セクション B
2
補足事項
5

検索条件の指定方法

Python — 検索条件いろいろ

import re
from bs4 import BeautifulSoup

html = """
<ul>
  <li data-type="fruit">りんご</li>
  <li data-type="fruit">みかん</li>
  <li data-type="veggie">にんじん</li>
  <li class="special">特別品</li>
</ul>
"""
soup = BeautifulSoup(html, "lxml")

# ① 属性の辞書で指定
fruits = soup.find_all("li", attrs={"data-type": "fruit"})
for f in fruits:
    print(f.text)           # りんご / みかん

# ② 正規表現でタグ名を指定
headers = soup.find_all(re.compile(r"^h[1-6]$"))   # h1〜h6 すべて

# ③ テキストが一致する要素
carrots = soup.find_all("li", string="にんじん")
print(carrots[0].text)      # にんじん

# ④ テキストを正規表現で検索
items = soup.find_all(string=re.compile("んご"))
print(items)                # ['りんご']

# ⑤ 最大件数を制限（limit）
top2 = soup.find_all("li", limit=2)
print(len(top2))            # 2

# ⑥ 直下の子のみを検索（recursive=False）
direct_li = soup.ul.find_all("li", recursive=False)
print(len(direct_li))       # 4

属性・テキストの取得

タグを取得したあと、テキストや属性値を取り出す方法です。

Python — 属性・テキスト取得

from bs4 import BeautifulSoup

html = """
<div class="card" data-id="42">
  <a href="https://example.com/item" class="link">
    詳細を <strong>見る</strong>
  </a>
  <p>  空白を含む  テキスト  </p>
</div>
"""
soup = BeautifulSoup(html, "lxml")

div = soup.find("div")

# ── テキスト取得 ──────────────────────────
# .text / .get_text()  → 子孫すべてのテキストを結合
print(div.text)                          # 詳細を 見る\n  空白を含む  テキスト

# strip=True で前後の空白を除去
print(div.get_text(strip=True))          # 詳細を見る空白を含むテキスト

# separator で区切り文字を指定
print(div.get_text(separator=" | ", strip=True))
# 詳細を | 見る | 空白を含む | テキスト

# ── 属性取得 ──────────────────────────────
# [] で直接アクセス（存在しないとKeyError）
print(div["class"])                      # ['card']
print(div["data-id"])                    # 42

# .get() で安全にアクセス（存在しなければ None）
print(div.get("data-id"))               # 42
print(div.get("nonexistent"))           # None

# .attrs で全属性を辞書として取得
print(div.attrs)                         # {'class': ['card'], 'data-id': '42'}

# ── リンク取得 ────────────────────────────
a_tag = soup.find("a")
print(a_tag["href"])                     # https://example.com/item
print(a_tag.get_text(strip=True))        # 詳細を 見る

実行結果（抜粋）

詳細を見る空白を含むテキスト
詳細を | 見る | 空白を含む | テキスト
['card']
42
https://example.com/item

⚠️ .text と .get_text() の違い

.text は .get_text() のショートハンドです。空白制御や区切り文字を指定したい場合は .get_text(strip=True, separator="...") を使ってください。

CSS セレクタで検索する（select）

フロントエンド開発の経験がある方には select() が直感的です。 CSS セレクタの文法をそのまま使えます。

Python — select / select_one

from bs4 import BeautifulSoup

html = """
<div class="container">
  <section id="main">
    <h2>タイトル 1</h2>
    <ul class="list">
      <li class="item active">りんご</li>
      <li class="item">みかん</li>
    </ul>
  </section>
  <section id="sub">
    <h2>タイトル 2</h2>
    <ul class="list">
      <li class="item">にんじん</li>
    </ul>
  </section>
</div>
"""
soup = BeautifulSoup(html, "lxml")

# select_one → 最初の1件（find と同等）
h2 = soup.select_one("h2")
print(h2.text)                          # タイトル 1

# select → 全件リスト（find_all と同等）
all_li = soup.select("li")
print([li.text for li in all_li])       # ['りんご', 'みかん', 'にんじん']

# クラスセレクタ
active = soup.select("li.active")
print(active[0].text)                   # りんご

# id セレクタ
main_h2 = soup.select_one("#main h2")
print(main_h2.text)                     # タイトル 1

# 子孫セレクタ（スペース）
items_in_sub = soup.select("#sub .item")
print([i.text for i in items_in_sub])  # ['にんじん']

# 直接の子セレクタ（>）
direct = soup.select(".list > li")
print(len(direct))                      # 3

# 属性セレクタ
# [attr]        → 属性が存在する
# [attr=val]    → 属性が val に一致
# [attr^=val]   → val で始まる
# [attr$=val]   → val で終わる
# [attr*=val]   → val を含む
links = soup.select("a[href^='https']")

# :nth-of-type（n番目）
second_li = soup.select("ul.list:first-of-type li:nth-of-type(2)")
print(second_li[0].text)               # みかん

セレクタ	意味	例
`tag`	タグ名	`"li"`
`.class`	クラス名	`".item"`
`#id`	id	`"#main"`
`A B`	A の子孫 B	`"div p"`
`A > B`	A の直接の子 B	`"ul > li"`
`A + B`	A の直後の兄弟 B	`"h2 + p"`
`[attr=val]`	属性値が val	`"a[href='#']"`
`[attr*=val]`	属性値が val を含む	`"a[href*='example']"`
`:nth-of-type(n)`	n 番目の要素	`"li:nth-of-type(1)"`

ツリーを上下に辿る

BeautifulSoup のオブジェクトはツリー構造をもっており、親・子・兄弟要素に移動できます。

Python — 親・子・兄弟の取得

from bs4 import BeautifulSoup

html = """
<article>
  <h2>タイトル</h2>
  <p class="lead">リード文</p>
  <p>本文1</p>
  <p>本文2</p>
</article>
"""
soup = BeautifulSoup(html, "lxml")

p_lead = soup.find("p", class_="lead")

# ── 親要素 ──────────────────────────────
print(p_lead.parent.name)              # article

# ── 子要素 ──────────────────────────────
article = soup.find("article")
# children → イテレータ（テキストノードも含む）
for child in article.children:
    if child.name:                     # タグのみ
        print(child.name, child.get_text(strip=True))

# contents → リスト（children と同じ内容）
print(article.contents[1].text)        # タイトル（インデックスで取得）

# ── 兄弟要素 ─────────────────────────────
# next_sibling / previous_sibling はテキストノードを含む
# next_siblings / previous_siblings はイテレータ
h2 = soup.find("h2")
for sib in h2.next_siblings:
    if sib.name:
        print(sib.get_text(strip=True))
# リード文 / 本文1 / 本文2

# ── 祖先を遡る ───────────────────────────
for parent in p_lead.parents:
    print(parent.name)                 # article → body → html → [document]

# ── タグを短縮記法でドット接続 ──────────
print(soup.article.h2.text)            # タイトル

requests と組み合わせて Web から取得

実際の Web ページを対象にしたスクレイピングの全体的なパターンを示します。

Python — Web スクレイピングの基本パターン

import time
import requests
from bs4 import BeautifulSoup

# ── 定数 ───────────────────────────────────────────────────
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    )
}

def fetch_soup(url: str, encoding: str | None = None) -> BeautifulSoup:
    """URL を取得して BeautifulSoup を返すヘルパー関数"""
    resp = requests.get(url, headers=HEADERS, timeout=15)
    resp.raise_for_status()                        # 4xx/5xx でも例外を発生させる
    if encoding:
        resp.encoding = encoding
    else:
        resp.encoding = resp.apparent_encoding     # 自動検出
    return BeautifulSoup(resp.text, "lxml")

# ── 使用例 ──────────────────────────────────────────────────
urls = [
    "https://example.com/page1",
    "https://example.com/page2",
]

results = []

for url in urls:
    try:
        soup = fetch_soup(url)
        title = soup.find("h1")
        results.append({
            "url":   url,
            "title": title.get_text(strip=True) if title else ""
        })
    except requests.RequestException as e:
        print(f"取得失敗: {url} — {e}")
    finally:
        time.sleep(1)                              # サーバーへの配慮（1秒待機）

for r in results:
    print(r)

⚠️ スクレイピングの注意事項

① 対象サイトの robots.txt や利用規約を必ず確認してください。② 連続リクエストにはスリープ（time.sleep()）を入れ、サーバーに過負荷をかけないようにしてください。③ 著作権法・不正競争防止法に抵触する可能性があるデータの収集・二次利用は行わないでください。

まとめ：メソッド早見表

目的	メソッド	戻り値
最初の要素を取得	`soup.find("タグ", class_="クラス")`	Tag または None
全要素を取得	`soup.find_all("タグ")`	list[Tag]
CSS セレクタで最初の要素	`soup.select_one("セレクタ")`	Tag または None
CSS セレクタで全要素	`soup.select("セレクタ")`	list[Tag]
テキスト取得	`tag.get_text(strip=True)`	str
属性値を取得	`tag["href"]` / `tag.get("href")`	str または None
全属性を取得	`tag.attrs`	dict
親要素	`tag.parent`	Tag
次の兄弟要素（複数）	`tag.next_siblings`	iterator
子要素リスト	`tag.contents`	list

✅ 次の章では…

PART 03 では HTML の テーブル（<table>） に特化した操作を解説します。複数テーブルが存在するページで特定のテーブルを見つけ出す方法、行・列・セルのデータを2次元リストや pandas DataFrame に変換する方法を丁寧に解説します。

→ PART 03 — テーブルの操作へ

PART 02 — HTMLパーサー入門
BeautifulSoup のインストールと基本操作

BeautifulSoup とは

インストール

パーサーの種類と選び方

HTML の読み込み方

① HTML 文字列から直接作成（テスト・学習用）

② HTML ファイルから読み込む

③ requests で Web ページから取得

find / find_all — タグを検索する

検索条件の指定方法

属性・テキストの取得

CSS セレクタで検索する（select）

ツリーを上下に辿る

requests と組み合わせて Web から取得

まとめ：メソッド早見表

参考・公式ドキュメント