実装①：静的ページからの抽出 — requests + BeautifulSoup

label要素の紐付けをfor属性だけで済ませようとして、ラップ構造のフォームで軒並み取りこぼした経験があります。優先順位を決めて探索するロジックに落ち着くまでの試行錯誤を、実装コードとともに残しました。

実装設計の方針

図1: requests+BeautifulSoupによる静的抽出実装

URL を受け取り、InputEntity のリストを返す関数 extract_static(url) を実装する
除外する type（submit / button / reset）はリストで管理して拡張可能にする
label の紐付けは for→ラップ→aria-label の優先順位で探索する
エラー処理は呼び出し元に任せ、関数内は raise_for_status() のみ

static.py の実装

Python — extractor/static.py（完全版）

"""static.py — requests + BeautifulSoup による静的ページ INPUT 抽出"""
from __future__ import annotations

import requests
from bs4 import BeautifulSoup, Tag
from typing import Optional

from .models import InputEntity

# 除外する input type
SKIP_TYPES = {"submit", "button", "reset", "image"}

# BeautifulSoup パーサー（lxml → html.parser にフォールバック）
try:
    import lxml  # noqa
    _PARSER = "lxml"
except ImportError:
    _PARSER = "html.parser"


def _get_label(soup: BeautifulSoup, tag: Tag) -> Optional[str]:
    """INPUT 要素に対応する label テキストを探索する"""
    # 優先度 1: for 属性と id の対応
    tag_id = tag.get("id")
    if tag_id:
        label = soup.find("label", attrs={"for": tag_id})
        if label:
            return label.get_text(strip=True)

    # 優先度 2: label タグでラップされている
    parent = tag.parent
    while parent:
        if parent.name == "label":
            # label 内テキストから INPUT のテキストを除いたものを返す
            texts = [t.strip() for t in parent.strings if t.strip()]
            return " ".join(texts) if texts else None
        parent = parent.parent

    # 優先度 3: aria-label 属性
    aria = tag.get("aria-label")
    if aria:
        return aria.strip()

    return None


def _extract_tag(soup: BeautifulSoup, tag: Tag, url: str) -> Optional[InputEntity]:
    """タグ 1 件を InputEntity に変換する（除外対象は None を返す）"""
    tag_name = tag.name.lower()

    if tag_name == "input":
        input_type = (tag.get("type") or "text").lower()
        if input_type in SKIP_TYPES:
            return None
        return InputEntity(
            tag=tag_name,
            type=input_type,
            name=tag.get("name"),
            id=tag.get("id"),
            label=_get_label(soup, tag),
            placeholder=tag.get("placeholder"),
            required="required" in tag.attrs or tag.get("required") is not None,
            maxlength=tag.get("maxlength"),
            minlength=tag.get("minlength"),
            min=tag.get("min"),
            max=tag.get("max"),
            pattern=tag.get("pattern"),
            value=tag.get("value"),
            page_url=url,
        )

    elif tag_name == "select":
        options = [
            opt.get_text(strip=True)
            for opt in tag.find_all("option")
            if opt.get("value") != ""  # 空の「選択してください」等は除く
        ]
        return InputEntity(
            tag=tag_name,
            type="select",
            name=tag.get("name"),
            id=tag.get("id"),
            label=_get_label(soup, tag),
            required="required" in tag.attrs or tag.get("required") is not None,
            options=options,
            page_url=url,
        )

    elif tag_name == "textarea":
        return InputEntity(
            tag=tag_name,
            type="textarea",
            name=tag.get("name"),
            id=tag.get("id"),
            label=_get_label(soup, tag),
            placeholder=tag.get("placeholder"),
            required="required" in tag.attrs or tag.get("required") is not None,
            maxlength=tag.get("maxlength"),
            page_url=url,
        )

    return None


def extract_static(url: str, timeout: int = 10) -> list[InputEntity]:
    """
    静的ページから INPUT / SELECT / TEXTAREA 要素を抽出して InputEntity リストを返す。

    Args:
        url: 抽出対象ページの URL
        timeout: requests タイムアウト秒数

    Returns:
        InputEntity のリスト
    """
    response = requests.get(url, timeout=timeout)
    response.encoding = response.apparent_encoding
    response.raise_for_status()

    soup = BeautifulSoup(response.content, _PARSER)
    results: list[InputEntity] = []

    for tag in soup.find_all(["input", "select", "textarea"]):
        entity = _extract_tag(soup, tag, url)
        if entity is not None:
            results.append(entity)

    return results

label 紐付けロジック詳解

_get_label() は 3 段階の優先順位で label テキストを探す。それぞれのパターンが実際の HTML でどう見えるかを確認しておこう。

Python — label 探索ロジック（抜粋）

# 優先度 1: <label for="username"> ... </label> + <input id="username">
label = soup.find("label", attrs={"for": tag_id})

# 優先度 2: <label>ユーザー名<input ...></label>
#   parent を遡って label タグを探す
while parent:
    if parent.name == "label": ...
    parent = parent.parent

# 優先度 3: <input aria-label="検索ワード">
aria = tag.get("aria-label")

select / textarea の処理

📌 select の option 取得

<option value="">選択してください</option> のような空値の選択肢は除外している。実際の選択可能な値だけを options フィールドに詰める。

JSON 出力

Python — JSON 出力ヘルパー

import json
import dataclasses
from pathlib import Path

def save_json(entities: list, output_path: str) -> None:
    """InputEntity リストを JSON ファイルに保存する"""
    data = [dataclasses.asdict(e) for e in entities]
    Path(output_path).write_text(
        json.dumps(data, ensure_ascii=False, indent=2),
        encoding="utf-8"
    )
    print(f"保存完了: {output_path} ({len(data)} 件)")

動作確認

Python — examples/run_extract.py

from extractor.static import extract_static
from extractor.utils import save_json  # 上記 save_json をutils.pyに配置

url = "https://httpbin.org/forms/post"  # テスト用フォームページ
entities = extract_static(url)

print(f"抽出件数: {len(entities)} 件")
for e in entities:
    print(f"  [{e.type:10}] name={e.name!r:20}  label={e.label!r}")

save_json(entities, "output/result.json")

実行結果例

抽出件数: 6 件
  [text      ] name='custname'          label='Customer name'
  [tel       ] name='custtel'           label='Telephone'
  [email     ] name='custemail'         label='E-mail address'
  [select    ] name='size'              label='Pizza Size'
  [checkbox  ] name='topping'           label=None
  [textarea  ] name='comments'          label='Any comments?'

✅ 次の章では…

PART 07 では JavaScript で描画される動的ページを Playwright で処理する実装を解説します。静的版との差分を中心に説明します。

→ PART 07 — 動的ページ実装へ

静的ページ実装でよくある誤り

requestsとBeautifulSoupで静的HTMLフォームからエンティティを抽出する実装で起きやすい問題を整理します。

誤り/失敗パターン	何が起きるか	正しい対処/防止策
フォーム全体をfind("form")で1つ取得してネストフォームを見落とす	ページ内に複数フォームがある場合に2つ目以降のフォームが処理されない	find_all("form")で全フォームを取得しactionやidで目的のフォームを特定する
labelタグとinputの対応関係を無視して抽出する	各入力項目のラベル（意味）が不明になりエンティティの意味づけができない	labelのfor属性とinputのidを突き合わせてラベルと入力要素を紐づける
フォームのactionがrelative URLのままリクエストを送ろうとする	パスのみのURLでrequestsを呼ぶとInvalidSchemaエラーが発生する	urllib.parse.urljoinでベースURLと結合して絶対URLにしてからrequestsを呼ぶ

PART 06 — 実装①：静的ページからの抽出
requests + BeautifulSoup で INPUT 要素を抽出・JSON 出力