pytilpack.htmlrag¶

必要なextra

pip install pytilpack[htmlrag]

`pytilpack.htmlrag` ¶

HtmlRAG関連。

clean_html だけを使用したい場合に依存関係が厳しいため、切り出したものを用意する。加えて独自の拡張を行っている。

https://github.com/plageon/HtmlRAG/blob/main/toolkit/README.md https://github.com/plageon/HtmlRAG/blob/main/toolkit/LICENSE https://github.com/plageon/HtmlRAG/blob/main/toolkit/htmlrag/html_utils.py

`DEFAULT_ACCEPT = 'text/markdown,text/plain;q=0.9,text/html,application/xhtml+xml,application/xml;q=0.8,/;q=0.7'` `module-attribute` ¶

Acceptヘッダーのデフォルト値。

`fetch_url(url, no_verify=False, accept=DEFAULT_ACCEPT, user_agent=None)` ¶

URLからHTMLを取得し、簡略化して返す。

引数：

名前	タイプ	デスクリプション	デフォルト
`url`	`str`	取得するURL	必須
`no_verify`	`bool`	SSL証明書の検証を無効化するかどうか	`False`
`accept`	`str`	受け入れるコンテンツタイプ	`DEFAULT_ACCEPT`
`user_agent`	`str \| None`	User-Agentヘッダー（未指定時はデフォルト値を使用）	`None`

戻り値：

タイプ	デスクリプション
`str`	簡略化されたHTML内容

発生：

タイプ	デスクリプション
`Exception`	HTTP取得やHTMLパースでエラーが発生した場合

ソースコード位置： pytilpack/htmlrag.py

def fetch_url(
    url: str,
    no_verify: bool = False,
    accept: str = DEFAULT_ACCEPT,
    user_agent: str | None = None,
) -> str:
    """URLからHTMLを取得し、簡略化して返す。

    Args:
        url: 取得するURL
        no_verify: SSL証明書の検証を無効化するかどうか
        accept: 受け入れるコンテンツタイプ
        user_agent: User-Agentヘッダー（未指定時はデフォルト値を使用）

    Returns:
        簡略化されたHTML内容

    Raises:
        Exception: HTTP取得やHTMLパースでエラーが発生した場合
    """
    if user_agent is None:
        user_agent = get_default_user_agent()

    r = httpx.get(
        url,
        headers={
            "Accept": accept,
            "User-Agent": user_agent,
        },
        verify=not no_verify,
        follow_redirects=True,
    )

    if r.status_code != 200:
        raise RuntimeError(f"URL {url} の取得に失敗しました。Status: {r.status_code}\n{r.text}")

    content_type = r.headers.get("Content-Type", "text/html")
    if (
        "text/markdown" in content_type
        or "text/plain" in content_type
        or "text/xml" in content_type
        or "application/xml" in content_type
        or "application/json" in content_type
    ):
        return r.text

    if "html" not in content_type:
        raise RuntimeError(f"URL {url} はHTMLではありません。Content-Type: {content_type}\n{r.text[:100]}...")

    content = r.text
    output = clean_html(
        content,
        aggressive=True,
        keep_title=True,
        keep_href=True,
    )
    return output

`get_default_user_agent()` ¶

デフォルトのUser-Agentヘッダーを取得する。

ソースコード位置： pytilpack/htmlrag.py

def get_default_user_agent():
    """デフォルトのUser-Agentヘッダーを取得する。"""
    version = importlib.metadata.version("pytilpack")
    user_agent = f"pytilpack/{version} (+https://github.com/ak110/pytilpack)"
    return user_agent

`clean_html(html, aggressive=False, keep_title=None, keep_href=None, remove_span=None)` ¶

HTMLからLLM向けに不要なタグを削除する。

引数：

名前	タイプ	デスクリプション	デフォルト
`html`	`str \| bytes`	HTML文字列	必須
`aggressive`	`bool`	より強力な削除を行うか否か。Defaults to False.	`False`
`keep_title`	`bool \| None`	titleタグを残すか否か。Defaults to 'not aggressive'.	`None`
`keep_href`	`bool \| None`	href属性を残すか否か。Defaults to 'not aggressive'.	`None`
`remove_span`	`bool \| None`	spanタグを削除するか否か。(deprecated)	`None`

戻り値：

タイプ	デスクリプション
`str`	処理後のHTML文字列

ソースコード位置： pytilpack/htmlrag.py

def clean_html(
    html: str | bytes,
    aggressive: bool = False,
    keep_title: bool | None = None,
    keep_href: bool | None = None,
    remove_span: bool | None = None,
) -> str:
    """HTMLからLLM向けに不要なタグを削除する。

    Args:
        html: HTML文字列
        aggressive: より強力な削除を行うか否か。Defaults to False.
        keep_title: titleタグを残すか否か。Defaults to 'not aggressive'.
        keep_href: href属性を残すか否か。Defaults to 'not aggressive'.
        remove_span: spanタグを削除するか否か。(deprecated)

    Returns:
        処理後のHTML文字列

    """
    if remove_span is not None:
        warnings.warn(
            "remove_span is deprecated. Use aggressive=True to remove span tags.",
            DeprecationWarning,
            stacklevel=2,
        )
        aggressive = remove_span
    if keep_title is None:
        keep_title = not aggressive
    if keep_href is None:
        keep_href = not aggressive

    soup = bs4.BeautifulSoup(html, "html.parser")
    html = _simplify_html(soup, aggressive=aggressive, keep_title=keep_title, keep_href=keep_href)
    html = _clean_xml(html)
    return html

pytilpack.htmlrag¶

pytilpack.htmlrag ¶

DEFAULT_ACCEPT = 'text/markdown,text/plain;q=0.9,text/html,application/xhtml+xml,application/xml;q=0.8,*/*;q=0.7' module-attribute ¶

fetch_url(url, no_verify=False, accept=DEFAULT_ACCEPT, user_agent=None) ¶

get_default_user_agent() ¶

clean_html(html, aggressive=False, keep_title=None, keep_href=None, remove_span=None) ¶

`pytilpack.htmlrag` ¶

`DEFAULT_ACCEPT = 'text/markdown,text/plain;q=0.9,text/html,application/xhtml+xml,application/xml;q=0.8,/;q=0.7'` `module-attribute` ¶

`fetch_url(url, no_verify=False, accept=DEFAULT_ACCEPT, user_agent=None)` ¶

`get_default_user_agent()` ¶

`clean_html(html, aggressive=False, keep_title=None, keep_href=None, remove_span=None)` ¶