HTML processing

HTML actions¶

conatus.actions.preloaded.html_actions ¶

Standard Actions for HTML processing.

Note that these actions require beautifulsoup4 and lxml to be installed. If that's not the case, you can install them with:

uv install conatus[html]
# or: pip install conatus[html]

html_full_view `conatus-action` ¶

html_full_view(
    elements: HTMLElements, nth_element: int = 0
) -> str

This function is an Action

You can call html_full_view just like a regular function, but note that it is actually a Action object.

This means that:

html_full_view has additional properties and methods that you can use (see the Action documentation for more information);
but it also means that operations like issubclass and isinstance will not work as expected.

Get the full view of the nth element in the result set.

(Actually, we still return only the first 1500 characters of the element.)

PARAMETER	DESCRIPTION
`elements`	The HTMLElements object. TYPE: `HTMLElements`
`nth_element`	The index of the element to get the full view of. TYPE: `int` DEFAULT: `0`

RETURNS	DESCRIPTION
`str`	The full view of the nth element in the result set.

Source code in conatus/actions/preloaded/html_actions.py

@action
def html_full_view(elements: "HTMLElements", nth_element: int = 0) -> str:
    """Get the full view of the nth element in the result set.

    (Actually, we still return only the first 1500 characters of the
    element.)

    Args:
        elements (HTMLElements): The HTMLElements object.
        nth_element (int): The index of the element to get the full view of.

    Returns:
        (str): The full view of the nth element in the result set.
    """
    return elements._full_view(nth_element)  # noqa: SLF001

html_view_slice `conatus-action` ¶

html_view_slice(
    elements: HTMLElements, start: int, end: int
) -> str

This function is an Action

You can call html_view_slice just like a regular function, but note that it is actually a Action object.

This means that:

html_view_slice has additional properties and methods that you can use (see the Action documentation for more information);
but it also means that operations like issubclass and isinstance will not work as expected.

Get a slice of the HTMLElements object.

Don't worry about going out of bounds. We'll handle it.

We enforce a maximum of 100 characters per element, and a maximum of 30 elements at a time, with a total of 500 characters.

PARAMETER	DESCRIPTION
`elements`	The HTMLElements object. TYPE: `HTMLElements`
`start`	The start index. TYPE: `int`
`end`	The end index. TYPE: `int`

RETURNS	DESCRIPTION
`str`	The slice of the HTMLElements object.

Source code in conatus/actions/preloaded/html_actions.py

@action
def html_view_slice(
    elements: HTMLElements,
    start: int,
    end: int,
) -> str:
    """Get a slice of the HTMLElements object.

    Don't worry about going out of bounds. We'll handle it.

    We enforce a maximum of 100 characters per element, and a maximum
    of 30 elements at a time, with a total of 500 characters.

    Args:
        elements: The HTMLElements object.
        start: The start index.
        end: The end index.

    Returns:
        The slice of the HTMLElements object.
    """
    return elements._view_slice(start, end)  # noqa: SLF001

html_find_all `conatus-action` ¶

html_find_all(
    html: HTML,
    *,
    class_: str | None = None,
    href: str | None = None,
    id: str | None = None,
    tag: str | None = None,
    attrs_pairs: list[str] | None = None,
    treat_as_regex: list[str] | None = None
) -> HTMLElements

This function is an Action

You can call html_find_all just like a regular function, but note that it is actually a Action object.

This means that:

html_find_all has additional properties and methods that you can use (see the Action documentation for more information);
but it also means that operations like issubclass and isinstance will not work as expected.

Find all HTML elements that match the given criteria.

Note that all the attributes arguments are optional. By default, we will match the exact value of the attribute. If you want to match a regex pattern, you need to pass the attribute name in the treat_as_regex argument.

If you want to treat a value as a regex pattern, don't forget to add '^' and '$' at the beginning and end of the value.

PARAMETER	DESCRIPTION
`html`	The HTML object. TYPE: `HTML`
`class_`	The class of the elements to find. TYPE: `str \| None` DEFAULT: `None`
`href`	The href of the elements to find. TYPE: `str \| None` DEFAULT: `None`
`id`	The id of the elements to fin TYPE: `str \| None` DEFAULT: `None`
`tag`	The tag (or name) of the elements to find. TYPE: `str \| None` DEFAULT: `None`
`attrs_pairs`	The attributes to find. These are pairs of attribute name and value. TYPE: `list[str] \| None` DEFAULT: `None`
`treat_as_regex`	The attributes to treat as regex patterns. TYPE: `list[str] \| None` DEFAULT: `None`

RETURNS	DESCRIPTION
`HTMLElements`	The HTML elements that match the given criteria.

RAISES	DESCRIPTION
`ValueError`	If the number of elements in attrs_pairs is odd.

Source code in conatus/actions/preloaded/html_actions.py

@action
def html_find_all(
    html: HTML,
    *,
    class_: str | None = None,
    href: str | None = None,
    id: str | None = None,  # noqa: A002
    tag: str | None = None,
    attrs_pairs: list[str] | None = None,
    treat_as_regex: list[str] | None = None,
) -> HTMLElements:
    """Find all HTML elements that match the given criteria.

    Note that all the attributes arguments are optional. By default,
    we will match the exact value of the attribute. If you want to match
    a regex pattern, you need to pass the attribute name in the
    `treat_as_regex` argument.

    If you want to treat a value as a regex pattern, don't forget to add
    '^' and '$' at the beginning and end of the value.

    Args:
        html: The HTML object.
        class_: The class of the elements to find.
        href: The href of the elements to find.
        id: The id of the elements to fin
        tag: The tag (or name) of the elements to find.
        attrs_pairs: The attributes to find. These are pairs of attribute
            name and value.
        treat_as_regex: The attributes to treat as regex patterns.

    Returns:
        The HTML elements that match the given criteria.

    Raises:
        ValueError: If the number of elements in attrs_pairs is odd.
    """
    soup = html.soup
    kwargs: dict[str, re.Pattern[str] | str | None] = {
        "class_": class_,
        "href": href,
        "id": id,
        "name": tag,
    }
    # Deconstruct the pairs into a dictionary.
    # Returns error if the number of elements is odd.
    if attrs_pairs is not None:
        if len(attrs_pairs) % 2 != 0:
            msg = "attrs_pairs must have an even number of elements"
            raise ValueError(msg)
        for i in range(0, len(attrs_pairs), 2):
            kwargs[attrs_pairs[i]] = attrs_pairs[i + 1]

    if treat_as_regex is not None:
        for attr in treat_as_regex:
            if attr not in kwargs:
                # TODO(lemeb): Add logging for this fn # noqa: TD003
                print("WARNING: Attribute not found in kwargs")  # noqa: T201
            kwargs[attr] = re.compile(str(kwargs[attr]))

    # Remove None values from the kwargs
    for k in list(kwargs.keys()):
        if kwargs[k] is None:
            del kwargs[k]

    result_set = soup.find_all(recursive=True, limit=None, **kwargs)
    return HTMLElements(result_set=result_set)

get_attributes_from_list `conatus-action` ¶

get_attributes_from_list(
    elements: HTMLElements, attr: HTMLElementsAttributes
) -> list[str | list[str] | None]

This function is an Action

You can call get_attributes_from_list just like a regular function, but note that it is actually a Action object.

This means that:

get_attributes_from_list has additional properties and methods that you can use (see the Action documentation for more information);
but it also means that operations like issubclass and isinstance will not work as expected.

Get the attributes from the list of elements.

PARAMETER	DESCRIPTION
`elements`	The list of elements. TYPE: `HTMLElements`
`attr`	The attribute to get from the elements, among the following: "href", "id", "tag", "class". for "name", use "tag" instead. TYPE: `HTMLElementsAttributes`

RETURNS	DESCRIPTION
`list[str \| list[str] \| None]`	The list of attributes from the elements.

Source code in conatus/actions/preloaded/html_actions.py

@action
def get_attributes_from_list(
    elements: HTMLElements,
    attr: HTMLElementsAttributes,
) -> list[str | list[str] | None]:
    """Get the attributes from the list of elements.

    Args:
        elements: The list of elements.
        attr: The attribute to get from the elements, among the following:
            "href", "id", "tag", "class". for "name", use "tag" instead.

    Returns:
        The list of attributes from the elements.
    """
    return [elem.get(attr) for elem in elements.result_set]

get_html_from_url `conatus-action` ¶

get_html_from_url(url: str) -> HTML

This function is an Action

You can call get_html_from_url just like a regular function, but note that it is actually a Action object.

This means that:

get_html_from_url has additional properties and methods that you can use (see the Action documentation for more information);
but it also means that operations like issubclass and isinstance will not work as expected.

Get the HTML content from a URL.

PARAMETER	DESCRIPTION
`url`	The URL to get the HTML content from. TYPE: `str`

RETURNS	DESCRIPTION
`HTML`	The HTML object.

Source code in conatus/actions/preloaded/html_actions.py

@action
def get_html_from_url(url: str) -> HTML:
    """Get the HTML content from a URL.

    Args:
        url: The URL to get the HTML content from.

    Returns:
        The HTML object.
    """
    r = httpx.get(url)
    return HTML(soup=BeautifulSoup(r.text, "html.parser"), url=url)

HTML helpers¶

conatus.actions.preloaded.html_helpers ¶

Helper classes for HTML processing.

We separate the helper classes from the actions to clarify the API.

HTML `dataclass` ¶

HTML(
    soup: BeautifulSoup,
    url: str | None = None,
    is_file: bool = False,
    hash: str | None = None,
)

HTML object.

repr_html_body ¶

repr_html_body() -> str

Get a preview of the HTML body.

We return the first 500 characters of the HTML body.

RETURNS	DESCRIPTION
`str`	A preview of the HTML body.

Source code in conatus/actions/preloaded/html_helpers.py

def repr_html_body(self) -> str:
    """Get a preview of the HTML body.

    We return the first 500 characters of the HTML body.

    Returns:
        A preview of the HTML body.
    """
    html_body = self.soup.body
    if html_body is None:
        return "No body found"
    # Remove SVG elements for clarity
    for svg in html_body.find_all("svg"):  # pyright: ignore[reportAny]
        cast("Tag", svg).decompose()
    return html_body.prettify()[:500]

from_file `staticmethod` ¶

from_file(
    file_path: str, *, encoding: str = "utf-8"
) -> HTML

Create an HTML object from a file.

PARAMETER	DESCRIPTION
`file_path`	The path to the file. TYPE: `str`
`encoding`	The encoding of the file. Default is "utf-8". TYPE: `str` DEFAULT: `'utf-8'`

RETURNS	DESCRIPTION
`HTML`	The HTML object.

Source code in conatus/actions/preloaded/html_helpers.py

@staticmethod
def from_file(file_path: str, *, encoding: str = "utf-8") -> "HTML":
    """Create an HTML object from a file.

    Args:
        file_path: The path to the file.
        encoding: The encoding of the file. Default is "utf-8".

    Returns:
        The HTML object.
    """
    with Path(file_path).open(encoding=encoding) as file:
        html_content = file.read()
    soup = BeautifulSoup(html_content, "html.parser")
    hsh = hashlib.sha256(html_content.encode()).hexdigest()[:10]
    return HTML(soup=soup, url=file_path, is_file=True, hash=hsh)

repr ¶

__repr__() -> str

Get the string representation of the HTML object.

RETURNS	DESCRIPTION
`str`	The string representation of the HTML object.

Source code in conatus/actions/preloaded/html_helpers.py

@override
def __repr__(self) -> str:
    """Get the string representation of the HTML object.

    Returns:
        The string representation of the HTML object.
    """
    url = self.url if not self.is_file else f"file://{self.url}"
    return (
        f"HTML(url={url}, hash={self.hash}, soup={self.repr_html_body()})"
    )

str ¶

__str__() -> str

Get the string representation of the HTML object.

RETURNS	DESCRIPTION
`str`	The string representation of the HTML object.

Source code in conatus/actions/preloaded/html_helpers.py

@override
def __str__(self) -> str:
    """Get the string representation of the HTML object.

    Returns:
        The string representation of the HTML object.
    """
    url = self.url if not self.is_file else f"file://{self.url}"
    return (
        f"HTML(url={url}, hash={self.hash}, soup={self.repr_html_body()})"
    )

HTMLElements `dataclass` ¶

HTMLElements(result_set: ResultSet[Tag])

HTML elements. Not quite a list.

str ¶

__str__() -> str

Get the string representation of the HTMLElements object.

RETURNS	DESCRIPTION
`str`	The string representation of the HTMLElements object.

Source code in conatus/actions/preloaded/html_helpers.py

@override
def __str__(self) -> str:
    """Get the string representation of the HTMLElements object.

    Returns:
        The string representation of the HTMLElements object.
    """
    preview_limit = 15
    length_set = len(self.result_set)
    preview_head = [
        str(elem)[:100] for elem in self.result_set[:preview_limit]
    ]
    return (
        f"HTMLElements(length={length_set},"
        f" preview_limit={preview_limit},"
        f" preview_head={preview_head})"
    )

repr ¶

__repr__() -> str

Get the string representation of the HTMLElements object.

RETURNS	DESCRIPTION
`str`	The string representation of the HTMLElements object.

Source code in conatus/actions/preloaded/html_helpers.py

@override
def __repr__(self) -> str:
    """Get the string representation of the HTMLElements object.

    Returns:
        The string representation of the HTMLElements object.
    """
    return str(self)

HTML processing

HTML actions¶

conatus.actions.preloaded.html_actions ¶

html_full_view conatus-action ¶

html_view_slice conatus-action ¶

html_find_all conatus-action ¶

get_attributes_from_list conatus-action ¶

get_html_from_url conatus-action ¶

HTML helpers¶

conatus.actions.preloaded.html_helpers ¶

HTML dataclass ¶

repr_html_body ¶

from_file staticmethod ¶

__repr__ ¶

__str__ ¶

HTMLElements dataclass ¶

__str__ ¶

__repr__ ¶

html_full_view `conatus-action` ¶

html_view_slice `conatus-action` ¶

html_find_all `conatus-action` ¶

get_attributes_from_list `conatus-action` ¶

get_html_from_url `conatus-action` ¶

HTML `dataclass` ¶

from_file `staticmethod` ¶

repr ¶

str ¶

HTMLElements `dataclass` ¶

str ¶

repr ¶