Skip to content

HTML processing

HTML actions

conatus.actions.preloaded.html_actions

Standard Actions for HTML processing.

Note that these actions require beautifulsoup4 and lxml to be installed. If that's not the case, you can install them with:

uv install conatus[html]
# or: pip install conatus[html]

html_full_view conatus-action

html_full_view(
    elements: HTMLElements, nth_element: int = 0
) -> str
This function is an Action

You can call html_full_view just like a regular function, but note that it is actually a Action object.

This means that:

  • html_full_view has additional properties and methods that you can use (see the Action documentation for more information);
  • but it also means that operations like issubclass and isinstance will not work as expected.

Get the full view of the nth element in the result set.

(Actually, we still return only the first 1500 characters of the element.)

PARAMETER DESCRIPTION
elements

The HTMLElements object.

TYPE: HTMLElements

nth_element

The index of the element to get the full view of.

TYPE: int DEFAULT: 0

RETURNS DESCRIPTION
str

The full view of the nth element in the result set.

Source code in conatus/actions/preloaded/html_actions.py
@action
def html_full_view(elements: "HTMLElements", nth_element: int = 0) -> str:
    """Get the full view of the nth element in the result set.

    (Actually, we still return only the first 1500 characters of the
    element.)

    Args:
        elements (HTMLElements): The HTMLElements object.
        nth_element (int): The index of the element to get the full view of.

    Returns:
        (str): The full view of the nth element in the result set.
    """
    return elements._full_view(nth_element)  # noqa: SLF001

html_view_slice conatus-action

html_view_slice(
    elements: HTMLElements, start: int, end: int
) -> str
This function is an Action

You can call html_view_slice just like a regular function, but note that it is actually a Action object.

This means that:

  • html_view_slice has additional properties and methods that you can use (see the Action documentation for more information);
  • but it also means that operations like issubclass and isinstance will not work as expected.

Get a slice of the HTMLElements object.

Don't worry about going out of bounds. We'll handle it.

We enforce a maximum of 100 characters per element, and a maximum of 30 elements at a time, with a total of 500 characters.

PARAMETER DESCRIPTION
elements

The HTMLElements object.

TYPE: HTMLElements

start

The start index.

TYPE: int

end

The end index.

TYPE: int

RETURNS DESCRIPTION
str

The slice of the HTMLElements object.

Source code in conatus/actions/preloaded/html_actions.py
@action
def html_view_slice(
    elements: HTMLElements,
    start: int,
    end: int,
) -> str:
    """Get a slice of the HTMLElements object.

    Don't worry about going out of bounds. We'll handle it.

    We enforce a maximum of 100 characters per element, and a maximum
    of 30 elements at a time, with a total of 500 characters.

    Args:
        elements: The HTMLElements object.
        start: The start index.
        end: The end index.

    Returns:
        The slice of the HTMLElements object.
    """
    return elements._view_slice(start, end)  # noqa: SLF001

html_find_all conatus-action

html_find_all(
    html: HTML,
    *,
    class_: str | None = None,
    href: str | None = None,
    id: str | None = None,
    tag: str | None = None,
    attrs_pairs: list[str] | None = None,
    treat_as_regex: list[str] | None = None
) -> HTMLElements
This function is an Action

You can call html_find_all just like a regular function, but note that it is actually a Action object.

This means that:

  • html_find_all has additional properties and methods that you can use (see the Action documentation for more information);
  • but it also means that operations like issubclass and isinstance will not work as expected.

Find all HTML elements that match the given criteria.

Note that all the attributes arguments are optional. By default, we will match the exact value of the attribute. If you want to match a regex pattern, you need to pass the attribute name in the treat_as_regex argument.

If you want to treat a value as a regex pattern, don't forget to add '^' and '$' at the beginning and end of the value.

PARAMETER DESCRIPTION
html

The HTML object.

TYPE: HTML

class_

The class of the elements to find.

TYPE: str | None DEFAULT: None

href

The href of the elements to find.

TYPE: str | None DEFAULT: None

id

The id of the elements to fin

TYPE: str | None DEFAULT: None

tag

The tag (or name) of the elements to find.

TYPE: str | None DEFAULT: None

attrs_pairs

The attributes to find. These are pairs of attribute name and value.

TYPE: list[str] | None DEFAULT: None

treat_as_regex

The attributes to treat as regex patterns.

TYPE: list[str] | None DEFAULT: None

RETURNS DESCRIPTION
HTMLElements

The HTML elements that match the given criteria.

RAISES DESCRIPTION
ValueError

If the number of elements in attrs_pairs is odd.

Source code in conatus/actions/preloaded/html_actions.py
@action
def html_find_all(
    html: HTML,
    *,
    class_: str | None = None,
    href: str | None = None,
    id: str | None = None,  # noqa: A002
    tag: str | None = None,
    attrs_pairs: list[str] | None = None,
    treat_as_regex: list[str] | None = None,
) -> HTMLElements:
    """Find all HTML elements that match the given criteria.

    Note that all the attributes arguments are optional. By default,
    we will match the exact value of the attribute. If you want to match
    a regex pattern, you need to pass the attribute name in the
    `treat_as_regex` argument.

    If you want to treat a value as a regex pattern, don't forget to add
    '^' and '$' at the beginning and end of the value.

    Args:
        html: The HTML object.
        class_: The class of the elements to find.
        href: The href of the elements to find.
        id: The id of the elements to fin
        tag: The tag (or name) of the elements to find.
        attrs_pairs: The attributes to find. These are pairs of attribute
            name and value.
        treat_as_regex: The attributes to treat as regex patterns.

    Returns:
        The HTML elements that match the given criteria.

    Raises:
        ValueError: If the number of elements in attrs_pairs is odd.
    """
    soup = html.soup
    kwargs: dict[str, re.Pattern[str] | str | None] = {
        "class_": class_,
        "href": href,
        "id": id,
        "name": tag,
    }
    # Deconstruct the pairs into a dictionary.
    # Returns error if the number of elements is odd.
    if attrs_pairs is not None:
        if len(attrs_pairs) % 2 != 0:
            msg = "attrs_pairs must have an even number of elements"
            raise ValueError(msg)
        for i in range(0, len(attrs_pairs), 2):
            kwargs[attrs_pairs[i]] = attrs_pairs[i + 1]

    if treat_as_regex is not None:
        for attr in treat_as_regex:
            if attr not in kwargs:
                # TODO(lemeb): Add logging for this fn # noqa: TD003
                print("WARNING: Attribute not found in kwargs")  # noqa: T201
            kwargs[attr] = re.compile(str(kwargs[attr]))

    # Remove None values from the kwargs
    for k in list(kwargs.keys()):
        if kwargs[k] is None:
            del kwargs[k]

    result_set = soup.find_all(recursive=True, limit=None, **kwargs)
    return HTMLElements(result_set=result_set)

get_attributes_from_list conatus-action

get_attributes_from_list(
    elements: HTMLElements, attr: HTMLElementsAttributes
) -> list[str | list[str] | None]
This function is an Action

You can call get_attributes_from_list just like a regular function, but note that it is actually a Action object.

This means that:

  • get_attributes_from_list has additional properties and methods that you can use (see the Action documentation for more information);
  • but it also means that operations like issubclass and isinstance will not work as expected.

Get the attributes from the list of elements.

PARAMETER DESCRIPTION
elements

The list of elements.

TYPE: HTMLElements

attr

The attribute to get from the elements, among the following: "href", "id", "tag", "class". for "name", use "tag" instead.

TYPE: HTMLElementsAttributes

RETURNS DESCRIPTION
list[str | list[str] | None]

The list of attributes from the elements.

Source code in conatus/actions/preloaded/html_actions.py
@action
def get_attributes_from_list(
    elements: HTMLElements,
    attr: HTMLElementsAttributes,
) -> list[str | list[str] | None]:
    """Get the attributes from the list of elements.

    Args:
        elements: The list of elements.
        attr: The attribute to get from the elements, among the following:
            "href", "id", "tag", "class". for "name", use "tag" instead.

    Returns:
        The list of attributes from the elements.
    """
    return [elem.get(attr) for elem in elements.result_set]

get_html_from_url conatus-action

get_html_from_url(url: str) -> HTML
This function is an Action

You can call get_html_from_url just like a regular function, but note that it is actually a Action object.

This means that:

  • get_html_from_url has additional properties and methods that you can use (see the Action documentation for more information);
  • but it also means that operations like issubclass and isinstance will not work as expected.

Get the HTML content from a URL.

PARAMETER DESCRIPTION
url

The URL to get the HTML content from.

TYPE: str

RETURNS DESCRIPTION
HTML

The HTML object.

Source code in conatus/actions/preloaded/html_actions.py
@action
def get_html_from_url(url: str) -> HTML:
    """Get the HTML content from a URL.

    Args:
        url: The URL to get the HTML content from.

    Returns:
        The HTML object.
    """
    r = httpx.get(url)
    return HTML(soup=BeautifulSoup(r.text, "html.parser"), url=url)

HTML helpers

conatus.actions.preloaded.html_helpers

Helper classes for HTML processing.

We separate the helper classes from the actions to clarify the API.

HTML dataclass

HTML(
    soup: BeautifulSoup,
    url: str | None = None,
    is_file: bool = False,
    hash: str | None = None,
)

HTML object.

repr_html_body

repr_html_body() -> str

Get a preview of the HTML body.

We return the first 500 characters of the HTML body.

RETURNS DESCRIPTION
str

A preview of the HTML body.

Source code in conatus/actions/preloaded/html_helpers.py
def repr_html_body(self) -> str:
    """Get a preview of the HTML body.

    We return the first 500 characters of the HTML body.

    Returns:
        A preview of the HTML body.
    """
    html_body = self.soup.body
    if html_body is None:
        return "No body found"
    # Remove SVG elements for clarity
    for svg in html_body.find_all("svg"):  # pyright: ignore[reportAny]
        cast("Tag", svg).decompose()
    return html_body.prettify()[:500]

from_file staticmethod

from_file(
    file_path: str, *, encoding: str = "utf-8"
) -> HTML

Create an HTML object from a file.

PARAMETER DESCRIPTION
file_path

The path to the file.

TYPE: str

encoding

The encoding of the file. Default is "utf-8".

TYPE: str DEFAULT: 'utf-8'

RETURNS DESCRIPTION
HTML

The HTML object.

Source code in conatus/actions/preloaded/html_helpers.py
@staticmethod
def from_file(file_path: str, *, encoding: str = "utf-8") -> "HTML":
    """Create an HTML object from a file.

    Args:
        file_path: The path to the file.
        encoding: The encoding of the file. Default is "utf-8".

    Returns:
        The HTML object.
    """
    with Path(file_path).open(encoding=encoding) as file:
        html_content = file.read()
    soup = BeautifulSoup(html_content, "html.parser")
    hsh = hashlib.sha256(html_content.encode()).hexdigest()[:10]
    return HTML(soup=soup, url=file_path, is_file=True, hash=hsh)

__repr__

__repr__() -> str

Get the string representation of the HTML object.

RETURNS DESCRIPTION
str

The string representation of the HTML object.

Source code in conatus/actions/preloaded/html_helpers.py
@override
def __repr__(self) -> str:
    """Get the string representation of the HTML object.

    Returns:
        The string representation of the HTML object.
    """
    url = self.url if not self.is_file else f"file://{self.url}"
    return (
        f"HTML(url={url}, hash={self.hash}, soup={self.repr_html_body()})"
    )

__str__

__str__() -> str

Get the string representation of the HTML object.

RETURNS DESCRIPTION
str

The string representation of the HTML object.

Source code in conatus/actions/preloaded/html_helpers.py
@override
def __str__(self) -> str:
    """Get the string representation of the HTML object.

    Returns:
        The string representation of the HTML object.
    """
    url = self.url if not self.is_file else f"file://{self.url}"
    return (
        f"HTML(url={url}, hash={self.hash}, soup={self.repr_html_body()})"
    )

HTMLElements dataclass

HTMLElements(result_set: ResultSet[Tag])

HTML elements. Not quite a list.

__str__

__str__() -> str

Get the string representation of the HTMLElements object.

RETURNS DESCRIPTION
str

The string representation of the HTMLElements object.

Source code in conatus/actions/preloaded/html_helpers.py
@override
def __str__(self) -> str:
    """Get the string representation of the HTMLElements object.

    Returns:
        The string representation of the HTMLElements object.
    """
    preview_limit = 15
    length_set = len(self.result_set)
    preview_head = [
        str(elem)[:100] for elem in self.result_set[:preview_limit]
    ]
    return (
        f"HTMLElements(length={length_set},"
        f" preview_limit={preview_limit},"
        f" preview_head={preview_head})"
    )

__repr__

__repr__() -> str

Get the string representation of the HTMLElements object.

RETURNS DESCRIPTION
str

The string representation of the HTMLElements object.

Source code in conatus/actions/preloaded/html_helpers.py
@override
def __repr__(self) -> str:
    """Get the string representation of the HTMLElements object.

    Returns:
        The string representation of the HTMLElements object.
    """
    return str(self)