HTML processing
HTML actions¶
conatus.actions.preloaded.html_actions
¶
Standard Actions for HTML processing.
Note that these actions require beautifulsoup4 and lxml to be installed.
If that's not the case, you can install them with:
html_full_view
conatus-action
¶
html_full_view(
elements: HTMLElements, nth_element: int = 0
) -> str
This function is an Action
You can call html_full_view just like a regular function, but note that it is
actually a Action object.
This means that:
html_full_viewhas additional properties and methods that you can use (see theActiondocumentation for more information);- but it also means that operations like
issubclassandisinstancewill not work as expected.
Get the full view of the nth element in the result set.
(Actually, we still return only the first 1500 characters of the element.)
| PARAMETER | DESCRIPTION |
|---|---|
elements
|
The HTMLElements object.
TYPE:
|
nth_element
|
The index of the element to get the full view of.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The full view of the nth element in the result set. |
Source code in conatus/actions/preloaded/html_actions.py
html_view_slice
conatus-action
¶
html_view_slice(
elements: HTMLElements, start: int, end: int
) -> str
This function is an Action
You can call html_view_slice just like a regular function, but note that it is
actually a Action object.
This means that:
html_view_slicehas additional properties and methods that you can use (see theActiondocumentation for more information);- but it also means that operations like
issubclassandisinstancewill not work as expected.
Get a slice of the HTMLElements object.
Don't worry about going out of bounds. We'll handle it.
We enforce a maximum of 100 characters per element, and a maximum of 30 elements at a time, with a total of 500 characters.
| PARAMETER | DESCRIPTION |
|---|---|
elements
|
The HTMLElements object.
TYPE:
|
start
|
The start index.
TYPE:
|
end
|
The end index.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
The slice of the HTMLElements object. |
Source code in conatus/actions/preloaded/html_actions.py
html_find_all
conatus-action
¶
html_find_all(
html: HTML,
*,
class_: str | None = None,
href: str | None = None,
id: str | None = None,
tag: str | None = None,
attrs_pairs: list[str] | None = None,
treat_as_regex: list[str] | None = None
) -> HTMLElements
This function is an Action
You can call html_find_all just like a regular function, but note that it is
actually a Action object.
This means that:
html_find_allhas additional properties and methods that you can use (see theActiondocumentation for more information);- but it also means that operations like
issubclassandisinstancewill not work as expected.
Find all HTML elements that match the given criteria.
Note that all the attributes arguments are optional. By default,
we will match the exact value of the attribute. If you want to match
a regex pattern, you need to pass the attribute name in the
treat_as_regex argument.
If you want to treat a value as a regex pattern, don't forget to add '^' and '$' at the beginning and end of the value.
| PARAMETER | DESCRIPTION |
|---|---|
html
|
The HTML object.
TYPE:
|
class_
|
The class of the elements to find.
TYPE:
|
href
|
The href of the elements to find.
TYPE:
|
id
|
The id of the elements to fin
TYPE:
|
tag
|
The tag (or name) of the elements to find.
TYPE:
|
attrs_pairs
|
The attributes to find. These are pairs of attribute name and value. |
treat_as_regex
|
The attributes to treat as regex patterns. |
| RETURNS | DESCRIPTION |
|---|---|
HTMLElements
|
The HTML elements that match the given criteria. |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If the number of elements in attrs_pairs is odd. |
Source code in conatus/actions/preloaded/html_actions.py
get_attributes_from_list
conatus-action
¶
get_attributes_from_list(
elements: HTMLElements, attr: HTMLElementsAttributes
) -> list[str | list[str] | None]
This function is an Action
You can call get_attributes_from_list just like a regular function, but note that it is
actually a Action object.
This means that:
get_attributes_from_listhas additional properties and methods that you can use (see theActiondocumentation for more information);- but it also means that operations like
issubclassandisinstancewill not work as expected.
Get the attributes from the list of elements.
| PARAMETER | DESCRIPTION |
|---|---|
elements
|
The list of elements.
TYPE:
|
attr
|
The attribute to get from the elements, among the following: "href", "id", "tag", "class". for "name", use "tag" instead.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[str | list[str] | None]
|
The list of attributes from the elements. |
Source code in conatus/actions/preloaded/html_actions.py
get_html_from_url
conatus-action
¶
This function is an Action
You can call get_html_from_url just like a regular function, but note that it is
actually a Action object.
This means that:
get_html_from_urlhas additional properties and methods that you can use (see theActiondocumentation for more information);- but it also means that operations like
issubclassandisinstancewill not work as expected.
Get the HTML content from a URL.
| PARAMETER | DESCRIPTION |
|---|---|
url
|
The URL to get the HTML content from.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
HTML
|
The HTML object. |
Source code in conatus/actions/preloaded/html_actions.py
HTML helpers¶
conatus.actions.preloaded.html_helpers
¶
Helper classes for HTML processing.
We separate the helper classes from the actions to clarify the API.
HTML
dataclass
¶
HTML(
soup: BeautifulSoup,
url: str | None = None,
is_file: bool = False,
hash: str | None = None,
)
HTML object.
repr_html_body
¶
repr_html_body() -> str
Get a preview of the HTML body.
We return the first 500 characters of the HTML body.
| RETURNS | DESCRIPTION |
|---|---|
str
|
A preview of the HTML body. |
Source code in conatus/actions/preloaded/html_helpers.py
from_file
staticmethod
¶
Create an HTML object from a file.
| PARAMETER | DESCRIPTION |
|---|---|
file_path
|
The path to the file.
TYPE:
|
encoding
|
The encoding of the file. Default is "utf-8".
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
HTML
|
The HTML object. |
Source code in conatus/actions/preloaded/html_helpers.py
__repr__
¶
__repr__() -> str
Get the string representation of the HTML object.
| RETURNS | DESCRIPTION |
|---|---|
str
|
The string representation of the HTML object. |
Source code in conatus/actions/preloaded/html_helpers.py
HTMLElements
dataclass
¶
HTML elements. Not quite a list.
__str__
¶
__str__() -> str
Get the string representation of the HTMLElements object.
| RETURNS | DESCRIPTION |
|---|---|
str
|
The string representation of the HTMLElements object. |