Chrome DOM
conatus.utils.browser.dom.chrome
¶
Various types to process Chrome DOM data.
🤓 Nerd Alert: As explained on our Browser concept page, you probably don't need to use these classes
directly.
- Instead, you should use either the
ProcessedDOMclass, or its sibling class,DOMNode.
- To put it more visually:

But if you need to process the Chrome DOM data directly, follow along!
Why use the DOM?¶
The DOM is a representation of the page by the browser. We use it for basically everything related to the processing of the page:
- Defining the input and clickable elements;
- Extracting the text from the page;
- Extracting the layout of the page and using it to draw nodes around the relevant elements;
- etc.
The DOM is not HTML: it is a representation of the page that is more structured and easier to process. Think of it like the internals of the browser that are used to render the page.
Chrome DOM Snapshots¶
From Chrome to Pydantic: We ask Chrome to give us a snapshot of the DOM when the page is loaded. The specification of Chrome DOM snapshots is a little complex. We have hardcoded that specification in Pydantic to ensure that what we get is stable.
- The
ChromeDOMclass, as well as the other classes it relies on, theoretically give a comprehensive overlook of the DOM snapshot structure.
Look at the raw JSON: In practice, a good way to understand the structure
of the Chrome DOM is to look at a simple DOM snapshot stored in JSON. You can
find a few ones in the conatus/utils/browser/dom/fixtures directory. If you
want to retrieve a JSON file like these from an arbitrary webpage, you can do
this:
import json
from conatus.utils.browser import Browser
from conatus.utils.browser.dom.chrome import ChromeDOM
url = "https://example.com"
file_name = "tests/tmp/example.com_dom.json"
browser = Browser()
browser.goto(url)
dom = ChromeDOM.get_snapshot(browser.b.current_page.cdp_session)
with open(file_name, "w") as f:
json.dump(dom, f)
- We could spill a lot of ink explaining the structure of the DOM snapshot. Honestly -- look at it yourself. This documentation will make much more sense if you look at the raw data in parallel.
DOMs are hard to read: DOM snapshots are designed to be parsed quickly by a
computer, not to be easily understood by us humans. This means that they
contain a lot of references. For instance, instead of telling you the value of
a node, it gives you the index of the value in a global list of strings.
Sometimes, that index is -1, which means that the node doesn't have a value.
- We keep these references in the
ChromeDOMclass to keep the data as close to the Chrome DevTools Protocol as possible. This makes it easier to process the data in one pass. If you want to use theChromeDOMclass, you'll need to check for-1s yourself.
- We have a method,
ChromeDOM.process_nodes(), that takes care of resolving these references and creating a list ofDOMNodes.
Additional references¶
- Chrome DevTools Protocol: The specification of the Chrome DOM snapshot.
- API: Processed DOM: The
ProcessedDOMclass, which is a cleaned-up version of the Chrome DOM. - API: DOM nodes: The
DOMNodeclass, which is a representation of a node in the DOM.
ChromeDOMTextBoxes
¶
Bases: BaseModel
Chrome DOM text boxes data.
This class is largely unused in this repository.
As a general note: attributes are imported directly from the Chrome DevTools Protocol. Not all attributes are used. You might want to refer to the Chrome DevTools Protocol documentation: DOMSnapshot.TextBoxSnapshot
Note from Chrome: 'The exact layout should not be regarded as stable, and may change between versions.'
| ATTRIBUTE | DESCRIPTION |
|---|---|
layout_index |
Index into the LayoutTreeSnapshot nodes. Unused by the rest of the repository. |
bounds |
The absolute position bounding box. Unused by the rest of the repository.
TYPE:
|
start |
The starting index in characters, for this post layout text. Unused by the rest of the repository. |
length |
The number of characters in this post layout text. Unused by the rest of the repository. |
ChromeDOMIndexValue
¶
Bases: BaseModel
Chrome DOM index value.
This stores the index and value of a DOM node. The value of instances of this type cannot be understood without the context of the values: they could be the text value of the node, the input value of the node, the URL of the script that generated the node, etc.
In other words: - index: The index of the node in the DOM. This is a universal node reference. - value: The value of [CONTEXT] of the node, which is generally a reference to a string in the strings attribute of the ChromeDOM class.
As an example (in raw form):
It means here that: - The node with index 101 has no input value. - The node with index 184 has an input value of DOM.strings[214].
ChromeDOMIndexNoValue
¶
Bases: BaseModel
Stores the index of a node in the DOM.
Generally used for nodes that have a boolean value, such as input_checked.
ChromeDOMLayout
¶
Bases: BaseModel
Chrome DOM layout data.
This class stores the layout of the DOM. It stores the data only for nodes
with a layout object. The first attribute of the class, node_index, is a
list of indexes that point to the nodes that have a layout object. The other
attributes are lists of the same size, which store either the values of the
layout objects or references to the ChromeDOM.strings attribute.
For example:
"layout": {
"node_index": [
2,
...
],
"styles": [
[],
...
],
"bounds": [
[
0,
0,
2200,
1400
],
...
],
...
}
indicate that node #2 has no styles, and a bounding box of (0, 0, 2200, 1400).
General note for all ChromeDOM classes: attributes are imported directly from the Chrome DevTools Protocol. Not all attributes are used. You might want to refer to the Chrome DevTools Protocol documentation: DOMSnapshot.LayoutTreeSnapshot.
| ATTRIBUTE | DESCRIPTION |
|---|---|
node_index |
Index of corresponding node in the NodeTreeSnapshot array. |
bounds |
The absolute position bounding box.
TYPE:
|
styles |
Array of indexes specifying computed style strings. This is
a reference to |
text |
Contents of the LayoutText, if any. The value is a reference to
|
stacking_contexts |
Stacking context information. Unused by the rest of the repository.
TYPE:
|
paint_orders |
Global paint order index. Unused by the rest of the repository. |
offset_rects |
The offset rect of nodes. Unused by the rest of the repository.
TYPE:
|
scroll_rects |
The scroll rect of nodes. Unused by the rest of the repository.
TYPE:
|
client_rects |
The client rect of nodes. Unused by the rest of the repository.
TYPE:
|
get_flipped_index
¶
Get a flipped index of nodes that are in the layout.
The original node index of the ChromeDOMLayout looks something like this:
This means that the first node in the layout is the node at index 0 in
the DOM, and its bounds are [0, 0, 2200, 1400].
To make it easier to traverse the DOM, we flip the index. Here, the
flipped index would be {0: 0, 2: 1}.
| RETURNS | DESCRIPTION |
|---|---|
dict[int, int]
|
dict[int, int]: Flipped index of the node index. |
Source code in conatus/utils/browser/dom/chrome.py
ChromeDOMNodeData
¶
Bases: BaseModel
Chrome DOM node data. It stores the data of all nodes in the DOM.
This class is made of three types of fields:
- Lists of equivalent size, which store the data of the nodes. If these
lists are traversed in parallel, the data of the nodes can be
reconstructed.
* The fields in question are
parent_index,node_type,node_name,node_value,attributes, andbackend_node_id.
- Key-value pairs, storing data for specific nodes in the DOM. Generally,
the key is the index of the node in the DOM (corresponding to the index in
the lists). For more
information, see the class
ChromeDOMIndexValue.* The fields in questions are
input_value,shadow_root_type,text_value,content_document_index,pseudo_type,pseudo_identifier,current_source_url, andorigin_url.
- List of indices, which store the index of the nodes in the DOM. For more
information, see the class
ChromeDOMIndexNoValue.* The fields in question are
input_checked,option_selected, andis_clickable.
General note for all ChromeDOM classes: attributes are imported directly from the Chrome DevTools Protocol. Not all attributes are used. You might want to refer to the Chrome DevTools Protocol documentation: DOMSnapshot.NodeTreeSnapshot.
| ATTRIBUTE | DESCRIPTION |
|---|---|
parent_index |
List of parent node indexes (one for each node). |
node_type |
List of DOM node types (one for each node).
See
TYPE:
|
node_name |
List of names (one for each node). The value is a reference
to |
node_value |
List of values (one for each node.) Varies depending on the
|
attributes |
List of attributes of an Element node (one for each node).
It's a list that is necessary of an even length. Each pair of
elements is a key-value pair. Values might contain |
input_value |
Only for input elements: input's associated text value.
The value is a reference to
TYPE:
|
input_checked |
Indices of radio and checkbox elements that are checked.
TYPE:
|
option_selected |
Indices of option elements that are selected.
TYPE:
|
is_clickable |
Indices for nodes with a click event listener.
TYPE:
|
shadow_root_type |
For specific nodes: Type of the shadow root the node is in. For more information, see MDN Web Docs. Unused by the rest of the repository.
TYPE:
|
backend_node_id |
List of more stable IDs (one for each node). For more information, see DevTools Protocol GitHub Unused by the rest of the repository. |
text_value |
Only set for textarea elements, contains the text value. The
value is a reference to
TYPE:
|
content_document_index |
Index of document in the list of snapshot docs. Unused by the rest of the repository.
TYPE:
|
pseudo_type |
Type of a pseudo element node. Unused by the rest of the repository.
TYPE:
|
pseudo_identifier |
Pseudo element identifier for this node. Unused by the rest of the repository.
TYPE:
|
current_source_url |
The selected url for nodes with a srcset attribute. Unused by the rest of the repository.
TYPE:
|
origin_url |
The url of the script (if any) that generates this node. Unused by the rest of the repository.
TYPE:
|
get_flipped_input_value_index
¶
Create a flipped index of nodes that have input values.
The original input value index of the ChromeDOMNodeData looks something like this:
This means that the first node in the input value is the node at index 101 in the DOM, and its value is 7.
To make it easier to traverse the DOM, we flip the index. So, the
flipped index would be: {101: 0, 184: 1}.
| RETURNS | DESCRIPTION |
|---|---|
dict[int, int]
|
dict[int, int]: Flipped index of the input value. |
Source code in conatus/utils/browser/dom/chrome.py
ChromeDOMDocument
¶
Bases: BaseModel
Chrome DOM document snapshot data.
Most of the attributes are not used. We only use the nodes, layout, and
text_boxes attributes. We also use content_width and content_height to
calculate the device pixel ratio later on.
General note for all ChromeDOM classes: attributes are imported directly from the Chrome DevTools Protocol. Not all attributes are used. You might want to refer to the Chrome DevTools Protocol documentation: DOMSnapshot.DocumentSnapshot
| ATTRIBUTE | DESCRIPTION |
|---|---|
nodes |
A table with dom nodes.
TYPE:
|
layout |
The nodes in the layout tree.
TYPE:
|
title |
Document title.
TYPE:
|
content_width |
Document content width.
TYPE:
|
content_height |
Document content height.
TYPE:
|
document_url |
Document URL that Document node points to. Generally, this is the URL of the page that the DOM snapshot was taken from.
TYPE:
|
scroll_offset_x |
Horizontal scroll offset.
TYPE:
|
scroll_offset_y |
Vertical scroll offset.
TYPE:
|
base_url |
Base URL that Document node uses for URL completion. Unused by the rest of the repository.
TYPE:
|
content_language |
Contains the document's content language. Unused by the rest of the repository.
TYPE:
|
encoding_name |
Contains the document's character set encoding. Unused by the rest of the repository.
TYPE:
|
public_id |
DocumentType node's publicId. Unused by the rest of the repository.
TYPE:
|
system_id |
DocumentType node's systemId. Unused by the rest of the repository.
TYPE:
|
frame_id |
Frame ID for document node. Unused by the rest of the repository.
TYPE:
|
text_boxes |
The post-layout inline text nodes. Unused by the rest of the repository.
TYPE:
|
ChromeDOM
¶
Bases: BaseModel
Chrome DOM snapshot data.
The DOM consists of a document (technically a list of documents, but
usually there's only one) and a list of "strings" (e.g. text data).
For more detail, in particular relating to the node structure, see the class
ChromeDOMNodeData.
General note for all ChromeDOM classes: attributes are imported directly from the Chrome DevTools Protocol. Not all attributes are used. You might want to refer to the Chrome DevTools Protocol documentation: DOMSnapshot.captureSnapshot
| ATTRIBUTE | DESCRIPTION |
|---|---|
documents |
List of DOM documents. (There is usually only one document,
and we only use the first one. See the
TYPE:
|
strings |
List of DOM strings. |
snapshot |
Raw DOM snapshot, as passed by the Chrome DevTools Protocol. |
document
property
¶
document: ChromeDOMDocument
Get the document of the Chrome DOM.
Important: There is usually only one document in the Chrome DOM. We only use the first one.
from conatus.utils.browser.dom.fixtures import (
example_chrome_dom
)
dom = example_chrome_dom()
dom_document = dom.document
title_idx = dom.document.title
assert dom.strings[title_idx] == "Example Domain"
| RETURNS | DESCRIPTION |
|---|---|
ChromeDOMDocument
|
The document of the Chrome DOM.
TYPE:
|
node_data
property
¶
node_data: ChromeDOMNodeData
Get the node data of the Chrome DOM.
NOT a list of nodes: This is a representation of the
data as it appears in the Chrome DOM snapshot. Please see the class
ChromeDOMNodeData for
more information.
from conatus.utils.browser.dom.fixtures import (
example_chrome_dom_inputtypes
)
dom = example_chrome_dom_inputtypes()
assert len(dom.node_data.node_name) == 198
| RETURNS | DESCRIPTION |
|---|---|
ChromeDOMNodeData
|
The node data of the Chrome DOM.
TYPE:
|
layout
property
¶
layout: ChromeDOMLayout
Get the layout of the Chrome DOM.
from conatus.utils.browser.dom.fixtures import (
example_chrome_dom
)
dom = example_chrome_dom()
assert len(dom.layout.node_index) == 11
| RETURNS | DESCRIPTION |
|---|---|
ChromeDOMLayout
|
The layout of the Chrome DOM.
TYPE:
|
page_title
property
¶
page_title: str
Get the title of the page.
from conatus.utils.browser.dom.fixtures import (
example_chrome_dom
)
dom = example_chrome_dom()
assert dom.page_title == "Example Domain"
| RETURNS | DESCRIPTION |
|---|---|
str
|
Title of the page.
TYPE:
|
page_url
property
¶
page_url: str
Get the URL of the page.
from conatus.utils.browser.dom.fixtures import (
example_chrome_dom
)
dom = example_chrome_dom()
assert dom.page_url == "https://example.com/"
| RETURNS | DESCRIPTION |
|---|---|
str
|
URL of the page.
TYPE:
|
from_page_async
async
classmethod
¶
Get the Chrome DOM from a Playwright or Conatus page.
Note: The expected type for a Playwright page is
playwright.async_api._generated.Page.
from conatus.utils.browser import Browser
from conatus.utils.browser.dom.chrome import ChromeDOM
browser = Browser()
browser.goto("https://example.com")
dom = ChromeDOM.from_page(browser.page)
assert dom.strings[dom.document.title] == "Example Domain"
| PARAMETER | DESCRIPTION |
|---|---|
page
|
Page object (Playwright or Conatus).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ChromeDOM
|
Chrome DOM of the page.
TYPE:
|
Source code in conatus/utils/browser/dom/chrome.py
from_page
classmethod
¶
Get the Chrome DOM from a Playwright or Conatus page.
More information can be found in the docstring of [from_page_async](
conatus.utils.browser.dom.chrome.from_page_async), the async sibling of¶
this function.
| PARAMETER | DESCRIPTION |
|---|---|
page
|
The Page object (either a Playwright Page or a Conatus Page).
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ChromeDOM
|
Chrome DOM of the page.
TYPE:
|
Source code in conatus/utils/browser/dom/chrome.py
get_snapshot_async
async
classmethod
¶
Get the raw Chrome DOM snapshot of the page.
Note: The expected type for the client argument is
playwright.async_api._generated.CDPSession.
from conatus.utils.browser import Browser
from conatus.utils.browser.dom.chrome import ChromeDOM
browser = Browser()
browser.goto("https://example.com")
snapshot = ChromeDOM.get_snapshot(browser.page.cdp_session)
title_ptr = snapshot["documents"][0]["title"]
assert snapshot["strings"][title_ptr] == "Example Domain"
# You can then pass this snapshot to ChromeDOM.from_snapshot:
# dom: ChromeDOM = ChromeDOM.from_snapshot(snapshot)
| PARAMETER | DESCRIPTION |
|---|---|
client
|
CDPSession object.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Raw Chrome DOM of the page. |
Source code in conatus/utils/browser/dom/chrome.py
get_snapshot
classmethod
¶
Get the raw Chrome DOM snapshot of the page.
More information can be found in the docstring of [aget_snapshot](
conatus.utils.browser.dom.chrome.aget_snapshot), the async sibling of¶
this function.
| PARAMETER | DESCRIPTION |
|---|---|
client
|
CDPSession object.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict
|
Raw Chrome DOM of the page. |
Source code in conatus/utils/browser/dom/chrome.py
from_snapshot
staticmethod
¶
Get a Chrome DOM object from a DOM snapshot.
import json
from conatus.utils.browser.dom.chrome import ChromeDOM
file = "tests/fixtures/example.com_dom.json"
snapshot = json.load(open(file))
dom = ChromeDOM.from_snapshot(snapshot)
assert dom.strings[dom.document.title] == "Example Domain"
| PARAMETER | DESCRIPTION |
|---|---|
snapshot
|
Chrome DOM snapshot of the page.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ChromeDOM
|
Chrome DOM of the page.
TYPE:
|
Source code in conatus/utils/browser/dom/chrome.py
from_json
staticmethod
¶
Get a ChromeDOM object from a snapshot JSON file.
Note: If you want to pass a raw JSON object, you should use
ChromeDOM.from_snapshot.
from conatus.utils.browser.dom.chrome import ChromeDOM
file = "tests/fixtures/example.com_dom.json"
dom = ChromeDOM.from_json(file)
assert dom.strings[dom.document.title] == "Example Domain"
| PARAMETER | DESCRIPTION |
|---|---|
path_or_json
|
Path to the JSON file. |
| RETURNS | DESCRIPTION |
|---|---|
ChromeDOM
|
Chrome DOM of the page.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
FileNotFoundError
|
If the file is not found. |
TypeError
|
If the JSON file is not a dictionary. |
Source code in conatus/utils/browser/dom/chrome.py
get_device_pixel_ratio
¶
Get the device pixel ratio of the page.
The device pixel ratio is the ratio between two types of pixels: CSS pixels and device pixels. Playwright knows the device pixels, and the Chrome DOM snapshot knows the CSS pixels. By reconciling the two, we can get the device pixel ratio.
from conatus.utils.browser.dom.fixtures import (
example_chrome_dom,
)
dom = example_chrome_dom()
width = 1100
assert dom.get_device_pixel_ratio(width) == 2.0
| PARAMETER | DESCRIPTION |
|---|---|
width
|
The actual width of the page (generally obtained by Playwright's Page object). Will work with an int.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
float
|
Device pixel ratio of the page.
TYPE:
|
Source code in conatus/utils/browser/dom/chrome.py
process_nodes
¶
Process the nodes of the Chrome DOM to a list of DOMNode objects.
from conatus.utils.browser.dom.fixtures import (
example_chrome_dom_inputtypes,
)
dom = example_chrome_dom_inputtypes()
width = 1100
nodes = dom.process_nodes(width)
assert len(nodes) == 198
| PARAMETER | DESCRIPTION |
|---|---|
width
|
The width of the page. We need it to calculate the bounds of the nodes.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[DOMNode]
|
list[DOMNode]: The nodes of the Chrome DOM. |
Source code in conatus/utils/browser/dom/chrome.py
options: members: - ChromeDOM - ChromeDOMDocument - ChromeDOMNodeData - ChromeDOMLayout - ChromeDOMTextBoxes - ChromeDOMIndexValue - ChromeDOMIndexNoValue