Commit Graph

257 Commits

Author SHA1 Message Date
Adrià Arrufat
3d6d669a50 testing: add LogFilter utility for scoped log suppression 2026-03-12 13:56:53 +09:00
Adrià Arrufat
60699229ca Merge branch 'main' into semantic-tree 2026-03-11 20:52:39 +09:00
Adrià Arrufat
af803da5c8 cdp.lp: use enum for getSemanticTree format param
Leverages std.json.parse to automatically validate the format param into a type-safe enum.
2026-03-11 16:21:43 +09:00
Karl Seguin
2a103fc94a Use Session as a container for cross-frame resources
The introduction of frames means that data is no longer tied to a specific Page
or Context. 255b9a91cc introduced Origins for
v8 values shared across frames of the same origin. The commit highlighted the
lifetime mismatched that we now have with data that can outlive 1 frame. A
specific issue with that commit was the finalizers were still Context-owned.
But like any other piece of data, that isn't right; aside from modules, nothing
should be context-owned.

This commit continues where the last left off and moves finalizers from Context
to Origin. This is done in a separate commit because it introduces significant
changes. Currently, finalizers take a *Page, but that's no longer correct. A
value created in one Page, can outlive the Page. We need another container. I
original thought to use Origin, but that isn't know to CDP/MCP. Instead, I
decide to enhance the Session.

Session is now the owner of the page.arena, the page.factory and the
page.arena_pool. Finalizers are given a *Session which they can use to release
their arena.
2026-03-11 08:44:49 +08:00
Karl Seguin
94ce5edd20 Frames on the same origin share v8 data
Depends on: https://github.com/lightpanda-io/zig-v8-fork/pull/153

In some ways this is an extension of
https://github.com/lightpanda-io/browser/pull/1635 but it has more implications
with respect to correctness.

A js.Context wraps a v8::Context. One of the important thing it adds is the
identity_map so that, given a Zig instance we always return the same v8::Object.

But imagine code running in a frame. This frame has its own Context, and thus
its own identity_map. What happens when that frame does:

```js
window.top.frame_loaded = true;
```

From Zig's point of view, `Window.getTop` will return the correct Zig instance.
It will return the *Window references by the "root" page. When that instance is
passed to the bridge, we'll look for the v8::Object in the Context's
`identity_map` but wont' find it. The mapping exists in the root context
`identity_map`, but not within this frame. So we create a new v8::Object and now
our 1 zig instance has N v8::Objects for every page/frame that tries to access
it.

This breaks cross-frame scripting which should work, at least to some degree,
even when frames are on the same origin.

This commit adds a `js.Origin` which contains the `identity_map`, along with our
other `v8::Global` storage. The `Env` now contains a `*js.Origin` lookup,
mapping an origin string (e.g. lightpanda.io:443) to an *Origin. When a Page's
URL is changed, we call `self.js.setOrigin(new_url)` which will then either get
or create an origin from the Env's origin lookup map.

js.Origin is reference counted so that it remains valid so long as at least 1
frame references them.

There's some special handling for null-origins (i.e. about:blank). At the root,
null origins get a distinct/isolated Origin. For a frame, the parent's origin
is used.

Above, we talked about `identity_map`, but a `js.Context` has 8 other fields
to track v8 values, e.g. `global_objects`, `global_functions`,
`global_values_temp`, etc. These all must be shared by frames on the same
origin. So all of these have also been moved to js.Origin. They've also been
merged so that we now have 3 fields: `identity_map`, `globals` and `temps`.

Finally, when the origin of a context is changed, we set the v8::Context's
SecurityToken (to that origin). This is a key part of how v8 allows cross-
context access.
2026-03-11 08:43:40 +08:00
Nikolay Govorov
3626f70d3e Merge pull request #1759 from lightpanda-io/wp/mrdimidum/net-poll-runtime
Network poll runtime
2026-03-10 23:38:07 +00:00
Pierre Tachoire
1ebf7460fe Merge pull request #1768 from lightpanda-io/inspector_cleanup
Call `resetContextGroup` on page removal
2026-03-10 15:32:47 +01:00
Karl Seguin
11fb5f990e Call resetContextGroup on page removal
Calling it here ensures that the inspector gets reset on internal page
navigation. We were seeing intermittent segfaults on a problematic WPT tests
(/encoding/legacy-mb-japanese/euc-jp/) which I believe this solves.

(The tests are still broken. Because we don't support form targets, they cause
the root page to reload in a tight cycle, causing a lot of context creation /
destruction, which I thin was the issue. This commit doesn't fix the broken test
but it hopefully fixes the crash).

Also, clear out the Inspector's default_context when the default context is
destroyed. (This was the first thing I did to try to fix the crash, it didn't
work, but I believe it's correct).
2026-03-10 20:50:58 +08:00
Adrià Arrufat
d1ee0442ea Merge branch 'main' into semantic-tree 2026-03-10 21:48:49 +09:00
Adrià Arrufat
62f31ea24a Merge pull request #1765 from egrs/lp-get-structured-data
add LP.getStructuredData CDP command
2026-03-10 21:48:18 +09:00
Pierre Tachoire
12c5bcd24f cdp: reszie the screenshot to 1920x1080
To be consistent w/ layout size returned
2026-03-10 10:09:53 +01:00
Adrià Arrufat
56f47ee574 Merge branch 'main' into semantic-tree 2026-03-10 17:26:34 +09:00
egrs
74f0436ac7 merge main, resolve conflicts with getInteractiveElements 2026-03-10 09:25:12 +01:00
egrs
22d31b1527 add LP.getStructuredData CDP command 2026-03-10 09:19:51 +01:00
Karl Seguin
9f3bca771a Merge pull request #1755 from lightpanda-io/cdp-page-layout-metrics
cdp: add a dummy Page.getLayoutMetrics
2026-03-10 16:16:17 +08:00
Adrià Arrufat
4e16d90a81 Merge pull request #1757 from egrs/lp-get-interactive-elements
add LP.getInteractiveElements CDP command
2026-03-10 17:15:18 +09:00
Pierre Tachoire
d669d5c153 cdp: add a dummy Page.getLayoutMetrics 2026-03-10 08:54:48 +01:00
egrs
dc3958356d address review feedback
- TreeWalker.Full instead of FullExcludeSelf so querying a specific
  nodeId evaluates the root element itself
- resolve href to absolute URL via URL.resolve
- isDisabled checks ancestor <fieldset disabled> with legend exemption
- parameter order: allocator before *Page per convention
2026-03-10 08:13:01 +01:00
Nikolay Govorov
8e59ce9e9f Prepare global NetworkRuntime module 2026-03-10 03:00:47 +00:00
egrs
a417c73bf7 add LP.getInteractiveElements CDP command
Returns a structured list of all interactive elements on a page:
buttons, links, inputs, ARIA widgets, contenteditable regions, and
elements with event listeners. Includes accessible names, roles,
listener types, and key attributes.

Event listener introspection (both addEventListener and inline
handlers) is unique to LP — no other browser exposes this to
automation code.
2026-03-09 19:46:12 +01:00
Pierre Tachoire
8672232ee2 cdp: add dummy page.captureScreenshot 2026-03-09 17:38:57 +01:00
Adrià Arrufat
b8a3135835 SemanticTree: add pruning support and move logic to walk 2026-03-09 13:02:03 +09:00
Adrià Arrufat
b674c2e448 CDP/MCP: add highly compressed text format for semantic tree 2026-03-08 22:42:00 +09:00
Adrià Arrufat
248851701f Refactor: move SemanticTree to core and expose via MCP tools 2026-03-06 15:44:03 +09:00
Adrià Arrufat
0f46277b1f CDP: implement LP.getSemanticTree for native semantic DOM extraction 2026-03-06 15:29:32 +09:00
Pierre Tachoire
6a8174a15c cdp: don't dispatch executionContextsCleared on frame navigation 2026-03-04 14:45:21 +01:00
Karl Seguin
01fab5c92a Merge pull request #1706 from lightpanda-io/cdp-attach-to-browser
cdp: fix send CDP raw command with Playwright
2026-03-04 07:40:05 +08:00
Karl Seguin
6f0cd87d1c Merge pull request #1703 from lightpanda-io/client_and_script_manager
Fix a few issues in Client
2026-03-04 07:32:14 +08:00
Pierre Tachoire
9ca5188e12 cdp: set consistent target's default
with about:blank for url and empty title.
2026-03-03 17:24:08 +01:00
Pierre Tachoire
56cc881ac0 Fcdp: fix attachtToTarget and attachToBrowserTarget resp 2026-03-03 15:01:53 +01:00
Pierre Tachoire
06ef6d3e6a cdp: attachToTarget must add the session id 2026-03-03 12:58:00 +01:00
Pierre Tachoire
14b58e8062 add target.attachToBrowserTarget 2026-03-03 12:58:00 +01:00
Pierre Tachoire
eee232c12c cdp: allow multiple calls to attachToTarget
Playwright, when creating a new CDPSession, sends an
attachToBrowserTarget followed by another attachToTarget to re-attach
itself to the existing target.

see playwright/axtree.js from demo/ repository.
2026-03-03 12:58:00 +01:00
Karl Seguin
523efbd85a Fix a few issues in Client
Most significantly, if removing from the multi fails, the connection
is added to a "dirty" list for the removal to be retried later. Looking at
the curl source code, remove fails on a recursive call, and we've struggled with
recursive calls before, so I _think_ this might be happening (it fails in other
cases, but I suspect if it _is_ happening, it's for this reason). The retry
happens _after_ `perform`, so it cannot fail for due to recursiveness. If it
fails at this point, we @panic. This is harsh, but it isn't easily recoverable
and before putting effort into it, I'd like to know that it's actually happening.

Fix potential use of undefined when a 401-407 request is received, but no
'WWW-Authenticate' or 'Proxy-Authenticate' header is received.

Don't call `curl_multi_remove_handle` on an easy that hasn't been added yet do
to error. Specifically, if `makeRequest` fails during setup, transfer_conn is
nulled so that `transfer.deinit()` doesn't try to remove the connection. And the
conn is removed from the `in_use` queue and made `available` again.

On Abort, if getting the private fails (extremely unlikely), we now still try
to remove the connection from the multi.

Added a few more fields to the famous "ScriptManager.Header recall" assertion.
2026-03-03 18:02:06 +08:00
Adrià Arrufat
b2e301418f cdp.lp: use page.document instead of window._document 2026-03-03 17:11:16 +09:00
Adrià Arrufat
334a2e44a1 lp: simplify dom_node resolution in getMarkdown 2026-03-03 17:08:43 +09:00
Adrià Arrufat
c9121a03d2 cdp: move LP.getMarkdown test to lp domain 2026-03-03 16:39:31 +09:00
Adrià Arrufat
cc93180d57 cdp: add LP domain and getMarkdown method
This PR introduces a custom CDP domain 'LP' (Lightpanda) to expose browser-specific tools. The first method, 'LP.getMarkdown', allows retrieving a Markdown representation of the DOM or a specific node by its 'nodeId'. This is optimized for AI agents and LLM-based scraping tasks.
2026-03-03 16:35:48 +09:00
Karl Seguin
10ad5d763e Rename page.id to page._frame_id
This field was recently added and is used to generate correct frameIds in CDP
messages. They remain the same during a navigation event, so calling them
page.id might cause surprises since navigation events create new pages, but
retain the original id. Hence, frame_id is more accurate and hopefully less
surprising.

(This is a small cleanup prior to doing some iframe navigation work).
2026-03-02 16:21:29 +08:00
Karl Seguin
e65667963f Correctly JSON encode URL
I think this code comes from some serialization tweak from when everything was
an std.Uri and by switch to [:0]const u8 everywhere not only was the tweak
unecessary, it was also wrong - possibly resulting in the generation of
invalid JSON.
2026-02-28 12:48:45 +08:00
Karl Seguin
315c9a2d92 Add RC support to NodeList
Most importantly, this allows the Selector.List to be self-contained with
an arena from the ArenaPool. Selector.List can be both relatively large
and relatively common, so moving it off the page.arena is a nice win.

Also applied this to ChildNodes, which is much smaller but could also be
called often.

I was initially going to hook into the v8::Object's internal fields to store
the referencing v8::Object. So the v8::Object representing the Iterator
would store the v8::Object representing the NodeList inside of its internal
field - which the GC would trace/detect/respect. And that is probably the
fastest and most v8-ish solution, but I couldn't come up with an elegant
solution. The best I had was having a "afterCreate" callback which passed
the v8 object (this is similar to the old postAttach callback we had, but
used for a different purpose). However, since "acquireRef" was recently
added to events, re-using that was much simpler and worked well.
2026-02-27 10:29:46 +08:00
Karl Seguin
21be3db51f Callers to page.navigate ensure URL is properly encoded.
Follow up to https://github.com/lightpanda-io/browser/pull/1646

The encodeURL (renamed to ensureEncoded and exposed in this commit) already
handled already-encoded URLs, so this was largely a matter of exposing the
functionality.

The reason this isn't baked directly into Page.navigate is that, in some places
e.g. internal navigation, the URL is already know to be encoded. So it's up
to every caller to make sure they are passing a valid URL to navigate.
2026-02-26 12:22:06 +08:00
Karl Seguin
71d34592d9 add frame created cdp messages 2026-02-19 23:47:33 +08:00
Karl Seguin
db2927eea7 cleanup a not-so-great rebase 2026-02-19 23:47:33 +08:00
Karl Seguin
bb01a5cb31 Make CDP frame-aware 2026-02-19 23:47:33 +08:00
Karl Seguin
938cd5e136 Merge pull request #1582 from lightpanda-io/cdp_per_page_frame_id
Rework CDP frameIds (and loaderIds and requestIds and interceptorIds)
2026-02-19 22:16:52 +08:00
Karl Seguin
e2a1ce623c Rework CDP frameIds (and loaderIds and requestIds and interceptorIds)
Our BrowsingContext currently supports 1 target. So we have a per-BC target_id.
Previously, our target had 1 "frame" - our page. So we often treated the
targetId as the frameId. But to work with frames, we need page-specific
frameIds and loaderIds.

This tries to clean up our ids (a little). frameIds are now ids derived from
a new incrementing page.id. This page.id has to be passed around (via http
Requests and through notifications) in order to properly generate messages with
a frameId.
2026-02-19 13:01:41 +08:00
Karl Seguin
645da2e307 Reduce cost of various Element render-related properties.
Added a get-only `getStyle` which doesn't lazily create a new style if none
exists. This can be used in the (frequently used) `checkVisibility` to avoid
an allocation. Added a specialized getBoundingClientRectForVisible which
skips the checkVisibility check, since a few callers have already done their
own visibility check.

DOMRect is now off the heap. This avoids _a lot_ of allocation when a DOMRect
is only needed for internal calculation, e.g. in Document.elementFromPoint.
2026-02-19 09:45:56 +08:00
egrs
628049cfd7 add cookie size to CDP response
Compute and include the cookie size field (name.len + value.len)
in Storage.getCookies and Network.getCookies CDP responses,
matching Chrome's behavior.
2026-02-17 13:08:02 +01:00
Nikolay Govorov
6553bb8147 Remove los interceptors feature 2026-02-16 15:48:18 +00:00