There is some risk to this change. The first is that I made a mistake. The
other is that one of the APIs that doesn't currently return an error changes
in the future.
Further reducing bouncing between page and server for loop polling. If there is
a page, the page polls. If there isn't a page, the server polls. Simpler.
There's a flaky performance test that I wanted to fix (1). This led to a couple
changes.
1 - Add timestamp() and milliTimestamp() to datetime.zig. Reduce some code
duplication and use better clock_ids where available
2 - Change Performance API to use milliTimestamp and store a u64 instead of a
f64. While the spec says a float, Firefox deals with u64 and implicit
conversion is always available. Makes our APIs simpler.
(1) - https://github.com/lightpanda-io/browser/actions/runs/17313296490/job/49151366798#step:4:131
Rather than stack-allocating MAX_MESSAGE_SIZE upfront, we now allocate 32KB
and grow the buffer as needed for larger messages, up to MAX_MESSAGE_SIZE.
This will reduce memory usage for drivers that don't send huge payloads (like
playwright does).
While not implemented, this would also enable us to set the MAX_MESSAGE_SIZE
at runtime (e.g. via a command line option).
Previously, the IO loop was doing three things:
1 - Managing timeouts (either from scripts or for our own needs)
2 - Handling browser IO events (page/script/xhr)
3 - Handling CDP events (accept, read, write, timeout)
With the libcurl merge, 1 was moved to an in-process scheduler and 2 was moved
to libcurl's own event loop. That means the entire loop code, including
the dependency on tigerbeetle-io existed for handling a single TCP client.
Not only is that a lot of code, there was also friction between the two loops
(the libcurl one and our IO loop), which would result in latency - while one
loop is waiting for the events, any events on the other loop go un-processed.
This PR removes our IO loop. To accomplish this:
1 - The main accept loop is blocking. This is simpler and works perfectly well,
given we only allow 1 active connection.
2 - The client socket is passed to libcurl - yes, libcurl's loop can take
arbitrary FDs and poll them along with its own.
In addition to having one less dependency, the CDP code is quite a bit simpler,
especially around shutdowns and writes. This also removes _some_ of the latency
caused by the friction between page process and CDP processing. Specifically,
when CDP now blocks for input, http page events (script loading, xhr, ...) will
still be processed.
There's still friction. For one, the reverse isn't true: when the page is
waiting for events, CDP events aren't going to be processed. But the page.wait
already have some sensitivity to this (e.g. the page.request_intercepted flag).
Also, when CDP waits, while we will process network events, page timeouts are
still not processed. Because of both these remaining issues, we still need to
jump between the two loops - but being able to block on CDP (even for a short
time) WITHOUT stopping the page's network I/O, should reduce some latency.
1 - Make log_level a runtime option (not a build-time)
2 - Make log_format a runtime option
3 - In Debug mode, allow for log scope filtering
Improve the general usability of scopes. Previously, the scope was more or less
based on the file that the log was in. Now they are more logically grouped.
Consider the case where you want to silence HTTP request information, previously
you'd have to filter out the `page`, `xhr` and `http_client` scopes, but that
would also elimiate other page, xhr and http_client logs. Now, you can just
filter out the `http` scope.
Outputs in logfmt in release and a "pretty" print in debug mode. The format
along with the log level will become arguments to the binary at some point in
the future.
When the browser microtask was added, zig-specific timeout functions were
added to the loop. This was necessary for two reasons:
1 - The existing functions were JS specific
2 - We wanted a different reset counter for JS and Zig
Like we did in https://github.com/lightpanda-io/browser/pull/577, the loop is
now JS-agnostic. It gets a Zig callback, and the Zig callback can execute JS
(or do whatever). An intrusive node, like with events, is used to minimize
allocations.
Also, because the microtask was recently moved to the page, there is no longer
a need for separate event counters. All timeouts are scoped to the page.
The new timeout callback can now be used to efficiently reschedule a task. This
reuses the IO.completion and Context, avoiding 2 allocations. More importantly
it makes the internal timer_id static for the lifetime of an "interval". This
is important for window.setInterval, where the callback can itself clear the
interval, which we would need to detect in the callback handler to avoid
re-scheduling. With the stable timer_id, the existing cancel mechanism works
as expected.
The loop no longer has a cbk_error. Callback code is expected to try/catch
callbacks (or use callback.tryCall) and handle errors accordingly.
Previously, we could have multiple in-flight messages from the server to a
single client. This isn't safe and can lead to message interleaving. While
write / send are atomic, they are only atomic for the N bytes which they write,
which may not be the entire buffer. Consider this writeAll function:
```
pub fn writeAll(socket: socket_t, bytes: []const u8) !void {
var index: usize = 0;
while (index < bytes.len) {
index += try posix.write(socket, bytes[index..]);
}
}
```
If we're trying to send "abc123", this could take anywhere from 1 to 6 calls
to posix.write (it would take 6 calls, for example, if every call to
posix.write only wrote a single byte). Now if you're trying to write other data
to this same socket at the same time, messages _will_ get interleaved.
In order for this to work, the client now has a send_queue (doubly linked list).
When one message is sent, it sends the next.
In addition to the above change, the Client is now self-contained with respect
to its lifetime. This is necessary so that completions which come in AFTER our
concept of its lifetime ends, can still be processed. I think all types that
receive completions need to follow this model. This relies on the fact that
kqueue (which I know for a fact) and io_uring (which people seem to imply) handle
socket shutdown properly. It's still a bit messy because of timeout and not
wanting to wait until timeout to accept new connections, but needing to wait
until timeout to cleanup the client.
The self-contained nature of Client makes it difficult to test as a generic. I
removed Client(T). Tests now use real sockets. Some tests had to be removed
because they're too difficult to test over a real connection :(
ADD CDP testing helpers (mock Browser, Session, Page and Client). These are
placeholders until tests are added which use them.
Added a couple CDP tests.
For the time being, given that we only allow 1 client at a time, I took a
shortcut to implement this. The server has an incrementing "current_client_id"
which is part of every completion. On completion callback, we just check if
its client_id is still equal to the server's current_client_id.