We currently set request._keepalive prematurely. There are [error cases] where
the request could be abandoned before being fully drained. While we do try to
drain in some cases, it isn't always possible. For this reason,
request.keepalive is only set at the end of the request lifecycle, at which
point we know the connection is ready to be re-used.
Test speed has been improved only slightly by tweaking a 2-second running tests.
Build has been improved by:
1 - moving logFunctionCallError out of js.Caller and to a standalone function
2 - removing some non-generic code from the generic portions of the logger
Caller.getter and Caller.setter have been removed in favor or calling
Caller.method. This wasn't previously possible - prior to our v8 upgrade, they
had different signatures.
Also removed a largely unused parser/str.zig file.
The mix of sync and async HTTP requests requires care to avoid deadlocks.
Previously, it was possible for async requests to use up all available HTTP
state objects duration a navigation flow (either directly, or via an internal
redirect (e.g. click, submit, ...)). This would block the navigation, which,
because everything is single thread, would block the I/O loop, resulting in a
deadlock.
The correct solution seems to be to remove all synchronous I/O. And I tried to
do that, but I ran into a wall with module-loading, which is initiated from V8.
V8 says "give me the source for this module", and I don't see a great way to
tell it: wait a bit.
So I went back to trying to make this work with the hybrid model, despite last
weeks failures to get it to work. I changed two things:
1 - The http client will only directly initiate an async request if there's
at least 2 free state objects available (1 for the request, and leaving 1
free for any synchronous requests)
2 - Delayed navigation retries until there's at least 1 free http state object
available.
Commits from last week did help with this. First, we're now guaranteed to have
a single sync-request at a time (previously, we could have had 2). Secondly,
the async connection is now async end-to-end (previously, it could have blocked
on an empty state pool).
We could probably make this a bit more obviously by reserving 1 state object
for synchronous requests. But, since the long term solution is probably having
no synchronous requests, I'm happy with anything that lets me move past this
issue.
CDP translate this into a Network.loadingFailed. This is necessary to make sure
every Network.requestWillBeSent is paired with either a Network.loadingFailed
or a Network.responseReceived.
We currently keep the main request open during loadHTMLDoc and processHTMLDoc.
It _has_ to be open during loadHTMLDoc, since that streams the body. But it
does not have to be open during processHTMLDoc, which can be log and itself
could make use of that same connection if it was released. Reorganized the
navigate flow to limit the scope of the request.
Also, just like we track pending_write and pending_read, we now also track
pending_connect and only shutdown when all are not pending.
The HTTP Client has a state pool. It blocks when we've exceeded max_concurrency.
This can block processing forever. A simple way to reproduce this is to go into
the demo cdp.js, and execute the XHR request 5 times (loading json/product.json)
To some degree, I think this is a result of weird / non-intuitive execution
flow. If you exec a JS with 100 XHR requests, it'll call our XHR _send function
but none of these will execute until the loop is run (after the script is done
being executed). This can result in poor utilization of our connection and
state pool.
For an async request, getting the *Request object is itself now asynchronous.
If no state is available, we use the Loop's timeout (at 20ms) to keep checking
for an available state.
Async HTTP request work by emitting a "Progress" object to a callback. This
object has a "done" flag which, when `true`, indicates that all data has been
emitting and no future "Progress" objects will be sent.
Callers like XHR buffer the response and wait for "done = true" to then process
the request.
The HTTP client relies on two important object pools: the connection and the
state (with all the buffers for reading/writing).
In its current implementation, the async flow does not release these pooled
objects until the final callback has returned. At best, this is inefficient:
we're keeping the connection and state objects checked out for longer than they
have to be. At worse, it can lead to a deadlock. If the calling code issues a
new request when done == true, we'll eventually run out of state objects in the
pool.
This commit now releases the state objects before emit the final "done" Progress
message. For this to work, this final message will always have null data and
an empty header object.
1 - Make log_level a runtime option (not a build-time)
2 - Make log_format a runtime option
3 - In Debug mode, allow for log scope filtering
Improve the general usability of scopes. Previously, the scope was more or less
based on the file that the log was in. Now they are more logically grouped.
Consider the case where you want to silence HTTP request information, previously
you'd have to filter out the `page`, `xhr` and `http_client` scopes, but that
would also elimiate other page, xhr and http_client logs. Now, you can just
filter out the `http` scope.
The Request object now exists on the heap, allowing it to outlive whatever is
making the request (e.g. the XHR object). We can now wait until all inflight IO
events are completed before clearing the memory.
This change fixes the crash observed in:
https://github.com/lightpanda-io/browser/issues/667
Outputs in logfmt in release and a "pretty" print in debug mode. The format
along with the log level will become arguments to the binary at some point in
the future.
This might not be specific to network notification, but the issue happens all
the time testing scenarios that rely on network notification, so it's hard
to ignore.
- Add 2 internal notifications
1 - http_request_start
2 - http_request_complete
- When Network.enable CDP message is received, browser context registers for
these 2 events (when Network.disable is called, it unregisters)
- On http_request_start, CDP will emit a Network.requestWillBeSent message.
This _does not_ include all the fields, but what we have appears to be enough
for puppeteer.waitForNetworkIdle.
- On http_request_complete, CDP will emit a Network.responseReceived message.
This _does not_ include all the fields, bu what we have appears to be enough
for puppeteer.waitForNetworkIdle.
We currently don't emit any other new events, including any network-specific
lifecycleEvent (i.e. Chrome will emit an networkIdle and networkAlmostIdle).
To support this, the following other things were done:
- CDP now has a `notification_arena` which is re-used between browser contexts.
Normally, CDP code runs based on a "cmd" which has its own message_arena, but
these notifications happen out-of-band, so we needed a new arena which is
valid for handling 1 notification.
- HTTP Client is notification-aware. The SessionState no longer includes the
*http.Client directly. It instead includes an http.RequestFactory which is
the combination fo the client + a specific configuration (i.e. *Notification).
This ensures that all requests made from that factory have the same settings.
- However, despite the above, _some_ requests do not appear to emit CDP events,
such as loading a <script src="X">. So the page still deals directly with the
*http.Client.
- Playwright and Puppeteer (but Playwright in particular) are very sensitive to
event ordering. These new events have introduced additional sensitivity.
The result sent to Page.navigate had to be moved to inside the navigate event
handler, which meant passing some cdp-specific data (the input.id) into the
NavigateOpts. This is the only way I found to keep both happy - the sequence
of events is closer (but still pretty far) from what Chrome does.
Synchronous body reader now exposes a peek() function to get the first few bytes
from the response body. This will be no less than 100 bytes (assuming the body
is that big), but could be more. Streaming API, via res.next() continues to work
as-is even if peek() is called.
Introduce Mime.sniff() that detects a few common types - the ones that we care
about right now - from the body content.
The HTTP response values _are_ mutable, but because we're using std.http.Header
the type is a `[]const u8`. This introduce a custom `Header` type where the
value is `[]u8`.
The goal is largely to allow more efficient value-comparison, by allowing
calling code to lower-case in-place. I specifically have the Mime parser in
mind:
25dcae7648/src/browser/mime.zig (L134)
Combine uri + rawuri into single struct.
Try to improve ownership around URIs and URI-like things.
- cookie & request can take *const std.Uri
(TODO: make them aware of the new URL struct?)
- Location (web api) should own its URL (web api URL)
- Window should own its Location
Most of these changes result in (a) a cleaner Page and (b) not having to carry
around 2 nullable objects (URI and rawuri).
When we only have 1 or 2 bytes missing from a chunk (i.e. the tailing \n or
\r\n), don't emit an empty chunk if we have more data available to process.
When set, this disables the host verification of all HTTP requests. Available
for both the fetch and serve mode.
Also introduced an App.Config, for future command line options which need to
be passed more deeply into the code.