15 KiB
Design: Slow-request isolation for the gthread worker (predictive dual-queue)
Status: proposal / draft
Author: (ankush)
Scope: gunicorn/workers/gthread.py, gunicorn/config.py
1. Problem
The gthread worker runs synchronous WSGI applications on a single
ThreadPoolExecutor sized to cfg.threads (gthread.py:95-97). Every accepted
connection is submitted to that one pool (enqueue_req, gthread.py:117-121).
Because the pool has a fixed number of threads and an unbounded work queue, a
flood of slow requests occupies every thread and all fast requests starve behind
them in the queue — head-of-line blocking.
Goal: route requests that are predicted to be slow into a separate, dedicated lane so they can never occupy the threads reserved for fast requests, even under a flood. Fast requests go to a fast lane; slow requests go to a slow lane. The slow lane may help drain fast work when its own queue is empty, but the fast lane never touches slow work.
This supersedes the earlier "demotion-only" proposal, which could not stop slow work from entering the fast pool and therefore could not survive a flood.
2. Why prediction is required (and its hard limit)
You cannot preempt a running Python thread executing WSGI code (gthread.py:352):
once a slow request is on a thread, that thread is committed until the app
returns. So isolation has to happen before a request is handed to a worker —
i.e. at routing time. That means we must decide "fast or slow" from the request
before running it.
The only information available pre-execution is the request itself (method, path, headers) plus what we have learned from prior requests. So the design is:
- A predictor that, given a request's route, answers "slow?" using learned per-route timing statistics (plus optional operator-seeded patterns).
- Routing at accept time based on that prediction, into one of two pools.
- Learning: every completed request — and any request that crosses the slow threshold mid-flight — updates the predictor, so a slow route is recognized after its first occurrence(s) and all subsequent traffic to it is routed to the slow lane.
Hard limit to state up front: a route that has never been seen cannot be predicted slow on its very first request(s); those first occurrences run in the fast lane until learning kicks in. We minimize this window (§5.4) and let operators pre-seed known-slow routes (§5.1). For repeated/flooding slow routes — the actual failure mode — prediction is effective after the first sample.
3. Architecture overview
┌─────────────────────────────────────────┐
listener ──accept──▶ │ main loop: poller-driven classification │
│ peek request line ▶ predictor.is_slow? │
└───────────────┬───────────────┬───────────┘
│ fast │ slow
▼ ▼
┌───────────┐ ┌───────────┐
│ fast_pool │ │ slow_pool │
│ F threads │ │ S threads │
└─────┬─────┘ └─────┬─────┘
└───────┬───────┘
on completion: predictor.update(route, duration)
- Fast lane: a
ThreadPoolExecutorofF = ceil(cfg.threads / 2)threads. Only ever runs fast-classified work. - Slow lane: a separate
ThreadPoolExecutorofS = cfg.threads // 2threads. Only ever runs slow-classified work. - Total OS threads per worker stays at
cfg.threads— adaptive-queueing mode splits the existing budget, it does not expand it.cfg.threadsmust be at least 2 for the split to be meaningful; otherwise the worker logs a warning and runs with a single pool. - Both lanes share the existing
worker_connectionsadmission like today; the slow lane is not separately bounded.
Why two plain pools (and not a custom dual-queue scheduler)
An earlier revision used a single custom scheduler with two queues and
one-directional work stealing (idle slow threads draining the fast queue).
Two independent ThreadPoolExecutors are dramatically simpler and rely on
well-tested stdlib machinery. The one capability given up is work stealing: the
S slow threads sit idle when there is no slow work, even if fast work is
queued. For the common case (S small, e.g. 1) this is a negligible amount of
parked capacity, and the simplicity is worth it. If maximizing throughput under
pure-fast load ever matters more than simplicity, the custom scheduler can be
reintroduced behind the same enqueue_req interface without touching routing.
4. Routing point: classify before threading
Today, parsing happens inside the worker thread (handle → next(conn.parser),
gthread.py:295), which is too late — the request is already on a thread. We
move classification only (not full parsing) into the main loop.
4.1 Restructured connection lifecycle
Both freshly accepted connections and keepalive connections flow through one
poller-driven classification step (this unifies accept/reuse_connection and
also moves slow-client header reads off the worker threads — a side benefit
against slowloris):
-
accept(gthread.py:123): accept socket, createTConn, set non-blocking, register it in the poller forEVENT_READwith aclassify_and_dispatchcallback. Do not submit to any pool yet.nr_conns += 1. -
When the socket becomes readable,
classify_and_dispatch(conn):- Peek the buffered bytes with
recv(n, socket.MSG_PEEK)(plaintext) — this reads without consuming, so the worker's parser still sees the full byte stream unchanged. No parser changes required. - Parse just the request line (
METHOD SP PATH SP VERSION CRLF) from the peeked buffer. If the line has not fully arrived yet, return and wait for the next readable event (bounded by the existing keepalive/header timeout so a stalled client is eventually closed, not left forever).
Why peek the request line, not fully read/parse the request here? Classification only needs method + path. Doing a full read/parse of the request in the main loop is actively harmful: the main loop is a single thread serving every connection (accepts, keepalive, the poller). A blocking full read lets one slow client — slowloris, slow network, or a large/chunked body — stall the entire worker, which is strictly worse than the thread-pool starvation we are fixing (there is no pool to absorb it). Peeking only inspects already-buffered bytes and defers to the poller if the line is incomplete, so it never blocks. It also avoids having to read the body in the main loop (WSGI streams
wsgi.inputlazily) and keeps header parsing, parse-error responses (400/414), andwsgi.inputwiring in the worker where they already live.- Compute
route_key(default:METHOD + " " + path, query string stripped; overridable via hook, §5.1). slow = predictor.is_slow(route_key)(or matches a seeded slow pattern).- Unregister the socket from the poller and submit the connection to the
slow pool if
slowelse the fast pool.
- Peek the buffered bytes with
-
The worker's
handle/handle_requestrun unchanged. On completion, the measuredrequest_time(already computed atgthread.py:362) is fed topredictor.update(route_key, duration). -
Keepalive: after a kept-alive request, re-register the connection in the poller with the same
classify_and_dispatchcallback (instead of the oldreuse_connection), so the next request on the connection is re-classified independently (it may hit a different route).
4.2 SSL connections
Plaintext peek does not work through TLS — the request line is encrypted until the handshake completes. For SSL connections in this first cut:
- They cannot be pre-classified at the socket level, so they default to the fast lane and rely on mid-flight + completion learning (§5.4) — meaning an SSL-only deployment does not get full flood protection.
- Note in docs that the common production layout terminates TLS upstream (e.g. nginx) so gunicorn sees plaintext and gets full protection.
- Phase 2 (deferred): drive a non-blocking TLS handshake from the poller and
buffer the decrypted request line (feeding it back via
Unreader.unread,unreader.py:51) to classify SSL the same way.
5. Components
5.1 Config (gunicorn/config.py)
New settings, mirroring WorkerThreads (config.py:697):
enable_adaptive_queueing— boolean; when true, thegthreadworker splits itscfg.threadsbudget between a fast and a slow lane and routes by prediction. DefaultFalse(single pool, today's behavior). Requirescfg.threads >= 2; otherwise the worker logs a warning and falls back to the single pool.slow_request_threshold— float seconds; a route whose learned timing meets/ exceeds this is "slow". Default1.0. Only consulted whenenable_adaptive_queueingis enabled.
A slow_route_key hook to customize the route key (e.g. collapse
/users/<id>) is a possible future addition; the default key is method + path
with the query string stripped.
5.2 Two thread pools
init_process builds two plain ThreadPoolExecutors when enable_adaptive_queueing is
enabled — fast_pool with F = ceil(cfg.threads / 2) workers and slow_pool
with S = cfg.threads // 2 workers — and falls back to the single
get_thread_pool() executor when it is disabled. enqueue_req(conn, slow)
submits to the matching pool; both produce ordinary
concurrent.futures.Futures, so _wrap_future, add_done_callback,
self.futures tracking, and futures.wait all keep working unchanged.
- The slow lane is not separately bounded; its executor's internal queue is
capped indirectly by the worker's existing
worker_connectionsadmission, the same as the single-pool path today. - Shutdown drains both pools via a
_shutdown_poolshelper, replacing the singletpool.shutdowncalls; thegraceful_timeoutfutures.waitis unchanged.
5.3 Predictor
A small, self-contained, thread-safe object:
- State: bounded LRU map
route_key -> {ewma_seconds, samples, last_seen}. Bounding caps memory under high route cardinality. update(route_key, duration): EWMA with decay so a route that becomes fast again eventually returns to the fast lane (avoids permanent misclassification after a one-off slow spike). Called on every completion.is_slow(route_key):Trueif itsewma_seconds >= slow_request_threshold. Unknown routes ⇒False(fast) by default.- Optional hysteresis (separate promote/demote thresholds) to avoid flapping around the boundary.
5.4 Learning signals
- Completion (primary): feed
request_time(gthread.py:362) intopredictor.update. After a slow route's first request completes, it is known. - Mid-flight observation (catches simultaneous first-bursts): the main loop
already sweeps
self.futuresfor the hard timeout (gthread.py:245-250). In that sweep, for any in-flight request whose elapsed time exceedsslow_request_threshold, callpredictor.updatewith that elapsed time immediately (do not wait for completion, and do not move the running request — we can't). This shortens the learning window when many requests to a brand-new slow route arrive at once: subsequent ones in the burst route to the slow lane after one threshold interval instead of after a full slow request.
6. Behavior under load (the cases that matter)
- Flood of a previously-seen slow route: every such request is routed to
the slow pool. The
Ffast threads are never given this work and keep serving fast traffic at full capacity. Excess slow requests sit in the slow pool's queue, gated overall byworker_connections. - Flood of a never-seen slow route: the first occurrence(s) run in the fast lane; mid-flight learning (§5.4.2) flips the route to slow after one threshold interval, so the flood is contained quickly.
- Mixed fast traffic, idle slow lane: the
Sslow threads stay parked (no work stealing in this design — see §3), so fast throughput isF, notF + S. This is the cost of splitting a fixedcfg.threadsbudget. - Misprediction (route marked slow but now fast): handled gracefully — it runs in the slow lane, and EWMA decay restores it to the fast lane over time.
7. Implementation checklist (touch points)
Implemented:
config.py—enable_adaptive_queueing,slow_request_threshold, plusvalidate_pos_float.gthread.pyinit_process/get_thread_pool— buildfast_poolandslow_pool(split fromcfg.threads) whenenable_adaptive_queueingis on, or the single legacy pool when off;_shutdown_pools.gthread.pyenqueue_req— route to the matching pool.gthread.pyaccept/park_for_request/classify_and_dispatch/_peek_request_line/_route_key— poller-driven request-line peek + routing.gthread.pyfinish_request—predictor.update, routing-aware keepalive re-park.gthread.pyrun-loop sweep — mid-flight learning.gthread_routing.py—SlowRoutePredictor.
8. Backward compatibility
enable_adaptive_queueing = False(the default) ⇒ feature off: single pool, no classification — byte-for-byte current behavior.- Hard per-request timeout (
gthread.py:243-250) preserved unchanged; this adds a softer, non-fatal classification on top. - Worker
handle/handle_request, keepalive semantics, and the future/finish_requestcontract are preserved (MSG_PEEK leaves the byte stream intact, so the parser is untouched).
9. Test plan
- Predictor unit: unknown ⇒ fast; after
updatewith a slow duration ⇒ slow; EWMA decay restores fast; seeded patterns are slow from first call; LRU bound holds under many keys. - Routing unit:
classify_and_dispatchextracts the rightroute_keyfrom partial vs complete peeked buffers; incomplete line defers; complete line dispatches to the expected lane. - Integration — flood isolation: app with a known-slow route flooded concurrently; assert fast-route latency stays low and slow requests never occupy fast workers.
- Integration — cold start: never-seen slow route burst ⇒ confirm the lane flips to slow within ~one threshold interval via mid-flight learning.
- Regression:
enable_adaptive_queueing = False⇒ current behavior; keepalive, SSL, and graceful shutdown paths still pass existing tests.