gunicorn/docs/design/gthread-slow-request-isolation.md
Ankush Menat 7dc4d184be refactor: Simpler implementation
Just do multiple queues, nothing else.
2026-05-28 16:33:54 +05:30

15 KiB

Design: Slow-request isolation for the gthread worker (predictive dual-queue)

Status: proposal / draft Author: (ankush) Scope: gunicorn/workers/gthread.py, gunicorn/config.py

1. Problem

The gthread worker runs synchronous WSGI applications on a single ThreadPoolExecutor sized to cfg.threads (gthread.py:95-97). Every accepted connection is submitted to that one pool (enqueue_req, gthread.py:117-121). Because the pool has a fixed number of threads and an unbounded work queue, a flood of slow requests occupies every thread and all fast requests starve behind them in the queue — head-of-line blocking.

Goal: route requests that are predicted to be slow into a separate, dedicated lane so they can never occupy the threads reserved for fast requests, even under a flood. Fast requests go to a fast lane; slow requests go to a slow lane. The slow lane may help drain fast work when its own queue is empty, but the fast lane never touches slow work.

This supersedes the earlier "demotion-only" proposal, which could not stop slow work from entering the fast pool and therefore could not survive a flood.

2. Why prediction is required (and its hard limit)

You cannot preempt a running Python thread executing WSGI code (gthread.py:352): once a slow request is on a thread, that thread is committed until the app returns. So isolation has to happen before a request is handed to a worker — i.e. at routing time. That means we must decide "fast or slow" from the request before running it.

The only information available pre-execution is the request itself (method, path, headers) plus what we have learned from prior requests. So the design is:

  1. A predictor that, given a request's route, answers "slow?" using learned per-route timing statistics (plus optional operator-seeded patterns).
  2. Routing at accept time based on that prediction, into one of two pools.
  3. Learning: every completed request — and any request that crosses the slow threshold mid-flight — updates the predictor, so a slow route is recognized after its first occurrence(s) and all subsequent traffic to it is routed to the slow lane.

Hard limit to state up front: a route that has never been seen cannot be predicted slow on its very first request(s); those first occurrences run in the fast lane until learning kicks in. We minimize this window (§5.4) and let operators pre-seed known-slow routes (§5.1). For repeated/flooding slow routes — the actual failure mode — prediction is effective after the first sample.

3. Architecture overview

                          ┌─────────────────────────────────────────┐
   listener ──accept──▶   │  main loop: poller-driven classification  │
                          │  peek request line ▶ predictor.is_slow?   │
                          └───────────────┬───────────────┬───────────┘
                                          │ fast          │ slow
                                          ▼               ▼
                                   ┌───────────┐   ┌───────────┐
                                   │ fast_pool │   │ slow_pool │
                                   │ F threads │   │ S threads │
                                   └─────┬─────┘   └─────┬─────┘
                                         └───────┬───────┘
                              on completion: predictor.update(route, duration)
  • Fast lane: a ThreadPoolExecutor of F = ceil(cfg.threads / 2) threads. Only ever runs fast-classified work.
  • Slow lane: a separate ThreadPoolExecutor of S = cfg.threads // 2 threads. Only ever runs slow-classified work.
  • Total OS threads per worker stays at cfg.threads — adaptive-queueing mode splits the existing budget, it does not expand it. cfg.threads must be at least 2 for the split to be meaningful; otherwise the worker logs a warning and runs with a single pool.
  • Both lanes share the existing worker_connections admission like today; the slow lane is not separately bounded.

Why two plain pools (and not a custom dual-queue scheduler)

An earlier revision used a single custom scheduler with two queues and one-directional work stealing (idle slow threads draining the fast queue). Two independent ThreadPoolExecutors are dramatically simpler and rely on well-tested stdlib machinery. The one capability given up is work stealing: the S slow threads sit idle when there is no slow work, even if fast work is queued. For the common case (S small, e.g. 1) this is a negligible amount of parked capacity, and the simplicity is worth it. If maximizing throughput under pure-fast load ever matters more than simplicity, the custom scheduler can be reintroduced behind the same enqueue_req interface without touching routing.

4. Routing point: classify before threading

Today, parsing happens inside the worker thread (handlenext(conn.parser), gthread.py:295), which is too late — the request is already on a thread. We move classification only (not full parsing) into the main loop.

4.1 Restructured connection lifecycle

Both freshly accepted connections and keepalive connections flow through one poller-driven classification step (this unifies accept/reuse_connection and also moves slow-client header reads off the worker threads — a side benefit against slowloris):

  1. accept (gthread.py:123): accept socket, create TConn, set non-blocking, register it in the poller for EVENT_READ with a classify_and_dispatch callback. Do not submit to any pool yet. nr_conns += 1.

  2. When the socket becomes readable, classify_and_dispatch(conn):

    • Peek the buffered bytes with recv(n, socket.MSG_PEEK) (plaintext) — this reads without consuming, so the worker's parser still sees the full byte stream unchanged. No parser changes required.
    • Parse just the request line (METHOD SP PATH SP VERSION CRLF) from the peeked buffer. If the line has not fully arrived yet, return and wait for the next readable event (bounded by the existing keepalive/header timeout so a stalled client is eventually closed, not left forever).

    Why peek the request line, not fully read/parse the request here? Classification only needs method + path. Doing a full read/parse of the request in the main loop is actively harmful: the main loop is a single thread serving every connection (accepts, keepalive, the poller). A blocking full read lets one slow client — slowloris, slow network, or a large/chunked body — stall the entire worker, which is strictly worse than the thread-pool starvation we are fixing (there is no pool to absorb it). Peeking only inspects already-buffered bytes and defers to the poller if the line is incomplete, so it never blocks. It also avoids having to read the body in the main loop (WSGI streams wsgi.input lazily) and keeps header parsing, parse-error responses (400/414), and wsgi.input wiring in the worker where they already live.

    • Compute route_key (default: METHOD + " " + path, query string stripped; overridable via hook, §5.1).
    • slow = predictor.is_slow(route_key) (or matches a seeded slow pattern).
    • Unregister the socket from the poller and submit the connection to the slow pool if slow else the fast pool.
  3. The worker's handle/handle_request run unchanged. On completion, the measured request_time (already computed at gthread.py:362) is fed to predictor.update(route_key, duration).

  4. Keepalive: after a kept-alive request, re-register the connection in the poller with the same classify_and_dispatch callback (instead of the old reuse_connection), so the next request on the connection is re-classified independently (it may hit a different route).

4.2 SSL connections

Plaintext peek does not work through TLS — the request line is encrypted until the handshake completes. For SSL connections in this first cut:

  • They cannot be pre-classified at the socket level, so they default to the fast lane and rely on mid-flight + completion learning (§5.4) — meaning an SSL-only deployment does not get full flood protection.
  • Note in docs that the common production layout terminates TLS upstream (e.g. nginx) so gunicorn sees plaintext and gets full protection.
  • Phase 2 (deferred): drive a non-blocking TLS handshake from the poller and buffer the decrypted request line (feeding it back via Unreader.unread, unreader.py:51) to classify SSL the same way.

5. Components

5.1 Config (gunicorn/config.py)

New settings, mirroring WorkerThreads (config.py:697):

  • enable_adaptive_queueing — boolean; when true, the gthread worker splits its cfg.threads budget between a fast and a slow lane and routes by prediction. Default False (single pool, today's behavior). Requires cfg.threads >= 2; otherwise the worker logs a warning and falls back to the single pool.
  • slow_request_threshold — float seconds; a route whose learned timing meets/ exceeds this is "slow". Default 1.0. Only consulted when enable_adaptive_queueing is enabled.

A slow_route_key hook to customize the route key (e.g. collapse /users/<id>) is a possible future addition; the default key is method + path with the query string stripped.

5.2 Two thread pools

init_process builds two plain ThreadPoolExecutors when enable_adaptive_queueing is enabled — fast_pool with F = ceil(cfg.threads / 2) workers and slow_pool with S = cfg.threads // 2 workers — and falls back to the single get_thread_pool() executor when it is disabled. enqueue_req(conn, slow) submits to the matching pool; both produce ordinary concurrent.futures.Futures, so _wrap_future, add_done_callback, self.futures tracking, and futures.wait all keep working unchanged.

  • The slow lane is not separately bounded; its executor's internal queue is capped indirectly by the worker's existing worker_connections admission, the same as the single-pool path today.
  • Shutdown drains both pools via a _shutdown_pools helper, replacing the single tpool.shutdown calls; the graceful_timeout futures.wait is unchanged.

5.3 Predictor

A small, self-contained, thread-safe object:

  • State: bounded LRU map route_key -> {ewma_seconds, samples, last_seen}. Bounding caps memory under high route cardinality.
  • update(route_key, duration): EWMA with decay so a route that becomes fast again eventually returns to the fast lane (avoids permanent misclassification after a one-off slow spike). Called on every completion.
  • is_slow(route_key): True if its ewma_seconds >= slow_request_threshold. Unknown routes ⇒ False (fast) by default.
  • Optional hysteresis (separate promote/demote thresholds) to avoid flapping around the boundary.

5.4 Learning signals

  1. Completion (primary): feed request_time (gthread.py:362) into predictor.update. After a slow route's first request completes, it is known.
  2. Mid-flight observation (catches simultaneous first-bursts): the main loop already sweeps self.futures for the hard timeout (gthread.py:245-250). In that sweep, for any in-flight request whose elapsed time exceeds slow_request_threshold, call predictor.update with that elapsed time immediately (do not wait for completion, and do not move the running request — we can't). This shortens the learning window when many requests to a brand-new slow route arrive at once: subsequent ones in the burst route to the slow lane after one threshold interval instead of after a full slow request.

6. Behavior under load (the cases that matter)

  • Flood of a previously-seen slow route: every such request is routed to the slow pool. The F fast threads are never given this work and keep serving fast traffic at full capacity. Excess slow requests sit in the slow pool's queue, gated overall by worker_connections.
  • Flood of a never-seen slow route: the first occurrence(s) run in the fast lane; mid-flight learning (§5.4.2) flips the route to slow after one threshold interval, so the flood is contained quickly.
  • Mixed fast traffic, idle slow lane: the S slow threads stay parked (no work stealing in this design — see §3), so fast throughput is F, not F + S. This is the cost of splitting a fixed cfg.threads budget.
  • Misprediction (route marked slow but now fast): handled gracefully — it runs in the slow lane, and EWMA decay restores it to the fast lane over time.

7. Implementation checklist (touch points)

Implemented:

  • config.pyenable_adaptive_queueing, slow_request_threshold, plus validate_pos_float.
  • gthread.py init_process/get_thread_pool — build fast_pool and slow_pool (split from cfg.threads) when enable_adaptive_queueing is on, or the single legacy pool when off; _shutdown_pools.
  • gthread.py enqueue_req — route to the matching pool.
  • gthread.py accept/park_for_request/classify_and_dispatch/ _peek_request_line/_route_key — poller-driven request-line peek + routing.
  • gthread.py finish_requestpredictor.update, routing-aware keepalive re-park.
  • gthread.py run-loop sweep — mid-flight learning.
  • gthread_routing.pySlowRoutePredictor.

8. Backward compatibility

  • enable_adaptive_queueing = False (the default) ⇒ feature off: single pool, no classification — byte-for-byte current behavior.
  • Hard per-request timeout (gthread.py:243-250) preserved unchanged; this adds a softer, non-fatal classification on top.
  • Worker handle/handle_request, keepalive semantics, and the future/finish_request contract are preserved (MSG_PEEK leaves the byte stream intact, so the parser is untouched).

9. Test plan

  • Predictor unit: unknown ⇒ fast; after update with a slow duration ⇒ slow; EWMA decay restores fast; seeded patterns are slow from first call; LRU bound holds under many keys.
  • Routing unit: classify_and_dispatch extracts the right route_key from partial vs complete peeked buffers; incomplete line defers; complete line dispatches to the expected lane.
  • Integration — flood isolation: app with a known-slow route flooded concurrently; assert fast-route latency stays low and slow requests never occupy fast workers.
  • Integration — cold start: never-seen slow route burst ⇒ confirm the lane flips to slow within ~one threshold interval via mid-flight learning.
  • Regression: enable_adaptive_queueing = False ⇒ current behavior; keepalive, SSL, and graceful shutdown paths still pass existing tests.