251 Commits

Author SHA1 Message Date
Tanmoy Sarkar
7f39839a8c fix(companion): Back off manager respawn and quiet expected exits
The arbiter respawned the companion manager on every main-loop tick once
its pid cleared, so a manager that could not boot would busy-spin. It also
logged every manager exit as an error, including the deliberate exits from
shutdown and reload.

Track whether a manager exit was on purpose: stop_companion_manager marks
it expected and clears any backoff, so the reaper logs it as info and a
reload respawns without delay. An unexpected exit now arms an exponential
crash backoff (2^(n-1)s, capped at 30s) that the main loop waits out before
respawning, and is logged as an error.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-13 11:42:02 +05:30
Tanmoy Sarkar
220bbfe150 fix(companion): Only reload companion manager when its config changed
A SIGHUP web reload recycles HTTP workers and re-reads config, but with
--preload it does not re-import application code: the WSGI callable is
loaded once and cached, so running companions are already current.
Restarting the manager on every reload bounced all companions for
nothing, slowing the common fast-reload path Frappe relies on.

reload_companion_manager now rebuilds the companion configs and compares
their sorted config_hash against the running manager's. Unchanged ->
leave the manager and its companions running. Changed (field, added, or
removed name) -> restart the manager from the fresh cfg, as before.
Per-companion reread via the control socket is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-12 23:44:32 +05:30
Tanmoy Sarkar
672c45a9c7 fix(companion): Harden manager and control against runtime errors
- _safe_kill: a companion can exit between the manager deciding to
  signal it and the kill landing; swallow ProcessLookupError at the three
  os.kill sites so the resulting race cannot take the manager down.
- _redirect_output: close the opened log fd after dup2 so a long-lived
  companion does not leak a descriptor per start.
- serve_connection: drop a control connection whose line grows past
  MAX_LINE_BYTES without a newline, so a client cannot pin unbounded
  memory in the manager.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-12 23:30:48 +05:30
Tanmoy Sarkar
3b972fe310 fix(companion): Validate stop_signal and harden control dispatch
A typo'd companion_stop_signal (e.g. "SIGTRM") passed validate_string
but raised ValueError in _signal_number when the manager later tried to
send it -- propagating past handle_line and killing the run loop.

Validate stop_signal at config-build time so a bad value fails loudly
on load and reread. As defense-in-depth, catch unexpected exceptions in
ControlServer.handle_line so no handler bug can escape and kill the
manager; they now return an error envelope.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-12 23:25:26 +05:30
Tanmoy Sarkar
68ac2e4bb2 fix(companion): Cancel pending restart when stopping a companion
A stop issued while a restart was in flight (state STOPPING,
restart_pending set) was ignored: handle_exit checked restart_pending
first and respawned the companion the user had just stopped. Clear
restart_pending in stop_process so manual stop wins.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-12 23:23:18 +05:30
Tanmoy Sarkar
642387dd0e fix(companion): Apply dead config settings and validate specs
Three companion settings were documented and configurable but never had any
effect. companion_restart_delay was ignored because CompanionProcess hardcoded
a 5s delay; it is now read from config and kept out of config_hash, since it
does not affect the spawned process and so must not trigger a restart on
reread. companion_config_file was never read; the manager now loads its
companion settings from that dedicated file when set, instead of always reading
the main gunicorn config. companion_manager_stop_timeout was unused, so
shutdown waited only graceful_timeout before SIGKILLing the manager and cut
short long-draining companions; stop now waits the larger of graceful_timeout
and the manager stop timeout, derived from the slowest companion stop_timeout
plus the buffer when not set explicitly.

Worker specs now reject unknown keys so a typo fails loudly instead of silently
falling back to a default. Also correct the spawn_companion_manager docstring,
drop its unused return value, and fix the README config-file description.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-12 22:51:19 +05:30
Tanmoy Sarkar
88e8ef0f36 feat(companion): Add gunicorn-companion control CLI
Add a command-line client for the companion control socket so operators do not
have to hand-craft JSON. gunicorn.companion.ctl speaks the manager's
newline-delimited JSON protocol: status, start, stop, restart, and reread. The
socket path comes from --socket or . Exit status is 0
when the manager reports ok, 1 when it reports a failure, and 2 for a usage
error or an unreachable socket. Registered as the gunicorn-companion console
script.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 23:22:41 +05:30
Tanmoy Sarkar
a90eba0c17 fix(companion): Correct manager pid and reset companion signals 2026-06-09 23:18:54 +05:30
Tanmoy Sarkar
1827667cb2 test(companion): Assert HTTP worker path is unchanged
Add two arbiter regression tests. A worker exit is still reaped normally (tmp
closed, child_exit called) while a companion manager pid is registered, so the
companion reap branch does not swallow worker exits. And an HTTP worker is
still spawned and recorded as before when companions are configured, so the
companion config never touches the worker path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 23:08:29 +05:30
Tanmoy Sarkar
1fc57b22a8 test(companion): Add transactional reread test
Add the missing transactionality assertion to the existing reread coverage: a
batch that would change one companion and add another but also contains a
duplicate name is rejected as a whole. The process set and configs stay
untouched and no fork or kill happens, proving nothing is applied on validation
failure.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 23:06:28 +05:30
Tanmoy Sarkar
4d554c2fac test(companion): Add control command tests
Wire ControlServer.handle_line to a real CompanionManager.handle_command and
assert the full decode/dispatch/encode round trip: status returns the companion
list, start routes through and reports a message, and unknown command, missing
name, and reread without a loader each return an error envelope.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 23:05:41 +05:30
Tanmoy Sarkar
e15dd583b9 test(companion): Add state transition tests
Add end-to-end chains over the per-unit tests: spawn to STARTING, promote to
RUNNING, unexpected exit to BACKOFF, retry back to STARTING; the stop path
ending in manual STOPPED; and the restart path that respawns immediately when
the old child exits.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 23:04:40 +05:30
Tanmoy Sarkar
e780484d24 test(companion): Add config validation tests
Cover validate_companion_workers (None becomes empty, non-list and non-dict
items rejected) and CompanionConfig.config_hash (stable for equal configs,
changes when a field changes, callable target keyed by qualified name and
hashed stably).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 23:03:43 +05:30
Tanmoy Sarkar
465aff870d feat(companion): Add lifecycle logs for companion transitions
Fill the gaps in the manager's lifecycle logging. Every reaped companion now
logs how it exited (signal vs status) before its fate is decided, and
handle_exit logs the decision: restarting, stopped when stopped on purpose, or
backing off with the retry delay. stop_all brackets shutdown with 'stopping all
companions' and 'all companions stopped', run() logs when the manager stops,
and reread_config logs an added/removed/restarted/unchanged summary.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 23:01:20 +05:30
Tanmoy Sarkar
9c4d81726d feat(companion): Add parent-death cleanup for manager and companions
Stop orphaned processes from lingering when their parent dies.

set_parent_death_signal arms Linux prctl(PR_SET_PDEATHSIG) so a process is
signalled the moment its parent exits, returning False off Linux so callers
fall back to polling getppid.

The manager records its parent pid, arms a SIGTERM parent-death signal, and
checks getppid each tick: if the arbiter dies, the manager stops its companions
and exits instead of running on under a dead arbiter. Each companion arms the
same parent-death signal and rechecks getppid right after the fork, exiting if
the manager already died before the signal was armed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 22:56:21 +05:30
Tanmoy Sarkar
f21d0310be feat(companion): Close manager-only fds in the companion child
spawn_process now closes the manager's control socket listener and wakeup
self-pipe in the forked companion before running its target. Both are
inherited across the fork; closing them stops a companion from holding the
control listener (and possibly answering control requests) or the manager's
private signal pipe. Guarded so direct spawns without a control socket or
running loop are a no-op.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 22:52:17 +05:30
Tanmoy Sarkar
31e08aac58 feat(companion): Close Gunicorn-only fds in the manager child
The forked companion manager inherits the arbiter's HTTP listening sockets,
its wakeup pipe, and the worker heartbeat files, none of which the manager
uses. Close them in the child before running so the manager and the companions
it forks do not pin the arbiter's fds. The manager creates its own signal pipe
and control socket after the fork.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 22:50:52 +05:30
Tanmoy Sarkar
22431f24e6 feat(companion): Restart the manager on Gunicorn reload
Arbiter.reload (SIGHUP) now calls reload_companion_manager. A running manager
is sent SIGTERM so it drains its companions; the SIGCHLD reaper clears its pid
and manage_companion_manager respawns it from the freshly reloaded cfg. If
companions were added where none ran, a new manager starts immediately.

Restarting reuses the existing stop and respawn path; transactional
per-companion reread stays available separately through the control socket.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 22:32:59 +05:30
Tanmoy Sarkar
073a0b2e7d feat(companion): Shut down the manager from the arbiter
Arbiter.stop now signals the companion manager alongside the workers. It sends
the same SIGTERM (graceful) or SIGQUIT (immediate), waits the graceful_timeout
for both the workers and the manager to exit, then SIGKILLs whatever remains.
A graceful SIGTERM lets the manager stop its own companions before exiting.

stop_companion_manager(sig) signals the manager pid when it is running and
clears the pid on ESRCH; the SIGCHLD reaper clears it on a normal exit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 22:30:58 +05:30
Tanmoy Sarkar
457bc5a69a feat(companion): Spawn and reap the manager from the arbiter
Run the companion manager as a single arbiter child with its own
supervision loop, and host the config model with its loader.

config.py holds CompanionConfig (moved from process.py) and
build_companion_configs(cfg), which expands each companion_workers entry into
a CompanionConfig, filling omitted fields from the global companion_* settings.
It is also the reread config_loader. process.py keeps State and CompanionProcess.

CompanionManager.run() is the forked-child body: installs SIGCHLD/SIGTERM/SIGINT
via a self-pipe, brings up the control socket, starts every companion, then
select-waits on the socket and the pipe. Each tick reaps exits, retries backoff,
promotes past startsecs, and SIGKILLs companions past their stop deadline.
SIGTERM/SIGINT stop all companions and return.

Arbiter gains companion_manager_pid, manage_companion_manager (respawns the
manager when it is gone and companions are configured), spawn_companion_manager
(fork; child runs the loop), and reap detection that clears the pid on exit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 22:24:53 +05:30
Tanmoy Sarkar
9f3762d6b6 refactor(companion): Spell out abbreviated identifiers
No behaviour change.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 22:03:13 +05:30
Tanmoy Sarkar
5db503295c feat(companion): Implement transactional reread
Add CompanionManager.reread_config(new_configs): diffs the running set against
a fresh, validated config list by config_hash -- a new name is added and
started, a missing name stopped and removed, a changed hash stores the config
and restarts (a manually stopped companion keeps STOPPED with the new config
ready), and an unchanged hash is left alone. Returns {ok, added, removed,
restarted, unchanged}. Validation runs first via _index_configs (duplicate-name
check), so a bad config mutates nothing and returns {ok: false, error,
kept_old_config: true}.

Wire the reread command to a config_loader hook on the manager -- the seam
between process supervision and config-file loading, set by the arbiter
(default None raises CommandError). A loader that raises returns the
kept-old-config error envelope.

Add tests for add/remove/restart-changed/manual-stop/unchanged/duplicate and
the reread no-loader, runs-loader, and bad-config paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 21:55:51 +05:30
Tanmoy Sarkar
ef6e42ecc1 feat(companion): Implement status, start, stop, and restart commands 2026-06-09 21:44:27 +05:30
Tanmoy Sarkar
104bfcebdd feat(companion): Add Unix control socket and JSON command protocol
Add gunicorn/companion/control.py with ControlServer, the manager's control
endpoint. It owns the Unix socket lifecycle (create unlinks any stale socket,
binds, chmods 0o600, and listens; close cleans up) and the newline-delimited
JSON framing: serve_connection buffers reads and answers each complete line.
decode_command parses a request into a JSON object carrying a string cmd, and
encode_response writes a newline-terminated JSON line; malformed input becomes
a CommandError rendered as an {ok: false, error: ...} reply so a bad client
can't take the manager down. Turning a command into an action is delegated to a
dispatch callable, wired up in the later command tasks.

The socket is 0o600 and owned by the non-root user gunicorn runs as; no group
switching.

Add tests/test_companion_control.py covering decode, encode, handle_line
dispatch and error envelopes, and socket create/close.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 18:23:03 +05:30
Tanmoy Sarkar
c82df2ab94 feat(companion): Make manual_stop ownership explicit
spawn_process no longer clears manual_stop; spawning is now policy-neutral.
Clearing the flag is owned by start_process and restart_process (which already
do it), and the respawn paths (retry_backoff, restart_pending) only run when
the flag is already false. A manually stopped companion now keeps manual_stop
set through its exit, so it settles in STOPPED and is not auto-restarted.

Add tests: manual_stop preserved through exit, start clears it, spawn leaves
it untouched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 18:17:44 +05:30
Tanmoy Sarkar
8e0ca34277 feat(companion): Implement restart_process control command
Add restart_process(name) following supervisor's restart rules: it always
clears manual_stop. RUNNING/STARTING are sent their stop_signal and enter
STOPPING with restart_pending set and a deadline from reload_timeout; the
reaper respawns them immediately once the old child exits. BACKOFF and STOPPED
start again right away. STOPPING is rejected. It never rereads config.

handle_exit now honors restart_pending first, respawning immediately (bumping
restart_count) instead of going to STOPPED or BACKOFF. Add a restart_pending
field on CompanionProcess.

Add tests for the running, pending-reap, stopped, backoff, and stopping cases.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 18:10:40 +05:30
Tanmoy Sarkar
8d9eb76e3d feat(companion): Implement stop_process control command
Add stop_process(name) following supervisor's stop rules: it always sets
manual_stop so the companion will not auto-restart. RUNNING/STARTING are sent
their stop_signal and moved to STOPPING with a stop_deadline (now +
stop_timeout) for the run loop to reap or SIGKILL; BACKOFF cancels its pending
retry and settles in STOPPED; STOPPED and STOPPING are success no-ops. Add
_signal_number to resolve a signal name and a stop_deadline field on
CompanionProcess.

Add tests for the running, backoff, already-stopped, unknown, and signal-name
cases.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 18:06:58 +05:30
Tanmoy Sarkar
8c9aa962ae feat(companion): Implement start_process control command
Add start_process(name) following supervisor's start rules: STOPPED and
BACKOFF clear manual_stop, drop any pending retry, and spawn now; RUNNING and
STARTING report success without acting; STOPPING is rejected so the caller
retries. Returns (ok, message).

Add tests for the stopped, backoff, running, stopping, and unknown cases.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 17:59:35 +05:30
Tanmoy Sarkar
87bc4cf70e feat(companion): Implement BACKOFF with fixed restart delay
Reaping now transitions each exited companion via handle_exit: a manually
stopped one settles in STOPPED, any other exit enters BACKOFF with
next_retry_at = now + restart_delay (fixed, no exponential backoff or cap).
Add retry_backoff to re-fork BACKOFF companions once their delay elapses,
bumping restart_count and returning them to STARTING.

Add tests for backoff on unexpected exit, manual-stop staying stopped, retry
timing, and reap-to-backoff.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 17:56:22 +05:30
Tanmoy Sarkar
84d69c46fd feat(companion): Promote companions from STARTING to RUNNING after startsecs
Add promote_running to CompanionManager: scans STARTING companions and moves
any that have stayed alive at least their startsecs window to RUNNING, logging
the pid and returning the promoted ones. Companions that die inside the window
are left to reaping.

Add tests for promotion after the window, too-early no-op, and non-STARTING.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 17:52:25 +05:30
Tanmoy Sarkar
bd8a91f656 feat(companion): Reap exited companion processes
Add reap_processes to CompanionManager: drains waitpid(WNOHANG), matches each
dead pid back to its companion, and records the exit via _record_exit (signal
number or exit code, exited_at, exit_count) while freeing the pid. Returns the
reaped companions; the restart decision stays with the run loop.

Add tests for exit-code, signal, and no-children cases.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 17:49:27 +05:30
Tanmoy Sarkar
2bf7e1b1fb feat(companion): Redirect companion stdout and stderr
Child calls _redirect_output after env setup: each configured log path is
opened append-mode and dup2'd onto fd 1/2. None/inherit keeps the inherited
fd; stderr stdout shares stdout's fd. Rotation stays external.

Add tests for inherit, append flags, file dup2, and stderr-to-stdout.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 17:30:18 +05:30
Tanmoy Sarkar
ea2748a209 feat(companion): Apply cwd and env in spawned companion child
Child runs _apply_environment before the target: os.chdir(cwd) then
os.environ.update(env).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 17:06:42 +05:30
Tanmoy Sarkar
5639d467f3 feat(companion): Add CompanionManager skeleton and single-companion spawn
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-09 16:57:18 +05:30
Ankush Menat
ec6af68013 fix: Remove hardcoded paths for slow prediction 2026-05-28 16:01:02 +05:30
Ankush Menat
ee9bf1e950 feat: Adaptive queueing of slow/fast requests 2026-05-27 11:58:54 +05:30
Benoit Chesneau
e75c3533e3
Merge pull request #3189 from pajod/patch-py36
chore: eat Python 2 leftovers
2024-08-10 10:40:40 +02:00
Benoit Chesneau
3f56d76548
Merge pull request #3192 from pajod/patch-allowed-script-name
22.0.0 regression: We need a better default treatment of SCRIPT_NAME
2024-08-09 09:05:57 +02:00
Paul J. Dorn
ffa48b581d test: default change was intentional 2024-08-08 18:37:32 +02:00
Paul J. Dorn
3e042e8269 Configurable list of forwarder headers 2024-08-07 20:15:13 +02:00
Paul J. Dorn
2bc931e7d9 whitespace handling in header field values
Strip whitespace also *after* header field value.
Simply refuse obsolete header folding (a default-off
option to revert is temporarily provided).
While we are at it, explicitly handle recently
introduced http error classes with intended status code.
2024-08-07 19:42:16 +02:00
Benoit Chesneau
ad7c1de132
Merge pull request #3080 from odyfatouros/Fix-#3079-worker_class-parameter-accepts-class
Fix for issue #3079, worker_class parameter accepts a class
2024-08-07 08:47:20 +02:00
benoitc
555d2fa27f don't tolerate wrong te headers
changes:

- Just follow the new TE specification (https://datatracker.ietf.org/doc/html/rfc9112#name-transfer-encoding)
 here and accept to introduce a breaking change.
- gandle multiple TE on one line

** breaking changes ** : invalid  headers and position will now return
an error.
2024-08-06 23:47:01 +02:00
Benoit Chesneau
9a96e75808
Merge pull request #3253 from pajod/patch-rfc9110-section5.5
Refuse requests with invalid and dangerous CR/LF/NUL in header field value, as demanded by rfc9110 section 5.5
2024-08-06 22:25:12 +02:00
Paul J. Dorn
cabc666277 chunked encoding: example invalid requests 2024-07-31 19:21:07 +02:00
Paul J. Dorn
eda9d456d3 forbid lone CR/LF and NUL in headers
New parser rule: refuse HTTP requests where a header field value
contains characters that
a) should never appear there in the first place,
b) might have lead to incorrect treatment in a proxy in front, and
c) might lead to unintended behaviour in applications.

From RFC 9110 section 5.5:
"Field values containing CR, LF, or NUL characters are invalid and
dangerous, due to the varying ways that implementations might parse
and interpret those characters; a recipient of CR, LF, or NUL within
a field value MUST either reject the message or replace each of those
characters with SP before further processing or forwarding of that
message."
2024-07-31 01:28:30 +02:00
Vaclav Rehak
97f87ec13e Fix InvalidHTTPVersion exception str method
Fixes: #3195
2024-04-26 13:58:10 +02:00
Paul J. Dorn
422b18acea class Name(object): -> class Name: 2024-04-22 03:33:30 +02:00
Paul J. Dorn
4323027b1e drop long-default - coding: utf-8 2024-04-22 03:33:14 +02:00
Odysseas Fatouros
08364f0365 Issue #3079, add unit test 2024-01-02 14:21:26 +01:00