spawn_process now closes the manager's control socket listener and wakeup
self-pipe in the forked companion before running its target. Both are
inherited across the fork; closing them stops a companion from holding the
control listener (and possibly answering control requests) or the manager's
private signal pipe. Guarded so direct spawns without a control socket or
running loop are a no-op.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The forked companion manager inherits the arbiter's HTTP listening sockets,
its wakeup pipe, and the worker heartbeat files, none of which the manager
uses. Close them in the child before running so the manager and the companions
it forks do not pin the arbiter's fds. The manager creates its own signal pipe
and control socket after the fork.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Arbiter.reload (SIGHUP) now calls reload_companion_manager. A running manager
is sent SIGTERM so it drains its companions; the SIGCHLD reaper clears its pid
and manage_companion_manager respawns it from the freshly reloaded cfg. If
companions were added where none ran, a new manager starts immediately.
Restarting reuses the existing stop and respawn path; transactional
per-companion reread stays available separately through the control socket.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Arbiter.stop now signals the companion manager alongside the workers. It sends
the same SIGTERM (graceful) or SIGQUIT (immediate), waits the graceful_timeout
for both the workers and the manager to exit, then SIGKILLs whatever remains.
A graceful SIGTERM lets the manager stop its own companions before exiting.
stop_companion_manager(sig) signals the manager pid when it is running and
clears the pid on ESRCH; the SIGCHLD reaper clears it on a normal exit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Run the companion manager as a single arbiter child with its own
supervision loop, and host the config model with its loader.
config.py holds CompanionConfig (moved from process.py) and
build_companion_configs(cfg), which expands each companion_workers entry into
a CompanionConfig, filling omitted fields from the global companion_* settings.
It is also the reread config_loader. process.py keeps State and CompanionProcess.
CompanionManager.run() is the forked-child body: installs SIGCHLD/SIGTERM/SIGINT
via a self-pipe, brings up the control socket, starts every companion, then
select-waits on the socket and the pipe. Each tick reaps exits, retries backoff,
promotes past startsecs, and SIGKILLs companions past their stop deadline.
SIGTERM/SIGINT stop all companions and return.
Arbiter gains companion_manager_pid, manage_companion_manager (respawns the
manager when it is gone and companions are configured), spawn_companion_manager
(fork; child runs the loop), and reap detection that clears the pid on exit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add CompanionManager.reread_config(new_configs): diffs the running set against
a fresh, validated config list by config_hash -- a new name is added and
started, a missing name stopped and removed, a changed hash stores the config
and restarts (a manually stopped companion keeps STOPPED with the new config
ready), and an unchanged hash is left alone. Returns {ok, added, removed,
restarted, unchanged}. Validation runs first via _index_configs (duplicate-name
check), so a bad config mutates nothing and returns {ok: false, error,
kept_old_config: true}.
Wire the reread command to a config_loader hook on the manager -- the seam
between process supervision and config-file loading, set by the arbiter
(default None raises CommandError). A loader that raises returns the
kept-old-config error envelope.
Add tests for add/remove/restart-changed/manual-stop/unchanged/duplicate and
the reread no-loader, runs-loader, and bad-config paths.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add gunicorn/companion/control.py with ControlServer, the manager's control
endpoint. It owns the Unix socket lifecycle (create unlinks any stale socket,
binds, chmods 0o600, and listens; close cleans up) and the newline-delimited
JSON framing: serve_connection buffers reads and answers each complete line.
decode_command parses a request into a JSON object carrying a string cmd, and
encode_response writes a newline-terminated JSON line; malformed input becomes
a CommandError rendered as an {ok: false, error: ...} reply so a bad client
can't take the manager down. Turning a command into an action is delegated to a
dispatch callable, wired up in the later command tasks.
The socket is 0o600 and owned by the non-root user gunicorn runs as; no group
switching.
Add tests/test_companion_control.py covering decode, encode, handle_line
dispatch and error envelopes, and socket create/close.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
spawn_process no longer clears manual_stop; spawning is now policy-neutral.
Clearing the flag is owned by start_process and restart_process (which already
do it), and the respawn paths (retry_backoff, restart_pending) only run when
the flag is already false. A manually stopped companion now keeps manual_stop
set through its exit, so it settles in STOPPED and is not auto-restarted.
Add tests: manual_stop preserved through exit, start clears it, spawn leaves
it untouched.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add restart_process(name) following supervisor's restart rules: it always
clears manual_stop. RUNNING/STARTING are sent their stop_signal and enter
STOPPING with restart_pending set and a deadline from reload_timeout; the
reaper respawns them immediately once the old child exits. BACKOFF and STOPPED
start again right away. STOPPING is rejected. It never rereads config.
handle_exit now honors restart_pending first, respawning immediately (bumping
restart_count) instead of going to STOPPED or BACKOFF. Add a restart_pending
field on CompanionProcess.
Add tests for the running, pending-reap, stopped, backoff, and stopping cases.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add stop_process(name) following supervisor's stop rules: it always sets
manual_stop so the companion will not auto-restart. RUNNING/STARTING are sent
their stop_signal and moved to STOPPING with a stop_deadline (now +
stop_timeout) for the run loop to reap or SIGKILL; BACKOFF cancels its pending
retry and settles in STOPPED; STOPPED and STOPPING are success no-ops. Add
_signal_number to resolve a signal name and a stop_deadline field on
CompanionProcess.
Add tests for the running, backoff, already-stopped, unknown, and signal-name
cases.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add start_process(name) following supervisor's start rules: STOPPED and
BACKOFF clear manual_stop, drop any pending retry, and spawn now; RUNNING and
STARTING report success without acting; STOPPING is rejected so the caller
retries. Returns (ok, message).
Add tests for the stopped, backoff, running, stopping, and unknown cases.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reaping now transitions each exited companion via handle_exit: a manually
stopped one settles in STOPPED, any other exit enters BACKOFF with
next_retry_at = now + restart_delay (fixed, no exponential backoff or cap).
Add retry_backoff to re-fork BACKOFF companions once their delay elapses,
bumping restart_count and returning them to STARTING.
Add tests for backoff on unexpected exit, manual-stop staying stopped, retry
timing, and reap-to-backoff.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add promote_running to CompanionManager: scans STARTING companions and moves
any that have stayed alive at least their startsecs window to RUNNING, logging
the pid and returning the promoted ones. Companions that die inside the window
are left to reaping.
Add tests for promotion after the window, too-early no-op, and non-STARTING.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add reap_processes to CompanionManager: drains waitpid(WNOHANG), matches each
dead pid back to its companion, and records the exit via _record_exit (signal
number or exit code, exited_at, exit_count) while freeing the pid. Returns the
reaped companions; the restart decision stays with the run loop.
Add tests for exit-code, signal, and no-children cases.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Child calls _redirect_output after env setup: each configured log path is
opened append-mode and dup2'd onto fd 1/2. None/inherit keeps the inherited
fd; stderr stdout shares stdout's fd. Rotation stays external.
Add tests for inherit, append flags, file dup2, and stderr-to-stdout.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Strip whitespace also *after* header field value.
Simply refuse obsolete header folding (a default-off
option to revert is temporarily provided).
While we are at it, explicitly handle recently
introduced http error classes with intended status code.