Cover validate_companion_workers (None becomes empty, non-list and non-dict items rejected) and CompanionConfig.config_hash (stable for equal configs, changes when a field changes, callable target keyed by qualified name and hashed stably). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
19 KiB
Status: proposal / draft
Author: Tanmoy Sarkar
Scope: gunicorn/arbiter.py, gunicorn/config.py, gunicorn/companion/
1. Problem
A Frappe deployment is not only HTTP workers.
Alongside Gunicorn, we usually run persistent non-HTTP processes:
- RQ worker pools
- scheduler
- socket.io / websocket server
- custom background daemons
Today these are usually managed separately through supervisor/systemd.
That causes:
- repeated app memory usage
- separate lifecycle for web and side processes
- reload drift between HTTP workers and background processes
- inconsistent shutdown behavior
- harder production process control
With preload_app=True, Gunicorn workers already share preloaded app memory using copy-on-write. The goal is to give non-HTTP processes the same lifecycle and memory-sharing benefit without making them HTTP workers.
2. Goal
Gunicorn manages one extra child process: the Companion Manager.
The Companion Manager manages all configured companion processes.
gunicorn master
├── HTTP worker
├── HTTP worker
└── companion manager
├── rq-default
├── rq-long
├── scheduler
└── socketio
Core rule:
Gunicorn Arbiter manages one Companion Manager.
Companion Manager manages companion processes.
Each companion process manages its own internals.
3. Architecture
gunicorn master
preload_app=True
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
HTTP worker HTTP worker companion manager
serves HTTP serves HTTP manages companions
│
┌──────────────────────┼──────────────────────┐
│ │ │
▼ ▼ ▼
rq-default scheduler socketio
Memory sharing still works:
gunicorn master preloads app
└── forks companion manager
└── forks rq / scheduler / socketio
The manager is forked from the preloaded master. Companion processes are forked from the manager, so they can inherit preloaded application memory.
4. Responsibility Boundary
Gunicorn Arbiter
The Arbiter should:
- start the Companion Manager
- restart it if it crashes
- stop it during Gunicorn shutdown
- ask it to
rereadconfig when needed - avoid per-companion process logic
Companion Manager
The manager should:
- load and validate companion config
- spawn/reap companions
- stop/start/restart companions
- restart unexpected exits after a fixed delay
- track state and expose
status - expose a Unix control socket
- redirect stdout/stderr
- apply env and cwd
- log lifecycle events
Companion Process
A companion runs the actual service, such as RQ, scheduler, socket.io, or a custom daemon.
The companion process owns its own internals:
- signal handling
- job draining
- child workers
- sockets
- event loops
5. Companion Is Not an HTTP Worker
A companion must not:
- serve Gunicorn HTTP traffic
- use Gunicorn listener sockets
- use Gunicorn worker heartbeat files
- trigger HTTP worker boot-error halt behavior
- call HTTP worker lifecycle hooks
If a companion exits with WORKER_BOOT_ERROR or APP_LOAD_ERROR, the web tier must not halt. The manager treats it as a normal companion exit.
6. Configuration
Use dict-based config.
preload_app = True
companion_config_file = "/home/frappe/frappe-bench/companion.conf.py"
companion_control_socket = "/run/gunicorn/companion.sock"
companion_workers = [
{
"name": "rq-default",
"target": "frappe_companions:start_rq_default",
"cwd": "/home/frappe/frappe-bench",
"env": {"QUEUE": "default"},
"stop_signal": "SIGTERM",
"stop_timeout": 300,
"reload_timeout": 60,
"stdout": "/var/log/frappe/rq-default.log",
"stderr": "/var/log/frappe/rq-default.error.log",
},
{
"name": "socketio",
"target": "frappe_companions:start_socketio",
"cwd": "/home/frappe/frappe-bench",
"stop_signal": "SIGTERM",
"stop_timeout": 60,
"reload_timeout": 30,
"stdout": "/var/log/frappe/socketio.log",
"stderr": "/var/log/frappe/socketio.error.log",
},
]
Global defaults:
companion_stop_signal = "SIGTERM"
companion_stop_timeout = 60
companion_reload_timeout = 60
companion_stdout = None
companion_stderr = None
companion_cwd = None
companion_env = {}
companion_startsecs = 1
companion_restart_delay = 5
# seconds; used when manager timeout is computed dynamically
companion_manager_shutdown_buffer = 10
companion_manager_stop_timeout = None
companion_manager_reload_timeout = None
companion_control_socket_mode = 0o600
If manager timeouts are unset, compute them dynamically:
manager_stop_timeout = max(companion.stop_timeout) + companion_manager_shutdown_buffer
manager_reload_timeout = max(companion.reload_timeout) + companion_manager_shutdown_buffer
7. Config Fields
Required:
| Field | Meaning |
|---|---|
name |
Unique process name |
target |
Zero-argument callable or import string |
Optional:
| Field | Meaning |
|---|---|
cwd |
Working directory before target |
env |
Extra environment variables |
stop_signal |
Signal used on stop |
stop_timeout |
Max wait during shutdown |
reload_timeout |
Max wait during restart/reread |
stdout |
Stdout log file or inherit |
stderr |
Stderr log file, stdout, or inherit |
startsecs |
Seconds process must survive before RUNNING; makes STARTING meaningful |
Validation must reject unknown keys, duplicate names, invalid signals/timeouts, invalid stdout/stderr values, and targets that are not zero-argument callables/import strings.
Not supported: groups, disable/fatal state, max restart count, exponential backoff, process groups, per-companion user switching, HTTP/TCP health checks, process-specific RQ/socket.io behavior.
8. Public States
Status should mimic supervisorctl status.
STOPPED
STARTING
RUNNING
BACKOFF
STOPPING
| State | Meaning |
|---|---|
STOPPED |
Manually stopped or not started |
STARTING |
Forked, but has not survived startsecs |
RUNNING |
Alive and survived startsecs |
BACKOFF |
Exited unexpectedly; will restart after companion_restart_delay |
STOPPING |
Stop is in progress, from first signal through optional SIGKILL until exit |
No public EXITED, UNKNOWN, or FATAL.
Exit metadata is tracked separately:
last_exit_code
last_exit_signal
last_exited_at
exit_count
9. State Transitions
STOPPED
└─ start
→ STARTING
STARTING
├─ survives startsecs
│ → RUNNING
├─ exits unexpectedly
│ → BACKOFF
└─ stop / restart / removed-by-reread
→ STOPPING
RUNNING
├─ exits unexpectedly
│ → BACKOFF
└─ stop / restart / removed-by-reread
→ STOPPING
BACKOFF
├─ retry timer expires
│ → STARTING
└─ stop
→ STOPPED
STOPPING
├─ process exits
│ → STOPPED
└─ timeout exceeded
→ SIGKILL
→ STOPPED
When waitpid reaps a child, the manager records exit metadata and immediately moves to the next public state.
Early exit during STARTING and unexpected exit after RUNNING both use the same fixed restart delay.
10. Restart Behavior
Configured companions are expected to stay running.
Unexpected exit:
record exit metadata
state = BACKOFF
next_retry_at = now + companion_restart_delay
restart after companion_restart_delay
Default:
companion_restart_delay = 5
There is no exponential backoff, max restart count, disable state, or fatal state.
A configured process restarts forever unless:
- manually stopped
- removed from config by
reread - Gunicorn is stopping/reloading
11. Control Socket
The manager exposes a Unix domain socket:
companion_control_socket = "/run/gunicorn/companion.sock"
Default permissions:
companion_control_socket_mode = 0o600
Gunicorn runs as a non-root user, so the socket is owned by that user and no group ownership switching is supported.
Protocol: newline-delimited JSON.
Commands:
status
reread
start <name>
stop <name>
restart <name>
The manager creates the socket before entering the main loop. During full manager replacement, clients should retry on ENOENT, ECONNREFUSED, or timeout.
12. Command Semantics
status
Request:
{"cmd": "status"}
Human output should mimic supervisorctl status:
rq-default RUNNING pid 1234, uptime 2 days, 03:12:44
rq-long BACKOFF exited with status 1, retrying in 3s
scheduler STOPPED stopped manually
JSON response:
{
"ok": true,
"companions": [
{
"name": "rq-default",
"state": "RUNNING",
"pid": 1234,
"description": "pid 1234, uptime 2 days, 03:12:44"
},
{
"name": "rq-long",
"state": "BACKOFF",
"pid": null,
"description": "exited with status 1, retrying in 3s",
"next_retry_at": 1730000000,
"restart_delay": 5,
"last_exit_code": 1
}
]
}
start <name>
{"cmd": "start", "name": "rq-default"}
Uses latest validated config.
STOPPED -> clear manual_stop, start now
BACKOFF -> cancel pending retry, clear manual_stop, start now
RUNNING -> success: already running
STARTING -> success: already starting
STOPPING -> error: process is stopping; poll status and retry
stop <name>
{"cmd": "stop", "name": "rq-default"}
RUNNING -> send stop_signal, wait stop_timeout, SIGKILL if needed, STOPPED
STARTING -> send stop_signal, wait stop_timeout, SIGKILL if needed, STOPPED
BACKOFF -> cancel pending retry, STOPPED
STOPPED -> success: already stopped
STOPPING -> success: already stopping
stop sets manual_stop = True.
If stopping while STARTING, stop_timeout governs the stop window, not startsecs.
restart <name>
{"cmd": "restart", "name": "rq-default"}
RUNNING -> clear manual_stop, stop using reload_timeout, start
STARTING -> enter STOPPING, stop current child using reload_timeout, start
BACKOFF -> cancel pending retry, clear manual_stop, start immediately
STOPPED -> clear manual_stop, start immediately
STOPPING -> error: process is stopping; poll status and retry
restart does not reread config.
reread
{"cmd": "reread"}
Transactional config reload:
new process -> add and start
removed process -> stop and remove
changed process -> update config; restart unless manual_stop=True
unchanged process -> keep current state
If a manually stopped process changes config:
update stored config
keep STOPPED
next start uses latest config
Success:
{
"ok": true,
"added": ["new-worker"],
"removed": ["old-worker"],
"restarted": ["rq-default"],
"unchanged": ["socketio"]
}
unchanged means no process action was taken. It may include manually stopped companions whose config changed; the new config is accepted and stored, and the next start <name> uses it.
Failure:
{
"ok": false,
"error": "invalid config: duplicate companion name rq-default",
"kept_old_config": true
}
kept_old_config=true means no running process was changed and previous validated config remains active.
13. Reread Diff
Use one stable config hash per companion.
new name -> add/start
missing name -> stop/remove
hash changed -> update config; restart unless manual_stop=True
hash unchanged -> no process action
This intentionally restarts even if only stop_timeout, stdout, or env changes. Simpler and easier to test.
reread flow:
- Read config file.
- Extract companion settings.
- Validate full config.
- Compute one config hash per companion.
- Diff old/new config.
- Apply only if validation succeeds.
Prefer a dedicated config file:
companion_config_file = "/home/frappe/frappe-bench/companion.conf.py"
If unset, the manager may fall back to Gunicorn config file, but must read only companion settings.
14. stdout/stderr, env, cwd
stdout/stderr
"stdout": "/var/log/frappe/rq-default.log",
"stderr": "/var/log/frappe/rq-default.error.log",
Allowed:
None
"inherit"
"stdout" # only for stderr
"/path/to/file.log"
The companion child opens stdout/stderr after fork and before target().
Files are opened in append mode.
Log rotation is external:
copytruncateworks without restartcreate/rename rotation needs companion restart- live fd reopen for already-running companions is out of scope
env/cwd
Before target():
os.chdir(cwd)
os.environ.update(env)
Changing stdout/stderr/env/cwd changes the config hash and causes restart unless manually stopped.
15. File Descriptors
Manager child must close Gunicorn-only fds:
- master signal pipe
- HTTP listener sockets
- worker heartbeat tmp files
Companion children must close manager-only fds before running target.
Companions must not keep Gunicorn HTTP listener sockets open.
16. Parent Death / Orphan Cleanup
Manager exits if Gunicorn master dies.
Linux:
prctl(PR_SET_PDEATHSIG, SIGTERM)
Non-Linux fallback:
manager records parent pid
manager checks os.getppid() every 5 seconds
if os.getppid() returns 1, manager exits
Companion children should also use parent-death signal where available. Without Linux prctl, cleanup after manager death is best-effort because target code takes over.
17. Internal State
Maintain enough state for status:
- name
- state
- pid
- uptime
- restart count
- exit count
- last exit code/signal
- last started/exited time
- next retry time
- stop timeout kills
- manual stop flag
- stdout/stderr path
No Prometheus exporter inside the manager.
18. Implementation Layout
gunicorn/companion/
__init__.py
config.py
process.py
manager.py
control.py
config.py:
- load config
- validate config
- normalize defaults
- compute config hash
process.py:
CompanionConfigCompanionProcess- state model
manager.py:
- run loop
- spawn/reap
- start/stop/restart
- fixed restart delay
- state transitions
- stdout/stderr/env/cwd setup
control.py:
- Unix socket server
- JSON command parser
- JSON response writer
19. Arbiter Changes
Keep Arbiter changes small:
- manager state
- spawn manager
- reap manager
- stop manager
- reload/reread manager
- helper to call control socket if needed
No per-companion logic in Arbiter.
20. Implementation Tasks
- Add companion config settings in
gunicorn/config.py. - Add config validation for
companion_workers. - Add
CompanionConfigand config hash generation. - Add public process states.
- Add
CompanionProcessruntime state. - Add status description helpers.
- Add
CompanionManagerskeleton. - Spawn one companion process from the manager.
- Apply
cwdandenvbefore target. - Redirect
stdoutandstderr. - Reap exited companion processes.
- Implement
STARTING -> RUNNINGusingstartsecs. - Implement
BACKOFFwith fixedcompanion_restart_delay. - Implement
start_process. - Implement
stop_process. - Implement
restart_process. - Preserve and clear
manual_stopcorrectly. - Add Unix control socket.
- Implement JSON command protocol.
- Implement
status. - Implement
start. - Implement
stop. - Implement
restart. - Implement transactional
reread. - Add manager spawn/reap logic in Arbiter.
- Add manager shutdown handling in Arbiter.
- Wire Gunicorn reload to manager
rereador restart. - Close Gunicorn-only fds in manager child.
- Close manager-only fds in companion child.
- Add parent-death cleanup.
- Add lifecycle logs.
- Add tests for config validation.
- Add tests for state transitions.
- Add tests for control commands.
- Add tests for transactional reread.
- Add tests that HTTP worker behavior is unchanged.
21. Test Plan
Test:
- config validation
- config hash diff
- transactional reread
rereadsuccess/failure response- manual stop + reread behavior
start,stop,restarton all public states- control socket commands and permissions
- control socket unavailable retry behavior
- supervisord-like status output
- state transitions
- manager lifecycle from Arbiter
- companion spawn/reap
- fixed 5s restart delay
startsecsbehavior- stdout/stderr redirection
- env and cwd
- fd cleanup
- parent-death cleanup
- HTTP worker behavior unchanged
22. Out of Scope
Not supported:
- groups
- dependency ordering
- process group killing
- disable/fatal state
- max restart count
- exponential backoff
- CLI config for companion specs
- RQ/socket.io/scheduler-specific behavior
- per-companion user switching
- HTTP/TCP/custom health checks
- live log fd reopen for already-running companions
23. Summary
Use a Companion Manager, not direct companion management inside Arbiter.
This gives:
- shared memory through
preload_app=True - small Arbiter changes
- supervisord-like process management and status
- controlled
start,stop,restart,reread,status - transactional config reread
- fixed restart delay
- simple process-running health
- per-companion env/cwd/stdout/stderr
- simple public state machine
- safer shutdown/reload behavior