feat: Add plan for companion process manager

This commit is contained in:
Tanmoy Sarkar 2026-06-09 14:25:20 +05:30
parent 54b59ca884
commit 6fd0c2b236

View File

@ -0,0 +1,764 @@
Status: proposal / draft
Author: Tanmoy Sarkar
Scope: `gunicorn/arbiter.py`, `gunicorn/config.py`, `gunicorn/companion/`
## 1. Problem
A Frappe deployment is not only HTTP workers.
Alongside Gunicorn, we usually run persistent non-HTTP processes:
- RQ worker pools
- scheduler
- socket.io / websocket server
- custom background daemons
Today these are usually managed separately through supervisor/systemd.
That causes:
- repeated app memory usage
- separate lifecycle for web and side processes
- reload drift between HTTP workers and background processes
- inconsistent shutdown behavior
- harder production process control
With `preload_app=True`, Gunicorn workers already share preloaded app memory using copy-on-write. The goal is to give non-HTTP processes the same lifecycle and memory-sharing benefit without making them HTTP workers.
## 2. Goal
Gunicorn manages one extra child process: the **Companion Manager**.
The Companion Manager manages all configured companion processes.
```text
gunicorn master
├── HTTP worker
├── HTTP worker
└── companion manager
├── rq-default
├── rq-long
├── scheduler
└── socketio
```
Core rule:
```text
Gunicorn Arbiter manages one Companion Manager.
Companion Manager manages companion processes.
Each companion process manages its own internals.
```
## 3. Architecture
```text
gunicorn master
preload_app=True
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
HTTP worker HTTP worker companion manager
serves HTTP serves HTTP manages companions
┌──────────────────────┼──────────────────────┐
│ │ │
▼ ▼ ▼
rq-default scheduler socketio
```
Memory sharing still works:
```text
gunicorn master preloads app
└── forks companion manager
└── forks rq / scheduler / socketio
```
The manager is forked from the preloaded master. Companion processes are forked from the manager, so they can inherit preloaded application memory.
## 4. Responsibility Boundary
### Gunicorn Arbiter
The Arbiter should:
- start the Companion Manager
- restart it if it crashes
- stop it during Gunicorn shutdown
- ask it to `reread` config when needed
- avoid per-companion process logic
### Companion Manager
The manager should:
- load and validate companion config
- spawn/reap companions
- stop/start/restart companions
- restart unexpected exits after a fixed delay
- track state and expose `status`
- expose a Unix control socket
- redirect stdout/stderr
- apply env and cwd
- log lifecycle events
### Companion Process
A companion runs the actual service, such as RQ, scheduler, socket.io, or a custom daemon.
The companion process owns its own internals:
- signal handling
- job draining
- child workers
- sockets
- event loops
## 5. Companion Is Not an HTTP Worker
A companion must not:
- serve Gunicorn HTTP traffic
- use Gunicorn listener sockets
- use Gunicorn worker heartbeat files
- trigger HTTP worker boot-error halt behavior
- call HTTP worker lifecycle hooks
If a companion exits with `WORKER_BOOT_ERROR` or `APP_LOAD_ERROR`, the web tier must not halt. The manager treats it as a normal companion exit.
## 6. Configuration
Use dict-based config.
```python
preload_app = True
companion_config_file = "/home/frappe/frappe-bench/companion.conf.py"
companion_control_socket = "/run/gunicorn/companion.sock"
companion_workers = [
{
"name": "rq-default",
"target": "frappe_companions:start_rq_default",
"cwd": "/home/frappe/frappe-bench",
"env": {"QUEUE": "default"},
"stop_signal": "SIGTERM",
"stop_timeout": 300,
"reload_timeout": 60,
"stdout": "/var/log/frappe/rq-default.log",
"stderr": "/var/log/frappe/rq-default.error.log",
},
{
"name": "socketio",
"target": "frappe_companions:start_socketio",
"cwd": "/home/frappe/frappe-bench",
"stop_signal": "SIGTERM",
"stop_timeout": 60,
"reload_timeout": 30,
"stdout": "/var/log/frappe/socketio.log",
"stderr": "/var/log/frappe/socketio.error.log",
},
]
```
Global defaults:
```python
companion_stop_signal = "SIGTERM"
companion_stop_timeout = 60
companion_reload_timeout = 60
companion_stdout = None
companion_stderr = None
companion_cwd = None
companion_env = {}
companion_startsecs = 1
companion_restart_delay = 5
# seconds; used when manager timeout is computed dynamically
companion_manager_shutdown_buffer = 10
companion_manager_stop_timeout = None
companion_manager_reload_timeout = None
companion_control_socket_mode = 0o600
companion_control_socket_group = None
```
If manager timeouts are unset, compute them dynamically:
```text
manager_stop_timeout = max(companion.stop_timeout) + companion_manager_shutdown_buffer
manager_reload_timeout = max(companion.reload_timeout) + companion_manager_shutdown_buffer
```
## 7. Config Fields
Required:
| Field | Meaning |
| -------- | --------------------------------------- |
| `name` | Unique process name |
| `target` | Zero-argument callable or import string |
Optional:
| Field | Meaning |
| ---------------- | -------------------------------------------------------------------------- |
| `cwd` | Working directory before target |
| `env` | Extra environment variables |
| `stop_signal` | Signal used on stop |
| `stop_timeout` | Max wait during shutdown |
| `reload_timeout` | Max wait during restart/reread |
| `stdout` | Stdout log file or inherit |
| `stderr` | Stderr log file, `stdout`, or inherit |
| `startsecs` | Seconds process must survive before `RUNNING`; makes `STARTING` meaningful |
Validation must reject unknown keys, duplicate names, invalid signals/timeouts, invalid stdout/stderr values, and targets that are not zero-argument callables/import strings.
Not supported: groups, disable/fatal state, max restart count, exponential backoff, process groups, per-companion user switching, HTTP/TCP health checks, process-specific RQ/socket.io behavior.
## 8. Public States
Status should mimic `supervisorctl status`.
```text
STOPPED
STARTING
RUNNING
BACKOFF
STOPPING
```
| State | Meaning |
| ---------- | ---------------------------------------------------------------------------- |
| `STOPPED` | Manually stopped or not started |
| `STARTING` | Forked, but has not survived `startsecs` |
| `RUNNING` | Alive and survived `startsecs` |
| `BACKOFF` | Exited unexpectedly; will restart after `companion_restart_delay` |
| `STOPPING` | Stop is in progress, from first signal through optional `SIGKILL` until exit |
No public `EXITED`, `UNKNOWN`, or `FATAL`.
Exit metadata is tracked separately:
```text
last_exit_code
last_exit_signal
last_exited_at
exit_count
```
## 9. State Transitions
```text
STOPPED
└─ start
→ STARTING
STARTING
├─ survives startsecs
│ → RUNNING
├─ exits unexpectedly
│ → BACKOFF
└─ stop / restart / removed-by-reread
→ STOPPING
RUNNING
├─ exits unexpectedly
│ → BACKOFF
└─ stop / restart / removed-by-reread
→ STOPPING
BACKOFF
├─ retry timer expires
│ → STARTING
└─ stop
→ STOPPED
STOPPING
├─ process exits
│ → STOPPED
└─ timeout exceeded
→ SIGKILL
→ STOPPED
```
When `waitpid` reaps a child, the manager records exit metadata and immediately moves to the next public state.
Early exit during `STARTING` and unexpected exit after `RUNNING` both use the same fixed restart delay.
## 10. Restart Behavior
Configured companions are expected to stay running.
Unexpected exit:
```text
record exit metadata
state = BACKOFF
next_retry_at = now + companion_restart_delay
restart after companion_restart_delay
```
Default:
```python
companion_restart_delay = 5
```
There is no exponential backoff, max restart count, disable state, or fatal state.
A configured process restarts forever unless:
- manually stopped
- removed from config by `reread`
- Gunicorn is stopping/reloading
## 11. Control Socket
The manager exposes a Unix domain socket:
```python
companion_control_socket = "/run/gunicorn/companion.sock"
```
Default permissions:
```python
companion_control_socket_mode = 0o600
```
Optional group access:
```python
companion_control_socket_mode = 0o660
companion_control_socket_group = "frappe-ops"
```
Protocol: newline-delimited JSON.
Commands:
```text
status
reread
start <name>
stop <name>
restart <name>
```
The manager creates the socket before entering the main loop. During full manager replacement, clients should retry on `ENOENT`, `ECONNREFUSED`, or timeout.
## 12. Command Semantics
### `status`
Request:
```json
{"cmd": "status"}
```
Human output should mimic `supervisorctl status`:
```text
rq-default RUNNING pid 1234, uptime 2 days, 03:12:44
rq-long BACKOFF exited with status 1, retrying in 3s
scheduler STOPPED stopped manually
```
JSON response:
```json
{
"ok": true,
"companions": [
{
"name": "rq-default",
"state": "RUNNING",
"pid": 1234,
"description": "pid 1234, uptime 2 days, 03:12:44"
},
{
"name": "rq-long",
"state": "BACKOFF",
"pid": null,
"description": "exited with status 1, retrying in 3s",
"next_retry_at": 1730000000,
"restart_delay": 5,
"last_exit_code": 1
}
]
}
```
### `start <name>`
```json
{"cmd": "start", "name": "rq-default"}
```
Uses latest validated config.
```text
STOPPED -> clear manual_stop, start now
BACKOFF -> cancel pending retry, clear manual_stop, start now
RUNNING -> success: already running
STARTING -> success: already starting
STOPPING -> error: process is stopping; poll status and retry
```
### `stop <name>`
```json
{"cmd": "stop", "name": "rq-default"}
```
```text
RUNNING -> send stop_signal, wait stop_timeout, SIGKILL if needed, STOPPED
STARTING -> send stop_signal, wait stop_timeout, SIGKILL if needed, STOPPED
BACKOFF -> cancel pending retry, STOPPED
STOPPED -> success: already stopped
STOPPING -> success: already stopping
```
`stop` sets `manual_stop = True`.
If stopping while `STARTING`, `stop_timeout` governs the stop window, not `startsecs`.
### `restart <name>`
```json
{"cmd": "restart", "name": "rq-default"}
```
```text
RUNNING -> clear manual_stop, stop using reload_timeout, start
STARTING -> enter STOPPING, stop current child using reload_timeout, start
BACKOFF -> cancel pending retry, clear manual_stop, start immediately
STOPPED -> clear manual_stop, start immediately
STOPPING -> error: process is stopping; poll status and retry
```
`restart` does not reread config.
### `reread`
```json
{"cmd": "reread"}
```
Transactional config reload:
```text
new process -> add and start
removed process -> stop and remove
changed process -> update config; restart unless manual_stop=True
unchanged process -> keep current state
```
If a manually stopped process changes config:
```text
update stored config
keep STOPPED
next start uses latest config
```
Success:
```json
{
"ok": true,
"added": ["new-worker"],
"removed": ["old-worker"],
"restarted": ["rq-default"],
"unchanged": ["socketio"]
}
```
`unchanged` means no process action was taken. It may include manually stopped companions whose config changed; the new config is accepted and stored, and the next `start <name>` uses it.
Failure:
```json
{
"ok": false,
"error": "invalid config: duplicate companion name rq-default",
"kept_old_config": true
}
```
`kept_old_config=true` means no running process was changed and previous validated config remains active.
## 13. Reread Diff
Use one stable config hash per companion.
```text
new name -> add/start
missing name -> stop/remove
hash changed -> update config; restart unless manual_stop=True
hash unchanged -> no process action
```
This intentionally restarts even if only `stop_timeout`, `stdout`, or `env` changes. Simpler and easier to test.
`reread` flow:
1. Read config file.
2. Extract companion settings.
3. Validate full config.
4. Compute one config hash per companion.
5. Diff old/new config.
6. Apply only if validation succeeds.
Prefer a dedicated config file:
```python
companion_config_file = "/home/frappe/frappe-bench/companion.conf.py"
```
If unset, the manager may fall back to Gunicorn config file, but must read only companion settings.
## 14. stdout/stderr, env, cwd
### stdout/stderr
```python
"stdout": "/var/log/frappe/rq-default.log",
"stderr": "/var/log/frappe/rq-default.error.log",
```
Allowed:
```python
None
"inherit"
"stdout" # only for stderr
"/path/to/file.log"
```
The companion child opens stdout/stderr after fork and before `target()`.
Files are opened in append mode.
Log rotation is external:
- `copytruncate` works without restart
- `create`/rename rotation needs companion restart
- live fd reopen for already-running companions is out of scope
### env/cwd
Before `target()`:
```python
os.chdir(cwd)
os.environ.update(env)
```
Changing stdout/stderr/env/cwd changes the config hash and causes restart unless manually stopped.
## 15. File Descriptors
Manager child must close Gunicorn-only fds:
- master signal pipe
- HTTP listener sockets
- worker heartbeat tmp files
Companion children must close manager-only fds before running target.
Companions must not keep Gunicorn HTTP listener sockets open.
## 16. Parent Death / Orphan Cleanup
Manager exits if Gunicorn master dies.
Linux:
```python
prctl(PR_SET_PDEATHSIG, SIGTERM)
```
Non-Linux fallback:
```text
manager records parent pid
manager checks os.getppid() every 5 seconds
if os.getppid() returns 1, manager exits
```
Companion children should also use parent-death signal where available. Without Linux `prctl`, cleanup after manager death is best-effort because target code takes over.
## 17. Internal State
Maintain enough state for `status`:
- name
- state
- pid
- uptime
- restart count
- exit count
- last exit code/signal
- last started/exited time
- next retry time
- stop timeout kills
- manual stop flag
- stdout/stderr path
No Prometheus exporter inside the manager.
## 18. Implementation Layout
```text
gunicorn/companion/
__init__.py
config.py
process.py
manager.py
control.py
```
`config.py`:
- load config
- validate config
- normalize defaults
- compute config hash
`process.py`:
- `CompanionConfig`
- `CompanionProcess`
- state model
`manager.py`:
- run loop
- spawn/reap
- start/stop/restart
- fixed restart delay
- state transitions
- stdout/stderr/env/cwd setup
`control.py`:
- Unix socket server
- JSON command parser
- JSON response writer
## 19. Arbiter Changes
Keep Arbiter changes small:
- manager state
- spawn manager
- reap manager
- stop manager
- reload/reread manager
- helper to call control socket if needed
No per-companion logic in Arbiter.
## 20. Implementation Tasks
- [ ] Add companion config settings in `gunicorn/config.py`.
- [ ] Add config validation for `companion_workers`.
- [ ] Add `CompanionConfig` and config hash generation.
- [ ] Add public process states.
- [ ] Add `CompanionProcess` runtime state.
- [ ] Add status description helpers.
- [ ] Add `CompanionManager` skeleton.
- [ ] Spawn one companion process from the manager.
- [ ] Apply `cwd` and `env` before target.
- [ ] Redirect `stdout` and `stderr`.
- [ ] Reap exited companion processes.
- [ ] Implement `STARTING -> RUNNING` using `startsecs`.
- [ ] Implement `BACKOFF` with fixed `companion_restart_delay`.
- [ ] Implement `start_process`.
- [ ] Implement `stop_process`.
- [ ] Implement `restart_process`.
- [ ] Preserve and clear `manual_stop` correctly.
- [ ] Add Unix control socket.
- [ ] Implement JSON command protocol.
- [ ] Implement `status`.
- [ ] Implement `start`.
- [ ] Implement `stop`.
- [ ] Implement `restart`.
- [ ] Implement transactional `reread`.
- [ ] Add manager spawn/reap logic in Arbiter.
- [ ] Add manager shutdown handling in Arbiter.
- [ ] Wire Gunicorn reload to manager `reread` or restart.
- [ ] Close Gunicorn-only fds in manager child.
- [ ] Close manager-only fds in companion child.
- [ ] Add parent-death cleanup.
- [ ] Add lifecycle logs.
- [ ] Add tests for config validation.
- [ ] Add tests for state transitions.
- [ ] Add tests for control commands.
- [ ] Add tests for transactional reread.
- [ ] Add tests that HTTP worker behavior is unchanged.
## 21. Test Plan
Test:
- config validation
- config hash diff
- transactional reread
- `reread` success/failure response
- manual stop + reread behavior
- `start`, `stop`, `restart` on all public states
- control socket commands and permissions
- control socket unavailable retry behavior
- supervisord-like status output
- state transitions
- manager lifecycle from Arbiter
- companion spawn/reap
- fixed 5s restart delay
- `startsecs` behavior
- stdout/stderr redirection
- env and cwd
- fd cleanup
- parent-death cleanup
- HTTP worker behavior unchanged
## 22. Out of Scope
Not supported:
- groups
- dependency ordering
- process group killing
- disable/fatal state
- max restart count
- exponential backoff
- CLI config for companion specs
- RQ/socket.io/scheduler-specific behavior
- per-companion user switching
- HTTP/TCP/custom health checks
- live log fd reopen for already-running companions
## 23. Summary
Use a Companion Manager, not direct companion management inside Arbiter.
This gives:
- shared memory through `preload_app=True`
- small Arbiter changes
- supervisord-like process management and status
- controlled `start`, `stop`, `restart`, `reread`, `status`
- transactional config reread
- fixed restart delay
- simple process-running health
- per-companion env/cwd/stdout/stderr
- simple public state machine
- safer shutdown/reload behavior