mirror of
https://github.com/frappe/gunicorn.git
synced 2026-07-01 18:21:30 +08:00
Add a lightweight chat simulator demonstrating dirty worker streaming: - Token-by-token SSE streaming via async generators - FastAPI endpoint with browser UI - Multiple canned responses based on keywords - Docker deployment with docker-compose - Integration tests for SSE protocol Update docs/content/dirty.md to link to both examples.
Streaming Chat Example
A FastAPI-based chat demo that simulates LLM token-by-token streaming, powered by Gunicorn's dirty workers for efficient long-running operations.
Overview
This example demonstrates how to build a streaming chat API that:
- Streams tokens word-by-word like ChatGPT (Server-Sent Events)
- Uses dirty workers for the "inference" workload
- Includes a browser-based chat UI for testing
- Requires no ML dependencies (simulated responses)
Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────────┐
│ Browser/curl │────►│ FastAPI (ASGI) │────►│ DirtyWorker │
│ SSE stream │ │ - /chat (SSE) │ │ - ChatApp │
│ │◄────│ - /chat/sync │◄────│ - Token generator │
└─────────────────┘ └──────────────────┘ └─────────────────────┘
│
▼
text/event-stream
data: {"token": "Hello"}
data: {"token": " "}
data: {"token": "world"}
data: [DONE]
Why streaming with dirty workers?
- Real LLM inference is slow (seconds to minutes)
- Users expect to see responses appear gradually
- Dirty workers keep the "model" loaded between requests
- HTTP workers remain responsive during streaming
Quick Start
With Docker (recommended)
cd examples/streaming_chat
docker compose up --build
Then open http://localhost:8000 in your browser.
Local Development
# Install dependencies
pip install fastapi pydantic
# Run with gunicorn
gunicorn examples.streaming_chat.main:app \
-c examples/streaming_chat/gunicorn_conf.py
API Reference
POST /chat
Stream a chat response using Server-Sent Events.
Request:
{
"prompt": "hello",
"thinking": false
}
Response: text/event-stream
data: {"token": "Hello"}
data: {"token": "!"}
data: {"token": " "}
data: {"token": "I'm"}
...
data: [DONE]
Example with curl:
curl -N http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"prompt": "hello"}'
POST /chat/sync
Non-streaming version that returns the complete response.
Request:
{
"prompt": "hello"
}
Response:
{
"response": "Hello! I'm a simulated AI assistant..."
}
GET /health
Health check endpoint.
Response:
{"status": "ok"}
GET /
Browser-based chat UI for testing.
Configuration
Edit gunicorn_conf.py to adjust:
| Setting | Default | Description |
|---|---|---|
workers |
2 | Number of HTTP workers |
dirty_workers |
1 | Number of dirty workers |
dirty_timeout |
60 | Max seconds per request |
bind |
0.0.0.0:8000 | Listen address |
Prompts
The simulated chat app responds to these keywords:
| Keyword | Response |
|---|---|
hello, hi, hey |
Greeting message |
explain |
Explanation of dirty workers |
streaming |
How streaming works |
code |
Example code snippet |
| (default) | Generic thoughtful response |
Features Demonstrated
- Token streaming - Word-by-word output via generators
- SSE protocol - Browser-compatible event streaming
- Async generators - Using
stream_async()from dirty client - Thinking mode - Multi-phase streaming with visible "thinking"
- Browser UI - Interactive chat with cursor animation
Testing
Run the integration tests:
# Start the service first
docker compose up -d
# Run tests
pip install requests
python test_streaming.py
Adapting for Real LLMs
To use a real LLM instead of simulated responses:
# chat_app.py
from gunicorn.dirty.app import DirtyApp
class ChatApp(DirtyApp):
def init(self):
from transformers import pipeline
self.generator = pipeline("text-generation", model="gpt2")
def generate(self, prompt):
for output in self.generator(prompt, max_new_tokens=100, do_sample=True):
# Yield tokens as they're generated
yield output["generated_text"]
def close(self):
del self.generator
Or with an API-based LLM:
class ChatApp(DirtyApp):
def init(self):
import openai
self.client = openai.OpenAI()
async def generate(self, prompt):
stream = self.client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
yield chunk.choices[0].delta.content
Production Considerations
- Real LLM: Replace
ChatAppwith actual model inference - GPU Support: Add CUDA to Dockerfile for faster inference
- Rate Limiting: Add FastAPI middleware for rate limiting
- Authentication: Add API key validation
- Monitoring: Add Prometheus metrics endpoint
- Timeouts: Adjust
dirty_timeoutbased on max response length