┌─────────────────┐     ┌──────────────────┐     ┌─────────────────────┐
│  Browser/curl   │────►│  FastAPI (ASGI)  │────►│  DirtyWorker        │
│  SSE stream     │     │  - /chat (SSE)   │     │  - ChatApp          │
│                 │◄────│  - /chat/sync    │◄────│  - Token generator  │
└─────────────────┘     └──────────────────┘     └─────────────────────┘
                              │
                              ▼
                        text/event-stream
                        data: {"token": "Hello"}
                        data: {"token": " "}
                        data: {"token": "world"}
                        data: [DONE]

Why streaming with dirty workers?

Real LLM inference is slow (seconds to minutes)
Users expect to see responses appear gradually
Dirty workers keep the "model" loaded between requests
HTTP workers remain responsive during streaming

Quick Start

With Docker (recommended)

cd examples/streaming_chat
docker compose up --build

Then open http://localhost:8000 in your browser.

Local Development

# Install dependencies
pip install fastapi pydantic

# Run with gunicorn
gunicorn examples.streaming_chat.main:app \
  -c examples/streaming_chat/gunicorn_conf.py

API Reference

POST /chat

Stream a chat response using Server-Sent Events.

Request:

{
  "prompt": "hello",
  "thinking": false
}

Response: text/event-stream

data: {"token": "Hello"}

data: {"token": "!"}

data: {"token": " "}

data: {"token": "I'm"}

...

data: [DONE]

Example with curl:

curl -N http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "hello"}'

POST /chat/sync

Non-streaming version that returns the complete response.

Request:

{
  "prompt": "hello"
}

Response:

{
  "response": "Hello! I'm a simulated AI assistant..."
}

GET /health

Health check endpoint.

Response:

{"status": "ok"}

GET /

Browser-based chat UI for testing.

Configuration

Edit gunicorn_conf.py to adjust:

Setting	Default	Description
`workers`	2	Number of HTTP workers
`dirty_workers`	1	Number of dirty workers
`dirty_timeout`	60	Max seconds per request
`bind`	0.0.0.0:8000	Listen address

Prompts

The simulated chat app responds to these keywords:

Keyword	Response
`hello`, `hi`, `hey`	Greeting message
`explain`	Explanation of dirty workers
`streaming`	How streaming works
`code`	Example code snippet
(default)	Generic thoughtful response

Features Demonstrated

Token streaming - Word-by-word output via generators
SSE protocol - Browser-compatible event streaming
Async generators - Using stream_async() from dirty client
Thinking mode - Multi-phase streaming with visible "thinking"
Browser UI - Interactive chat with cursor animation

Testing

Run the integration tests:

# Start the service first
docker compose up -d

# Run tests
pip install requests
python test_streaming.py

Adapting for Real LLMs

To use a real LLM instead of simulated responses:

# chat_app.py
from gunicorn.dirty.app import DirtyApp

class ChatApp(DirtyApp):
    def init(self):
        from transformers import pipeline
        self.generator = pipeline("text-generation", model="gpt2")

    def generate(self, prompt):
        for output in self.generator(prompt, max_new_tokens=100, do_sample=True):
            # Yield tokens as they're generated
            yield output["generated_text"]

    def close(self):
        del self.generator

Or with an API-based LLM:

class ChatApp(DirtyApp):
    def init(self):
        import openai
        self.client = openai.OpenAI()

    async def generate(self, prompt):
        stream = self.client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            stream=True
        )
        for chunk in stream:
            if chunk.choices[0].delta.content:
                yield chunk.choices[0].delta.content

Production Considerations

Real LLM: Replace ChatApp with actual model inference
GPU Support: Add CUDA to Dockerfile for faster inference
Rate Limiting: Add FastAPI middleware for rate limiting
Authentication: Add API key validation
Monitoring: Add Prometheus metrics endpoint
Timeouts: Adjust dirty_timeout based on max response length