gunicorn/examples/streaming_chat/README.md

# Streaming Chat Example

A FastAPI-based chat demo that simulates LLM token-by-token streaming, powered
by Gunicorn's dirty workers for efficient long-running operations.

## Overview

This example demonstrates how to build a streaming chat API that:
- Streams tokens word-by-word like ChatGPT (Server-Sent Events)
- Uses dirty workers for the "inference" workload
- Includes a browser-based chat UI for testing
- Requires no ML dependencies (simulated responses)

## Architecture

```
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────────┐
│  Browser/curl   │────►│  FastAPI (ASGI)  │────►│  DirtyWorker        │
│  SSE stream     │     │  - /chat (SSE)   │     │  - ChatApp          │
│                 │◄────│  - /chat/sync    │◄────│  - Token generator  │
└─────────────────┘     └──────────────────┘     └─────────────────────┘
                              │
                              ▼
                        text/event-stream
                        data: {"token": "Hello"}
                        data: {"token": " "}
                        data: {"token": "world"}
                        data: [DONE]
```

**Why streaming with dirty workers?**
- Real LLM inference is slow (seconds to minutes)
- Users expect to see responses appear gradually
- Dirty workers keep the "model" loaded between requests
- HTTP workers remain responsive during streaming

## Quick Start

### With Docker (recommended)

```bash
cd examples/streaming_chat
docker compose up --build
```

Then open http://localhost:8000 in your browser.

### Local Development

```bash
# Install dependencies
pip install fastapi pydantic

# Run with gunicorn
gunicorn examples.streaming_chat.main:app \
  -c examples/streaming_chat/gunicorn_conf.py
```

## API Reference

### POST /chat

Stream a chat response using Server-Sent Events.

**Request:**
```json
{
  "prompt": "hello",
  "thinking": false
}
```

**Response:** `text/event-stream`
```
data: {"token": "Hello"}

data: {"token": "!"}

data: {"token": " "}

data: {"token": "I'm"}

...

data: [DONE]
```

**Example with curl:**
```bash
curl -N http://localhost:8000/chat \
  -H "Content-Type: application/json" \
  -d '{"prompt": "hello"}'
```

### POST /chat/sync

Non-streaming version that returns the complete response.

**Request:**
```json
{
  "prompt": "hello"
}
```

**Response:**
```json
{
  "response": "Hello! I'm a simulated AI assistant..."
}
```

### GET /health

Health check endpoint.

**Response:**
```json
{"status": "ok"}
```

### GET /

Browser-based chat UI for testing.

## Configuration

Edit `gunicorn_conf.py` to adjust:

| Setting | Default | Description |
|---------|---------|-------------|
| `workers` | 2 | Number of HTTP workers |
| `dirty_workers` | 1 | Number of dirty workers |
| `dirty_timeout` | 60 | Max seconds per request |
| `bind` | 0.0.0.0:8000 | Listen address |

## Prompts

The simulated chat app responds to these keywords:

| Keyword | Response |
|---------|----------|
| `hello`, `hi`, `hey` | Greeting message |
| `explain` | Explanation of dirty workers |
| `streaming` | How streaming works |
| `code` | Example code snippet |
| (default) | Generic thoughtful response |

## Features Demonstrated

1. **Token streaming** - Word-by-word output via generators
2. **SSE protocol** - Browser-compatible event streaming
3. **Async generators** - Using `stream_async()` from dirty client
4. **Thinking mode** - Multi-phase streaming with visible "thinking"
5. **Browser UI** - Interactive chat with cursor animation

## Testing

Run the integration tests:

```bash
# Start the service first
docker compose up -d

# Run tests
pip install requests
python test_streaming.py
```

## Adapting for Real LLMs

To use a real LLM instead of simulated responses:

```python
# chat_app.py
from gunicorn.dirty.app import DirtyApp

class ChatApp(DirtyApp):
    def init(self):
        from transformers import pipeline
        self.generator = pipeline("text-generation", model="gpt2")

    def generate(self, prompt):
        for output in self.generator(prompt, max_new_tokens=100, do_sample=True):
            # Yield tokens as they're generated
            yield output["generated_text"]

    def close(self):
        del self.generator
```

Or with an API-based LLM:

```python
class ChatApp(DirtyApp):
    def init(self):
        import openai
        self.client = openai.OpenAI()

    async def generate(self, prompt):
        stream = self.client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            stream=True
        )
        for chunk in stream:
            if chunk.choices[0].delta.content:
                yield chunk.choices[0].delta.content
```

## Production Considerations

1. **Real LLM**: Replace `ChatApp` with actual model inference
2. **GPU Support**: Add CUDA to Dockerfile for faster inference
3. **Rate Limiting**: Add FastAPI middleware for rate limiting
4. **Authentication**: Add API key validation
5. **Monitoring**: Add Prometheus metrics endpoint
6. **Timeouts**: Adjust `dirty_timeout` based on max response length