gunicorn/examples/embedding_service/README.md

# Embedding Service Example

A FastAPI-based text embedding service using sentence-transformers, powered by
gunicorn's dirty workers for efficient ML model management.

## Overview

This example demonstrates how to build a production-ready embedding API that:
- Keeps ML models loaded in memory across requests (dirty workers)
- Handles HTTP efficiently with async FastAPI (ASGI workers)
- Provides batch embedding for multiple texts
- Includes Docker-based deployment and testing

## Architecture

```
┌─────────────────┐     ┌──────────────────┐     ┌─────────────────────┐
│  HTTP Clients   │────►│  FastAPI (ASGI)  │────►│  DirtyWorker        │
│                 │     │  - /embed        │     │  - sentence-        │
│                 │◄────│  - /health       │◄────│    transformers     │
└─────────────────┘     └──────────────────┘     │  - Model in memory  │
                                                  └─────────────────────┘
```

**Why dirty workers?**
- ML models are expensive to load (several seconds)
- Dirty workers load the model once at startup
- HTTP workers remain lightweight and responsive
- Model stays in memory, serving many requests

## Quick Start

### With Docker (recommended)

```bash
cd examples/embedding_service
docker compose up --build
```

### Local Development

```bash
# Install dependencies
pip install sentence-transformers fastapi pydantic

# Run with gunicorn
gunicorn examples.embedding_service.main:app \
  -c examples/embedding_service/gunicorn_conf.py
```

## API Reference

### POST /embed

Generate embeddings for a list of texts.

**Request:**
```json
{
  "texts": ["Hello world", "Another sentence"]
}
```

**Response:**
```json
{
  "embeddings": [
    [0.123, -0.456, ...],
    [0.789, -0.012, ...]
  ]
}
```

**Example:**
```bash
curl -X POST http://localhost:8000/embed \
  -H "Content-Type: application/json" \
  -d '{"texts": ["Hello world"]}'
```

### GET /health

Health check endpoint.

**Response:**
```json
{"status": "ok"}
```

## Configuration

Edit `gunicorn_conf.py` to adjust:

| Setting | Default | Description |
|---------|---------|-------------|
| `workers` | 2 | Number of HTTP workers |
| `dirty_workers` | 1 | Number of ML model workers |
| `dirty_timeout` | 60 | Max seconds per inference |
| `bind` | 0.0.0.0:8000 | Listen address |

## Model

Uses [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2):
- 384-dimensional embeddings
- Fast inference (~14K sentences/sec on GPU)
- Good quality for semantic search
- ~90MB download

To use a different model, edit `embedding_app.py`:
```python
self.model = SentenceTransformer('your-model-name')
```

## Testing

Run the integration tests:

```bash
# Start the service first
docker compose up -d

# Run tests
pip install requests numpy
python test_embedding.py
```

## Production Considerations

1. **GPU Support**: Add CUDA to the Dockerfile for faster inference
2. **Scaling**: Increase `dirty_workers` for more concurrent embeddings
3. **Caching**: Add Redis caching for repeated texts
4. **Rate Limiting**: Add FastAPI middleware for rate limiting
5. **Monitoring**: Add Prometheus metrics endpoint