mirror of https://github.com/frappe/gunicorn.git synced 2026-07-01 18:21:30 +08:00

History

Benoit Chesneau 0e05c824e9 feat(examples): add FastAPI embedding service with Docker testing

Add a complete example demonstrating dirty workers with sentence-transformers
for text embeddings via FastAPI:

- EmbeddingApp DirtyApp that loads and manages the ML model
- FastAPI endpoints for /embed and /health
- Docker and docker-compose configuration
- Integration tests with numpy similarity checks
- GitHub Actions CI workflow

2026-01-25 10:23:25 +01:00

__init__.py

feat(examples): add FastAPI embedding service with Docker testing

2026-01-25 10:23:25 +01:00

docker-compose.yml

feat(examples): add FastAPI embedding service with Docker testing

2026-01-25 10:23:25 +01:00

Dockerfile

feat(examples): add FastAPI embedding service with Docker testing

2026-01-25 10:23:25 +01:00

embedding_app.py

feat(examples): add FastAPI embedding service with Docker testing

2026-01-25 10:23:25 +01:00

gunicorn_conf.py

feat(examples): add FastAPI embedding service with Docker testing

2026-01-25 10:23:25 +01:00

main.py

feat(examples): add FastAPI embedding service with Docker testing

2026-01-25 10:23:25 +01:00

README.md

feat(examples): add FastAPI embedding service with Docker testing

2026-01-25 10:23:25 +01:00

requirements.txt

feat(examples): add FastAPI embedding service with Docker testing

2026-01-25 10:23:25 +01:00

test_embedding.py

feat(examples): add FastAPI embedding service with Docker testing

2026-01-25 10:23:25 +01:00

README.md

Embedding Service Example

A FastAPI-based text embedding service using sentence-transformers, powered by gunicorn's dirty workers for efficient ML model management.

Overview

This example demonstrates how to build a production-ready embedding API that:

Keeps ML models loaded in memory across requests (dirty workers)
Handles HTTP efficiently with async FastAPI (ASGI workers)
Provides batch embedding for multiple texts
Includes Docker-based deployment and testing

Architecture

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────────┐
│  HTTP Clients   │────►│  FastAPI (ASGI)  │────►│  DirtyWorker        │
│                 │     │  - /embed        │     │  - sentence-        │
│                 │◄────│  - /health       │◄────│    transformers     │
└─────────────────┘     └──────────────────┘     │  - Model in memory  │
                                                  └─────────────────────┘

Why dirty workers?

ML models are expensive to load (several seconds)
Dirty workers load the model once at startup
HTTP workers remain lightweight and responsive
Model stays in memory, serving many requests

Quick Start

With Docker (recommended)

cd examples/embedding_service
docker compose up --build

Local Development

# Install dependencies
pip install sentence-transformers fastapi pydantic

# Run with gunicorn
gunicorn examples.embedding_service.main:app \
  -c examples/embedding_service/gunicorn_conf.py

API Reference

POST /embed

Generate embeddings for a list of texts.

Request:

{
  "texts": ["Hello world", "Another sentence"]
}

Response:

{
  "embeddings": [
    [0.123, -0.456, ...],
    [0.789, -0.012, ...]
  ]
}

Example:

curl -X POST http://localhost:8000/embed \
  -H "Content-Type: application/json" \
  -d '{"texts": ["Hello world"]}'

GET /health

Health check endpoint.

Response:

{"status": "ok"}

Configuration

Edit gunicorn_conf.py to adjust:

Setting	Default	Description
`workers`	2	Number of HTTP workers
`dirty_workers`	1	Number of ML model workers
`dirty_timeout`	60	Max seconds per inference
`bind`	0.0.0.0:8000	Listen address

Model

Uses all-MiniLM-L6-v2:

384-dimensional embeddings
Fast inference (~14K sentences/sec on GPU)
Good quality for semantic search
~90MB download

To use a different model, edit embedding_app.py:

self.model = SentenceTransformer('your-model-name')

Testing

Run the integration tests:

# Start the service first
docker compose up -d

# Run tests
pip install requests numpy
python test_embedding.py

Production Considerations

GPU Support: Add CUDA to the Dockerfile for faster inference
Scaling: Increase dirty_workers for more concurrent embeddings
Caching: Add Redis caching for repeated texts
Rate Limiting: Add FastAPI middleware for rate limiting
Monitoring: Add Prometheus metrics endpoint