mirror of
https://github.com/frappe/gunicorn.git
synced 2026-07-01 18:21:30 +08:00
Add a complete example demonstrating dirty workers with sentence-transformers for text embeddings via FastAPI: - EmbeddingApp DirtyApp that loads and manages the ML model - FastAPI endpoints for /embed and /health - Docker and docker-compose configuration - Integration tests with numpy similarity checks - GitHub Actions CI workflow
Embedding Service Example
A FastAPI-based text embedding service using sentence-transformers, powered by gunicorn's dirty workers for efficient ML model management.
Overview
This example demonstrates how to build a production-ready embedding API that:
- Keeps ML models loaded in memory across requests (dirty workers)
- Handles HTTP efficiently with async FastAPI (ASGI workers)
- Provides batch embedding for multiple texts
- Includes Docker-based deployment and testing
Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────────┐
│ HTTP Clients │────►│ FastAPI (ASGI) │────►│ DirtyWorker │
│ │ │ - /embed │ │ - sentence- │
│ │◄────│ - /health │◄────│ transformers │
└─────────────────┘ └──────────────────┘ │ - Model in memory │
└─────────────────────┘
Why dirty workers?
- ML models are expensive to load (several seconds)
- Dirty workers load the model once at startup
- HTTP workers remain lightweight and responsive
- Model stays in memory, serving many requests
Quick Start
With Docker (recommended)
cd examples/embedding_service
docker compose up --build
Local Development
# Install dependencies
pip install sentence-transformers fastapi pydantic
# Run with gunicorn
gunicorn examples.embedding_service.main:app \
-c examples/embedding_service/gunicorn_conf.py
API Reference
POST /embed
Generate embeddings for a list of texts.
Request:
{
"texts": ["Hello world", "Another sentence"]
}
Response:
{
"embeddings": [
[0.123, -0.456, ...],
[0.789, -0.012, ...]
]
}
Example:
curl -X POST http://localhost:8000/embed \
-H "Content-Type: application/json" \
-d '{"texts": ["Hello world"]}'
GET /health
Health check endpoint.
Response:
{"status": "ok"}
Configuration
Edit gunicorn_conf.py to adjust:
| Setting | Default | Description |
|---|---|---|
workers |
2 | Number of HTTP workers |
dirty_workers |
1 | Number of ML model workers |
dirty_timeout |
60 | Max seconds per inference |
bind |
0.0.0.0:8000 | Listen address |
Model
Uses all-MiniLM-L6-v2:
- 384-dimensional embeddings
- Fast inference (~14K sentences/sec on GPU)
- Good quality for semantic search
- ~90MB download
To use a different model, edit embedding_app.py:
self.model = SentenceTransformer('your-model-name')
Testing
Run the integration tests:
# Start the service first
docker compose up -d
# Run tests
pip install requests numpy
python test_embedding.py
Production Considerations
- GPU Support: Add CUDA to the Dockerfile for faster inference
- Scaling: Increase
dirty_workersfor more concurrent embeddings - Caching: Add Redis caching for repeated texts
- Rate Limiting: Add FastAPI middleware for rate limiting
- Monitoring: Add Prometheus metrics endpoint