mirror of
https://github.com/frappe/gunicorn.git
synced 2026-07-03 03:01:31 +08:00
Add a complete example demonstrating dirty workers with sentence-transformers for text embeddings via FastAPI: - EmbeddingApp DirtyApp that loads and manages the ML model - FastAPI endpoints for /embed and /health - Docker and docker-compose configuration - Integration tests with numpy similarity checks - GitHub Actions CI workflow
134 lines
3.4 KiB
Markdown
134 lines
3.4 KiB
Markdown
# Embedding Service Example
|
|
|
|
A FastAPI-based text embedding service using sentence-transformers, powered by
|
|
gunicorn's dirty workers for efficient ML model management.
|
|
|
|
## Overview
|
|
|
|
This example demonstrates how to build a production-ready embedding API that:
|
|
- Keeps ML models loaded in memory across requests (dirty workers)
|
|
- Handles HTTP efficiently with async FastAPI (ASGI workers)
|
|
- Provides batch embedding for multiple texts
|
|
- Includes Docker-based deployment and testing
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────────┐
|
|
│ HTTP Clients │────►│ FastAPI (ASGI) │────►│ DirtyWorker │
|
|
│ │ │ - /embed │ │ - sentence- │
|
|
│ │◄────│ - /health │◄────│ transformers │
|
|
└─────────────────┘ └──────────────────┘ │ - Model in memory │
|
|
└─────────────────────┘
|
|
```
|
|
|
|
**Why dirty workers?**
|
|
- ML models are expensive to load (several seconds)
|
|
- Dirty workers load the model once at startup
|
|
- HTTP workers remain lightweight and responsive
|
|
- Model stays in memory, serving many requests
|
|
|
|
## Quick Start
|
|
|
|
### With Docker (recommended)
|
|
|
|
```bash
|
|
cd examples/embedding_service
|
|
docker compose up --build
|
|
```
|
|
|
|
### Local Development
|
|
|
|
```bash
|
|
# Install dependencies
|
|
pip install sentence-transformers fastapi pydantic
|
|
|
|
# Run with gunicorn
|
|
gunicorn examples.embedding_service.main:app \
|
|
-c examples/embedding_service/gunicorn_conf.py
|
|
```
|
|
|
|
## API Reference
|
|
|
|
### POST /embed
|
|
|
|
Generate embeddings for a list of texts.
|
|
|
|
**Request:**
|
|
```json
|
|
{
|
|
"texts": ["Hello world", "Another sentence"]
|
|
}
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"embeddings": [
|
|
[0.123, -0.456, ...],
|
|
[0.789, -0.012, ...]
|
|
]
|
|
}
|
|
```
|
|
|
|
**Example:**
|
|
```bash
|
|
curl -X POST http://localhost:8000/embed \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"texts": ["Hello world"]}'
|
|
```
|
|
|
|
### GET /health
|
|
|
|
Health check endpoint.
|
|
|
|
**Response:**
|
|
```json
|
|
{"status": "ok"}
|
|
```
|
|
|
|
## Configuration
|
|
|
|
Edit `gunicorn_conf.py` to adjust:
|
|
|
|
| Setting | Default | Description |
|
|
|---------|---------|-------------|
|
|
| `workers` | 2 | Number of HTTP workers |
|
|
| `dirty_workers` | 1 | Number of ML model workers |
|
|
| `dirty_timeout` | 60 | Max seconds per inference |
|
|
| `bind` | 0.0.0.0:8000 | Listen address |
|
|
|
|
## Model
|
|
|
|
Uses [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2):
|
|
- 384-dimensional embeddings
|
|
- Fast inference (~14K sentences/sec on GPU)
|
|
- Good quality for semantic search
|
|
- ~90MB download
|
|
|
|
To use a different model, edit `embedding_app.py`:
|
|
```python
|
|
self.model = SentenceTransformer('your-model-name')
|
|
```
|
|
|
|
## Testing
|
|
|
|
Run the integration tests:
|
|
|
|
```bash
|
|
# Start the service first
|
|
docker compose up -d
|
|
|
|
# Run tests
|
|
pip install requests numpy
|
|
python test_embedding.py
|
|
```
|
|
|
|
## Production Considerations
|
|
|
|
1. **GPU Support**: Add CUDA to the Dockerfile for faster inference
|
|
2. **Scaling**: Increase `dirty_workers` for more concurrent embeddings
|
|
3. **Caching**: Add Redis caching for repeated texts
|
|
4. **Rate Limiting**: Add FastAPI middleware for rate limiting
|
|
5. **Monitoring**: Add Prometheus metrics endpoint
|