Benoit Chesneau 0e05c824e9 feat(examples): add FastAPI embedding service with Docker testing
Add a complete example demonstrating dirty workers with sentence-transformers
for text embeddings via FastAPI:

- EmbeddingApp DirtyApp that loads and manages the ML model
- FastAPI endpoints for /embed and /health
- Docker and docker-compose configuration
- Integration tests with numpy similarity checks
- GitHub Actions CI workflow
2026-01-25 10:23:25 +01:00

134 lines
3.4 KiB
Markdown

# Embedding Service Example
A FastAPI-based text embedding service using sentence-transformers, powered by
gunicorn's dirty workers for efficient ML model management.
## Overview
This example demonstrates how to build a production-ready embedding API that:
- Keeps ML models loaded in memory across requests (dirty workers)
- Handles HTTP efficiently with async FastAPI (ASGI workers)
- Provides batch embedding for multiple texts
- Includes Docker-based deployment and testing
## Architecture
```
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────────┐
│ HTTP Clients │────►│ FastAPI (ASGI) │────►│ DirtyWorker │
│ │ │ - /embed │ │ - sentence- │
│ │◄────│ - /health │◄────│ transformers │
└─────────────────┘ └──────────────────┘ │ - Model in memory │
└─────────────────────┘
```
**Why dirty workers?**
- ML models are expensive to load (several seconds)
- Dirty workers load the model once at startup
- HTTP workers remain lightweight and responsive
- Model stays in memory, serving many requests
## Quick Start
### With Docker (recommended)
```bash
cd examples/embedding_service
docker compose up --build
```
### Local Development
```bash
# Install dependencies
pip install sentence-transformers fastapi pydantic
# Run with gunicorn
gunicorn examples.embedding_service.main:app \
-c examples/embedding_service/gunicorn_conf.py
```
## API Reference
### POST /embed
Generate embeddings for a list of texts.
**Request:**
```json
{
"texts": ["Hello world", "Another sentence"]
}
```
**Response:**
```json
{
"embeddings": [
[0.123, -0.456, ...],
[0.789, -0.012, ...]
]
}
```
**Example:**
```bash
curl -X POST http://localhost:8000/embed \
-H "Content-Type: application/json" \
-d '{"texts": ["Hello world"]}'
```
### GET /health
Health check endpoint.
**Response:**
```json
{"status": "ok"}
```
## Configuration
Edit `gunicorn_conf.py` to adjust:
| Setting | Default | Description |
|---------|---------|-------------|
| `workers` | 2 | Number of HTTP workers |
| `dirty_workers` | 1 | Number of ML model workers |
| `dirty_timeout` | 60 | Max seconds per inference |
| `bind` | 0.0.0.0:8000 | Listen address |
## Model
Uses [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2):
- 384-dimensional embeddings
- Fast inference (~14K sentences/sec on GPU)
- Good quality for semantic search
- ~90MB download
To use a different model, edit `embedding_app.py`:
```python
self.model = SentenceTransformer('your-model-name')
```
## Testing
Run the integration tests:
```bash
# Start the service first
docker compose up -d
# Run tests
pip install requests numpy
python test_embedding.py
```
## Production Considerations
1. **GPU Support**: Add CUDA to the Dockerfile for faster inference
2. **Scaling**: Increase `dirty_workers` for more concurrent embeddings
3. **Caching**: Add Redis caching for repeated texts
4. **Rate Limiting**: Add FastAPI middleware for rate limiting
5. **Monitoring**: Add Prometheus metrics endpoint