Voice Upload & Library - travisvn/chatterbox-tts-api GitHub Wiki
Voice Library Management
๐ต Overview
The Chatterbox TTS API now includes a comprehensive voice library management system that allows users to upload, manage, and use custom voices across all speech generation endpoints. This feature enables you to create a persistent collection of voices that can be referenced by name in API calls.
โจ Key Features
- Persistent Voice Storage: Uploaded voices are stored persistently and survive container restarts
- Voice Selection by Name: Reference uploaded voices by name in any speech generation endpoint
- Multiple Audio Formats: Support for MP3, WAV, FLAC, M4A, and OGG files
- RESTful Voice Management: Full CRUD operations for voice management
- Docker & Local Support: Works seamlessly with both Docker and direct Python installations
- Frontend Integration: Complete voice management UI in the web frontend
๐ Getting Started
For Docker Users
The voice library is automatically configured when using Docker. Voices are stored in a persistent volume:
# Start with voice library enabled
docker-compose up -d
# Your voices will be persisted in the "chatterbox-voices" Docker volume
For Local Python Users
Create a voice library directory (default: ./voices
):
# Create voices directory
mkdir voices
# Or set custom location
export VOICE_LIBRARY_DIR="/path/to/your/voices"
๐ API Endpoints
List Voices
GET /v1/voices
Get a list of all voices in the library.
curl -X GET "http://localhost:4123/v1/voices"
Response:
{
"voices": [
{
"name": "sarah_professional",
"filename": "sarah_professional.mp3",
"original_filename": "sarah_recording.mp3",
"file_extension": ".mp3",
"file_size": 1024768,
"upload_date": "2024-01-15T10:30:00Z",
"path": "/voices/sarah_professional.mp3"
}
],
"count": 1
}
Upload Voice
POST /v1/voices
Upload a new voice to the library.
curl -X POST "http://localhost:4123/v1/voices" \
-F "voice_name=sarah_professional" \
-F "voice_file=@/path/to/voice.mp3"
Parameters:
voice_name
(string): Name for the voice (used in API calls)voice_file
(file): Audio file (MP3, WAV, FLAC, M4A, OGG, max 10MB)
Delete Voice
DELETE /v1/voices/{voice_name}
Delete a voice from the library.
curl -X DELETE "http://localhost:4123/v1/voices/sarah_professional"
Rename Voice
PUT /v1/voices/{voice_name}
Rename an existing voice.
curl -X PUT "http://localhost:4123/v1/voices/sarah_professional" \
-F "new_name=sarah_business"
Get Voice Info
GET /v1/voices/{voice_name}
Get detailed information about a specific voice.
curl -X GET "http://localhost:4123/v1/voices/sarah_professional"
Download Voice
GET /v1/voices/{voice_name}/download
Download the original voice file.
curl -X GET "http://localhost:4123/v1/voices/sarah_professional/download" \
--output voice.mp3
๐ค Using Voices in Speech Generation
JSON API (Recommended)
Use the voice name in the voice
parameter:
curl -X POST "http://localhost:4123/v1/audio/speech" \
-H "Content-Type: application/json" \
-d '{
"input": "Hello! This is using my custom voice.",
"voice": "sarah_professional",
"exaggeration": 0.7,
"temperature": 0.8
}' \
--output speech.wav
Form Data API
curl -X POST "http://localhost:4123/v1/audio/speech/upload" \
-F "input=Hello! This is using my custom voice." \
-F "voice=sarah_professional" \
-F "exaggeration=0.7" \
--output speech.wav
Streaming API
curl -X POST "http://localhost:4123/v1/audio/speech/stream" \
-H "Content-Type: application/json" \
-d '{
"input": "This will stream with my custom voice.",
"voice": "sarah_professional"
}' \
--output stream.wav
๐ง Configuration
Environment Variables
# Voice library directory (default: ./voices for local, /voices for Docker)
VOICE_LIBRARY_DIR=/path/to/voices
# For Docker, this is typically set to /voices and mounted as a volume
Docker Configuration
The voice library is automatically configured in Docker with a persistent volume:
volumes:
- chatterbox-voices:/voices
๐ Voice Naming Guidelines
Valid Characters
- Letters (a-z, A-Z)
- Numbers (0-9)
- Underscores (_)
- Hyphens (-)
- Spaces (converted to underscores)
Invalid Characters
- Forward/backward slashes (/, \)
- Colons (:)
- Asterisks (*)
- Question marks (?)
- Quotes (", ')
- Angle brackets (<, >)
- Pipes (|)
Examples
โ
Good names:
- "sarah_professional"
- "john-voice-2024"
- "female_american"
- "narration_style"
โ Invalid names:
- "sarah/professional" # Contains slash
- "voice:sample" # Contains colon
- "my voice?" # Contains question mark
๐ฏ Best Practices
Voice Quality
- Use high-quality audio samples (16-48kHz sample rate)
- Aim for 10-30 seconds of clean speech
- Avoid background noise and music
- Choose samples with consistent volume
File Management
- Use descriptive voice names
- Keep file sizes reasonable (< 10MB)
- Organize voices by speaker or style
- Clean up unused voices periodically
API Usage
- Use the JSON API for better performance
- Cache voice lists on the client side
- Handle voice-not-found errors gracefully
- Test voices before production use
๐ Troubleshooting
Voice Not Found
{
"error": {
"message": "Voice 'my_voice' not found in voice library. Use /voices endpoint to list available voices.",
"type": "voice_not_found_error"
}
}
Solution: Check available voices with GET /v1/voices
or upload the voice first.
Upload Failed
{
"error": {
"message": "Unsupported audio format: .txt. Supported formats: .mp3, .wav, .flac, .m4a, .ogg",
"type": "invalid_request_error"
}
}
Solution: Use a supported audio format and ensure the file is valid.
Voice Already Exists
{
"error": {
"message": "Voice 'sarah_professional' already exists",
"type": "voice_exists_error"
}
}
Solution: Use a different name or delete the existing voice first.
๐๏ธ Frontend Integration
The web frontend includes a complete voice library management interface:
- Voice Library Panel: Browse and manage voices
- Upload Modal: Easy voice upload with drag-and-drop
- Voice Selection: Choose voices in the TTS interface
- Preview Playback: Listen to voice samples before use
- Rename/Delete: Manage voice metadata
๐ Migration from Client-Side Storage
If you were previously using the client-side voice library (localStorage), you'll need to re-upload your voices to the new server-side library for persistence and cross-device access.
๐ API Aliases
All voice endpoints support multiple URL formats:
/v1/voices
(recommended)/voices
/voice-library
/voice_library
๐ท๏ธ OpenAI Compatibility
The voice parameter also accepts OpenAI voice names for compatibility:
alloy
,echo
,fable
,onyx
,nova
,shimmer
These will use the default configured voice sample, while custom names will use uploaded voices from the library.
๐ก๏ธ Security Considerations
- Voice files are stored on the server filesystem
- File uploads are validated for type and size
- Voice names are sanitized to prevent path traversal
- No authentication required (same as other endpoints)
๐ Performance Notes
- Voice library operations are fast (< 100ms typical)
- Voice files are loaded on-demand for TTS generation
- Large voice files may increase TTS processing time
- Consider voice file size vs. quality trade-offs
๐ Future Enhancements
Planned features for future releases:
- Voice categorization and tagging
- Bulk voice operations
- Voice sharing between users
- Advanced voice metadata
- Voice quality analysis
- Automatic voice optimization
Voice Upload Feature Implementation Summary
๐ค Overview
Successfully implemented voice file upload functionality for the Chatterbox TTS API, allowing users to upload custom voice samples per request while maintaining full backward compatibility.
๐ Changes Made
1. Core Dependencies Added
python-multipart>=0.0.6 - Required for FastAPI multipart/form-data support
Files Updated:
requirements.txt
- Added python-multipart dependencypyproject.toml
- Added python-multipart to project dependencies- All Docker files - Added python-multipart to pip install commands
app/api/endpoints/speech.py
)
2. Enhanced Speech Endpoint (New Features:
- โ
Voice file upload support - Optional
voice_file
parameter - โ Multiple endpoint formats - Both JSON and form data support
- โ File validation - Format, size, and content validation
- โ Temporary file handling - Secure file processing with automatic cleanup
- โ Backward compatibility - Existing JSON requests continue to work
Supported File Formats:
- MP3 (.mp3)
- WAV (.wav)
- FLAC (.flac)
- M4A (.m4a)
- OGG (.ogg)
- Maximum size: 10MB
New Endpoints:
POST /v1/audio/speech
- Multipart form data (supports voice upload)POST /v1/audio/speech/json
- Legacy JSON endpoint (backward compatibility)
3. Comprehensive Testing
New Test Files:
tests/test_voice_upload.py
- Dedicated voice upload testing- Updated
tests/test_api.py
- Tests both JSON and form data endpoints
Test Coverage:
- โ Default voice (both endpoints)
- โ Custom voice upload
- โ File format validation
- โ Error handling
- โ Parameter validation
- โ Backward compatibility
4. Updated Documentation
README.md Updates:
- Added voice upload examples
- Documented supported file formats
- Provided usage examples in multiple languages (Python, cURL)
- Added file requirements and best practices
๐ Usage Examples
Basic Usage (Default Voice)
# JSON (legacy)
curl -X POST http://localhost:4123/v1/audio/speech/json \
-H "Content-Type: application/json" \
-d '{"input": "Hello world!"}' \
--output output.wav
# Form data (new)
curl -X POST http://localhost:4123/v1/audio/speech \
-F "input=Hello world!" \
--output output.wav
Custom Voice Upload
curl -X POST http://localhost:4123/v1/audio/speech \
-F "input=Hello with my custom voice!" \
-F "exaggeration=0.8" \
-F "voice_file=@my_voice.mp3" \
--output custom_voice.wav
Python Example
import requests
# With custom voice upload
with open("my_voice.mp3", "rb") as voice_file:
response = requests.post(
"http://localhost:4123/v1/audio/speech",
data={
"input": "Hello with my custom voice!",
"exaggeration": 0.8,
"temperature": 1.0
},
files={
"voice_file": ("my_voice.mp3", voice_file, "audio/mpeg")
}
)
with open("output.wav", "wb") as f:
f.write(response.content)
๐ณ Docker Support
All Docker files updated with python-multipart:
docker/Dockerfile
- Standard Docker imagedocker/Dockerfile.cpu
- CPU-only imagedocker/Dockerfile.gpu
- GPU-enabled imagedocker/Dockerfile.uv
- uv-optimized imagedocker/Dockerfile.uv.gpu
- uv + GPU image
Docker Usage:
# Build and run with voice upload support
docker compose -f docker/docker-compose.yml up -d
# Test voice upload
curl -X POST http://localhost:4123/v1/audio/speech \
-F "input=Hello from Docker!" \
-F "[email protected]" \
--output docker_test.wav
๐ง Technical Implementation
File Processing Flow
- Upload - Receive multipart form data with optional voice file
- Validate - Check file format, size, and content
- Store - Create temporary file with secure naming
- Process - Use uploaded file or default voice sample for TTS
- Cleanup - Automatically remove temporary files
Memory Management
- Temporary files are automatically cleaned up in
finally
blocks - File validation prevents oversized uploads
- Secure temporary file creation with unique names
Error Handling
- File format validation with helpful error messages
- File size limits (10MB maximum)
- Graceful fallback to default voice on upload errors
- Comprehensive error responses with error codes
๐งช Testing
Quick Test
# Start the API
python main.py
# Run comprehensive tests
python tests/test_voice_upload.py
python tests/test_api.py
Test Results Expected
- โ Health check
- โ API documentation endpoints
- โ Legacy JSON endpoint compatibility
- โ New form data endpoint
- โ Voice file upload functionality
- โ Error handling and validation
๐ API Documentation
The API documentation is automatically updated and available at:
- Swagger UI: http://localhost:4123/docs
- ReDoc: http://localhost:4123/redoc
- OpenAPI Schema: http://localhost:4123/openapi.json
The documentation now includes:
- Multipart form data support
- File upload parameters
- Example requests and responses
- Error codes and descriptions
โ Backward Compatibility
100% backward compatible:
- Existing JSON requests work unchanged
- All previous API behavior preserved
- Legacy endpoint (
/v1/audio/speech/json
) maintains exact same interface - No breaking changes to existing functionality
๐ Security Considerations
- File type validation prevents malicious uploads
- File size limits prevent DoS attacks
- Temporary files use secure random naming
- Automatic cleanup prevents file system bloat
- No persistent storage of uploaded files
๐ Performance Impact
- Minimal overhead for JSON requests (unchanged code path)
- Temporary file I/O only when voice files are uploaded
- Efficient memory management with automatic cleanup
- FastAPI's built-in multipart handling is highly optimized
Status: โ Complete and Production Ready
The voice upload feature is fully implemented, tested, and documented. Users can now upload custom voice files for personalized text-to-speech generation while maintaining full backward compatibility with existing implementations.