OpenAI Enhanced - bigsk1/voice-chat-ai GitHub Wiki
Introduction
OpenAI Enhanced mode is a specialized interface in Voice Chat AI that unlocks the full potential of OpenAI's latest models and voice technologies. This interface provides advanced features that go beyond the capabilities of the standard mode, allowing for more natural, responsive, and expressive conversations with AI characters.
Key Features
Advanced Model Selection
The Enhanced interface lets you select from the latest OpenAI models:
-
Chat Models:
- GPT-4o: OpenAI's most capable multimodal model with high reasoning abilities
- GPT-4o Mini: A faster, more efficient version of GPT-4o
- GPT-4: The traditional text-based model with strong reasoning capabilities
-
Text-to-Speech Models:
- GPT-4o Mini TTS: New multimodal TTS with voice instructions support
- TTS-1: Standard OpenAI TTS model
- TTS-1 HD: High-definition version with enhanced audio quality
-
Transcription Models:
- GPT-4o Transcribe: Latest OpenAI speech recognition
- GPT-4o Mini Transcribe: Efficient speech recognition
- Whisper-1: Legacy speech recognition model
Character-Specific Voice Instructions
One of the most powerful features in Enhanced mode is support for Voice Instructions - a revolutionary technology that allows for much more expressive and emotional text-to-speech:
-
Base Voice Instructions: Each character has a set of base voice instructions in their character file, defining how they should sound by default.
-
Mood-Adaptive Voice: The system analyzes the mood of the conversation and dynamically adjusts the voice characteristics to match the emotional context, making interactions feel more natural and responsive.
-
Structured Voice Formatting: Voice instructions follow a specific format that controls:
- Voice Quality: Tone, timbre, and character of the voice
- Pacing: Speed, rhythm, and pauses
- Pronunciation: Emphasis on specific sounds or words
- Emotion: Emotional undertones in the voice
- Inflection: Rising or falling patterns in speech
Real-Time Voice Visualization
The Enhanced interface includes a dynamic voice visualization that appears when the AI is speaking, providing visual feedback that makes the conversation feel more interactive and engaging.
WebSocket Communication
Enhanced mode uses WebSockets for real-time communication between the client and server, enabling:
- Instant response updates
- Real-time audio status indicators
- Seamless transitions between user input and AI responses
How It Works
-
Character Selection: Choose from any available character using the dropdown menu.
-
Model Configuration: Select your preferred chat model, TTS model, and transcription model.
-
Voice Selection: Choose from 10 different base voices that will be modified by the voice instructions.
-
Start Conversation: Click the "Start" button to begin your conversation.
-
Natural Interaction: Speak naturally - the system will:
- Record your voice
- Transcribe your speech to text using the selected transcription model
- Analyze the emotional context of your message
- Generate an AI response using the selected chat model
- Apply character-specific voice instructions based on the detected mood
- Convert the response to speech using the selected TTS model
- Play back the audio with appropriate voice characteristics
-
Visual Feedback: The interface provides real-time feedback:
- Microphone icon changes color based on recording status
- Animated dots appear when the system is listening
- Voice wave animation displays when the AI is speaking
Technical Details
Voice Instructions Implementation
The Enhanced mode uses a sophisticated pipeline to process voice instructions:
- Parsing: Extracts structured voice instructions from character files
- Mood Detection: Analyzes user input to determine emotional context
- Instruction Merging: Combines base and mood-specific instructions
- API Integration: Sends formatted instructions to the GPT-4o Mini TTS model
Audio Processing
The system uses optimized audio processing for lower latency:
- WAV format for faster processing
- Buffered audio streams for smoother playback
- Proper resource cleanup to prevent memory leaks
WebSocket Protocol
The WebSocket connection handles various event types:
waiting_for_speech
: System is ready for user inputrecording_started
: User is speakingrecording_stopped
: Processing user inputaudio_actually_playing
: AI response is being playedai_stop_speaking
: Audio playback has completed
Comparison to Standard Mode
Feature | Standard Mode | Enhanced Mode |
---|---|---|
Voice Options | Basic voice selection | Advanced voice instructions with emotional adaptation |
Model Selection | Limited options | Full suite of latest OpenAI models |
Conversation Flow | Basic start/stop | Real-time status indicators and feedback |
Audio Quality | Standard | Optimized for lower latency and better expression |
Visual Feedback | Minimal | Dynamic microphone and speaking animations |