ElevenLabs
Overview
Section titled “Overview”ElevenLabs is a specialized audio provider for text-to-speech and speech-to-text. Call it through DeepIntShield using the standard /v1/audio/speech and /v1/audio/transcriptions endpoints. Capabilities you can use:
- Voice configuration - control stability, similarity boost, speaker boost, speed, and style
- Speech formats - MP3, Opus, and PCM/WAV output
- Timestamps - character-level timing alignment for TTS
- Transcription - word- and character-level timing, speaker diarization, and additional output formats (SRT, etc.)
- Pronunciation - custom pronunciation dictionaries
Supported Operations
Section titled “Supported Operations”| Operation | Non-Streaming | Streaming | Endpoint |
|---|---|---|---|
| Speech (TTS) | ✅ | ✅ | /v1/text-to-speech/{voice_id} |
| Transcriptions (STT) | ✅ | - | /v1/speech-to-text |
| List Models | ✅ | - | /v1/models |
| Chat Completions | ❌ | ❌ | - |
| Responses API | ❌ | ❌ | - |
| Text Completions | ❌ | ❌ | - |
| Embeddings | ❌ | ❌ | - |
| Image Generation | ❌ | ❌ | - |
1. Speech (Text-to-Speech)
Section titled “1. Speech (Text-to-Speech)”Request Parameters
Section titled “Request Parameters”Core Parameters
Section titled “Core Parameters”| Parameter | Notes |
|---|---|
input | The text to convert to speech (required) |
model | Model identifier (e.g., "eleven_multilingual_v2") |
voice | Voice ID (required) |
response_format | Speech format (see Response Format) |
Voice Configuration
Section titled “Voice Configuration”Voice settings are optional:
| Parameter | Default | Range |
|---|---|---|
speed | 1.0 | 0.5-2.0 |
stability | 0.5 | 0-1.0 |
similarity_boost | 0.75 | 0-1.0 |
use_speaker_boost | true | boolean |
style | 0 | 0-1.0 |
Advanced Parameters
Section titled “Advanced Parameters”ElevenLabs-specific TTS features can be passed directly in the request body:
curl -X POST https://app.deepintshield.com/v1/audio/speech \ -H "Authorization: Bearer sk-bf-your-virtual-key" \ -H "Content-Type: application/json" \ -d '{ "model": "eleven_multilingual_v2", "input": "Hello, how are you?", "voice": "21m00Tcm4TlvDq8ikWAM", "response_format": "mp3", "stability": 0.5, "similarity_boost": 0.75, "use_speaker_boost": true, "style": 0, "speed": 1.0, "language_code": "en", "seed": 42, "previous_text": "Context text", "next_text": "Future context", "apply_text_normalization": "auto" }'Advanced TTS Parameters
Section titled “Advanced TTS Parameters”| Parameter | Type | Description |
|---|---|---|
language_code | string | Language code (e.g., “en”, “es”) |
seed | integer | Reproducible output (0-4294967295) |
previous_text | string | Previous text context for consistency |
next_text | string | Next text context for consistency |
previous_request_ids | string[] | Previous request IDs for continuity |
next_request_ids | string[] | Next request IDs for continuity |
apply_text_normalization | string | Text normalization mode: "auto", "on", "off" |
apply_language_text_normalization | boolean | Apply language-specific text normalization |
Response Format
Section titled “Response Format”| Format | Output | Quality | Bitrate |
|---|---|---|---|
mp3 | MP3 | High | 128 kbps @ 44100 Hz |
opus | Opus | High | 128 kbps @ 48000 Hz |
wav / pcm | PCM WAV | Lossless | 16-bit @ 44100 Hz |
Timestamps Support
Section titled “Timestamps Support”To get character-level timing alignment, enable with_timestamps:
{ "with_timestamps": true}When enabled, the endpoint /v1/text-to-speech/{voice_id}/with-timestamps is used and the response includes:
audio_base64- Audio data as base64-encoded stringalignment.char_start_times_ms- Character start times in millisecondsalignment.char_end_times_ms- Character end times in millisecondsalignment.characters- Array of charactersnormalized_alignment- Same as alignment but for normalized text
Response
Section titled “Response”Non-Timestamp Response
Section titled “Non-Timestamp Response”{ "audio": "<binary audio data>"}Timestamp Response
Section titled “Timestamp Response”{ "audio_base64": "<base64 encoded audio>", "alignment": { "char_start_times_ms": [0, 150, 280, ...], "char_end_times_ms": [150, 280, 420, ...], "characters": ["H", "e", "l", "l", "o", ...] }, "normalized_alignment": { "char_start_times_ms": [...], "char_end_times_ms": [...], "characters": [...] }}Streaming
Section titled “Streaming”Streaming speech returns audio in chunks as they are generated:
{ "type": "audio.delta", "audio": "<binary audio chunk>"}Final chunk:
{ "type": "audio.done"}2. Transcription (Speech-to-Text)
Section titled “2. Transcription (Speech-to-Text)”Request Parameters
Section titled “Request Parameters”Input Source
Section titled “Input Source”Choose one of the following (mutually exclusive):
| Parameter | Type | Description |
|---|---|---|
file | bytes | Audio file content (WAV, MP3, etc.) |
cloud_storage_url | string | URL to cloud-hosted audio file |
Error: Providing both or neither will result in error.
Core Parameters
Section titled “Core Parameters”| Parameter | Description |
|---|---|
model | Model identifier (required) |
language | Language code (ISO 639-1, e.g., “en”) |
Advanced Parameters
Section titled “Advanced Parameters”Transcription-specific features can be passed directly as form fields:
curl -X POST https://app.deepintshield.com/v1/audio/transcriptions \ -H "Authorization: Bearer sk-bf-your-virtual-key" \ -F "file=@audio.wav" \ -F "model=eleven_latest" \ -F "language_code=en" \ -F "tag_audio_events=true" \ -F "num_speakers=2" \ -F "timestamps_granularity=word" \ -F "diarize=true" \ -F "diarization_threshold=0.5" \ -F "temperature=0.1" \ -F "seed=42" \ -F "use_multi_channel=true" \ -F "webhook=true" \ -F "webhook_id=webhook-123"Transcription Options
Section titled “Transcription Options”| Parameter | Type | Description |
|---|---|---|
tag_audio_events | boolean | Tag audio events (background noise, music, etc.) |
num_speakers | integer | Expected number of speakers (for diarization) |
timestamps_granularity | string | Timestamp level: "none", "word", "character" |
diarize | boolean | Identify different speakers |
diarization_threshold | float | Speaker diarization sensitivity (0.0-1.0) |
file_format | string | Input format: "pcm_s16le_16", "other" |
temperature | float | Transcription temperature (0.0-1.0) |
seed | integer | Reproducible transcription |
use_multi_channel | boolean | Process multi-channel audio separately |
webhook | boolean | Enable webhook for async processing |
webhook_id | string | Webhook endpoint ID |
webhook_metadata | object/string | Additional webhook metadata |
cloud_storage_url | string | URL to cloud-hosted audio (alternative to file) |
Additional Formats
Section titled “Additional Formats”Request multiple output formats simultaneously:
{ "additional_formats": [ { "format": "segmented_json", "include_speakers": true, "include_timestamps": true, "segment_on_silence_longer_than_s": 1.0, "max_segment_duration_s": 30.0 }, { "format": "srt", "max_segment_duration_s": 30.0 } ]}Supported formats: segmented_json, docx, pdf, txt, html, srt
Response
Section titled “Response”Basic Transcription
Section titled “Basic Transcription”{ "transcript": { "language_code": "en", "language_probability": 0.95, "text": "Full transcribed text...", "words": [ { "text": "Hello", "start": 0.0, "end": 0.5, "type": "word", "speaker_id": "speaker_1", "logprob": -0.05 } ] }}With Diarization
Section titled “With Diarization”When diarize: true, the response includes speaker identification:
{ "transcript": { "text": "Hello how are you?", "words": [ { "text": "Hello", "speaker_id": "speaker_1" }, { "text": "how", "speaker_id": "speaker_2" } ] }}With Timestamps
Section titled “With Timestamps”Character-level timing when timestamps_granularity: "character":
{ "words": [ { "text": "Hello", "characters": [ {"text": "H", "start": 0.0, "end": 0.1}, {"text": "e", "start": 0.1, "end": 0.2} ] } ]}With Additional Formats
Section titled “With Additional Formats”{ "transcript": { ... }, "additional_formats": [ { "requested_format": "srt", "file_extension": "srt", "content_type": "text/plain", "is_base64_encoded": false, "content": "1\n00:00:00,000 --> 00:00:01,000\nHello\n\n2\n..." } ]}Caveats
Section titled “Caveats”Voice ID Required
Severity: High Behavior: Voice ID must be provided for TTS requests Impact: Request fails without voice configuration
File or URL Required for Transcription
Severity: High
Behavior: Either file or cloud_storage_url must be provided (not both)
Impact: Request fails with ambiguous input
Audio Format Conversion
Severity: Low Behavior: Response formats (MP3, Opus, WAV) mapped via format string Impact: Format parameter passed as query string to endpoint
Timestamps as Separate Endpoint
Severity: Low
Behavior: Timestamp requests use /with-timestamps endpoint variant
Impact: Switches endpoint based on with_timestamps flag
Multipart Form Data for Transcription
Severity: Low Behavior: Transcription uses multipart/form-data, not JSON Impact: File and parameters sent as form fields
3. List Models
Section titled “3. List Models”Request Parameters
Section titled “Request Parameters”| Parameter | Type | Description |
|---|---|---|
| (none) | - | No parameters required |
Returns available models with their capabilities and language support.
Response
Section titled “Response”{ "models": [ { "model_id": "eleven_multilingual_v2", "name": "Eleven Multilingual v2", "description": "Multilingual speech synthesis", "serves_pro_voices": true, "token_cost_factor": 1.0, "can_do_text_to_speech": true, "can_do_voice_conversion": true, "can_use_style": true, "can_use_speaker_boost": true, "languages": [ {"language_id": "en", "name": "English"}, {"language_id": "es", "name": "Spanish"} ], "requires_alpha_access": false, "max_characters_request_free_user": 1000, "max_characters_request_subscribed_user": 100000, "maximum_text_length_per_request": 5000, "model_rates": { "character_cost_multiplier": 1.0 } } ]}