Echō
AI-powered closed caption translation & natural voice dubbing.
Echō
Enterprise-grade closed caption translation and natural voice dubbing. Upload media, get perfectly timed translations and AI-cloned voice dubs in minutes.
Localization pipelines at major studios involve 5+ fragmented tools, weeks of turnaround, and expensive manual linguist work for isometric dubbing — rewriting translations to match original speech timing. Voice casting alone can take days. The result: content launches in 1-2 languages and takes months to reach global audiences.
Drop in any video or audio file. Echō accepts MP4, MOV, MKV, WAV, MP3, and more.
ElevenLabs Scribe extracts speech with timestamps and speaker identification. Claude translates with cultural adaptation, regional dialect, and timing constraints.
ElevenLabs clones each speaker's voice and synthesizes the translation. Optionally detect on-screen text (signs, captions, UI) for VFX handoff. Export captions, dubbed audio, or dubbed video.
High-accuracy speech-to-text with word-level timestamps, speaker diarization, and 99+ language support. Powers the transcription pipeline.
Context-aware translation that preserves idioms, tone, and cultural nuance. Handles isometric adaptation — rewriting translated text to fit original timing windows.
Server-side vocal separation, automatic voice cloning per speaker, multilingual synthesis, and professional mixing — all in one API call. Background audio stays intact.
The secret sauce. Claude rewrites translated text to match the duration of each original speech segment, then ElevenLabs' Dubbing API handles timing and synthesis server-side — the same thing major studios pay linguists to do manually.
Optional on-screen text detection. Claude scans video keyframes to identify signs, captions, messages, and UI text, translating each for VFX compositing handoff — the forced-narrative workflow that usually requires dedicated VFX teams.
Studios currently use 5+ separate tools for transcription, translation agencies, voice casting, recording studios, and manual QC. Echō consolidates everything into two APIs — ElevenLabs for transcription (Scribe), voice cloning, and full dubbing with background audio preserved, and Claude for intelligent translation and isometric adaptation. Voice cloning eliminates casting and recording for 80% of use cases, isometric adaptation replaces weeks of manual linguist work, and real-time preview eliminates the back-and-forth between translation and audio teams.
MP4, MOV, MKV, AVI, WebM, MP3, WAV, AAC, FLAC, OGG, M4A
SRT, VTT (WebVTT), SBV (YouTube), TTML, DFXP, JSON
MP3, WAV, AAC — individual segments or full mixed track
32 languages supported: English, Spanish, French, German, Italian, Portuguese, Polish, Hindi, Arabic, Chinese, Japanese, Korean, Dutch, Russian, Turkish, Swedish, Indonesian, Filipino, Malay, Tamil, Ukrainian, Greek, Czech, Finnish, Croatian, Slovak, Danish, Bulgarian, Romanian, Hungarian, Norwegian, Vietnamese.
Follow these steps to go from raw media to fully translated captions and natural-sounding dubbed audio.
Echō connects to two AI services. Enter your keys in the sub-bar at the top of the page. Each key lights up green when configured.
Your keys are stored in your browser's localStorage only. They are never sent to BuilderBias servers — each key is sent directly to its respective API (Anthropic or ElevenLabs) over HTTPS.
Drag and drop a video or audio file onto the upload zone, or click to browse your files. Echō accepts all major formats:
Once uploaded, you'll see the file name, size, and type. Select your source language and target language from the dropdowns. All 32 supported languages are available for both source and target.
Click "Start Transcription" to send your audio to ElevenLabs Scribe. Scribe will:
- Extract all spoken words from the audio track with high accuracy
- Generate precise start and end timestamps for each word and segment
- Identify different speakers (speaker diarization)
- Detect the source language automatically if set to "Auto-detect"
When complete, you'll see the full transcript in a table with timecodes, speaker labels, and the original text. A stats bar shows total segments, duration, speakers detected, and word count. Video files have their audio automatically extracted before transcription.
Tip: For best results, use audio with minimal background noise. Scribe handles accents and multiple speakers well, but heavy music or sound effects can reduce accuracy.
Click "Translate & Adapt" to send all segments to Claude. This is a two-part process:
The transcript table updates with translations shown in green and adapted versions in amber. The "Fit" column shows whether the adapted text fits the timing window — OK means it fits, a percentage shows how much longer it runs.
Click "Generate Voice Dub" to send your media to ElevenLabs' Dubbing API. This is the same technology used by professional studios:
- Vocal separation — ElevenLabs' ML models surgically remove original voices while keeping all background music, SFX, and ambience intact
- Automatic voice cloning — Each speaker's voice is cloned server-side so the dubbed voices match the original speakers
- Professional mixing — Dubbed voices are mixed back with the preserved background audio for a seamless final result
- Progress tracking — Watch the dubbing status update in real-time as ElevenLabs processes your media
This step is optional — if you only need translated captions, click "Skip to Export" to jump ahead.
Tip: The Dubbing API handles voice cloning automatically from your uploaded media. No need to upload voice samples separately — it extracts and clones each speaker's voice directly from the original audio.
After translation, click "👁 Detect On-Screen Text" to identify text that appears visually in the video (not spoken dialogue):
- Keyframe extraction — Echō samples up to 20 frames across the video duration to control API costs
- Vision analysis — Claude inspects each frame and identifies signs, captions, text messages, labels, chyrons, lower thirds, UI elements, and other on-screen text
- Translation & positioning — each text element gets translated, with its approximate screen position and text type (sign, caption, message, etc.) recorded
- Review table — edit any translation inline, delete false positives, and refine before exporting
- VFX handoff — export as an On-Screen Text Manifest (CSV + JSON) with timecodes and positions so your compositing team can do the final text replacement
This step is Phase 1 of on-screen text localization. Echō detects and translates; actual text replacement (inpainting the original + compositing the translation with matching style) happens in downstream VFX tools. Skip this step for audio-only files or content without on-screen text.
Tip: Cultural adaptation matters here too. A "Stop" sign doesn't just become "Alto" — for some markets it stays "Stop" because that's what the sign actually reads locally. Review each detection in context.
The preview player appears after translation completes. Use it to review your work before exporting:
Use the scrubber to jump to any point in the timeline. Captions update in real-time as you scrub through the waveform.
Click any export card to download your translated content. Available formats:
If you only need translated captions (no voice dubbing), skip step 4 entirely. After translation, go straight to export. Both API keys are still needed — ElevenLabs for transcription and Claude for translation.
ElevenLabs' Dubbing API automatically clones each speaker's voice from the original audio — no manual setup needed. For best results, use clear audio with minimal crosstalk between speakers.
As you progress through the pipeline, each completed step's action button turns into a green "✓ Complete" indicator. You can see at a glance what's been done without scrolling up. If you want to redo a step, remove the file and start fresh.
The on-screen text detection step appears after translation completes (video files only). It's entirely optional — skip it if your content has no significant on-screen text to localize. If you run it, use the results as a VFX handoff manifest rather than expecting automatic visual replacement.
Watch the "Fit" column after translation. Green "OK" means the adapted text fits the timing. A percentage like "+15%" means it runs slightly long — the voice synthesis will speak slightly faster to compensate, but you may want to manually shorten the text for the most natural result.
Your API keys and media files never touch BuilderBias servers. All API calls go directly from your browser to Anthropic and ElevenLabs over encrypted HTTPS connections. Keys are saved in localStorage so you don't have to re-enter them.
Traditional localization pipelines involve multiple vendors, weeks of coordination, and significant per-minute costs. Echō consolidates the entire workflow.
For a 90-minute feature film localized into 10 languages: traditional cost is approximately $135K – $225K over 20 – 40 weeks. Echō estimates $450 – $1,800 in API costs, deliverable in days. Human QC review time is additional but integrated into the workflow.
AI dubbing is powerful but not infallible. Echō builds human review into every stage of the pipeline — not as an afterthought, but as a core design principle. Every automated decision has a checkpoint where a human can verify, override, or refine.
After Scribe transcribes, review the full transcript table before proceeding. Correct speaker labels, fix misheard words, and verify timecodes. The transcript is the foundation — errors here cascade downstream.
Each translated segment is displayed alongside the original. Click the edit button (✎) on any segment to review, modify, or provide instructions for re-translation. Critical for: cultural idioms, humor, brand-specific terminology, character names, and legal/compliance language that AI may handle inconsistently.
The "Fit" column flags segments where the translated text may not fit the original timing window. Green means it fits; a percentage shows how much longer it runs. Human reviewers can shorten or rephrase flagged segments before dubbing to ensure lip-sync quality.
After dubbing, click the play button (▶) on any segment to hear exactly how it sounds. If a specific segment sounds unnatural, has wrong emphasis, or mispronounces a proper noun, edit and re-dub just that segment without reprocessing the entire file.
The preview player lets you listen to the complete dubbed result with Original, Dubbed, and Side-by-Side modes before committing to an export. QC teams can scrub through the timeline, compare translations in context, and approve or flag issues.
AI translation can produce confident but incorrect results in edge cases. Always have a native speaker verify: cultural references and humor (jokes don't translate literally), legal/regulatory language (compliance text must be exact), brand terminology (character names, product names, proprietary terms), songs and poetry (rhythm and rhyme require creative adaptation), sensitive content (violence, political, religious references that vary by market), and proper nouns and place names (transliteration varies by target language).
Echō is architected to scale from proof-of-concept to enterprise deployment. The following capabilities are on the product roadmap for enterprise customers.
SAML/OIDC single sign-on integration. Role-based permissions for translators, reviewers, and admins with approval workflows.
Queue 100+ files for processing across multiple target languages simultaneously. Priority scheduling and progress dashboards.
API connectors for Media Asset Management systems (Dalet, Avid, Frame.io). Ingest from and deliver to existing content pipelines.
Preserve content ratings metadata through the pipeline. Profanity filtering and regional compliance rules per target market.
Full history of every translation decision, edit, and approval. Version control for dubbed assets with rollback capability.
Build and reuse glossaries of brand-specific terms, character names, and recurring phrases across projects for consistency.
Self-hosted option for studios with strict data sovereignty requirements. Content never leaves your infrastructure.
Dub a single piece of content into all 32 supported languages in one batch. Regional voice profiles and dialect support for key markets.
Security is built into Echō at every layer — from how API keys are handled to how content is transmitted.
No BuilderBias server sits between you and the APIs. All calls go directly from your browser to Anthropic and ElevenLabs over HTTPS. Your media files and translations never touch our infrastructure.
Keys are validated against each API on page load. Sessions automatically expire after 4 hours of inactivity — keys are cleared from localStorage to prevent stale credential exposure.
API keys are automatically redacted from browser console output. If someone opens dev tools, sensitive credentials are masked in all log statements.
Right-click, view-source, and dev tool keyboard shortcuts are disabled. Dev tools detection alerts when inspection is attempted.
Planned: AES-256 encryption for stored assets, DRM watermarking for dubbed content, IP allowlisting, and SOC 2 Type II compliance certification.
Following the 2023 SAG-AFTRA strike, studios have specific contractual obligations around AI voice use. Any enterprise deployment of Echō will implement a compliance framework covering consent, compensation, disclosure, and revocation. This is a roadmap requirement for production use with unionized talent.
Before any voice is cloned, the original actor (or their designated rep) must provide written consent specifying: the scope of use (which content, which languages, which territories), the term (finite time period, not perpetual), and the permitted derivative uses. Echo's enterprise version will include a consent management UI that captures, stores, and verifies these agreements before a clone can be created. No consent record → no clone.
Every voice in the character library will carry rights metadata: original actor name, consent document reference, union membership status, expiration date, permitted territories, and any performance restrictions. When a voice is used for dubbing, these rights are checked in real-time. Expired or out-of-scope uses are blocked at the API layer.
Each synthesis event (every time a cloned voice is used to generate dubbed content) is logged with full metadata: voice ID, project, duration generated, territory, and date. This usage log feeds into payroll/royalty systems to trigger residual payments per the SAG-AFTRA AI framework. Studios can generate quarterly reports for guild reporting requirements.
All Echō-generated dubbed content is marked with an internal "AI-generated" flag in metadata. For jurisdictions that require consumer disclosure (California AB 2602, EU AI Act, etc.), the flag surfaces in exported files (TTML metadata, SideCar files, and platform-specific disclosure markers). Studios can apply custom disclosure language per region.
Actors can revoke consent at any time. When a revocation is received, the voice profile is marked as archived — the clone can no longer be used for new synthesis, though existing delivered content remains per contract. The character voice library supports versioning, so historical uses are traceable and auditable.
The framework distinguishes between two use cases: performance extension (using an actor's voice for additional takes, languages, or ADR with their consent and payment) vs. replacement (using an AI voice in place of hiring an actor). The former is contemplated by the SAG-AFTRA framework; the latter requires explicit contractual provisions and generally higher compensation tiers. Echō's tooling supports both modes with distinct audit trails.
⚠ Current browser-based Echō is a proof-of-concept and does not enforce these compliance controls. Before any production use with unionized talent or pre-release content, the enterprise version with the full compliance framework must be deployed. Studios should work with their guild counsel and talent rights teams to operationalize consent and compensation workflows.
Songs are explicitly out of scope for AI dubbing at broadcast/theatrical quality. Echō flags musical content and routes it to a human specialist workflow — the same process major studios use for animated musicals and feature films.
During translation, Claude identifies segments that are sung or musically performed (lyrics, chanting, rhymed/metered delivery) and flags them with an isSong marker.
Reviewers can toggle the song flag on any segment using the music note button (🎵) in the transcript or the checkbox in the segment edit modal. Useful for edge cases like rap, spoken word, or instrumental score with voiceover.
Song segments are visually highlighted in the transcript and excluded from AI voice synthesis. The Dubbing API still processes the file, but song segments are marked "skip" and should be preserved in their original form or replaced via the human specialist track.
Export a Song Manifest (CSV + JSON) listing every flagged musical segment with timecode, original lyrics, and machine translation for reference. This manifest is handed off to native-speaking lyricists and performing artists in the target market who re-write the lyrics preserving meter, rhyme, and cultural resonance — then record performances that are mixed back into the final deliverable.
| Time | Speaker | Original | Translation | Fit | Dub | Actions |
|---|
Text detected in video frames (signs, captions, messages, labels). Review and edit translations before export. Final text replacement requires VFX compositing downstream.
| Time | Type | Original Text | Translation | Actions |
|---|