Echō

AI-powered closed caption translation & natural voice dubbing.

What is Echō?

Echō

Enterprise-grade closed caption translation and natural voice dubbing. Upload media, get perfectly timed translations and AI-cloned voice dubs in minutes.

The Problem

Localization pipelines at major studios involve 5+ fragmented tools, weeks of turnaround, and expensive manual linguist work for isometric dubbing — rewriting translations to match original speech timing. Voice casting alone can take days. The result: content launches in 1-2 languages and takes months to reach global audiences.

How Echō Works

Upload

Drop in any video or audio file. Echō accepts MP4, MOV, MKV, WAV, MP3, and more.

Transcribe & Translate

ElevenLabs Scribe extracts speech with timestamps and speaker identification. Claude translates with cultural adaptation, regional dialect, and timing constraints.

Dub & Export

ElevenLabs clones each speaker's voice and synthesizes the translation. Optionally detect on-screen text (signs, captions, UI) for VFX handoff. Export captions, dubbed audio, or dubbed video.

The Technology Stack

ElevenLabs Scribe

High-accuracy speech-to-text with word-level timestamps, speaker diarization, and 99+ language support. Powers the transcription pipeline.

Claude (Anthropic)

Context-aware translation that preserves idioms, tone, and cultural nuance. Handles isometric adaptation — rewriting translated text to fit original timing windows.

ElevenLabs Dubbing API

Server-side vocal separation, automatic voice cloning per speaker, multilingual synthesis, and professional mixing — all in one API call. Background audio stays intact.

Isometric Dubbing

The secret sauce. Claude rewrites translated text to match the duration of each original speech segment, then ElevenLabs' Dubbing API handles timing and synthesis server-side — the same thing major studios pay linguists to do manually.

Claude Vision

Optional on-screen text detection. Claude scans video keyframes to identify signs, captions, messages, and UI text, translating each for VFX compositing handoff — the forced-narrative workflow that usually requires dedicated VFX teams.

What Studios Could Do Better

Studios currently use 5+ separate tools for transcription, translation agencies, voice casting, recording studios, and manual QC. Echō consolidates everything into two APIs — ElevenLabs for transcription (Scribe), voice cloning, and full dubbing with background audio preserved, and Claude for intelligent translation and isometric adaptation. Voice cloning eliminates casting and recording for 80% of use cases, isometric adaptation replaces weeks of manual linguist work, and real-time preview eliminates the back-and-forth between translation and audio teams.

Supported Formats

Input

MP4, MOV, MKV, AVI, WebM, MP3, WAV, AAC, FLAC, OGG, M4A

Caption Export

SRT, VTT (WebVTT), SBV (YouTube), TTML, DFXP, JSON

Audio Export

MP3, WAV, AAC — individual segments or full mixed track

Languages

32 languages supported: English, Spanish, French, German, Italian, Portuguese, Polish, Hindi, Arabic, Chinese, Japanese, Korean, Dutch, Russian, Turkish, Swedish, Indonesian, Filipino, Malay, Tamil, Ukrainian, Greek, Czech, Finnish, Croatian, Slovak, Danish, Bulgarian, Romanian, Hungarian, Norwegian, Vietnamese.

Step-by-Step Guide

Follow these steps to go from raw media to fully translated captions and natural-sounding dubbed audio.

0 Configure Your API Keys

Echō connects to two AI services. Enter your keys in the sub-bar at the top of the page. Each key lights up green when configured.

Claude

Powers translation and isometric adaptation — rewriting translations to match original speech timing. Get your key at console.anthropic.com. Uses Claude Sonnet for fast, high-quality results.

ElevenLabs

Powers voice cloning and speech synthesis. Get your key at elevenlabs.io/app/settings/api-keys. Free tier includes limited characters; paid plans unlock more.

Your keys are stored in your browser's localStorage only. They are never sent to BuilderBias servers — each key is sent directly to its respective API (Anthropic or ElevenLabs) over HTTPS.

1 Upload Your Media

Drag and drop a video or audio file onto the upload zone, or click to browse your files. Echō accepts all major formats:

Video: MP4, MOV, MKV, AVI, WebM Audio: MP3, WAV, AAC, FLAC, OGG Captions: SRT, VTT, SBV, TTML

Once uploaded, you'll see the file name, size, and type. Select your source language and target language from the dropdowns. All 32 supported languages are available for both source and target.

2 Transcribe with ElevenLabs Scribe

Click "Start Transcription" to send your audio to ElevenLabs Scribe. Scribe will:

Extract all spoken words from the audio track with high accuracy
Generate precise start and end timestamps for each word and segment
Identify different speakers (speaker diarization)
Detect the source language automatically if set to "Auto-detect"

When complete, you'll see the full transcript in a table with timecodes, speaker labels, and the original text. A stats bar shows total segments, duration, speakers detected, and word count. Video files have their audio automatically extracted before transcription.

Tip: For best results, use audio with minimal background noise. Scribe handles accents and multiple speakers well, but heavy music or sound effects can reduce accuracy.

3 Translate & Isometric Adaptation

Click "Translate & Adapt" to send all segments to Claude. This is a two-part process:

Translation

Claude translates each segment with cultural context — preserving idioms, humor, tone, and register rather than doing a word-for-word literal translation. A joke stays funny, a formal address stays formal.

Adaptation

Claude then rewrites the translation to fit the original segment's time window. This is isometric dubbing — the same process studios pay linguists to do manually. If a 3-second English phrase translates to a 5-second Spanish phrase, Claude finds a shorter way to say it that still sounds natural.

The transcript table updates with translations shown in green and adapted versions in amber. The "Fit" column shows whether the adapted text fits the timing window — OK means it fits, a percentage shows how much longer it runs.

4 Generate Voice Dub

Click "Generate Voice Dub" to send your media to ElevenLabs' Dubbing API. This is the same technology used by professional studios:

Vocal separation — ElevenLabs' ML models surgically remove original voices while keeping all background music, SFX, and ambience intact
Automatic voice cloning — Each speaker's voice is cloned server-side so the dubbed voices match the original speakers
Professional mixing — Dubbed voices are mixed back with the preserved background audio for a seamless final result
Progress tracking — Watch the dubbing status update in real-time as ElevenLabs processes your media

This step is optional — if you only need translated captions, click "Skip to Export" to jump ahead.

Tip: The Dubbing API handles voice cloning automatically from your uploaded media. No need to upload voice samples separately — it extracts and clones each speaker's voice directly from the original audio.

👁 Detect On-Screen Text Optional · Video Only

After translation, click "👁 Detect On-Screen Text" to identify text that appears visually in the video (not spoken dialogue):

Keyframe extraction — Echō samples up to 20 frames across the video duration to control API costs
Vision analysis — Claude inspects each frame and identifies signs, captions, text messages, labels, chyrons, lower thirds, UI elements, and other on-screen text
Translation & positioning — each text element gets translated, with its approximate screen position and text type (sign, caption, message, etc.) recorded
Review table — edit any translation inline, delete false positives, and refine before exporting
VFX handoff — export as an On-Screen Text Manifest (CSV + JSON) with timecodes and positions so your compositing team can do the final text replacement

This step is Phase 1 of on-screen text localization. Echō detects and translates; actual text replacement (inpainting the original + compositing the translation with matching style) happens in downstream VFX tools. Skip this step for audio-only files or content without on-screen text.

Tip: Cultural adaptation matters here too. A "Stop" sign doesn't just become "Alto" — for some markets it stays "Stop" because that's what the sign actually reads locally. Review each detection in context.

5 Preview Your Results

The preview player appears after translation completes. Use it to review your work before exporting:

Original

Play back with original-language captions overlaid on the waveform timeline.

Dubbed

Play back with translated/adapted captions. If voice dubbing is complete, hear the synthesized audio.

Side by Side

See original and translated captions simultaneously — ideal for QC review and comparing translations.

Use the scrubber to jump to any point in the timeline. Captions update in real-time as you scrub through the waveform.

6 Export

Click any export card to download your translated content. Available formats:

SRT

The universal subtitle format. Works with VLC, Premiere Pro, DaVinci Resolve, Final Cut, and virtually every video player and editor.

WebVTT

Web-native format with CSS styling support. Ideal for HTML5 video players, web apps, and streaming platforms.

SBV

YouTube's native caption format. Upload directly to YouTube Studio for instant localized captions.

TTML / DFXP

Broadcast-grade XML format used by major streaming platforms and broadcast networks. Required for many content delivery platforms.

JSON

Full pipeline data including original text, translations, adapted text, timing, and speaker info. Use this for API integrations or custom workflows.

Dubbed Audio

Download the fully dubbed file from ElevenLabs (available after dubbing is complete). Includes cloned voices mixed with preserved background audio — ready to use.

Dubbed Video

For video uploads, download the full dubbed video with the new audio track merged in. The original visuals and background audio remain intact — only the dialogue is replaced.

Song Manifest

CSV + JSON listing every song segment flagged during translation. Handoff to native lyricists for recreation in the target language.

On-Screen Text Manifest

CSV + JSON of detected on-screen text with timecodes, screen positions, original text, and translations. For VFX compositing handoff. Only appears if you ran the optional on-screen text detection step.

Pro Tips

⚡

Caption-Only Workflow

If you only need translated captions (no voice dubbing), skip step 4 entirely. After translation, go straight to export. Both API keys are still needed — ElevenLabs for transcription and Claude for translation.

🎤

Automatic Voice Cloning

ElevenLabs' Dubbing API automatically clones each speaker's voice from the original audio — no manual setup needed. For best results, use clear audio with minimal crosstalk between speakers.

✓

Completed Steps Are Greyed

As you progress through the pipeline, each completed step's action button turns into a green "✓ Complete" indicator. You can see at a glance what's been done without scrolling up. If you want to redo a step, remove the file and start fresh.

👁

On-Screen Text is Optional

The on-screen text detection step appears after translation completes (video files only). It's entirely optional — skip it if your content has no significant on-screen text to localize. If you run it, use the results as a VFX handoff manifest rather than expecting automatic visual replacement.

📊

Timing Fit Indicators

Watch the "Fit" column after translation. Green "OK" means the adapted text fits the timing. A percentage like "+15%" means it runs slightly long — the voice synthesis will speak slightly faster to compensate, but you may want to manually shorten the text for the most natural result.

🔒

Security

Your API keys and media files never touch BuilderBias servers. All API calls go directly from your browser to Anthropic and ElevenLabs over encrypted HTTPS connections. Keys are saved in localStorage so you don't have to re-enter them.

Cost & Time Comparison

Traditional localization pipelines involve multiple vendors, weeks of coordination, and significant per-minute costs. Echō consolidates the entire workflow.

Traditional Pipeline

Cost: $15 – $25 per finished minute

Turnaround: 2 – 4 weeks per language

Voice casting: 3 – 5 days

Studio recording: 1 – 3 days per language

QC cycles: 2 – 4 review rounds

Vendors involved: 5+

Echō Pipeline

Cost: ~$0.50 – $2 per finished minute

Turnaround: Minutes to hours

Voice cloning: Automatic (from source audio)

Studio recording: None — AI synthesis

QC: Built-in human-in-the-loop review

APIs involved: 2 (Claude + ElevenLabs)

For a 90-minute feature film localized into 10 languages: traditional cost is approximately $135K – $225K over 20 – 40 weeks. Echō estimates $450 – $1,800 in API costs, deliverable in days. Human QC review time is additional but integrated into the workflow.

Human-in-the-Loop Quality Control

AI dubbing is powerful but not infallible. Echō builds human review into every stage of the pipeline — not as an afterthought, but as a core design principle. Every automated decision has a checkpoint where a human can verify, override, or refine.

Transcription Review

After Scribe transcribes, review the full transcript table before proceeding. Correct speaker labels, fix misheard words, and verify timecodes. The transcript is the foundation — errors here cascade downstream.

Translation & Adaptation Verification

Each translated segment is displayed alongside the original. Click the edit button (✎) on any segment to review, modify, or provide instructions for re-translation. Critical for: cultural idioms, humor, brand-specific terminology, character names, and legal/compliance language that AI may handle inconsistently.

Timing Fit Indicators

The "Fit" column flags segments where the translated text may not fit the original timing window. Green means it fits; a percentage shows how much longer it runs. Human reviewers can shorten or rephrase flagged segments before dubbing to ensure lip-sync quality.

Per-Segment Audio Preview

After dubbing, click the play button (▶) on any segment to hear exactly how it sounds. If a specific segment sounds unnatural, has wrong emphasis, or mispronounces a proper noun, edit and re-dub just that segment without reprocessing the entire file.

Full Preview Before Export

The preview player lets you listen to the complete dubbed result with Original, Dubbed, and Side-by-Side modes before committing to an export. QC teams can scrub through the timeline, compare translations in context, and approve or flag issues.

⚠

When Human Review Is Critical

AI translation can produce confident but incorrect results in edge cases. Always have a native speaker verify: cultural references and humor (jokes don't translate literally), legal/regulatory language (compliance text must be exact), brand terminology (character names, product names, proprietary terms), songs and poetry (rhythm and rhyme require creative adaptation), sensitive content (violence, political, religious references that vary by market), and proper nouns and place names (transliteration varies by target language).

Enterprise Roadmap

Echō is architected to scale from proof-of-concept to enterprise deployment. The following capabilities are on the product roadmap for enterprise customers.

SSO & Role-Based Access

SAML/OIDC single sign-on integration. Role-based permissions for translators, reviewers, and admins with approval workflows.

Batch Processing

Queue 100+ files for processing across multiple target languages simultaneously. Priority scheduling and progress dashboards.

MAM Integration

API connectors for Media Asset Management systems (Dalet, Avid, Frame.io). Ingest from and deliver to existing content pipelines.

Compliance & Content Ratings

Preserve content ratings metadata through the pipeline. Profanity filtering and regional compliance rules per target market.

Audit Trail & Versioning

Full history of every translation decision, edit, and approval. Version control for dubbed assets with rollback capability.

Translation Memory

Build and reuse glossaries of brand-specific terms, character names, and recurring phrases across projects for consistency.

On-Premise Deployment

Self-hosted option for studios with strict data sovereignty requirements. Content never leaves your infrastructure.

32-Language Scale

Dub a single piece of content into all 32 supported languages in one batch. Regional voice profiles and dialect support for key markets.

Security

Security is built into Echō at every layer — from how API keys are handled to how content is transmitted.

✓

Zero-Server Architecture

No BuilderBias server sits between you and the APIs. All calls go directly from your browser to Anthropic and ElevenLabs over HTTPS. Your media files and translations never touch our infrastructure.

✓

API Key Validation & Auto-Expiry

Keys are validated against each API on page load. Sessions automatically expire after 4 hours of inactivity — keys are cleared from localStorage to prevent stale credential exposure.

✓

Console Redaction

API keys are automatically redacted from browser console output. If someone opens dev tools, sensitive credentials are masked in all log statements.

✓

Source Inspection Deterrents

Right-click, view-source, and dev tool keyboard shortcuts are disabled. Dev tools detection alerts when inspection is attempted.

●

Enterprise Roadmap: Encryption & DRM

Planned: AES-256 encryption for stored assets, DRM watermarking for dubbed content, IP allowlisting, and SOC 2 Type II compliance certification.

SAG-AFTRA & Voice Rights Compliance (Roadmap)

Following the 2023 SAG-AFTRA strike, studios have specific contractual obligations around AI voice use. Any enterprise deployment of Echō will implement a compliance framework covering consent, compensation, disclosure, and revocation. This is a roadmap requirement for production use with unionized talent.

1. Informed Consent Workflow

Before any voice is cloned, the original actor (or their designated rep) must provide written consent specifying: the scope of use (which content, which languages, which territories), the term (finite time period, not perpetual), and the permitted derivative uses. Echo's enterprise version will include a consent management UI that captures, stores, and verifies these agreements before a clone can be created. No consent record → no clone.

2. Voice Rights Attribution

Every voice in the character library will carry rights metadata: original actor name, consent document reference, union membership status, expiration date, permitted territories, and any performance restrictions. When a voice is used for dubbing, these rights are checked in real-time. Expired or out-of-scope uses are blocked at the API layer.

3. Compensation & Residuals Tracking

Each synthesis event (every time a cloned voice is used to generate dubbed content) is logged with full metadata: voice ID, project, duration generated, territory, and date. This usage log feeds into payroll/royalty systems to trigger residual payments per the SAG-AFTRA AI framework. Studios can generate quarterly reports for guild reporting requirements.

4. Disclosure & Transparency

All Echō-generated dubbed content is marked with an internal "AI-generated" flag in metadata. For jurisdictions that require consumer disclosure (California AB 2602, EU AI Act, etc.), the flag surfaces in exported files (TTML metadata, SideCar files, and platform-specific disclosure markers). Studios can apply custom disclosure language per region.

5. Revocation & Right to Withdraw

Actors can revoke consent at any time. When a revocation is received, the voice profile is marked as archived — the clone can no longer be used for new synthesis, though existing delivered content remains per contract. The character voice library supports versioning, so historical uses are traceable and auditable.

6. Performance Clone vs. Replacement Clone

The framework distinguishes between two use cases: performance extension (using an actor's voice for additional takes, languages, or ADR with their consent and payment) vs. replacement (using an AI voice in place of hiring an actor). The former is contemplated by the SAG-AFTRA framework; the latter requires explicit contractual provisions and generally higher compensation tiers. Echō's tooling supports both modes with distinct audit trails.

⚠ Current browser-based Echō is a proof-of-concept and does not enforce these compliance controls. Before any production use with unionized talent or pre-release content, the enterprise version with the full compliance framework must be deployed. Studios should work with their guild counsel and talent rights teams to operationalize consent and compensation workflows.

Musical Content Workflow

Songs are explicitly out of scope for AI dubbing at broadcast/theatrical quality. Echō flags musical content and routes it to a human specialist workflow — the same process major studios use for animated musicals and feature films.

Automatic Detection

During translation, Claude identifies segments that are sung or musically performed (lyrics, chanting, rhymed/metered delivery) and flags them with an isSong marker.

Manual Flag Override

Reviewers can toggle the song flag on any segment using the music note button (🎵) in the transcript or the checkbox in the segment edit modal. Useful for edge cases like rap, spoken word, or instrumental score with voiceover.

Exclusion from AI Dub

Song segments are visually highlighted in the transcript and excluded from AI voice synthesis. The Dubbing API still processes the file, but song segments are marked "skip" and should be preserved in their original form or replaced via the human specialist track.

Song Manifest Export

Export a Song Manifest (CSV + JSON) listing every flagged musical segment with timecode, original lyrics, and machine translation for reference. This manifest is handed off to native-speaking lyricists and performing artists in the target market who re-write the lyrics preserving meter, rhyme, and cultural resonance — then record performances that are mixed back into the final deliverable.

1 Upload

→

2 Transcribe

→

3 Translate

→

4 Dub

→

5 Export

🎬

Drop your media file here

or click to browse — video, audio, or existing caption files

MP4 MOV MKV AVI WebM MP3 WAV AAC FLAC OGG SRT VTT

🎬

Source Language

→

Target Language

Preparing...

Processing

Transcript & Translation

Segments

0:00

Duration

Speakers

Words

Time	Speaker	Original	Translation	Fit	Dub	Actions

For video files. Uses Claude vision to detect signs, captions, text messages, and other on-screen text. Translations are exported as a manifest for VFX handoff.

Preview

0:00 / 0:00

Captions will appear here during playback

Export

📄

SRT Captions

Standard subtitle format. Works everywhere.

SRT

🌐

WebVTT Captions

Web-native format with styling support.

VTT

▶

YouTube SBV

YouTube's native caption format.

SBV

🎥

TTML / DFXP

Broadcast-grade XML format for studios.

TTML DFXP

{}

JSON Data

Full pipeline data for API integration.

JSON

🎵

Song Manifest

List of song segments flagged for human specialist (lyricist + performing artist) adaptation.

CSV JSON

👁

On-Screen Text Manifest

Detected on-screen text with timecodes, positions, and translations. Ready for VFX handoff.

CSV JSON

🎧

Dubbed Audio

Full dubbed audio track ready for mixing.

MP3 WAV

🎬

Dubbed Video

Original video with dubbed audio track mixed in.

WebM

Echō

Echō

🎤 Character Voice Library

Save Voice to Library

Edit Segment