Avatar Video from Text
Generate a talking-head video from:
- Text — what the character says
- Character photo — how they look (generated or real)
- Voice — any ElevenLabs voice (by ID or name)
Pipeline
- ElevenLabs V3 TTS — text → speech audio
- OmniHuman 1.5 — image + audio → lipsync video (lip movements match the speech)
- Kling v3 pro motion control — image + OmniHuman video → enhanced quality video (optional, improves realism)
Prerequisites
python3,ffmpegfal-clientPython package (uv run --with fal-clientorpip install fal-client)ELEVENLABS_API_KEYin~/.secrets/elevenlabs.envFAL_AI_KEYorFAL_KEYin environment
Workflow
1. Generate speech
python3 scripts/tts_elevenlabs_v3.py \
--text "Your script text here" \
--voice-name "Celestia" \
--output /path/to/speech.mp3Or from a file: --text-file /path/to/script.txt
Or by voice ID: --voice-id VaKkxizh5XgA7ihroKqO
Model defaults to eleven_v3 (recommended). Voice settings tuned from production:
--stability 0.34--similarity-boost 0.91--style 0.49
For more expressive/less clone-like output: --stability 0.15 --similarity-boost 0.6 --style 0.85
Use --list-voices to see all available voices.
Audio Tags (v3 only)
ElevenLabs v3 supports emotion control via tags in the text:
[excited] This is amazing!
[sigh] I can't believe it...
[serious] Stop doing that.
[whisper] Don't tell anyone.Available: [excited], [sad], [angry], [nervous], [sigh], [whisper], [happily], [serious], [tired], [frustrated]
Important: Image size for OmniHuman
OmniHuman rejects images >5MB. If using Nano Banana 2K images, resize first:
ffmpeg -y -i big.png -vf "scale=1080:-1" small.png2. Generate lipsync video
uv run --with fal-client python3 scripts/omnihuman_lipsync.py \
--image /path/to/character.png \
--audio /path/to/speech.mp3 \
--output /path/to/lipsync.mp4OmniHuman 1.5 costs ~$0.16/sec. A 60s video ≈ $9.60.
3. Enhance with Kling motion control (optional)
uv run --with fal-client python3 scripts/kling_motion_enhance.py \
--image /path/to/character.png \
--video /path/to/lipsync.mp4 \
--face-image /path/to/character_face.png \
--prompt "Young woman presenting a product to camera, expressive gestures, photorealistic" \
--output /path/to/final.mp4Kling v3 pro takes the OmniHuman output as motion reference and re-renders with better quality. Uses elements for face identity preservation.
Note: Kling has video length limits. For longer content, split into chunks ≤15s.
Cost estimate (60s video)
| Step | Model | Cost |
|---|---|---|
| TTS | ElevenLabs V3 | ~$0.25 |
| Lipsync | OmniHuman 1.5 | ~$9.60 |
| Motion enhance | Kling v3 pro (×4-6 chunks) | ~$4-5 |
| Total | ~$14-15 |
Guardrails
- Always review the TTS audio before running OmniHuman (it's the most expensive step).
- For long texts, split into segments and generate separately.
- Kling motion enhance is optional — OmniHuman alone may be sufficient for some use cases.
- Compare OmniHuman output vs Kling-enhanced output before committing to the full pipeline.
Files
scripts/tts_elevenlabs_v3.py— ElevenLabs V3 text-to-speech (any voice)scripts/omnihuman_lipsync.py— OmniHuman 1.5 image + audio → lipsync videoscripts/kling_motion_enhance.py— Kling v3 pro motion control enhancement