Avatar Video from Text

Generate a talking-head video from:

Text — what the character says
Character photo — how they look (generated or real)
Voice — any ElevenLabs voice (by ID or name)

Pipeline

ElevenLabs V3 TTS — text → speech audio
OmniHuman 1.5 — image + audio → lipsync video (lip movements match the speech)
Kling v3 pro motion control — image + OmniHuman video → enhanced quality video (optional, improves realism)

Prerequisites

python3, ffmpeg
fal-client Python package (uv run --with fal-client or pip install fal-client)
ELEVENLABS_API_KEY in ~/.secrets/elevenlabs.env
FAL_AI_KEY or FAL_KEY in environment

Workflow

1. Generate speech

python3 scripts/tts_elevenlabs_v3.py \
  --text "Your script text here" \
  --voice-name "Celestia" \
  --output /path/to/speech.mp3

Or from a file: --text-file /path/to/script.txt

Or by voice ID: --voice-id VaKkxizh5XgA7ihroKqO

Model defaults to eleven_v3 (recommended). Voice settings tuned from production:

--stability 0.34
--similarity-boost 0.91
--style 0.49

For more expressive/less clone-like output: --stability 0.15 --similarity-boost 0.6 --style 0.85

Use --list-voices to see all available voices.

Audio Tags (v3 only)

ElevenLabs v3 supports emotion control via tags in the text:

[excited] This is amazing!
[sigh] I can't believe it...
[serious] Stop doing that.
[whisper] Don't tell anyone.

Available: [excited], [sad], [angry], [nervous], [sigh], [whisper], [happily], [serious], [tired], [frustrated]

Important: Image size for OmniHuman

OmniHuman rejects images >5MB. If using Nano Banana 2K images, resize first:

ffmpeg -y -i big.png -vf "scale=1080:-1" small.png

2. Generate lipsync video

uv run --with fal-client python3 scripts/omnihuman_lipsync.py \
  --image /path/to/character.png \
  --audio /path/to/speech.mp3 \
  --output /path/to/lipsync.mp4

OmniHuman 1.5 costs ~$0.16/sec. A 60s video ≈ $9.60.

3. Enhance with Kling motion control (optional)

uv run --with fal-client python3 scripts/kling_motion_enhance.py \
  --image /path/to/character.png \
  --video /path/to/lipsync.mp4 \
  --face-image /path/to/character_face.png \
  --prompt "Young woman presenting a product to camera, expressive gestures, photorealistic" \
  --output /path/to/final.mp4

Kling v3 pro takes the OmniHuman output as motion reference and re-renders with better quality. Uses elements for face identity preservation.

Note: Kling has video length limits. For longer content, split into chunks ≤15s.

Cost estimate (60s video)

Step	Model	Cost
TTS	ElevenLabs V3	~$0.25
Lipsync	OmniHuman 1.5	~$9.60
Motion enhance	Kling v3 pro (×4-6 chunks)	~$4-5
Total		~$14-15

Guardrails

Always review the TTS audio before running OmniHuman (it's the most expensive step).
For long texts, split into segments and generate separately.
Kling motion enhance is optional — OmniHuman alone may be sufficient for some use cases.
Compare OmniHuman output vs Kling-enhanced output before committing to the full pipeline.

Files

scripts/tts_elevenlabs_v3.py — ElevenLabs V3 text-to-speech (any voice)
scripts/omnihuman_lipsync.py — OmniHuman 1.5 image + audio → lipsync video
scripts/kling_motion_enhance.py — Kling v3 pro motion control enhancement

Avatar Video from Text

Generate a talking-head video from:

Text — what the character says
Character photo — how they look (generated or real)
Voice — any ElevenLabs voice (by ID or name)

Pipeline

ElevenLabs V3 TTS — text → speech audio
OmniHuman 1.5 — image + audio → lipsync video (lip movements match the speech)
Kling v3 pro motion control — image + OmniHuman video → enhanced quality video (optional, improves realism)

Prerequisites

python3, ffmpeg
fal-client Python package (uv run --with fal-client or pip install fal-client)
ELEVENLABS_API_KEY in ~/.secrets/elevenlabs.env
FAL_AI_KEY or FAL_KEY in environment

Workflow

1. Generate speech

python3 scripts/tts_elevenlabs_v3.py \
  --text "Your script text here" \
  --voice-name "Celestia" \
  --output /path/to/speech.mp3

Or from a file: --text-file /path/to/script.txt

Or by voice ID: --voice-id VaKkxizh5XgA7ihroKqO

Model defaults to eleven_v3 (recommended). Voice settings tuned from production:

--stability 0.34
--similarity-boost 0.91
--style 0.49

For more expressive/less clone-like output: --stability 0.15 --similarity-boost 0.6 --style 0.85

Use --list-voices to see all available voices.

Audio Tags (v3 only)

ElevenLabs v3 supports emotion control via tags in the text:

[excited] This is amazing!
[sigh] I can't believe it...
[serious] Stop doing that.
[whisper] Don't tell anyone.

Available: [excited], [sad], [angry], [nervous], [sigh], [whisper], [happily], [serious], [tired], [frustrated]

Important: Image size for OmniHuman

OmniHuman rejects images >5MB. If using Nano Banana 2K images, resize first:

ffmpeg -y -i big.png -vf "scale=1080:-1" small.png

2. Generate lipsync video

uv run --with fal-client python3 scripts/omnihuman_lipsync.py \
  --image /path/to/character.png \
  --audio /path/to/speech.mp3 \
  --output /path/to/lipsync.mp4

OmniHuman 1.5 costs ~$0.16/sec. A 60s video ≈ $9.60.

3. Enhance with Kling motion control (optional)

uv run --with fal-client python3 scripts/kling_motion_enhance.py \
  --image /path/to/character.png \
  --video /path/to/lipsync.mp4 \
  --face-image /path/to/character_face.png \
  --prompt "Young woman presenting a product to camera, expressive gestures, photorealistic" \
  --output /path/to/final.mp4

Kling v3 pro takes the OmniHuman output as motion reference and re-renders with better quality. Uses elements for face identity preservation.

Note: Kling has video length limits. For longer content, split into chunks ≤15s.

Cost estimate (60s video)

Step	Model	Cost
TTS	ElevenLabs V3	~$0.25
Lipsync	OmniHuman 1.5	~$9.60
Motion enhance	Kling v3 pro (×4-6 chunks)	~$4-5
Total		~$14-15

Guardrails

Always review the TTS audio before running OmniHuman (it's the most expensive step).
For long texts, split into segments and generate separately.
Kling motion enhance is optional — OmniHuman alone may be sufficient for some use cases.
Compare OmniHuman output vs Kling-enhanced output before committing to the full pipeline.

Files

scripts/tts_elevenlabs_v3.py — ElevenLabs V3 text-to-speech (any voice)
scripts/omnihuman_lipsync.py — OmniHuman 1.5 image + audio → lipsync video
scripts/kling_motion_enhance.py — Kling v3 pro motion control enhancement

Avatar Video from Text

SKILL.md

Avatar Video from Text

Pipeline

Prerequisites

Workflow

1. Generate speech

Audio Tags (v3 only)

Important: Image size for OmniHuman

2. Generate lipsync video

3. Enhance with Kling motion control (optional)

Cost estimate (60s video)

Guardrails

Files

Preparing the source view

Avatar Video from Text

SKILL.md

Avatar Video from Text

Pipeline

Prerequisites

Workflow

1. Generate speech

Audio Tags (v3 only)

Important: Image size for OmniHuman

2. Generate lipsync video

3. Enhance with Kling motion control (optional)

Cost estimate (60s video)

Guardrails

Files