Source from bundle

Avatar Video from Text

Generate talking-head avatar videos from text. Pipeline: ElevenLabs V3 TTS → OmniHuman 1.5 lipsync → Kling v3 motion enhancement.

Костянтин@Latand

Files

Skill

1.1K

Size

15.2 KB

Entrypoint

SKILL.md

Format

folder

Open file

SKILL.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown115 linesEntrypointFree

SKILL.md

1---
2name: avatar-video-from-text
3description: "Generate a talking-head avatar video from text, a character photo, and a voice. Pipeline: ElevenLabs V3 TTS → OmniHuman 1.5 lipsync → Kling v3 pro motion enhancement. Use when you need to create a presenter video from a script without recording."
4---
5 
6# Avatar Video from Text
7 
8Generate a talking-head video from:
9- **Text** — what the character says
10- **Character photo** — how they look (generated or real)
11- **Voice** — any ElevenLabs voice (by ID or name)
12 
13## Pipeline
14 
151. **ElevenLabs V3 TTS** — text → speech audio
162. **OmniHuman 1.5** — image + audio → lipsync video (lip movements match the speech)
173. **Kling v3 pro motion control** — image + OmniHuman video → enhanced quality video (optional, improves realism)
18 
19## Prerequisites
20 
21- `python3`, `ffmpeg`
22- `fal-client` Python package (`uv run --with fal-client` or `pip install fal-client`)
23- `ELEVENLABS_API_KEY` in `~/.secrets/elevenlabs.env`
24- `FAL_AI_KEY` or `FAL_KEY` in environment
25 
26## Workflow
27 
28### 1. Generate speech
29 
30```bash
31python3 scripts/tts_elevenlabs_v3.py \
32  --text "Your script text here" \
33  --voice-name "Celestia" \
34  --output /path/to/speech.mp3
35```
36 
37Or from a file: `--text-file /path/to/script.txt`
38 
39Or by voice ID: `--voice-id VaKkxizh5XgA7ihroKqO`
40 
41Model defaults to `eleven_v3` (recommended). Voice settings tuned from production:
42- `--stability 0.34`
43- `--similarity-boost 0.91`
44- `--style 0.49`
45 
46For more expressive/less clone-like output: `--stability 0.15 --similarity-boost 0.6 --style 0.85`
47 
48Use `--list-voices` to see all available voices.
49 
50### Audio Tags (v3 only)
51 
52ElevenLabs v3 supports emotion control via tags in the text:
53```
54[excited] This is amazing!
55[sigh] I can't believe it...
56[serious] Stop doing that.
57[whisper] Don't tell anyone.
58```
59Available: `[excited]`, `[sad]`, `[angry]`, `[nervous]`, `[sigh]`, `[whisper]`, `[happily]`, `[serious]`, `[tired]`, `[frustrated]`
60 
61### Important: Image size for OmniHuman
62 
63OmniHuman rejects images >5MB. If using Nano Banana 2K images, resize first:
64```bash
65ffmpeg -y -i big.png -vf "scale=1080:-1" small.png
66```
67 
68### 2. Generate lipsync video
69 
70```bash
71uv run --with fal-client python3 scripts/omnihuman_lipsync.py \
72  --image /path/to/character.png \
73  --audio /path/to/speech.mp3 \
74  --output /path/to/lipsync.mp4
75```
76 
77OmniHuman 1.5 costs ~$0.16/sec. A 60s video ≈ $9.60.
78 
79### 3. Enhance with Kling motion control (optional)
80 
81```bash
82uv run --with fal-client python3 scripts/kling_motion_enhance.py \
83  --image /path/to/character.png \
84  --video /path/to/lipsync.mp4 \
85  --face-image /path/to/character_face.png \
86  --prompt "Young woman presenting a product to camera, expressive gestures, photorealistic" \
87  --output /path/to/final.mp4
88```
89 
90Kling v3 pro takes the OmniHuman output as motion reference and re-renders with better quality. Uses `elements` for face identity preservation.
91 
92Note: Kling has video length limits. For longer content, split into chunks ≤15s.
93 
94## Cost estimate (60s video)
95 
96| Step | Model | Cost |
97|------|-------|------|
98| TTS | ElevenLabs V3 | ~$0.25 |
99| Lipsync | OmniHuman 1.5 | ~$9.60 |
100| Motion enhance | Kling v3 pro (×4-6 chunks) | ~$4-5 |
101| **Total** | | **~$14-15** |
102 
103## Guardrails
104 
105- Always review the TTS audio before running OmniHuman (it's the most expensive step).
106- For long texts, split into segments and generate separately.
107- Kling motion enhance is optional — OmniHuman alone may be sufficient for some use cases.
108- Compare OmniHuman output vs Kling-enhanced output before committing to the full pipeline.
109 
110## Files
111 
112- `scripts/tts_elevenlabs_v3.py` — ElevenLabs V3 text-to-speech (any voice)
113- `scripts/omnihuman_lipsync.py` — OmniHuman 1.5 image + audio → lipsync video
114- `scripts/kling_motion_enhance.py` — Kling v3 pro motion control enhancement
115

Marketplace

Source from bundle

Avatar Video from Text

Generate talking-head avatar videos from text. Pipeline: ElevenLabs V3 TTS → OmniHuman 1.5 lipsync → Kling v3 motion enhancement.

Костянтин@Latand

Files

Skill

1.1K

Size

15.2 KB

Entrypoint

SKILL.md

Format

folder

Open file

SKILL.md

Syntax-highlighted preview of this file as included in the skill package.

Rendered Source

markdown115 linesEntrypointFree

SKILL.md

1---
2name: avatar-video-from-text
3description: "Generate a talking-head avatar video from text, a character photo, and a voice. Pipeline: ElevenLabs V3 TTS → OmniHuman 1.5 lipsync → Kling v3 pro motion enhancement. Use when you need to create a presenter video from a script without recording."
4---
5 
6# Avatar Video from Text
7 
8Generate a talking-head video from:
9- **Text** — what the character says
10- **Character photo** — how they look (generated or real)
11- **Voice** — any ElevenLabs voice (by ID or name)
12 
13## Pipeline
14 
151. **ElevenLabs V3 TTS** — text → speech audio
162. **OmniHuman 1.5** — image + audio → lipsync video (lip movements match the speech)
173. **Kling v3 pro motion control** — image + OmniHuman video → enhanced quality video (optional, improves realism)
18 
19## Prerequisites
20 
21- `python3`, `ffmpeg`
22- `fal-client` Python package (`uv run --with fal-client` or `pip install fal-client`)
23- `ELEVENLABS_API_KEY` in `~/.secrets/elevenlabs.env`
24- `FAL_AI_KEY` or `FAL_KEY` in environment
25 
26## Workflow
27 
28### 1. Generate speech
29 
30```bash
31python3 scripts/tts_elevenlabs_v3.py \
32  --text "Your script text here" \
33  --voice-name "Celestia" \
34  --output /path/to/speech.mp3
35```
36 
37Or from a file: `--text-file /path/to/script.txt`
38 
39Or by voice ID: `--voice-id VaKkxizh5XgA7ihroKqO`
40 
41Model defaults to `eleven_v3` (recommended). Voice settings tuned from production:
42- `--stability 0.34`
43- `--similarity-boost 0.91`
44- `--style 0.49`
45 
46For more expressive/less clone-like output: `--stability 0.15 --similarity-boost 0.6 --style 0.85`
47 
48Use `--list-voices` to see all available voices.
49 
50### Audio Tags (v3 only)
51 
52ElevenLabs v3 supports emotion control via tags in the text:
53```
54[excited] This is amazing!
55[sigh] I can't believe it...
56[serious] Stop doing that.
57[whisper] Don't tell anyone.
58```
59Available: `[excited]`, `[sad]`, `[angry]`, `[nervous]`, `[sigh]`, `[whisper]`, `[happily]`, `[serious]`, `[tired]`, `[frustrated]`
60 
61### Important: Image size for OmniHuman
62 
63OmniHuman rejects images >5MB. If using Nano Banana 2K images, resize first:
64```bash
65ffmpeg -y -i big.png -vf "scale=1080:-1" small.png
66```
67 
68### 2. Generate lipsync video
69 
70```bash
71uv run --with fal-client python3 scripts/omnihuman_lipsync.py \
72  --image /path/to/character.png \
73  --audio /path/to/speech.mp3 \
74  --output /path/to/lipsync.mp4
75```
76 
77OmniHuman 1.5 costs ~$0.16/sec. A 60s video ≈ $9.60.
78 
79### 3. Enhance with Kling motion control (optional)
80 
81```bash
82uv run --with fal-client python3 scripts/kling_motion_enhance.py \
83  --image /path/to/character.png \
84  --video /path/to/lipsync.mp4 \
85  --face-image /path/to/character_face.png \
86  --prompt "Young woman presenting a product to camera, expressive gestures, photorealistic" \
87  --output /path/to/final.mp4
88```
89 
90Kling v3 pro takes the OmniHuman output as motion reference and re-renders with better quality. Uses `elements` for face identity preservation.
91 
92Note: Kling has video length limits. For longer content, split into chunks ≤15s.
93 
94## Cost estimate (60s video)
95 
96| Step | Model | Cost |
97|------|-------|------|
98| TTS | ElevenLabs V3 | ~$0.25 |
99| Lipsync | OmniHuman 1.5 | ~$9.60 |
100| Motion enhance | Kling v3 pro (×4-6 chunks) | ~$4-5 |
101| **Total** | | **~$14-15** |
102 
103## Guardrails
104 
105- Always review the TTS audio before running OmniHuman (it's the most expensive step).
106- For long texts, split into segments and generate separately.
107- Kling motion enhance is optional — OmniHuman alone may be sufficient for some use cases.
108- Compare OmniHuman output vs Kling-enhanced output before committing to the full pipeline.
109 
110## Files
111 
112- `scripts/tts_elevenlabs_v3.py` — ElevenLabs V3 text-to-speech (any voice)
113- `scripts/omnihuman_lipsync.py` — OmniHuman 1.5 image + audio → lipsync video
114- `scripts/kling_motion_enhance.py` — Kling v3 pro motion control enhancement
115

Avatar Video from Text

SKILL.md

Preparing the source view

Avatar Video from Text

SKILL.md