Source from bundle
TikTok Promo Video Pipeline

Full pipeline for TikTok videos: real footage + AI avatar (Nano Banana + OmniHuman + Kling) + ElevenLabs TTS + captions + music mixing
Костянтин@Latand
Files
Skill
2.8K
Size
9.2 KB
Entrypoint
SKILL.md
Format
folder
Open file
SKILL.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown270 linesEntrypointFree
SKILL.md
1---
2name: tiktok-promo-video
3description: "Create a TikTok promo video combining real recorded clips, AI-generated talking head scenes (Nano Banana + OmniHuman + Kling), AI avatar presenter, TTS voiceover, and styled captions. Full pipeline from script to final video."
4---
5 
6# TikTok Promo Video Pipeline
7 
8End-to-end pipeline for creating TikTok-style promo videos that combine real footage with AI-generated content.
9 
10## Pipeline Overview
11 
12```
13Script → TTS (ElevenLabs v3) → Face images (Nano Banana) → Lipsync (OmniHuman 1.5) → Enhance (Kling v3) → Captions → Final
14```
15 
16## Prerequisites
17 
18- `python3`, `ffmpeg`, `ffprobe`
19- `fal-client` Python package
20- `FAL_AI_KEY` or `FAL_KEY` env var
21- `ELEVENLABS_API_KEY` env var (or in a `.env` file)
22- Dependent skills (install via Forgedemy): `nano-banana-fal`, `avatar-video-from-text`, `kling-motion-control`, `record-transcribe-revoice`
23 
24## Step 1: Script & Structure
25 
26Plan scenes with types:
27- **Real video**: Record with `record-transcribe-revoice/assets/camera-recorder.html`
28- **AI lipsync**: Generated face + TTS voice → OmniHuman → Kling
29- **AI avatar**: Different character with different backgrounds/sets
30 
31## Step 2: TTS with ElevenLabs v3
32 
33```bash
34python3 {avatar-video-from-text}/scripts/tts_elevenlabs_v3.py \
35  --voice-name "Voice Name" \
36  --text "[excited] Your text here with audio tags!" \
37  --stability 0.15 --similarity-boost 0.6 --style 0.85 \
38  --output scene_audio.mp3
39```
40 
41### Audio Tags (ElevenLabs v3)
42Control emotion with tags in square brackets:
43- `[excited]`, `[sad]`, `[angry]`, `[nervous]`, `[frustrated]`, `[tired]`
44- `[sigh]`, `[whisper]`, `[happily]`, `[serious]`
45 
46### Voice Settings for Expressiveness
47| Setting | Clone-like | Expressive (recommended) |
48|---------|-----------|--------------------------|
49| stability | 0.34 | 0.15-0.2 |
50| similarity-boost | 0.91 | 0.6 |
51| style | 0.49 | 0.7-0.85 |
52 
53Use `--list-voices` to see available voices.
54 
55## Step 3: Face-Consistent Images (Nano Banana)
56 
57```bash
58uv run --with fal-client {nano-banana-fal}/scripts/nano_banana_edit.py \
59  --face ./face_reference.png \
60  --prompt "Description of scene" \
61  --output scene_image.png
62```
63 
64### Critical: Image size must be <5MB for OmniHuman/Kling
65```bash
66# Resize if needed
67size=$(stat -c%s image.png)
68if [ "$size" -gt 5000000 ]; then
69  ffmpeg -y -i image.png -vf "scale=1080:1920:force_original_aspect_ratio=decrease" image_resized.png
70  mv image_resized.png image.png
71fi
72```
73 
74### Different backgrounds for variety
75Generate multiple images with same face but different environments (office, cafe, studio, etc.) for scene changes.
76 
77## Step 4: OmniHuman 1.5 Lipsync
78 
79```bash
80uv run --with fal-client python3 {avatar-video-from-text}/scripts/omnihuman_lipsync.py \
81  --image scene_image.png \
82  --audio scene_audio.mp3 \
83  --output scene_lipsync.mp4
84```
85 
86Cost: ~$0.16/sec. Run multiple scenes in parallel.
87 
88## Step 5: Kling v3 Motion Enhancement
89 
90```bash
91FAL_KEY="$FAL_AI_KEY" uv run --with fal-client \
92  python3 {kling-motion-control}/scripts/kling_motion_control.py \
93  --image scene_image.png \
94  --video scene_lipsync.mp4 \
95  --orientation video \
96  --prompt "description matching the scene" \
97  --out scene_kling.mp4
98```
99 
100### Important constraints
101- **Minimum video duration: 3 seconds** — Kling rejects shorter clips
102- **Kling strips audio** — must mux original audio back after:
103```bash
104ffmpeg -y -i scene_kling.mp4 -i scene_audio.mp3 \
105  -vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black" \
106  -c:v libx264 -pix_fmt yuv420p -r 30 -ac 2 -ar 44100 -c:a aac -b:a 192k -shortest \
107  scene_final.mp4
108```
109 
110### For long audio (>15s): split into chunks
111Kling has duration limits. Split audio into ≤15s chunks, generate separate images for each, run OmniHuman+Kling on each, then concat.
112 
113## Step 6: Assembly
114 
115### Normalize all clips
116Every clip must have identical format before concat:
117```bash
118ffmpeg -y -i input.mp4 \
119  -vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black" \
120  -c:v libx264 -pix_fmt yuv420p -r 30 -ac 2 -ar 44100 -c:a aac -b:a 192k \
121  output_ready.mp4
122```
123 
124**Critical**: All clips must be **stereo** (`-ac 2`). Mixing mono/stereo breaks concat audio.
125 
126### Concat
127```bash
128printf "file 's1.mp4'\nfile 's2.mp4'\n..." > concat.txt
129ffmpeg -y -f concat -safe 0 -i concat.txt -c copy final.mp4
130```
131 
132### PiP overlay (e.g., product screenshot)
133```bash
134ffmpeg -y -i video.mp4 -i overlay.png \
135  -filter_complex "[1:v]scale=700:-1[pip];[0:v][pip]overlay=(W-w)/2:H-h-80:enable='gte(t,START_TIME)'" \
136  -c:v libx264 -c:a copy output.mp4
137```
138 
139## Step 7: Captions (Forgedemy Style)
140 
141### Transcribe with word timestamps
142```bash
143ELEVENLABS_API_KEY=... python3 \
144  {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \
145  --input final_audio.wav --out-dir ./
146```
147 
148### Caption style
149- Regular text: light gray `#CCCCCC`
150- Key words highlighted: orange `#E8720C`
151- Font: Bold, size 62-68
152- Black outline (borderwidth 4)
153- Position: bottom third of screen
154- Groups of 3-5 words per caption
155 
156### Important: Always transcribe the FINAL assembled video
157Transcribing individual clips then offsetting causes sync issues. Transcribe after full assembly.
158 
159## Cost Estimate (60s video, 5 AI scenes)
160 
161| Step | Cost |
162|------|------|
163| ElevenLabs TTS (5 clips) | ~$0.50 |
164| Nano Banana images (5-8) | ~$1.00 |
165| OmniHuman lipsync (~40s) | ~$6.40 |
166| Kling enhancement (~40s) | ~$4.00 |
167| **Total** | **~$12** |
168 
169## Step 8: Background Music (ElevenLabs Music API)
170 
171```bash
172curl -s -X POST "https://api.elevenlabs.io/v1/music" \
173  -H "xi-api-key: $ELEVENLABS_API_KEY" \
174  -H "Content-Type: application/json" \
175  -d '{
176    "prompt": "Modern tech startup promo, upbeat minimal electronic, inspiring, instrumental only",
177    "duration_seconds": 65,
178    "instrumental": true
179  }' --output bgm.mp3
180```
181 
182### Mixing music with video (proper approach from digitalsamba toolkit)
183 
184```bash
185# Calculate fade-out start
186dur=$(ffprobe -v quiet -show_entries format=duration -of csv=p=0 video.mp4)
187fade_out_start=$(python3 -c "print(round($dur - 3, 2))")
188 
189ffmpeg -y -i video.mp4 -i bgm.mp3 \
190  -filter_complex "\
191[0:a]volume=1.0[voice];\
192[1:a]atrim=0:$dur,volume=0.056,afade=t=in:d=2,afade=t=out:st=$fade_out_start:d=3[bgm];\
193[voice][bgm]amix=inputs=2:duration=first:dropout_transition=0[aout]" \
194  -map 0:v -map "[aout]" \
195  -c:v copy -c:a aac -b:a 192k output.mp4
196```
197 
198Key settings:
199- **Music volume**: `0.056` = -25dB (barely audible, doesn't overpower voice)
200- **Fade-in**: 2s at start
201- **Fade-out**: 3s at end
202- **dropout_transition=0**: prevents volume pumping when mixing
203 
204### Sound Effects (ElevenLabs SFX V2 API)
205 
206```bash
207curl -s -X POST "https://api.elevenlabs.io/v1/sound-generation" \
208  -H "xi-api-key: $ELEVENLABS_API_KEY" \
209  -H "Content-Type: application/json" \
210  -d '{"text": "description", "duration_seconds": 1.5}' \
211  --output sfx.mp3
212```
213 
214Mix SFX at -25dB (`volume=0.056`) with `adelay=MILLISECONDS|MILLISECONDS` for precise timing.
215 
216## Step 9: Captions (ElevenLabs Transcription)
217 
2181. Extract audio: `ffmpeg -i final.mp4 -vn -ac 1 -ar 16000 audio.wav`
2192. Transcribe with word timestamps using `record-transcribe-revoice` skill
2203. Build ASS subtitles from word data (groups of 3-5 words)
2214. Burn: `ffmpeg -i video.mp4 -vf "ass=captions.ass" -c:v libx264 -c:a copy output.mp4`
222 
223### Caption styling (Forgedemy brand)
224- Regular text: `&H00CCCCCC` (light gray)
225- Key words: `&H000C72E8` (orange #E8720C in ASS BGR format)
226- Font: Arial Bold, size 62-68
227- Black outline: borderwidth 4
228- Position: bottom third
229 
230### Important: Transcription quirks
231- "Forgedemy" may be transcribed as "Forge to me" — fix in ASS file with sed
232- Always transcribe the FINAL assembled video, not individual clips
233 
234## Recording & Cutting Takes
235 
236### Workflow: record once, pick best takes
237 
2381. Record all phrases in one continuous take — multiple attempts per phrase are fine
2392. Transcribe with ElevenLabs word timestamps:
240```bash
241ffmpeg -i recording.webm -vn -ac 1 -ar 16000 audio.wav
242ELEVENLABS_API_KEY=... python3 {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \
243  --input audio.wav --out-dir ./
244```
2453. Read `sentences.json` to find all takes of each phrase
2464. Pick the cleanest take (no stutters, good energy)
2475. Use word timestamps for precise cuts — trim right after last word ends (+0.3s buffer)
2486. Cut with ffmpeg: `ffmpeg -i recording.mp4 -ss START -to END -vf "..." output.mp4`
249 
250### Cutting rules
251- Cut right after last word ends (+0.3s max buffer) — don't leave pauses where you look away
252- If speaker stutters at start of a phrase ("So three to se-- so three skills"), skip to the clean start
253- Pauses >0.5s between scenes should be trimmed
254- Always check the last frame — if speaker looks away or down, trim earlier
255 
256## Audio Normalization
257 
258Normalize all clips to -16 LUFS (TikTok/Instagram standard) before concat:
259```bash
260ffmpeg -i clip.mp4 -af "loudnorm=I=-16:LRA=11:TP=-1" -c:v copy -c:a aac -b:a 192k clip_norm.mp4
261```
262 
263## Notes
264 
265- Record at your target FPS (30 or 60) — upscaling doesn't help
266- Always encode all clips with `-ac 2` (stereo) before concat
267- Resize images <5MB before OmniHuman/Kling
268- Normalize audio to -16 LUFS before concat — clips from different sources have very different levels
269- Transcribe the final assembled video for captions, not individual clips
270
Preparing the source view

TikTok Promo Video Pipeline

SKILL.md