TikTok Promo Video Pipeline

End-to-end pipeline for creating TikTok-style promo videos that combine real footage with AI-generated content.

Pipeline Overview

Script → TTS (ElevenLabs v3) → Face images (Nano Banana) → Lipsync (OmniHuman 1.5) → Enhance (Kling v3) → Captions → Final

Prerequisites

python3, ffmpeg, ffprobe
fal-client Python package
FAL_AI_KEY or FAL_KEY env var
ELEVENLABS_API_KEY env var (or in a .env file)
Dependent skills (install via Forgedemy): nano-banana-fal, avatar-video-from-text, kling-motion-control, record-transcribe-revoice

Step 1: Script & Structure

Plan scenes with types:

Real video: Record with record-transcribe-revoice/assets/camera-recorder.html
AI lipsync: Generated face + TTS voice → OmniHuman → Kling
AI avatar: Different character with different backgrounds/sets

Step 2: TTS with ElevenLabs v3

python3 {avatar-video-from-text}/scripts/tts_elevenlabs_v3.py \
  --voice-name "Voice Name" \
  --text "[excited] Your text here with audio tags!" \
  --stability 0.15 --similarity-boost 0.6 --style 0.85 \
  --output scene_audio.mp3

Audio Tags (ElevenLabs v3)

Control emotion with tags in square brackets:

[excited], [sad], [angry], [nervous], [frustrated], [tired]
[sigh], [whisper], [happily], [serious]

Voice Settings for Expressiveness

Setting	Clone-like	Expressive (recommended)
stability	0.34	0.15-0.2
similarity-boost	0.91	0.6
style	0.49	0.7-0.85

Use --list-voices to see available voices.

Step 3: Face-Consistent Images (Nano Banana)

uv run --with fal-client {nano-banana-fal}/scripts/nano_banana_edit.py \
  --face ./face_reference.png \
  --prompt "Description of scene" \
  --output scene_image.png

Critical: Image size must be <5MB for OmniHuman/Kling

# Resize if needed
size=$(stat -c%s image.png)
if [ "$size" -gt 5000000 ]; then
  ffmpeg -y -i image.png -vf "scale=1080:1920:force_original_aspect_ratio=decrease" image_resized.png
  mv image_resized.png image.png
fi

Different backgrounds for variety

Generate multiple images with same face but different environments (office, cafe, studio, etc.) for scene changes.

Step 4: OmniHuman 1.5 Lipsync

uv run --with fal-client python3 {avatar-video-from-text}/scripts/omnihuman_lipsync.py \
  --image scene_image.png \
  --audio scene_audio.mp3 \
  --output scene_lipsync.mp4

Cost: ~$0.16/sec. Run multiple scenes in parallel.

Step 5: Kling v3 Motion Enhancement

FAL_KEY="$FAL_AI_KEY" uv run --with fal-client \
  python3 {kling-motion-control}/scripts/kling_motion_control.py \
  --image scene_image.png \
  --video scene_lipsync.mp4 \
  --orientation video \
  --prompt "description matching the scene" \
  --out scene_kling.mp4

Important constraints

Minimum video duration: 3 seconds — Kling rejects shorter clips
Kling strips audio — must mux original audio back after:

ffmpeg -y -i scene_kling.mp4 -i scene_audio.mp3 \
  -vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black" \
  -c:v libx264 -pix_fmt yuv420p -r 30 -ac 2 -ar 44100 -c:a aac -b:a 192k -shortest \
  scene_final.mp4

For long audio (>15s): split into chunks

Kling has duration limits. Split audio into ≤15s chunks, generate separate images for each, run OmniHuman+Kling on each, then concat.

Step 6: Assembly

Normalize all clips

Every clip must have identical format before concat:

ffmpeg -y -i input.mp4 \
  -vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black" \
  -c:v libx264 -pix_fmt yuv420p -r 30 -ac 2 -ar 44100 -c:a aac -b:a 192k \
  output_ready.mp4

Critical: All clips must be stereo (-ac 2). Mixing mono/stereo breaks concat audio.

Concat

printf "file 's1.mp4'\nfile 's2.mp4'\n..." > concat.txt
ffmpeg -y -f concat -safe 0 -i concat.txt -c copy final.mp4

PiP overlay (e.g., product screenshot)

ffmpeg -y -i video.mp4 -i overlay.png \
  -filter_complex "[1:v]scale=700:-1[pip];[0:v][pip]overlay=(W-w)/2:H-h-80:enable='gte(t,START_TIME)'" \
  -c:v libx264 -c:a copy output.mp4

Step 7: Captions (Forgedemy Style)

Transcribe with word timestamps

ELEVENLABS_API_KEY=... python3 \
  {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \
  --input final_audio.wav --out-dir ./

Caption style

Regular text: light gray #CCCCCC
Key words highlighted: orange #E8720C
Font: Bold, size 62-68
Black outline (borderwidth 4)
Position: bottom third of screen
Groups of 3-5 words per caption

Important: Always transcribe the FINAL assembled video

Transcribing individual clips then offsetting causes sync issues. Transcribe after full assembly.

Cost Estimate (60s video, 5 AI scenes)

Step	Cost
ElevenLabs TTS (5 clips)	~$0.50
Nano Banana images (5-8)	~$1.00
OmniHuman lipsync (~40s)	~$6.40
Kling enhancement (~40s)	~$4.00
Total	~$12

Step 8: Background Music (ElevenLabs Music API)

curl -s -X POST "https://api.elevenlabs.io/v1/music" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Modern tech startup promo, upbeat minimal electronic, inspiring, instrumental only",
    "duration_seconds": 65,
    "instrumental": true
  }' --output bgm.mp3

Mixing music with video (proper approach from digitalsamba toolkit)

# Calculate fade-out start
dur=$(ffprobe -v quiet -show_entries format=duration -of csv=p=0 video.mp4)
fade_out_start=$(python3 -c "print(round($dur - 3, 2))")

ffmpeg -y -i video.mp4 -i bgm.mp3 \
  -filter_complex "\
[0:a]volume=1.0[voice];\
[1:a]atrim=0:$dur,volume=0.056,afade=t=in:d=2,afade=t=out:st=$fade_out_start:d=3[bgm];\
[voice][bgm]amix=inputs=2:duration=first:dropout_transition=0[aout]" \
  -map 0:v -map "[aout]" \
  -c:v copy -c:a aac -b:a 192k output.mp4

Key settings:

Music volume: 0.056 = -25dB (barely audible, doesn't overpower voice)
Fade-in: 2s at start
Fade-out: 3s at end
dropout_transition=0: prevents volume pumping when mixing

Sound Effects (ElevenLabs SFX V2 API)

curl -s -X POST "https://api.elevenlabs.io/v1/sound-generation" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text": "description", "duration_seconds": 1.5}' \
  --output sfx.mp3

Mix SFX at -25dB (volume=0.056) with adelay=MILLISECONDS|MILLISECONDS for precise timing.

Step 9: Captions (ElevenLabs Transcription)

Extract audio: ffmpeg -i final.mp4 -vn -ac 1 -ar 16000 audio.wav
Transcribe with word timestamps using record-transcribe-revoice skill
Build ASS subtitles from word data (groups of 3-5 words)
Burn: ffmpeg -i video.mp4 -vf "ass=captions.ass" -c:v libx264 -c:a copy output.mp4

Caption styling (Forgedemy brand)

Regular text: &H00CCCCCC (light gray)
Key words: &H000C72E8 (orange #E8720C in ASS BGR format)
Font: Arial Bold, size 62-68
Black outline: borderwidth 4
Position: bottom third

Important: Transcription quirks

"Forgedemy" may be transcribed as "Forge to me" — fix in ASS file with sed
Always transcribe the FINAL assembled video, not individual clips

Recording & Cutting Takes

Workflow: record once, pick best takes

Record all phrases in one continuous take — multiple attempts per phrase are fine
Transcribe with ElevenLabs word timestamps:

ffmpeg -i recording.webm -vn -ac 1 -ar 16000 audio.wav
ELEVENLABS_API_KEY=... python3 {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \
  --input audio.wav --out-dir ./

Read sentences.json to find all takes of each phrase
Pick the cleanest take (no stutters, good energy)
Use word timestamps for precise cuts — trim right after last word ends (+0.3s buffer)
Cut with ffmpeg: ffmpeg -i recording.mp4 -ss START -to END -vf "..." output.mp4

Cutting rules

Cut right after last word ends (+0.3s max buffer) — don't leave pauses where you look away
If speaker stutters at start of a phrase ("So three to se-- so three skills"), skip to the clean start
Pauses >0.5s between scenes should be trimmed
Always check the last frame — if speaker looks away or down, trim earlier

Audio Normalization

Normalize all clips to -16 LUFS (TikTok/Instagram standard) before concat:

ffmpeg -i clip.mp4 -af "loudnorm=I=-16:LRA=11:TP=-1" -c:v copy -c:a aac -b:a 192k clip_norm.mp4

Notes

Record at your target FPS (30 or 60) — upscaling doesn't help
Always encode all clips with -ac 2 (stereo) before concat
Resize images <5MB before OmniHuman/Kling
Normalize audio to -16 LUFS before concat — clips from different sources have very different levels
Transcribe the final assembled video for captions, not individual clips

TikTok Promo Video Pipeline

End-to-end pipeline for creating TikTok-style promo videos that combine real footage with AI-generated content.

Pipeline Overview

Script → TTS (ElevenLabs v3) → Face images (Nano Banana) → Lipsync (OmniHuman 1.5) → Enhance (Kling v3) → Captions → Final

Prerequisites

python3, ffmpeg, ffprobe
fal-client Python package
FAL_AI_KEY or FAL_KEY env var
ELEVENLABS_API_KEY env var (or in a .env file)
Dependent skills (install via Forgedemy): nano-banana-fal, avatar-video-from-text, kling-motion-control, record-transcribe-revoice

Step 1: Script & Structure

Plan scenes with types:

Real video: Record with record-transcribe-revoice/assets/camera-recorder.html
AI lipsync: Generated face + TTS voice → OmniHuman → Kling
AI avatar: Different character with different backgrounds/sets

Step 2: TTS with ElevenLabs v3

python3 {avatar-video-from-text}/scripts/tts_elevenlabs_v3.py \
  --voice-name "Voice Name" \
  --text "[excited] Your text here with audio tags!" \
  --stability 0.15 --similarity-boost 0.6 --style 0.85 \
  --output scene_audio.mp3

Audio Tags (ElevenLabs v3)

Control emotion with tags in square brackets:

[excited], [sad], [angry], [nervous], [frustrated], [tired]
[sigh], [whisper], [happily], [serious]

Voice Settings for Expressiveness

Setting	Clone-like	Expressive (recommended)
stability	0.34	0.15-0.2
similarity-boost	0.91	0.6
style	0.49	0.7-0.85

Use --list-voices to see available voices.

Step 3: Face-Consistent Images (Nano Banana)

uv run --with fal-client {nano-banana-fal}/scripts/nano_banana_edit.py \
  --face ./face_reference.png \
  --prompt "Description of scene" \
  --output scene_image.png

Critical: Image size must be <5MB for OmniHuman/Kling

# Resize if needed
size=$(stat -c%s image.png)
if [ "$size" -gt 5000000 ]; then
  ffmpeg -y -i image.png -vf "scale=1080:1920:force_original_aspect_ratio=decrease" image_resized.png
  mv image_resized.png image.png
fi

Different backgrounds for variety

Generate multiple images with same face but different environments (office, cafe, studio, etc.) for scene changes.

Step 4: OmniHuman 1.5 Lipsync

uv run --with fal-client python3 {avatar-video-from-text}/scripts/omnihuman_lipsync.py \
  --image scene_image.png \
  --audio scene_audio.mp3 \
  --output scene_lipsync.mp4

Cost: ~$0.16/sec. Run multiple scenes in parallel.

Step 5: Kling v3 Motion Enhancement

FAL_KEY="$FAL_AI_KEY" uv run --with fal-client \
  python3 {kling-motion-control}/scripts/kling_motion_control.py \
  --image scene_image.png \
  --video scene_lipsync.mp4 \
  --orientation video \
  --prompt "description matching the scene" \
  --out scene_kling.mp4

Important constraints

Minimum video duration: 3 seconds — Kling rejects shorter clips
Kling strips audio — must mux original audio back after:

ffmpeg -y -i scene_kling.mp4 -i scene_audio.mp3 \
  -vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black" \
  -c:v libx264 -pix_fmt yuv420p -r 30 -ac 2 -ar 44100 -c:a aac -b:a 192k -shortest \
  scene_final.mp4

For long audio (>15s): split into chunks

Kling has duration limits. Split audio into ≤15s chunks, generate separate images for each, run OmniHuman+Kling on each, then concat.

Step 6: Assembly

Normalize all clips

Every clip must have identical format before concat:

ffmpeg -y -i input.mp4 \
  -vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black" \
  -c:v libx264 -pix_fmt yuv420p -r 30 -ac 2 -ar 44100 -c:a aac -b:a 192k \
  output_ready.mp4

Critical: All clips must be stereo (-ac 2). Mixing mono/stereo breaks concat audio.

Concat

printf "file 's1.mp4'\nfile 's2.mp4'\n..." > concat.txt
ffmpeg -y -f concat -safe 0 -i concat.txt -c copy final.mp4

PiP overlay (e.g., product screenshot)

ffmpeg -y -i video.mp4 -i overlay.png \
  -filter_complex "[1:v]scale=700:-1[pip];[0:v][pip]overlay=(W-w)/2:H-h-80:enable='gte(t,START_TIME)'" \
  -c:v libx264 -c:a copy output.mp4

Step 7: Captions (Forgedemy Style)

Transcribe with word timestamps

ELEVENLABS_API_KEY=... python3 \
  {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \
  --input final_audio.wav --out-dir ./

Caption style

Regular text: light gray #CCCCCC
Key words highlighted: orange #E8720C
Font: Bold, size 62-68
Black outline (borderwidth 4)
Position: bottom third of screen
Groups of 3-5 words per caption

Important: Always transcribe the FINAL assembled video

Transcribing individual clips then offsetting causes sync issues. Transcribe after full assembly.

Cost Estimate (60s video, 5 AI scenes)

Step	Cost
ElevenLabs TTS (5 clips)	~$0.50
Nano Banana images (5-8)	~$1.00
OmniHuman lipsync (~40s)	~$6.40
Kling enhancement (~40s)	~$4.00
Total	~$12

Step 8: Background Music (ElevenLabs Music API)

curl -s -X POST "https://api.elevenlabs.io/v1/music" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Modern tech startup promo, upbeat minimal electronic, inspiring, instrumental only",
    "duration_seconds": 65,
    "instrumental": true
  }' --output bgm.mp3

Mixing music with video (proper approach from digitalsamba toolkit)

# Calculate fade-out start
dur=$(ffprobe -v quiet -show_entries format=duration -of csv=p=0 video.mp4)
fade_out_start=$(python3 -c "print(round($dur - 3, 2))")

ffmpeg -y -i video.mp4 -i bgm.mp3 \
  -filter_complex "\
[0:a]volume=1.0[voice];\
[1:a]atrim=0:$dur,volume=0.056,afade=t=in:d=2,afade=t=out:st=$fade_out_start:d=3[bgm];\
[voice][bgm]amix=inputs=2:duration=first:dropout_transition=0[aout]" \
  -map 0:v -map "[aout]" \
  -c:v copy -c:a aac -b:a 192k output.mp4

Key settings:

Music volume: 0.056 = -25dB (barely audible, doesn't overpower voice)
Fade-in: 2s at start
Fade-out: 3s at end
dropout_transition=0: prevents volume pumping when mixing

Sound Effects (ElevenLabs SFX V2 API)

curl -s -X POST "https://api.elevenlabs.io/v1/sound-generation" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text": "description", "duration_seconds": 1.5}' \
  --output sfx.mp3

Mix SFX at -25dB (volume=0.056) with adelay=MILLISECONDS|MILLISECONDS for precise timing.

Step 9: Captions (ElevenLabs Transcription)

Extract audio: ffmpeg -i final.mp4 -vn -ac 1 -ar 16000 audio.wav
Transcribe with word timestamps using record-transcribe-revoice skill
Build ASS subtitles from word data (groups of 3-5 words)
Burn: ffmpeg -i video.mp4 -vf "ass=captions.ass" -c:v libx264 -c:a copy output.mp4

Caption styling (Forgedemy brand)

Regular text: &H00CCCCCC (light gray)
Key words: &H000C72E8 (orange #E8720C in ASS BGR format)
Font: Arial Bold, size 62-68
Black outline: borderwidth 4
Position: bottom third

Important: Transcription quirks

"Forgedemy" may be transcribed as "Forge to me" — fix in ASS file with sed
Always transcribe the FINAL assembled video, not individual clips

Recording & Cutting Takes

Workflow: record once, pick best takes

Record all phrases in one continuous take — multiple attempts per phrase are fine
Transcribe with ElevenLabs word timestamps:

ffmpeg -i recording.webm -vn -ac 1 -ar 16000 audio.wav
ELEVENLABS_API_KEY=... python3 {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \
  --input audio.wav --out-dir ./

Read sentences.json to find all takes of each phrase
Pick the cleanest take (no stutters, good energy)
Use word timestamps for precise cuts — trim right after last word ends (+0.3s buffer)
Cut with ffmpeg: ffmpeg -i recording.mp4 -ss START -to END -vf "..." output.mp4

Cutting rules

Cut right after last word ends (+0.3s max buffer) — don't leave pauses where you look away
If speaker stutters at start of a phrase ("So three to se-- so three skills"), skip to the clean start
Pauses >0.5s between scenes should be trimmed
Always check the last frame — if speaker looks away or down, trim earlier

Audio Normalization

Normalize all clips to -16 LUFS (TikTok/Instagram standard) before concat:

ffmpeg -i clip.mp4 -af "loudnorm=I=-16:LRA=11:TP=-1" -c:v copy -c:a aac -b:a 192k clip_norm.mp4

Notes

Record at your target FPS (30 or 60) — upscaling doesn't help
Always encode all clips with -ac 2 (stereo) before concat
Resize images <5MB before OmniHuman/Kling
Normalize audio to -16 LUFS before concat — clips from different sources have very different levels
Transcribe the final assembled video for captions, not individual clips

TikTok Promo Video Pipeline

SKILL.md

TikTok Promo Video Pipeline

Pipeline Overview

Prerequisites

Step 1: Script & Structure

Step 2: TTS with ElevenLabs v3

Audio Tags (ElevenLabs v3)

Voice Settings for Expressiveness

Step 3: Face-Consistent Images (Nano Banana)

Critical: Image size must be <5MB for OmniHuman/Kling

Different backgrounds for variety

Step 4: OmniHuman 1.5 Lipsync

Step 5: Kling v3 Motion Enhancement

Important constraints

For long audio (>15s): split into chunks

Step 6: Assembly

Normalize all clips

Concat

PiP overlay (e.g., product screenshot)

Step 7: Captions (Forgedemy Style)

Transcribe with word timestamps

Caption style

Important: Always transcribe the FINAL assembled video

Cost Estimate (60s video, 5 AI scenes)

Step 8: Background Music (ElevenLabs Music API)

Mixing music with video (proper approach from digitalsamba toolkit)

Sound Effects (ElevenLabs SFX V2 API)

Step 9: Captions (ElevenLabs Transcription)

Caption styling (Forgedemy brand)

Important: Transcription quirks

Recording & Cutting Takes

Workflow: record once, pick best takes

Cutting rules

Audio Normalization

Notes

Preparing the source view

TikTok Promo Video Pipeline

SKILL.md

TikTok Promo Video Pipeline

Pipeline Overview

Prerequisites

Step 1: Script & Structure

Step 2: TTS with ElevenLabs v3

Audio Tags (ElevenLabs v3)

Voice Settings for Expressiveness

Step 3: Face-Consistent Images (Nano Banana)

Critical: Image size must be <5MB for OmniHuman/Kling

Different backgrounds for variety

Step 4: OmniHuman 1.5 Lipsync

Step 5: Kling v3 Motion Enhancement

Important constraints

For long audio (>15s): split into chunks

Step 6: Assembly

Normalize all clips

Concat

PiP overlay (e.g., product screenshot)

Step 7: Captions (Forgedemy Style)

Transcribe with word timestamps

Caption style

Important: Always transcribe the FINAL assembled video

Cost Estimate (60s video, 5 AI scenes)

Step 8: Background Music (ElevenLabs Music API)

Mixing music with video (proper approach from digitalsamba toolkit)

Sound Effects (ElevenLabs SFX V2 API)

Step 9: Captions (ElevenLabs Transcription)

Caption styling (Forgedemy brand)

Important: Transcription quirks

Recording & Cutting Takes

Workflow: record once, pick best takes

Cutting rules

Audio Normalization

Notes