TikTok Promo Video Pipeline
End-to-end pipeline for creating TikTok-style promo videos that combine real footage with AI-generated content.
Pipeline Overview
Script → TTS (ElevenLabs v3) → Face images (Nano Banana) → Lipsync (OmniHuman 1.5) → Enhance (Kling v3) → Captions → FinalPrerequisites
python3,ffmpeg,ffprobefal-clientPython packageFAL_AI_KEYorFAL_KEYenv varELEVENLABS_API_KEYenv var (or in a.envfile)- Dependent skills (install via Forgedemy):
nano-banana-fal,avatar-video-from-text,kling-motion-control,record-transcribe-revoice
Step 1: Script & Structure
Plan scenes with types:
- Real video: Record with
record-transcribe-revoice/assets/camera-recorder.html - AI lipsync: Generated face + TTS voice → OmniHuman → Kling
- AI avatar: Different character with different backgrounds/sets
Step 2: TTS with ElevenLabs v3
python3 {avatar-video-from-text}/scripts/tts_elevenlabs_v3.py \
--voice-name "Voice Name" \
--text "[excited] Your text here with audio tags!" \
--stability 0.15 --similarity-boost 0.6 --style 0.85 \
--output scene_audio.mp3Audio Tags (ElevenLabs v3)
Control emotion with tags in square brackets:
[excited],[sad],[angry],[nervous],[frustrated],[tired][sigh],[whisper],[happily],[serious]
Voice Settings for Expressiveness
| Setting | Clone-like | Expressive (recommended) |
|---|---|---|
| stability | 0.34 | 0.15-0.2 |
| similarity-boost | 0.91 | 0.6 |
| style | 0.49 | 0.7-0.85 |
Use --list-voices to see available voices.
Step 3: Face-Consistent Images (Nano Banana)
uv run --with fal-client {nano-banana-fal}/scripts/nano_banana_edit.py \
--face ./face_reference.png \
--prompt "Description of scene" \
--output scene_image.pngCritical: Image size must be <5MB for OmniHuman/Kling
# Resize if needed
size=$(stat -c%s image.png)
if [ "$size" -gt 5000000 ]; then
ffmpeg -y -i image.png -vf "scale=1080:1920:force_original_aspect_ratio=decrease" image_resized.png
mv image_resized.png image.png
fiDifferent backgrounds for variety
Generate multiple images with same face but different environments (office, cafe, studio, etc.) for scene changes.
Step 4: OmniHuman 1.5 Lipsync
uv run --with fal-client python3 {avatar-video-from-text}/scripts/omnihuman_lipsync.py \
--image scene_image.png \
--audio scene_audio.mp3 \
--output scene_lipsync.mp4Cost: ~$0.16/sec. Run multiple scenes in parallel.
Step 5: Kling v3 Motion Enhancement
FAL_KEY="$FAL_AI_KEY" uv run --with fal-client \
python3 {kling-motion-control}/scripts/kling_motion_control.py \
--image scene_image.png \
--video scene_lipsync.mp4 \
--orientation video \
--prompt "description matching the scene" \
--out scene_kling.mp4Important constraints
- Minimum video duration: 3 seconds — Kling rejects shorter clips
- Kling strips audio — must mux original audio back after:
ffmpeg -y -i scene_kling.mp4 -i scene_audio.mp3 \
-vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black" \
-c:v libx264 -pix_fmt yuv420p -r 30 -ac 2 -ar 44100 -c:a aac -b:a 192k -shortest \
scene_final.mp4For long audio (>15s): split into chunks
Kling has duration limits. Split audio into ≤15s chunks, generate separate images for each, run OmniHuman+Kling on each, then concat.
Step 6: Assembly
Normalize all clips
Every clip must have identical format before concat:
ffmpeg -y -i input.mp4 \
-vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black" \
-c:v libx264 -pix_fmt yuv420p -r 30 -ac 2 -ar 44100 -c:a aac -b:a 192k \
output_ready.mp4Critical: All clips must be stereo (-ac 2). Mixing mono/stereo breaks concat audio.
Concat
printf "file 's1.mp4'\nfile 's2.mp4'\n..." > concat.txt
ffmpeg -y -f concat -safe 0 -i concat.txt -c copy final.mp4PiP overlay (e.g., product screenshot)
ffmpeg -y -i video.mp4 -i overlay.png \
-filter_complex "[1:v]scale=700:-1[pip];[0:v][pip]overlay=(W-w)/2:H-h-80:enable='gte(t,START_TIME)'" \
-c:v libx264 -c:a copy output.mp4Step 7: Captions (Forgedemy Style)
Transcribe with word timestamps
ELEVENLABS_API_KEY=... python3 \
{record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \
--input final_audio.wav --out-dir ./Caption style
- Regular text: light gray
#CCCCCC - Key words highlighted: orange
#E8720C - Font: Bold, size 62-68
- Black outline (borderwidth 4)
- Position: bottom third of screen
- Groups of 3-5 words per caption
Important: Always transcribe the FINAL assembled video
Transcribing individual clips then offsetting causes sync issues. Transcribe after full assembly.
Cost Estimate (60s video, 5 AI scenes)
| Step | Cost |
|---|---|
| ElevenLabs TTS (5 clips) | ~$0.50 |
| Nano Banana images (5-8) | ~$1.00 |
| OmniHuman lipsync (~40s) | ~$6.40 |
| Kling enhancement (~40s) | ~$4.00 |
| Total | ~$12 |
Step 8: Background Music (ElevenLabs Music API)
curl -s -X POST "https://api.elevenlabs.io/v1/music" \
-H "xi-api-key: $ELEVENLABS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"prompt": "Modern tech startup promo, upbeat minimal electronic, inspiring, instrumental only",
"duration_seconds": 65,
"instrumental": true
}' --output bgm.mp3Mixing music with video (proper approach from digitalsamba toolkit)
# Calculate fade-out start
dur=$(ffprobe -v quiet -show_entries format=duration -of csv=p=0 video.mp4)
fade_out_start=$(python3 -c "print(round($dur - 3, 2))")
ffmpeg -y -i video.mp4 -i bgm.mp3 \
-filter_complex "\
[0:a]volume=1.0[voice];\
[1:a]atrim=0:$dur,volume=0.056,afade=t=in:d=2,afade=t=out:st=$fade_out_start:d=3[bgm];\
[voice][bgm]amix=inputs=2:duration=first:dropout_transition=0[aout]" \
-map 0:v -map "[aout]" \
-c:v copy -c:a aac -b:a 192k output.mp4Key settings:
- Music volume:
0.056= -25dB (barely audible, doesn't overpower voice) - Fade-in: 2s at start
- Fade-out: 3s at end
- dropout_transition=0: prevents volume pumping when mixing
Sound Effects (ElevenLabs SFX V2 API)
curl -s -X POST "https://api.elevenlabs.io/v1/sound-generation" \
-H "xi-api-key: $ELEVENLABS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"text": "description", "duration_seconds": 1.5}' \
--output sfx.mp3Mix SFX at -25dB (volume=0.056) with adelay=MILLISECONDS|MILLISECONDS for precise timing.
Step 9: Captions (ElevenLabs Transcription)
- Extract audio:
ffmpeg -i final.mp4 -vn -ac 1 -ar 16000 audio.wav - Transcribe with word timestamps using
record-transcribe-revoiceskill - Build ASS subtitles from word data (groups of 3-5 words)
- Burn:
ffmpeg -i video.mp4 -vf "ass=captions.ass" -c:v libx264 -c:a copy output.mp4
Caption styling (Forgedemy brand)
- Regular text:
&H00CCCCCC(light gray) - Key words:
&H000C72E8(orange #E8720C in ASS BGR format) - Font: Arial Bold, size 62-68
- Black outline: borderwidth 4
- Position: bottom third
Important: Transcription quirks
- "Forgedemy" may be transcribed as "Forge to me" — fix in ASS file with sed
- Always transcribe the FINAL assembled video, not individual clips
Recording & Cutting Takes
Workflow: record once, pick best takes
- Record all phrases in one continuous take — multiple attempts per phrase are fine
- Transcribe with ElevenLabs word timestamps:
ffmpeg -i recording.webm -vn -ac 1 -ar 16000 audio.wav
ELEVENLABS_API_KEY=... python3 {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \
--input audio.wav --out-dir ./- Read
sentences.jsonto find all takes of each phrase - Pick the cleanest take (no stutters, good energy)
- Use word timestamps for precise cuts — trim right after last word ends (+0.3s buffer)
- Cut with ffmpeg:
ffmpeg -i recording.mp4 -ss START -to END -vf "..." output.mp4
Cutting rules
- Cut right after last word ends (+0.3s max buffer) — don't leave pauses where you look away
- If speaker stutters at start of a phrase ("So three to se-- so three skills"), skip to the clean start
- Pauses >0.5s between scenes should be trimmed
- Always check the last frame — if speaker looks away or down, trim earlier
Audio Normalization
Normalize all clips to -16 LUFS (TikTok/Instagram standard) before concat:
ffmpeg -i clip.mp4 -af "loudnorm=I=-16:LRA=11:TP=-1" -c:v copy -c:a aac -b:a 192k clip_norm.mp4Notes
- Record at your target FPS (30 or 60) — upscaling doesn't help
- Always encode all clips with
-ac 2(stereo) before concat - Resize images <5MB before OmniHuman/Kling
- Normalize audio to -16 LUFS before concat — clips from different sources have very different levels
- Transcribe the final assembled video for captions, not individual clips