Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from bundle
Full pipeline for TikTok videos: real footage + AI avatar (Nano Banana + OmniHuman + Kling) + ElevenLabs TTS + captions + music mixing
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
SKILL.md
1---2name: tiktok-promo-video3description: "Create a TikTok promo video combining real recorded clips, AI-generated talking head scenes (Nano Banana + OmniHuman + Kling), AI avatar presenter, TTS voiceover, and styled captions. Full pipeline from script to final video."4---56# TikTok Promo Video Pipeline78End-to-end pipeline for creating TikTok-style promo videos that combine real footage with AI-generated content.910## Pipeline Overview1112```13Script → TTS (ElevenLabs v3) → Face images (Nano Banana) → Lipsync (OmniHuman 1.5) → Enhance (Kling v3) → Captions → Final14```1516## Prerequisites1718- `python3`, `ffmpeg`, `ffprobe`19- `fal-client` Python package20- `FAL_AI_KEY` or `FAL_KEY` env var21- `ELEVENLABS_API_KEY` env var (or in a `.env` file)22- Dependent skills (install via Forgedemy): `nano-banana-fal`, `avatar-video-from-text`, `kling-motion-control`, `record-transcribe-revoice`2324## Step 1: Script & Structure2526Plan scenes with types:27- **Real video**: Record with `record-transcribe-revoice/assets/camera-recorder.html`28- **AI lipsync**: Generated face + TTS voice → OmniHuman → Kling29- **AI avatar**: Different character with different backgrounds/sets3031## Step 2: TTS with ElevenLabs v33233```bash34python3 {avatar-video-from-text}/scripts/tts_elevenlabs_v3.py \35--voice-name "Voice Name" \36--text "[excited] Your text here with audio tags!" \37--stability 0.15 --similarity-boost 0.6 --style 0.85 \38--output scene_audio.mp339```4041### Audio Tags (ElevenLabs v3)42Control emotion with tags in square brackets:43- `[excited]`, `[sad]`, `[angry]`, `[nervous]`, `[frustrated]`, `[tired]`44- `[sigh]`, `[whisper]`, `[happily]`, `[serious]`4546### Voice Settings for Expressiveness47| Setting | Clone-like | Expressive (recommended) |48|---------|-----------|--------------------------|49| stability | 0.34 | 0.15-0.2 |50| similarity-boost | 0.91 | 0.6 |51| style | 0.49 | 0.7-0.85 |5253Use `--list-voices` to see available voices.5455## Step 3: Face-Consistent Images (Nano Banana)5657```bash58uv run --with fal-client {nano-banana-fal}/scripts/nano_banana_edit.py \59--face ./face_reference.png \60--prompt "Description of scene" \61--output scene_image.png62```6364### Critical: Image size must be <5MB for OmniHuman/Kling65```bash66# Resize if needed67size=$(stat -c%s image.png)68if [ "$size" -gt 5000000 ]; then69ffmpeg -y -i image.png -vf "scale=1080:1920:force_original_aspect_ratio=decrease" image_resized.png70mv image_resized.png image.png71fi72```7374### Different backgrounds for variety75Generate multiple images with same face but different environments (office, cafe, studio, etc.) for scene changes.7677## Step 4: OmniHuman 1.5 Lipsync7879```bash80uv run --with fal-client python3 {avatar-video-from-text}/scripts/omnihuman_lipsync.py \81--image scene_image.png \82--audio scene_audio.mp3 \83--output scene_lipsync.mp484```8586Cost: ~$0.16/sec. Run multiple scenes in parallel.8788## Step 5: Kling v3 Motion Enhancement8990```bash91FAL_KEY="$FAL_AI_KEY" uv run --with fal-client \92python3 {kling-motion-control}/scripts/kling_motion_control.py \93--image scene_image.png \94--video scene_lipsync.mp4 \95--orientation video \96--prompt "description matching the scene" \97--out scene_kling.mp498```99100### Important constraints101- **Minimum video duration: 3 seconds** — Kling rejects shorter clips102- **Kling strips audio** — must mux original audio back after:103```bash104ffmpeg -y -i scene_kling.mp4 -i scene_audio.mp3 \105-vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black" \106-c:v libx264 -pix_fmt yuv420p -r 30 -ac 2 -ar 44100 -c:a aac -b:a 192k -shortest \107scene_final.mp4108```109110### For long audio (>15s): split into chunks111Kling has duration limits. Split audio into ≤15s chunks, generate separate images for each, run OmniHuman+Kling on each, then concat.112113## Step 6: Assembly114115### Normalize all clips116Every clip must have identical format before concat:117```bash118ffmpeg -y -i input.mp4 \119-vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black" \120-c:v libx264 -pix_fmt yuv420p -r 30 -ac 2 -ar 44100 -c:a aac -b:a 192k \121output_ready.mp4122```123124**Critical**: All clips must be **stereo** (`-ac 2`). Mixing mono/stereo breaks concat audio.125126### Concat127```bash128printf "file 's1.mp4'\nfile 's2.mp4'\n..." > concat.txt129ffmpeg -y -f concat -safe 0 -i concat.txt -c copy final.mp4130```131132### PiP overlay (e.g., product screenshot)133```bash134ffmpeg -y -i video.mp4 -i overlay.png \135-filter_complex "[1:v]scale=700:-1[pip];[0:v][pip]overlay=(W-w)/2:H-h-80:enable='gte(t,START_TIME)'" \136-c:v libx264 -c:a copy output.mp4137```138139## Step 7: Captions (Forgedemy Style)140141### Transcribe with word timestamps142```bash143ELEVENLABS_API_KEY=... python3 \144{record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \145--input final_audio.wav --out-dir ./146```147148### Caption style149- Regular text: light gray `#CCCCCC`150- Key words highlighted: orange `#E8720C`151- Font: Bold, size 62-68152- Black outline (borderwidth 4)153- Position: bottom third of screen154- Groups of 3-5 words per caption155156### Important: Always transcribe the FINAL assembled video157Transcribing individual clips then offsetting causes sync issues. Transcribe after full assembly.158159## Cost Estimate (60s video, 5 AI scenes)160161| Step | Cost |162|------|------|163| ElevenLabs TTS (5 clips) | ~$0.50 |164| Nano Banana images (5-8) | ~$1.00 |165| OmniHuman lipsync (~40s) | ~$6.40 |166| Kling enhancement (~40s) | ~$4.00 |167| **Total** | **~$12** |168169## Step 8: Background Music (ElevenLabs Music API)170171```bash172curl -s -X POST "https://api.elevenlabs.io/v1/music" \173-H "xi-api-key: $ELEVENLABS_API_KEY" \174-H "Content-Type: application/json" \175-d '{176"prompt": "Modern tech startup promo, upbeat minimal electronic, inspiring, instrumental only",177"duration_seconds": 65,178"instrumental": true179}' --output bgm.mp3180```181182### Mixing music with video (proper approach from digitalsamba toolkit)183184```bash185# Calculate fade-out start186dur=$(ffprobe -v quiet -show_entries format=duration -of csv=p=0 video.mp4)187fade_out_start=$(python3 -c "print(round($dur - 3, 2))")188189ffmpeg -y -i video.mp4 -i bgm.mp3 \190-filter_complex "\191[0:a]volume=1.0[voice];\192[1:a]atrim=0:$dur,volume=0.056,afade=t=in:d=2,afade=t=out:st=$fade_out_start:d=3[bgm];\193[voice][bgm]amix=inputs=2:duration=first:dropout_transition=0[aout]" \194-map 0:v -map "[aout]" \195-c:v copy -c:a aac -b:a 192k output.mp4196```197198Key settings:199- **Music volume**: `0.056` = -25dB (barely audible, doesn't overpower voice)200- **Fade-in**: 2s at start201- **Fade-out**: 3s at end202- **dropout_transition=0**: prevents volume pumping when mixing203204### Sound Effects (ElevenLabs SFX V2 API)205206```bash207curl -s -X POST "https://api.elevenlabs.io/v1/sound-generation" \208-H "xi-api-key: $ELEVENLABS_API_KEY" \209-H "Content-Type: application/json" \210-d '{"text": "description", "duration_seconds": 1.5}' \211--output sfx.mp3212```213214Mix SFX at -25dB (`volume=0.056`) with `adelay=MILLISECONDS|MILLISECONDS` for precise timing.215216## Step 9: Captions (ElevenLabs Transcription)2172181. Extract audio: `ffmpeg -i final.mp4 -vn -ac 1 -ar 16000 audio.wav`2192. Transcribe with word timestamps using `record-transcribe-revoice` skill2203. Build ASS subtitles from word data (groups of 3-5 words)2214. Burn: `ffmpeg -i video.mp4 -vf "ass=captions.ass" -c:v libx264 -c:a copy output.mp4`222223### Caption styling (Forgedemy brand)224- Regular text: `&H00CCCCCC` (light gray)225- Key words: `&H000C72E8` (orange #E8720C in ASS BGR format)226- Font: Arial Bold, size 62-68227- Black outline: borderwidth 4228- Position: bottom third229230### Important: Transcription quirks231- "Forgedemy" may be transcribed as "Forge to me" — fix in ASS file with sed232- Always transcribe the FINAL assembled video, not individual clips233234## Recording & Cutting Takes235236### Workflow: record once, pick best takes2372381. Record all phrases in one continuous take — multiple attempts per phrase are fine2392. Transcribe with ElevenLabs word timestamps:240```bash241ffmpeg -i recording.webm -vn -ac 1 -ar 16000 audio.wav242ELEVENLABS_API_KEY=... python3 {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \243--input audio.wav --out-dir ./244```2453. Read `sentences.json` to find all takes of each phrase2464. Pick the cleanest take (no stutters, good energy)2475. Use word timestamps for precise cuts — trim right after last word ends (+0.3s buffer)2486. Cut with ffmpeg: `ffmpeg -i recording.mp4 -ss START -to END -vf "..." output.mp4`249250### Cutting rules251- Cut right after last word ends (+0.3s max buffer) — don't leave pauses where you look away252- If speaker stutters at start of a phrase ("So three to se-- so three skills"), skip to the clean start253- Pauses >0.5s between scenes should be trimmed254- Always check the last frame — if speaker looks away or down, trim earlier255256## Audio Normalization257258Normalize all clips to -16 LUFS (TikTok/Instagram standard) before concat:259```bash260ffmpeg -i clip.mp4 -af "loudnorm=I=-16:LRA=11:TP=-1" -c:v copy -c:a aac -b:a 192k clip_norm.mp4261```262263## Notes264265- Record at your target FPS (30 or 60) — upscaling doesn't help266- Always encode all clips with `-ac 2` (stereo) before concat267- Resize images <5MB before OmniHuman/Kling268- Normalize audio to -16 LUFS before concat — clips from different sources have very different levels269- Transcribe the final assembled video for captions, not individual clips270