Loading source
Pulling the file list, source metadata, and syntax-aware rendering for this listing.
Source from bundle
Replace yourself in filmed TikTok scenes with an AI-generated character using Nano Banana face-consistent image generation, Kling v3 motion transfer, ElevenLabs
Files
Skill
Size
Entrypoint
Format
Open file
Syntax-highlighted preview of this file as included in the skill package.
SKILL.md
1---2name: face-swap-tiktok3description: "Replace yourself in filmed TikTok scenes with an AI-generated character using Nano Banana face-consistent image generation, Kling v3 motion transfer, ElevenLabs speech-to-speech voice swap, and dynamic cut editing. Full pipeline from recording to final video."4---56# Face-Swap TikTok Pipeline78Take filmed TikTok scenes of yourself and replace the presenter with a consistent AI-generated character, preserving the original motion with a new voice.910## Prerequisites1112- Dependent skills: `nano-banana-fal`, `kling-motion-control`, `record-transcribe-revoice`, `tiktok-promo-video`13- `FAL_AI_KEY` or `FAL_KEY` env var14- `ELEVENLABS_API_KEY` env var (find in user's .env files)15- `python3`, `ffmpeg`, `ffprobe`16- `fal-client` Python package (auto-installed via `uv run --with fal-client`)1718## Pipeline Overview1920```21Record → Transcribe (word timestamps) → Cut stutters → Combine draft22→ Review draft → Speech-to-speech (voice swap) → Nano Banana (face swap per scene)23→ Review images → Kling v3 (motion transfer) → Mux voice audio24→ Cut pauses → Re-transcribe → Burn captions → Mix music → Final25```2627## Step 0: Ask the User2829Before starting, ask:30- **Which voice** to use for speech-to-speech (list available with `--list-voices`)31- **Which face reference** image to use for the character32- **Captions**: burned into video or rely on TikTok's auto-captions?33- The user should confirm the script/lines before recording3435### CRITICAL: Gate every phase with user review36Never proceed to the next step without user confirmation. Open all outputs with the default viewer at every phase:37- Draft combined video → user reviews cuts38- Nano Banana images → user reviews all before Kling39- Final assembled video → user reviews before captions/music40- Final with captions + music → user confirms before posting4142Use `python3 -c "import webbrowser; webbrowser.open('<path>')"` or platform-native open command.4344## Step 1: Record & Transcribe4546Record all lines in one continuous take — multiple attempts per phrase are fine.4748```bash49# Open camera recorder (OS-agnostic)50# macOS: open {record-transcribe-revoice}/assets/camera-recorder.html51# Linux: xdg-open {record-transcribe-revoice}/assets/camera-recorder.html52# Windows: start {record-transcribe-revoice}/assets/camera-recorder.html53python3 -c "import webbrowser; webbrowser.open('{record-transcribe-revoice}/assets/camera-recorder.html')"5455# After recording, extract audio and transcribe with word timestamps56ffmpeg -y -i recording.webm -vn -ac 1 -ar 16000 audio.wav57python3 {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \58--input audio.wav --out-dir ./ --env-file <path-to-.env>59```6061## Step 2: Cut Stutters & Build Draft6263Use word-level timestamps to identify and remove stutters. When multiple takes exist, **use the last/best take**.6465### CRITICAL: Stutter detection66Look for these patterns in the transcript:67- Words ending with `--` (e.g., "finish--", "Forge--") — false starts68- Repeated phrases — speaker retrying a line69- Words ending with `...` — trailing off70- Partial words (e.g., "re-", "de-") — word-level stutters7172### Cutting rules73- Cut right after the last clean word before the stutter74- Resume at the start of the clean retake75- When two takes exist, prefer the **second/later** take (better delivery)76- Check the last scene for trailing audio bleeding from abandoned next takes77- Leave max 0.3s buffer around cuts7879### CRITICAL: Always backup before overwriting80```bash81cp source.mp4 backup_source.mp4 # ALWAYS before destructive edits82```8384### Combine into draft and open for review85```bash86ffmpeg -y -f concat -safe 0 -i concat.txt -c copy draft_combined.mp487xdg-open draft_combined.mp488```8990## Step 3: Speech-to-Speech (Voice Swap)9192Convert your voice to the target character's voice using ElevenLabs speech-to-speech.9394```bash95python3 {record-transcribe-revoice}/scripts/speech_to_speech_elevenlabs.py \96--input-audio scene_XX.wav \97--output scene_XX_voice.mp3 \98--voice-id <VOICE_ID> \99--env-file <path-to-.env> \100--file-format other \101--stability 0.15 --similarity-boost 0.6 --style 0.85102```103104### CRITICAL: API requires `--file-format other`105The ElevenLabs STS API changed — `mp3_44100_128` is no longer valid. Always use `--file-format other`.106107### CRITICAL: Speech-to-speech preserves stutters108STS converts voice but keeps the speech rhythm. **Cut ALL stutters from source audio BEFORE running STS.** If stutters survive into STS output, you'll need to recut the final video.109110### Long audio (>15s) may produce wrong voice111Split scenes longer than ~15s into two halves, process each separately, then concat:112```bash113ffmpeg -y -i long_scene.wav -to 10.0 part_a.wav114ffmpeg -y -i long_scene.wav -ss 10.0 part_b.wav115# Process each, then concat116```117118### CRITICAL: STS and Kling must use the SAME source cut119If you recut a scene after generating Kling motion, the audio timing won't match the lip movements. Always ensure both STS audio and Kling video come from the same draft scene file. If you recut, redo BOTH.120121## Step 4: Generate Face-Swapped Images (Nano Banana 2)122123### Generate scene 1 FIRST124Generate scene 1 alone, review it, then provide scene 1's generated image as additional context in prompts for scenes 2+. Reference the same room, lighting, outfit, and character details from the approved scene 1 output. This dramatically improves cross-scene consistency.125126### Prompt structure127Always include:128- Character description (hair, accessories, outfit)129- **Same posture** as the original frame (describe exactly what the person is doing)130- Setting description (match the original room exactly)131- Camera angle (match original)132- End with **"vertical portrait photo"**133134### CRITICAL: Avoid TikTok UI hallucination135Nano Banana will generate fake TikTok UI elements (hearts, share buttons, usernames, LIVE badges) if:1361. The **prompt** contains "TikTok style" or social media references — use "vertical portrait photo" instead1372. The **face reference image** has UI overlays — the model picks up visual context from the reference138139Fix: crop face reference to just the face, or add "clean photo, no text overlays, no UI elements" to prompt.140141### Run ALL scenes in parallel142```bash143uv run --with fal-client python3 {nano-banana-fal}/scripts/nano_banana_edit.py \144--face <face_reference.png> \145--prompt "<description>" \146--output <scene_XX.png>147```148149### Resize if needed (must be <5MB for Kling)150```bash151size=$(stat -c%s image.png)152if [ "$size" -gt 5000000 ]; then153ffmpeg -y -i image.png -vf "scale=1080:1920:force_original_aspect_ratio=decrease" resized.png154mv resized.png image.png155fi156```157158### Open ALL images for review before proceeding159160## Step 5: Kling v3 Motion Transfer161162```bash163FAL_KEY="$FAL_AI_KEY" uv run --with fal-client \164python3 {kling-motion-control}/scripts/kling_motion_control.py \165--image <scene_XX.png> \166--video <draft_scene_XX.mp4> \167--orientation video \168--prompt "<description>" \169--out <scene_XX_kling.mp4>170```171172### Run ALL in parallel173### Always use the latest Kling model174Check fal.ai for the latest Kling motion control endpoint before running. Do not hardcode model versions — they update frequently.175176### Constraints177- Minimum video duration: 3 seconds178- Kling strips audio — must re-mux after179- For clips >15s, split into chunks first180181## Step 6: Mux Voice Audio & Normalize182183### CRITICAL: Never use `-shortest` — it truncates audio184STS audio is often slightly longer than Kling video. `-shortest` cuts off the end of the audio, losing final words (e.g., "does it" gets dropped). Instead, extend the video with `tpad` to match audio length:185186```bash187# Mux STS audio onto Kling video — extend video to match audio, NOT truncate audio188ffmpeg -y -i scene_XX_kling.mp4 -i scene_XX_voice.mp3 \189-vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black,tpad=stop_mode=clone:stop_duration=1" \190-c:v libx264 -pix_fmt yuv420p -r 30 -ac 2 -ar 44100 -c:a aac -b:a 192k \191-map 0:v -map 1:a scene_XX_muxed.mp4192193# Normalize to -16 LUFS194ffmpeg -y -i scene_XX_muxed.mp4 -af "loudnorm=I=-16:LRA=11:TP=-1" \195-c:v copy -c:a aac -b:a 192k scene_XX_final.mp4196```197198## Step 7: Assemble & Dynamic Cut199200### Concat scenes201```bash202printf "file 'scene_01_final.mp4'\n..." > concat.txt203ffmpeg -y -f concat -safe 0 -i concat.txt -c copy assembled.mp4204```205206### CRITICAL: Cut pauses for dynamic pacing207Transcribe the assembled video, then find all pauses >0.3s and cut them to ~0.15s:2081. Transcribe with word timestamps2092. Use the pauses JSON to find gaps2103. Present pauses to user in `...word || word...` format with timestamps for review2114. Build keep-segments skipping the dead air2125. Re-encode and concat segments213214This typically saves 5-10 seconds and makes the video much more dynamic.215216### CRITICAL: Also cut stutters and repeated words/phrases217STS often preserves or introduces:218- Partial word stutters ("ju-", "de-", "re-")219- Repeated words ("you use, you use,")220- Trailing words that cut off ("just..." instead of "just does it")221222Always do a **word-level review** of the transcription after assembly. Search for:223- Words ending in `-` or `--` or `...`224- Consecutive identical words or short phrases225- Phrases that seem incomplete226227### CRITICAL: Verify pause timestamps match the actual source228When cutting pauses, always verify the timestamps come from the **current** version of the file. Timestamps shift after every cut. Re-transcribe before each cut pass.229230### End card231After the last spoken word, cut silence and append an end card:232```bash233ffmpeg -y -f lavfi -i "color=c=#333333:s=1080x1920:d=2:r=30" \234-f lavfi -i "anullsrc=r=44100:cl=stereo" \235-vf "drawtext=text='forgedemy.org':fontcolor=white:fontsize=72:font=Arial:x=(w-text_w)/2:y=(h-text_h)/2" \236-c:v libx264 -pix_fmt yuv420p -c:a aac -b:a 192k -shortest endcard.mp4237```238Trim the main video right after the last word ends (+0.15s max), then concat with endcard.239240### TikTok description + hashtags241After final video is ready, generate a short description (2-3 sentences summarizing the content) plus 5 relevant hashtags. Keep it punchy and action-oriented. Ask the user to confirm before posting.242243## Step 8: Captions244245### CRITICAL: Always transcribe the FINAL cut246Never reuse earlier transcriptions — timing shifts after cuts make them wrong.247248```bash249ffmpeg -y -i final.mp4 -vn -ac 1 -ar 16000 audio.wav250python3 {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \251--input audio.wav --out-dir ./252```253254### Caption style (Forgedemy brand)255- Regular text: light gray `#CCCCCC` / ASS `&H00CCCCCC`256- Key words highlighted: orange `#E8720C` / ASS `&H000C72E8`257- Font: Arial Bold, size 64258- Black outline: borderwidth 4259- Position: bottom third260- Groups of 3-5 words per caption261262### CRITICAL: Fix transcription errors263ElevenLabs commonly mis-transcribes brand names and technical terms. Always review the generated ASS file for garbled words and sed-fix them before burning. Keep a project-specific list of known corrections.264265### Burn captions266```bash267ffmpeg -y -i video.mp4 -vf "ass=captions.ass" -c:v libx264 -pix_fmt yuv420p -c:a copy captioned.mp4268```269270## Step 9: Background Music271272### Generate with ElevenLabs Music API273```bash274curl -s -X POST "https://api.elevenlabs.io/v1/music" \275-H "xi-api-key: $ELEVENLABS_API_KEY" \276-H "Content-Type: application/json" \277-d '{"prompt": "Dynamic energetic tech promo, fast electronic beat, inspiring, instrumental only",278"duration_seconds": 85, "instrumental": true}' \279--output bgm.mp3280```281282### Mix at -25dB283```bash284dur=$(ffprobe -v quiet -show_entries format=duration -of csv=p=0 video.mp4)285fade_out=$(python3 -c "print(round($dur - 3, 2))")286287ffmpeg -y -i video.mp4 -i bgm.mp3 \288-filter_complex "\289[0:a]volume=1.0[voice];\290[1:a]atrim=0:$dur,volume=0.056,afade=t=in:d=2,afade=t=out:st=$fade_out:d=3[bgm];\291[voice][bgm]amix=inputs=2:duration=first:dropout_transition=0[aout]" \292-map 0:v -map "[aout]" -c:v copy -c:a aac -b:a 192k final.mp4293```294295## File Organization296297```298generated/<project-name>/299recording/300source.webm # Original camera recording301backup_source.mp4 # Always keep backups302transcripts/303*.json # Word timestamps, sentences, pauses304draft/305scene_01.mp4 ... 06.mp4 # Clean-cut draft scenes306draft_combined.mp4 # Combined draft for review307avatar-version/308images/309frame_01.jpg ... # Extracted original frames310avatar_scene_01.png ... # Nano Banana outputs311audio/312scene_XX_original.wav # Extracted audio per scene313scene_XX_<voice>.mp3 # Speech-to-speech outputs314kling/315scene_XX_kling.mp4 # Kling motion outputs316final/317scene_XX_final.mp4 # Per-scene finals318final_avatar.mp4 # Final output with captions + music319backup_*.mp4 # Backups before destructive edits320```321322## Step 10: Speed Up (Optional)323324Apply 1.2x speedup without pitch shift for more dynamic pacing:325```bash326ffmpeg -y -i video.mp4 \327-filter_complex "[0:v]setpts=PTS/1.2[v];[0:a]atempo=1.2[a]" \328-map "[v]" -map "[a]" \329-c:v libx264 -pix_fmt yuv420p -r 30 -c:a aac -b:a 192k \330video_fast.mp4331```332333`atempo` preserves pitch. `setpts=PTS/1.2` speeds up video to match. Re-transcribe and rebuild captions after speedup.334335**CRITICAL: Never use the same file for input and output** — ffmpeg silently fails. Always write to a new file then `mv`.336337## Lessons Learned (Hard-Won)3383391. **Always backup before overwriting source files** — destructive cuts are irreversible3402. **STS preserves stutters** — clean audio BEFORE voice conversion, not after3413. **STS can drop words at the end** — "just does it" became "just..." when the clip was too long. Split long clips into halves before STS3424. **STS + Kling must share the same source cut** — mismatched timings = broken lip sync3435. **Never use `-shortest` when muxing STS onto Kling** — STS audio is often longer than Kling video, `-shortest` truncates final words. Use `tpad=stop_mode=clone:stop_duration=1` to extend video instead3446. **Long STS clips (>15s) can produce wrong voice** — split into halves3457. **ElevenLabs STS API requires `--file-format other`** — old format strings rejected with 4223468. **"TikTok style" in prompts = fake UI in images** — use "vertical portrait photo"3479. **Generate scene 1 first** — use as reference for scenes 2+ for better consistency34810. **Cut pauses to 0.15s for dynamic pacing** — saves 5-10s, makes video snappy34911. **Always re-transcribe the final cut** — timestamps shift after every cut35012. **Check for repeated words/phrases in STS output** — "you use, you use," happens often35113. **Check last scene for trailing audio** — next take's words can bleed in35214. **Fix transcription errors for brand names** — ElevenLabs commonly garbles proper nouns; keep a per-project correction list35315. **Trim trailing silence after last word** — long sustained vowels ("build.") create dead air; cut +0.15s after voice stops, then append end card35416. **Verify pause timestamps against CURRENT file** — timestamps from a previous cut are invalid after re-encoding; always re-transcribe first35517. **Do one comprehensive cut pass, not iterative** — multiple rounds of cuts compound timestamp drift and make debugging harder35618. **ElevenLabs music API caps at ~45s** — for longer videos, loop the track with `ffmpeg -stream_loop 2 -i bgm.mp3 -t <dur> -c:a copy bgm_full.mp3`35719. **Complex ffmpeg filter chains can silently fail** — always verify output file exists and has expected duration before proceeding358359## Cost Estimate (6 scenes, ~80s video)360361| Step | Cost |362|------|------|363| Nano Banana images (6) | ~$1.20 |364| Kling v3 motion (6 clips) | ~$4.50 |365| ElevenLabs STS (6 clips) | ~$1.00 |366| ElevenLabs music (1 track) | ~$0.50 |367| ElevenLabs transcription (3x) | ~$0.30 |368| **Total** | **~$7.50** |369