Source from bundle
Face Swap Tiktok

Replace yourself in filmed TikTok scenes with an AI-generated character using Nano Banana face-consistent image generation, Kling v3 motion transfer, ElevenLabs
Костянтин@Latand
Files
Skill
4.3K
Size
15.6 KB
Entrypoint
SKILL.md
Format
folder
Open file
SKILL.md

Syntax-highlighted preview of this file as included in the skill package.
Rendered Source
markdown369 linesEntrypointFree
SKILL.md
1---
2name: face-swap-tiktok
3description: "Replace yourself in filmed TikTok scenes with an AI-generated character using Nano Banana face-consistent image generation, Kling v3 motion transfer, ElevenLabs speech-to-speech voice swap, and dynamic cut editing. Full pipeline from recording to final video."
4---
5 
6# Face-Swap TikTok Pipeline
7 
8Take filmed TikTok scenes of yourself and replace the presenter with a consistent AI-generated character, preserving the original motion with a new voice.
9 
10## Prerequisites
11 
12- Dependent skills: `nano-banana-fal`, `kling-motion-control`, `record-transcribe-revoice`, `tiktok-promo-video`
13- `FAL_AI_KEY` or `FAL_KEY` env var
14- `ELEVENLABS_API_KEY` env var (find in user's .env files)
15- `python3`, `ffmpeg`, `ffprobe`
16- `fal-client` Python package (auto-installed via `uv run --with fal-client`)
17 
18## Pipeline Overview
19 
20```
21Record → Transcribe (word timestamps) → Cut stutters → Combine draft
22→ Review draft → Speech-to-speech (voice swap) → Nano Banana (face swap per scene)
23→ Review images → Kling v3 (motion transfer) → Mux voice audio
24→ Cut pauses → Re-transcribe → Burn captions → Mix music → Final
25```
26 
27## Step 0: Ask the User
28 
29Before starting, ask:
30- **Which voice** to use for speech-to-speech (list available with `--list-voices`)
31- **Which face reference** image to use for the character
32- **Captions**: burned into video or rely on TikTok's auto-captions?
33- The user should confirm the script/lines before recording
34 
35### CRITICAL: Gate every phase with user review
36Never proceed to the next step without user confirmation. Open all outputs with the default viewer at every phase:
37- Draft combined video → user reviews cuts
38- Nano Banana images → user reviews all before Kling
39- Final assembled video → user reviews before captions/music
40- Final with captions + music → user confirms before posting
41 
42Use `python3 -c "import webbrowser; webbrowser.open('<path>')"` or platform-native open command.
43 
44## Step 1: Record & Transcribe
45 
46Record all lines in one continuous take — multiple attempts per phrase are fine.
47 
48```bash
49# Open camera recorder (OS-agnostic)
50# macOS: open {record-transcribe-revoice}/assets/camera-recorder.html
51# Linux: xdg-open {record-transcribe-revoice}/assets/camera-recorder.html
52# Windows: start {record-transcribe-revoice}/assets/camera-recorder.html
53python3 -c "import webbrowser; webbrowser.open('{record-transcribe-revoice}/assets/camera-recorder.html')"
54 
55# After recording, extract audio and transcribe with word timestamps
56ffmpeg -y -i recording.webm -vn -ac 1 -ar 16000 audio.wav
57python3 {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \
58  --input audio.wav --out-dir ./ --env-file <path-to-.env>
59```
60 
61## Step 2: Cut Stutters & Build Draft
62 
63Use word-level timestamps to identify and remove stutters. When multiple takes exist, **use the last/best take**.
64 
65### CRITICAL: Stutter detection
66Look for these patterns in the transcript:
67- Words ending with `--` (e.g., "finish--", "Forge--") — false starts
68- Repeated phrases — speaker retrying a line
69- Words ending with `...` — trailing off
70- Partial words (e.g., "re-", "de-") — word-level stutters
71 
72### Cutting rules
73- Cut right after the last clean word before the stutter
74- Resume at the start of the clean retake
75- When two takes exist, prefer the **second/later** take (better delivery)
76- Check the last scene for trailing audio bleeding from abandoned next takes
77- Leave max 0.3s buffer around cuts
78 
79### CRITICAL: Always backup before overwriting
80```bash
81cp source.mp4 backup_source.mp4  # ALWAYS before destructive edits
82```
83 
84### Combine into draft and open for review
85```bash
86ffmpeg -y -f concat -safe 0 -i concat.txt -c copy draft_combined.mp4
87xdg-open draft_combined.mp4
88```
89 
90## Step 3: Speech-to-Speech (Voice Swap)
91 
92Convert your voice to the target character's voice using ElevenLabs speech-to-speech.
93 
94```bash
95python3 {record-transcribe-revoice}/scripts/speech_to_speech_elevenlabs.py \
96  --input-audio scene_XX.wav \
97  --output scene_XX_voice.mp3 \
98  --voice-id <VOICE_ID> \
99  --env-file <path-to-.env> \
100  --file-format other \
101  --stability 0.15 --similarity-boost 0.6 --style 0.85
102```
103 
104### CRITICAL: API requires `--file-format other`
105The ElevenLabs STS API changed — `mp3_44100_128` is no longer valid. Always use `--file-format other`.
106 
107### CRITICAL: Speech-to-speech preserves stutters
108STS converts voice but keeps the speech rhythm. **Cut ALL stutters from source audio BEFORE running STS.** If stutters survive into STS output, you'll need to recut the final video.
109 
110### Long audio (>15s) may produce wrong voice
111Split scenes longer than ~15s into two halves, process each separately, then concat:
112```bash
113ffmpeg -y -i long_scene.wav -to 10.0 part_a.wav
114ffmpeg -y -i long_scene.wav -ss 10.0 part_b.wav
115# Process each, then concat
116```
117 
118### CRITICAL: STS and Kling must use the SAME source cut
119If you recut a scene after generating Kling motion, the audio timing won't match the lip movements. Always ensure both STS audio and Kling video come from the same draft scene file. If you recut, redo BOTH.
120 
121## Step 4: Generate Face-Swapped Images (Nano Banana 2)
122 
123### Generate scene 1 FIRST
124Generate scene 1 alone, review it, then provide scene 1's generated image as additional context in prompts for scenes 2+. Reference the same room, lighting, outfit, and character details from the approved scene 1 output. This dramatically improves cross-scene consistency.
125 
126### Prompt structure
127Always include:
128- Character description (hair, accessories, outfit)
129- **Same posture** as the original frame (describe exactly what the person is doing)
130- Setting description (match the original room exactly)
131- Camera angle (match original)
132- End with **"vertical portrait photo"**
133 
134### CRITICAL: Avoid TikTok UI hallucination
135Nano Banana will generate fake TikTok UI elements (hearts, share buttons, usernames, LIVE badges) if:
1361. The **prompt** contains "TikTok style" or social media references — use "vertical portrait photo" instead
1372. The **face reference image** has UI overlays — the model picks up visual context from the reference
138 
139Fix: crop face reference to just the face, or add "clean photo, no text overlays, no UI elements" to prompt.
140 
141### Run ALL scenes in parallel
142```bash
143uv run --with fal-client python3 {nano-banana-fal}/scripts/nano_banana_edit.py \
144  --face <face_reference.png> \
145  --prompt "<description>" \
146  --output <scene_XX.png>
147```
148 
149### Resize if needed (must be <5MB for Kling)
150```bash
151size=$(stat -c%s image.png)
152if [ "$size" -gt 5000000 ]; then
153  ffmpeg -y -i image.png -vf "scale=1080:1920:force_original_aspect_ratio=decrease" resized.png
154  mv resized.png image.png
155fi
156```
157 
158### Open ALL images for review before proceeding
159 
160## Step 5: Kling v3 Motion Transfer
161 
162```bash
163FAL_KEY="$FAL_AI_KEY" uv run --with fal-client \
164  python3 {kling-motion-control}/scripts/kling_motion_control.py \
165  --image <scene_XX.png> \
166  --video <draft_scene_XX.mp4> \
167  --orientation video \
168  --prompt "<description>" \
169  --out <scene_XX_kling.mp4>
170```
171 
172### Run ALL in parallel
173### Always use the latest Kling model
174Check fal.ai for the latest Kling motion control endpoint before running. Do not hardcode model versions — they update frequently.
175 
176### Constraints
177- Minimum video duration: 3 seconds
178- Kling strips audio — must re-mux after
179- For clips >15s, split into chunks first
180 
181## Step 6: Mux Voice Audio & Normalize
182 
183### CRITICAL: Never use `-shortest` — it truncates audio
184STS audio is often slightly longer than Kling video. `-shortest` cuts off the end of the audio, losing final words (e.g., "does it" gets dropped). Instead, extend the video with `tpad` to match audio length:
185 
186```bash
187# Mux STS audio onto Kling video — extend video to match audio, NOT truncate audio
188ffmpeg -y -i scene_XX_kling.mp4 -i scene_XX_voice.mp3 \
189  -vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black,tpad=stop_mode=clone:stop_duration=1" \
190  -c:v libx264 -pix_fmt yuv420p -r 30 -ac 2 -ar 44100 -c:a aac -b:a 192k \
191  -map 0:v -map 1:a scene_XX_muxed.mp4
192 
193# Normalize to -16 LUFS
194ffmpeg -y -i scene_XX_muxed.mp4 -af "loudnorm=I=-16:LRA=11:TP=-1" \
195  -c:v copy -c:a aac -b:a 192k scene_XX_final.mp4
196```
197 
198## Step 7: Assemble & Dynamic Cut
199 
200### Concat scenes
201```bash
202printf "file 'scene_01_final.mp4'\n..." > concat.txt
203ffmpeg -y -f concat -safe 0 -i concat.txt -c copy assembled.mp4
204```
205 
206### CRITICAL: Cut pauses for dynamic pacing
207Transcribe the assembled video, then find all pauses >0.3s and cut them to ~0.15s:
2081. Transcribe with word timestamps
2092. Use the pauses JSON to find gaps
2103. Present pauses to user in `...word || word...` format with timestamps for review
2114. Build keep-segments skipping the dead air
2125. Re-encode and concat segments
213 
214This typically saves 5-10 seconds and makes the video much more dynamic.
215 
216### CRITICAL: Also cut stutters and repeated words/phrases
217STS often preserves or introduces:
218- Partial word stutters ("ju-", "de-", "re-")
219- Repeated words ("you use, you use,")
220- Trailing words that cut off ("just..." instead of "just does it")
221 
222Always do a **word-level review** of the transcription after assembly. Search for:
223- Words ending in `-` or `--` or `...`
224- Consecutive identical words or short phrases
225- Phrases that seem incomplete
226 
227### CRITICAL: Verify pause timestamps match the actual source
228When cutting pauses, always verify the timestamps come from the **current** version of the file. Timestamps shift after every cut. Re-transcribe before each cut pass.
229 
230### End card
231After the last spoken word, cut silence and append an end card:
232```bash
233ffmpeg -y -f lavfi -i "color=c=#333333:s=1080x1920:d=2:r=30" \
234  -f lavfi -i "anullsrc=r=44100:cl=stereo" \
235  -vf "drawtext=text='forgedemy.org':fontcolor=white:fontsize=72:font=Arial:x=(w-text_w)/2:y=(h-text_h)/2" \
236  -c:v libx264 -pix_fmt yuv420p -c:a aac -b:a 192k -shortest endcard.mp4
237```
238Trim the main video right after the last word ends (+0.15s max), then concat with endcard.
239 
240### TikTok description + hashtags
241After final video is ready, generate a short description (2-3 sentences summarizing the content) plus 5 relevant hashtags. Keep it punchy and action-oriented. Ask the user to confirm before posting.
242 
243## Step 8: Captions
244 
245### CRITICAL: Always transcribe the FINAL cut
246Never reuse earlier transcriptions — timing shifts after cuts make them wrong.
247 
248```bash
249ffmpeg -y -i final.mp4 -vn -ac 1 -ar 16000 audio.wav
250python3 {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \
251  --input audio.wav --out-dir ./
252```
253 
254### Caption style (Forgedemy brand)
255- Regular text: light gray `#CCCCCC` / ASS `&H00CCCCCC`
256- Key words highlighted: orange `#E8720C` / ASS `&H000C72E8`
257- Font: Arial Bold, size 64
258- Black outline: borderwidth 4
259- Position: bottom third
260- Groups of 3-5 words per caption
261 
262### CRITICAL: Fix transcription errors
263ElevenLabs commonly mis-transcribes brand names and technical terms. Always review the generated ASS file for garbled words and sed-fix them before burning. Keep a project-specific list of known corrections.
264 
265### Burn captions
266```bash
267ffmpeg -y -i video.mp4 -vf "ass=captions.ass" -c:v libx264 -pix_fmt yuv420p -c:a copy captioned.mp4
268```
269 
270## Step 9: Background Music
271 
272### Generate with ElevenLabs Music API
273```bash
274curl -s -X POST "https://api.elevenlabs.io/v1/music" \
275  -H "xi-api-key: $ELEVENLABS_API_KEY" \
276  -H "Content-Type: application/json" \
277  -d '{"prompt": "Dynamic energetic tech promo, fast electronic beat, inspiring, instrumental only",
278       "duration_seconds": 85, "instrumental": true}' \
279  --output bgm.mp3
280```
281 
282### Mix at -25dB
283```bash
284dur=$(ffprobe -v quiet -show_entries format=duration -of csv=p=0 video.mp4)
285fade_out=$(python3 -c "print(round($dur - 3, 2))")
286 
287ffmpeg -y -i video.mp4 -i bgm.mp3 \
288  -filter_complex "\
289[0:a]volume=1.0[voice];\
290[1:a]atrim=0:$dur,volume=0.056,afade=t=in:d=2,afade=t=out:st=$fade_out:d=3[bgm];\
291[voice][bgm]amix=inputs=2:duration=first:dropout_transition=0[aout]" \
292  -map 0:v -map "[aout]" -c:v copy -c:a aac -b:a 192k final.mp4
293```
294 
295## File Organization
296 
297```
298generated/<project-name>/
299  recording/
300    source.webm              # Original camera recording
301    backup_source.mp4        # Always keep backups
302  transcripts/
303    *.json                   # Word timestamps, sentences, pauses
304  draft/
305    scene_01.mp4 ... 06.mp4  # Clean-cut draft scenes
306    draft_combined.mp4       # Combined draft for review
307  avatar-version/
308    images/
309      frame_01.jpg ...         # Extracted original frames
310      avatar_scene_01.png ...  # Nano Banana outputs
311    audio/
312      scene_XX_original.wav    # Extracted audio per scene
313      scene_XX_<voice>.mp3    # Speech-to-speech outputs
314    kling/
315      scene_XX_kling.mp4       # Kling motion outputs
316    final/
317      scene_XX_final.mp4       # Per-scene finals
318      final_avatar.mp4         # Final output with captions + music
319      backup_*.mp4             # Backups before destructive edits
320```
321 
322## Step 10: Speed Up (Optional)
323 
324Apply 1.2x speedup without pitch shift for more dynamic pacing:
325```bash
326ffmpeg -y -i video.mp4 \
327  -filter_complex "[0:v]setpts=PTS/1.2[v];[0:a]atempo=1.2[a]" \
328  -map "[v]" -map "[a]" \
329  -c:v libx264 -pix_fmt yuv420p -r 30 -c:a aac -b:a 192k \
330  video_fast.mp4
331```
332 
333`atempo` preserves pitch. `setpts=PTS/1.2` speeds up video to match. Re-transcribe and rebuild captions after speedup.
334 
335**CRITICAL: Never use the same file for input and output** — ffmpeg silently fails. Always write to a new file then `mv`.
336 
337## Lessons Learned (Hard-Won)
338 
3391. **Always backup before overwriting source files** — destructive cuts are irreversible
3402. **STS preserves stutters** — clean audio BEFORE voice conversion, not after
3413. **STS can drop words at the end** — "just does it" became "just..." when the clip was too long. Split long clips into halves before STS
3424. **STS + Kling must share the same source cut** — mismatched timings = broken lip sync
3435. **Never use `-shortest` when muxing STS onto Kling** — STS audio is often longer than Kling video, `-shortest` truncates final words. Use `tpad=stop_mode=clone:stop_duration=1` to extend video instead
3446. **Long STS clips (>15s) can produce wrong voice** — split into halves
3457. **ElevenLabs STS API requires `--file-format other`** — old format strings rejected with 422
3468. **"TikTok style" in prompts = fake UI in images** — use "vertical portrait photo"
3479. **Generate scene 1 first** — use as reference for scenes 2+ for better consistency
34810. **Cut pauses to 0.15s for dynamic pacing** — saves 5-10s, makes video snappy
34911. **Always re-transcribe the final cut** — timestamps shift after every cut
35012. **Check for repeated words/phrases in STS output** — "you use, you use," happens often
35113. **Check last scene for trailing audio** — next take's words can bleed in
35214. **Fix transcription errors for brand names** — ElevenLabs commonly garbles proper nouns; keep a per-project correction list
35315. **Trim trailing silence after last word** — long sustained vowels ("build.") create dead air; cut +0.15s after voice stops, then append end card
35416. **Verify pause timestamps against CURRENT file** — timestamps from a previous cut are invalid after re-encoding; always re-transcribe first
35517. **Do one comprehensive cut pass, not iterative** — multiple rounds of cuts compound timestamp drift and make debugging harder
35618. **ElevenLabs music API caps at ~45s** — for longer videos, loop the track with `ffmpeg -stream_loop 2 -i bgm.mp3 -t <dur> -c:a copy bgm_full.mp3`
35719. **Complex ffmpeg filter chains can silently fail** — always verify output file exists and has expected duration before proceeding
358 
359## Cost Estimate (6 scenes, ~80s video)
360 
361| Step | Cost |
362|------|------|
363| Nano Banana images (6) | ~$1.20 |
364| Kling v3 motion (6 clips) | ~$4.50 |
365| ElevenLabs STS (6 clips) | ~$1.00 |
366| ElevenLabs music (1 track) | ~$0.50 |
367| ElevenLabs transcription (3x) | ~$0.30 |
368| **Total** | **~$7.50** |
369
Preparing the source view

Face Swap Tiktok

SKILL.md