Face-Swap TikTok Pipeline
Take filmed TikTok scenes of yourself and replace the presenter with a consistent AI-generated character, preserving the original motion with a new voice.
Prerequisites
- Dependent skills:
nano-banana-fal,kling-motion-control,record-transcribe-revoice,tiktok-promo-video FAL_AI_KEYorFAL_KEYenv varELEVENLABS_API_KEYenv var (find in user's .env files)python3,ffmpeg,ffprobefal-clientPython package (auto-installed viauv run --with fal-client)
Pipeline Overview
Record → Transcribe (word timestamps) → Cut stutters → Combine draft
→ Review draft → Speech-to-speech (voice swap) → Nano Banana (face swap per scene)
→ Review images → Kling v3 (motion transfer) → Mux voice audio
→ Cut pauses → Re-transcribe → Burn captions → Mix music → FinalStep 0: Ask the User
Before starting, ask:
- Which voice to use for speech-to-speech (list available with
--list-voices) - Which face reference image to use for the character
- Captions: burned into video or rely on TikTok's auto-captions?
- The user should confirm the script/lines before recording
CRITICAL: Gate every phase with user review
Never proceed to the next step without user confirmation. Open all outputs with the default viewer at every phase:
- Draft combined video → user reviews cuts
- Nano Banana images → user reviews all before Kling
- Final assembled video → user reviews before captions/music
- Final with captions + music → user confirms before posting
Use python3 -c "import webbrowser; webbrowser.open('<path>')" or platform-native open command.
Step 1: Record & Transcribe
Record all lines in one continuous take — multiple attempts per phrase are fine.
# Open camera recorder (OS-agnostic)
# macOS: open {record-transcribe-revoice}/assets/camera-recorder.html
# Linux: xdg-open {record-transcribe-revoice}/assets/camera-recorder.html
# Windows: start {record-transcribe-revoice}/assets/camera-recorder.html
python3 -c "import webbrowser; webbrowser.open('{record-transcribe-revoice}/assets/camera-recorder.html')"
# After recording, extract audio and transcribe with word timestamps
ffmpeg -y -i recording.webm -vn -ac 1 -ar 16000 audio.wav
python3 {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \
--input audio.wav --out-dir ./ --env-file <path-to-.env>Step 2: Cut Stutters & Build Draft
Use word-level timestamps to identify and remove stutters. When multiple takes exist, use the last/best take.
CRITICAL: Stutter detection
Look for these patterns in the transcript:
- Words ending with
--(e.g., "finish--", "Forge--") — false starts - Repeated phrases — speaker retrying a line
- Words ending with
...— trailing off - Partial words (e.g., "re-", "de-") — word-level stutters
Cutting rules
- Cut right after the last clean word before the stutter
- Resume at the start of the clean retake
- When two takes exist, prefer the second/later take (better delivery)
- Check the last scene for trailing audio bleeding from abandoned next takes
- Leave max 0.3s buffer around cuts
CRITICAL: Always backup before overwriting
cp source.mp4 backup_source.mp4 # ALWAYS before destructive editsCombine into draft and open for review
ffmpeg -y -f concat -safe 0 -i concat.txt -c copy draft_combined.mp4
xdg-open draft_combined.mp4Step 3: Speech-to-Speech (Voice Swap)
Convert your voice to the target character's voice using ElevenLabs speech-to-speech.
python3 {record-transcribe-revoice}/scripts/speech_to_speech_elevenlabs.py \
--input-audio scene_XX.wav \
--output scene_XX_voice.mp3 \
--voice-id <VOICE_ID> \
--env-file <path-to-.env> \
--file-format other \
--stability 0.15 --similarity-boost 0.6 --style 0.85CRITICAL: API requires --file-format other
The ElevenLabs STS API changed — mp3_44100_128 is no longer valid. Always use --file-format other.
CRITICAL: Speech-to-speech preserves stutters
STS converts voice but keeps the speech rhythm. Cut ALL stutters from source audio BEFORE running STS. If stutters survive into STS output, you'll need to recut the final video.
Long audio (>15s) may produce wrong voice
Split scenes longer than ~15s into two halves, process each separately, then concat:
ffmpeg -y -i long_scene.wav -to 10.0 part_a.wav
ffmpeg -y -i long_scene.wav -ss 10.0 part_b.wav
# Process each, then concatCRITICAL: STS and Kling must use the SAME source cut
If you recut a scene after generating Kling motion, the audio timing won't match the lip movements. Always ensure both STS audio and Kling video come from the same draft scene file. If you recut, redo BOTH.
Step 4: Generate Face-Swapped Images (Nano Banana 2)
Generate scene 1 FIRST
Generate scene 1 alone, review it, then provide scene 1's generated image as additional context in prompts for scenes 2+. Reference the same room, lighting, outfit, and character details from the approved scene 1 output. This dramatically improves cross-scene consistency.
Prompt structure
Always include:
- Character description (hair, accessories, outfit)
- Same posture as the original frame (describe exactly what the person is doing)
- Setting description (match the original room exactly)
- Camera angle (match original)
- End with "vertical portrait photo"
CRITICAL: Avoid TikTok UI hallucination
Nano Banana will generate fake TikTok UI elements (hearts, share buttons, usernames, LIVE badges) if:
- The prompt contains "TikTok style" or social media references — use "vertical portrait photo" instead
- The face reference image has UI overlays — the model picks up visual context from the reference
Fix: crop face reference to just the face, or add "clean photo, no text overlays, no UI elements" to prompt.
Run ALL scenes in parallel
uv run --with fal-client python3 {nano-banana-fal}/scripts/nano_banana_edit.py \
--face <face_reference.png> \
--prompt "<description>" \
--output <scene_XX.png>Resize if needed (must be <5MB for Kling)
size=$(stat -c%s image.png)
if [ "$size" -gt 5000000 ]; then
ffmpeg -y -i image.png -vf "scale=1080:1920:force_original_aspect_ratio=decrease" resized.png
mv resized.png image.png
fiOpen ALL images for review before proceeding
Step 5: Kling v3 Motion Transfer
FAL_KEY="$FAL_AI_KEY" uv run --with fal-client \
python3 {kling-motion-control}/scripts/kling_motion_control.py \
--image <scene_XX.png> \
--video <draft_scene_XX.mp4> \
--orientation video \
--prompt "<description>" \
--out <scene_XX_kling.mp4>Run ALL in parallel
Always use the latest Kling model
Check fal.ai for the latest Kling motion control endpoint before running. Do not hardcode model versions — they update frequently.
Constraints
- Minimum video duration: 3 seconds
- Kling strips audio — must re-mux after
- For clips >15s, split into chunks first
Step 6: Mux Voice Audio & Normalize
CRITICAL: Never use -shortest — it truncates audio
STS audio is often slightly longer than Kling video. -shortest cuts off the end of the audio, losing final words (e.g., "does it" gets dropped). Instead, extend the video with tpad to match audio length:
# Mux STS audio onto Kling video — extend video to match audio, NOT truncate audio
ffmpeg -y -i scene_XX_kling.mp4 -i scene_XX_voice.mp3 \
-vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black,tpad=stop_mode=clone:stop_duration=1" \
-c:v libx264 -pix_fmt yuv420p -r 30 -ac 2 -ar 44100 -c:a aac -b:a 192k \
-map 0:v -map 1:a scene_XX_muxed.mp4
# Normalize to -16 LUFS
ffmpeg -y -i scene_XX_muxed.mp4 -af "loudnorm=I=-16:LRA=11:TP=-1" \
-c:v copy -c:a aac -b:a 192k scene_XX_final.mp4Step 7: Assemble & Dynamic Cut
Concat scenes
printf "file 'scene_01_final.mp4'\n..." > concat.txt
ffmpeg -y -f concat -safe 0 -i concat.txt -c copy assembled.mp4CRITICAL: Cut pauses for dynamic pacing
Transcribe the assembled video, then find all pauses >0.3s and cut them to ~0.15s:
- Transcribe with word timestamps
- Use the pauses JSON to find gaps
- Present pauses to user in
...word || word...format with timestamps for review - Build keep-segments skipping the dead air
- Re-encode and concat segments
This typically saves 5-10 seconds and makes the video much more dynamic.
CRITICAL: Also cut stutters and repeated words/phrases
STS often preserves or introduces:
- Partial word stutters ("ju-", "de-", "re-")
- Repeated words ("you use, you use,")
- Trailing words that cut off ("just..." instead of "just does it")
Always do a word-level review of the transcription after assembly. Search for:
- Words ending in
-or--or... - Consecutive identical words or short phrases
- Phrases that seem incomplete
CRITICAL: Verify pause timestamps match the actual source
When cutting pauses, always verify the timestamps come from the current version of the file. Timestamps shift after every cut. Re-transcribe before each cut pass.
End card
After the last spoken word, cut silence and append an end card:
ffmpeg -y -f lavfi -i "color=c=#333333:s=1080x1920:d=2:r=30" \
-f lavfi -i "anullsrc=r=44100:cl=stereo" \
-vf "drawtext=text='forgedemy.org':fontcolor=white:fontsize=72:font=Arial:x=(w-text_w)/2:y=(h-text_h)/2" \
-c:v libx264 -pix_fmt yuv420p -c:a aac -b:a 192k -shortest endcard.mp4Trim the main video right after the last word ends (+0.15s max), then concat with endcard.
TikTok description + hashtags
After final video is ready, generate a short description (2-3 sentences summarizing the content) plus 5 relevant hashtags. Keep it punchy and action-oriented. Ask the user to confirm before posting.
Step 8: Captions
CRITICAL: Always transcribe the FINAL cut
Never reuse earlier transcriptions — timing shifts after cuts make them wrong.
ffmpeg -y -i final.mp4 -vn -ac 1 -ar 16000 audio.wav
python3 {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \
--input audio.wav --out-dir ./Caption style (Forgedemy brand)
- Regular text: light gray
#CCCCCC/ ASS&H00CCCCCC - Key words highlighted: orange
#E8720C/ ASS&H000C72E8 - Font: Arial Bold, size 64
- Black outline: borderwidth 4
- Position: bottom third
- Groups of 3-5 words per caption
CRITICAL: Fix transcription errors
ElevenLabs commonly mis-transcribes brand names and technical terms. Always review the generated ASS file for garbled words and sed-fix them before burning. Keep a project-specific list of known corrections.
Burn captions
ffmpeg -y -i video.mp4 -vf "ass=captions.ass" -c:v libx264 -pix_fmt yuv420p -c:a copy captioned.mp4Step 9: Background Music
Generate with ElevenLabs Music API
curl -s -X POST "https://api.elevenlabs.io/v1/music" \
-H "xi-api-key: $ELEVENLABS_API_KEY" \
-H "Content-Type: application/json" \
-d '{"prompt": "Dynamic energetic tech promo, fast electronic beat, inspiring, instrumental only",
"duration_seconds": 85, "instrumental": true}' \
--output bgm.mp3Mix at -25dB
dur=$(ffprobe -v quiet -show_entries format=duration -of csv=p=0 video.mp4)
fade_out=$(python3 -c "print(round($dur - 3, 2))")
ffmpeg -y -i video.mp4 -i bgm.mp3 \
-filter_complex "\
[0:a]volume=1.0[voice];\
[1:a]atrim=0:$dur,volume=0.056,afade=t=in:d=2,afade=t=out:st=$fade_out:d=3[bgm];\
[voice][bgm]amix=inputs=2:duration=first:dropout_transition=0[aout]" \
-map 0:v -map "[aout]" -c:v copy -c:a aac -b:a 192k final.mp4File Organization
generated/<project-name>/
recording/
source.webm # Original camera recording
backup_source.mp4 # Always keep backups
transcripts/
*.json # Word timestamps, sentences, pauses
draft/
scene_01.mp4 ... 06.mp4 # Clean-cut draft scenes
draft_combined.mp4 # Combined draft for review
avatar-version/
images/
frame_01.jpg ... # Extracted original frames
avatar_scene_01.png ... # Nano Banana outputs
audio/
scene_XX_original.wav # Extracted audio per scene
scene_XX_<voice>.mp3 # Speech-to-speech outputs
kling/
scene_XX_kling.mp4 # Kling motion outputs
final/
scene_XX_final.mp4 # Per-scene finals
final_avatar.mp4 # Final output with captions + music
backup_*.mp4 # Backups before destructive editsStep 10: Speed Up (Optional)
Apply 1.2x speedup without pitch shift for more dynamic pacing:
ffmpeg -y -i video.mp4 \
-filter_complex "[0:v]setpts=PTS/1.2[v];[0:a]atempo=1.2[a]" \
-map "[v]" -map "[a]" \
-c:v libx264 -pix_fmt yuv420p -r 30 -c:a aac -b:a 192k \
video_fast.mp4atempo preserves pitch. setpts=PTS/1.2 speeds up video to match. Re-transcribe and rebuild captions after speedup.
CRITICAL: Never use the same file for input and output — ffmpeg silently fails. Always write to a new file then mv.
Lessons Learned (Hard-Won)
- Always backup before overwriting source files — destructive cuts are irreversible
- STS preserves stutters — clean audio BEFORE voice conversion, not after
- STS can drop words at the end — "just does it" became "just..." when the clip was too long. Split long clips into halves before STS
- STS + Kling must share the same source cut — mismatched timings = broken lip sync
- Never use
-shortestwhen muxing STS onto Kling — STS audio is often longer than Kling video,-shortesttruncates final words. Usetpad=stop_mode=clone:stop_duration=1to extend video instead - Long STS clips (>15s) can produce wrong voice — split into halves
- ElevenLabs STS API requires
--file-format other— old format strings rejected with 422 - "TikTok style" in prompts = fake UI in images — use "vertical portrait photo"
- Generate scene 1 first — use as reference for scenes 2+ for better consistency
- Cut pauses to 0.15s for dynamic pacing — saves 5-10s, makes video snappy
- Always re-transcribe the final cut — timestamps shift after every cut
- Check for repeated words/phrases in STS output — "you use, you use," happens often
- Check last scene for trailing audio — next take's words can bleed in
- Fix transcription errors for brand names — ElevenLabs commonly garbles proper nouns; keep a per-project correction list
- Trim trailing silence after last word — long sustained vowels ("build.") create dead air; cut +0.15s after voice stops, then append end card
- Verify pause timestamps against CURRENT file — timestamps from a previous cut are invalid after re-encoding; always re-transcribe first
- Do one comprehensive cut pass, not iterative — multiple rounds of cuts compound timestamp drift and make debugging harder
- ElevenLabs music API caps at ~45s — for longer videos, loop the track with
ffmpeg -stream_loop 2 -i bgm.mp3 -t <dur> -c:a copy bgm_full.mp3 - Complex ffmpeg filter chains can silently fail — always verify output file exists and has expected duration before proceeding
Cost Estimate (6 scenes, ~80s video)
| Step | Cost |
|---|---|
| Nano Banana images (6) | ~$1.20 |
| Kling v3 motion (6 clips) | ~$4.50 |
| ElevenLabs STS (6 clips) | ~$1.00 |
| ElevenLabs music (1 track) | ~$0.50 |
| ElevenLabs transcription (3x) | ~$0.30 |
| Total | ~$7.50 |