Face-Swap TikTok Pipeline

Take filmed TikTok scenes of yourself and replace the presenter with a consistent AI-generated character, preserving the original motion with a new voice.

Prerequisites

Dependent skills: nano-banana-fal, kling-motion-control, record-transcribe-revoice, tiktok-promo-video
FAL_AI_KEY or FAL_KEY env var
ELEVENLABS_API_KEY env var (find in user's .env files)
python3, ffmpeg, ffprobe
fal-client Python package (auto-installed via uv run --with fal-client)

Pipeline Overview

Record → Transcribe (word timestamps) → Cut stutters → Combine draft
→ Review draft → Speech-to-speech (voice swap) → Nano Banana (face swap per scene)
→ Review images → Kling v3 (motion transfer) → Mux voice audio
→ Cut pauses → Re-transcribe → Burn captions → Mix music → Final

Step 0: Ask the User

Before starting, ask:

Which voice to use for speech-to-speech (list available with --list-voices)
Which face reference image to use for the character
Captions: burned into video or rely on TikTok's auto-captions?
The user should confirm the script/lines before recording

CRITICAL: Gate every phase with user review

Never proceed to the next step without user confirmation. Open all outputs with the default viewer at every phase:

Draft combined video → user reviews cuts
Nano Banana images → user reviews all before Kling
Final assembled video → user reviews before captions/music
Final with captions + music → user confirms before posting

Use python3 -c "import webbrowser; webbrowser.open('<path>')" or platform-native open command.

Step 1: Record & Transcribe

Record all lines in one continuous take — multiple attempts per phrase are fine.

# Open camera recorder (OS-agnostic)
# macOS: open {record-transcribe-revoice}/assets/camera-recorder.html
# Linux: xdg-open {record-transcribe-revoice}/assets/camera-recorder.html
# Windows: start {record-transcribe-revoice}/assets/camera-recorder.html
python3 -c "import webbrowser; webbrowser.open('{record-transcribe-revoice}/assets/camera-recorder.html')"

# After recording, extract audio and transcribe with word timestamps
ffmpeg -y -i recording.webm -vn -ac 1 -ar 16000 audio.wav
python3 {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \
  --input audio.wav --out-dir ./ --env-file <path-to-.env>

Step 2: Cut Stutters & Build Draft

Use word-level timestamps to identify and remove stutters. When multiple takes exist, use the last/best take.

CRITICAL: Stutter detection

Look for these patterns in the transcript:

Words ending with -- (e.g., "finish--", "Forge--") — false starts
Repeated phrases — speaker retrying a line
Words ending with ... — trailing off
Partial words (e.g., "re-", "de-") — word-level stutters

Cutting rules

Cut right after the last clean word before the stutter
Resume at the start of the clean retake
When two takes exist, prefer the second/later take (better delivery)
Check the last scene for trailing audio bleeding from abandoned next takes
Leave max 0.3s buffer around cuts

CRITICAL: Always backup before overwriting

cp source.mp4 backup_source.mp4  # ALWAYS before destructive edits

Combine into draft and open for review

ffmpeg -y -f concat -safe 0 -i concat.txt -c copy draft_combined.mp4
xdg-open draft_combined.mp4

Step 3: Speech-to-Speech (Voice Swap)

Convert your voice to the target character's voice using ElevenLabs speech-to-speech.

python3 {record-transcribe-revoice}/scripts/speech_to_speech_elevenlabs.py \
  --input-audio scene_XX.wav \
  --output scene_XX_voice.mp3 \
  --voice-id <VOICE_ID> \
  --env-file <path-to-.env> \
  --file-format other \
  --stability 0.15 --similarity-boost 0.6 --style 0.85

CRITICAL: API requires `--file-format other`

The ElevenLabs STS API changed — mp3_44100_128 is no longer valid. Always use --file-format other.

CRITICAL: Speech-to-speech preserves stutters

STS converts voice but keeps the speech rhythm. Cut ALL stutters from source audio BEFORE running STS. If stutters survive into STS output, you'll need to recut the final video.

Long audio (>15s) may produce wrong voice

Split scenes longer than ~15s into two halves, process each separately, then concat:

ffmpeg -y -i long_scene.wav -to 10.0 part_a.wav
ffmpeg -y -i long_scene.wav -ss 10.0 part_b.wav
# Process each, then concat

CRITICAL: STS and Kling must use the SAME source cut

If you recut a scene after generating Kling motion, the audio timing won't match the lip movements. Always ensure both STS audio and Kling video come from the same draft scene file. If you recut, redo BOTH.

Step 4: Generate Face-Swapped Images (Nano Banana 2)

Generate scene 1 FIRST

Generate scene 1 alone, review it, then provide scene 1's generated image as additional context in prompts for scenes 2+. Reference the same room, lighting, outfit, and character details from the approved scene 1 output. This dramatically improves cross-scene consistency.

Prompt structure

Always include:

Character description (hair, accessories, outfit)
Same posture as the original frame (describe exactly what the person is doing)
Setting description (match the original room exactly)
Camera angle (match original)
End with "vertical portrait photo"

CRITICAL: Avoid TikTok UI hallucination

Nano Banana will generate fake TikTok UI elements (hearts, share buttons, usernames, LIVE badges) if:

The prompt contains "TikTok style" or social media references — use "vertical portrait photo" instead
The face reference image has UI overlays — the model picks up visual context from the reference

Fix: crop face reference to just the face, or add "clean photo, no text overlays, no UI elements" to prompt.

Run ALL scenes in parallel

uv run --with fal-client python3 {nano-banana-fal}/scripts/nano_banana_edit.py \
  --face <face_reference.png> \
  --prompt "<description>" \
  --output <scene_XX.png>

Resize if needed (must be <5MB for Kling)

size=$(stat -c%s image.png)
if [ "$size" -gt 5000000 ]; then
  ffmpeg -y -i image.png -vf "scale=1080:1920:force_original_aspect_ratio=decrease" resized.png
  mv resized.png image.png
fi

Open ALL images for review before proceeding

Step 5: Kling v3 Motion Transfer

FAL_KEY="$FAL_AI_KEY" uv run --with fal-client \
  python3 {kling-motion-control}/scripts/kling_motion_control.py \
  --image <scene_XX.png> \
  --video <draft_scene_XX.mp4> \
  --orientation video \
  --prompt "<description>" \
  --out <scene_XX_kling.mp4>

Run ALL in parallel

Always use the latest Kling model

Check fal.ai for the latest Kling motion control endpoint before running. Do not hardcode model versions — they update frequently.

Constraints

Minimum video duration: 3 seconds
Kling strips audio — must re-mux after
For clips >15s, split into chunks first

Step 6: Mux Voice Audio & Normalize

CRITICAL: Never use `-shortest` — it truncates audio

STS audio is often slightly longer than Kling video. -shortest cuts off the end of the audio, losing final words (e.g., "does it" gets dropped). Instead, extend the video with tpad to match audio length:

# Mux STS audio onto Kling video — extend video to match audio, NOT truncate audio
ffmpeg -y -i scene_XX_kling.mp4 -i scene_XX_voice.mp3 \
  -vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black,tpad=stop_mode=clone:stop_duration=1" \
  -c:v libx264 -pix_fmt yuv420p -r 30 -ac 2 -ar 44100 -c:a aac -b:a 192k \
  -map 0:v -map 1:a scene_XX_muxed.mp4

# Normalize to -16 LUFS
ffmpeg -y -i scene_XX_muxed.mp4 -af "loudnorm=I=-16:LRA=11:TP=-1" \
  -c:v copy -c:a aac -b:a 192k scene_XX_final.mp4

Step 7: Assemble & Dynamic Cut

Concat scenes

printf "file 'scene_01_final.mp4'\n..." > concat.txt
ffmpeg -y -f concat -safe 0 -i concat.txt -c copy assembled.mp4

CRITICAL: Cut pauses for dynamic pacing

Transcribe the assembled video, then find all pauses >0.3s and cut them to ~0.15s:

Transcribe with word timestamps
Use the pauses JSON to find gaps
Present pauses to user in ...word || word... format with timestamps for review
Build keep-segments skipping the dead air
Re-encode and concat segments

This typically saves 5-10 seconds and makes the video much more dynamic.

CRITICAL: Also cut stutters and repeated words/phrases

STS often preserves or introduces:

Partial word stutters ("ju-", "de-", "re-")
Repeated words ("you use, you use,")
Trailing words that cut off ("just..." instead of "just does it")

Always do a word-level review of the transcription after assembly. Search for:

Words ending in - or -- or ...
Consecutive identical words or short phrases
Phrases that seem incomplete

CRITICAL: Verify pause timestamps match the actual source

When cutting pauses, always verify the timestamps come from the current version of the file. Timestamps shift after every cut. Re-transcribe before each cut pass.

End card

After the last spoken word, cut silence and append an end card:

ffmpeg -y -f lavfi -i "color=c=#333333:s=1080x1920:d=2:r=30" \
  -f lavfi -i "anullsrc=r=44100:cl=stereo" \
  -vf "drawtext=text='forgedemy.org':fontcolor=white:fontsize=72:font=Arial:x=(w-text_w)/2:y=(h-text_h)/2" \
  -c:v libx264 -pix_fmt yuv420p -c:a aac -b:a 192k -shortest endcard.mp4

Trim the main video right after the last word ends (+0.15s max), then concat with endcard.

TikTok description + hashtags

After final video is ready, generate a short description (2-3 sentences summarizing the content) plus 5 relevant hashtags. Keep it punchy and action-oriented. Ask the user to confirm before posting.

Step 8: Captions

CRITICAL: Always transcribe the FINAL cut

Never reuse earlier transcriptions — timing shifts after cuts make them wrong.

ffmpeg -y -i final.mp4 -vn -ac 1 -ar 16000 audio.wav
python3 {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \
  --input audio.wav --out-dir ./

Caption style (Forgedemy brand)

Regular text: light gray #CCCCCC / ASS &H00CCCCCC
Key words highlighted: orange #E8720C / ASS &H000C72E8
Font: Arial Bold, size 64
Black outline: borderwidth 4
Position: bottom third
Groups of 3-5 words per caption

CRITICAL: Fix transcription errors

ElevenLabs commonly mis-transcribes brand names and technical terms. Always review the generated ASS file for garbled words and sed-fix them before burning. Keep a project-specific list of known corrections.

Burn captions

ffmpeg -y -i video.mp4 -vf "ass=captions.ass" -c:v libx264 -pix_fmt yuv420p -c:a copy captioned.mp4

Step 9: Background Music

Generate with ElevenLabs Music API

curl -s -X POST "https://api.elevenlabs.io/v1/music" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Dynamic energetic tech promo, fast electronic beat, inspiring, instrumental only",
       "duration_seconds": 85, "instrumental": true}' \
  --output bgm.mp3

Mix at -25dB

dur=$(ffprobe -v quiet -show_entries format=duration -of csv=p=0 video.mp4)
fade_out=$(python3 -c "print(round($dur - 3, 2))")

ffmpeg -y -i video.mp4 -i bgm.mp3 \
  -filter_complex "\
[0:a]volume=1.0[voice];\
[1:a]atrim=0:$dur,volume=0.056,afade=t=in:d=2,afade=t=out:st=$fade_out:d=3[bgm];\
[voice][bgm]amix=inputs=2:duration=first:dropout_transition=0[aout]" \
  -map 0:v -map "[aout]" -c:v copy -c:a aac -b:a 192k final.mp4

File Organization

generated/<project-name>/
  recording/
    source.webm              # Original camera recording
    backup_source.mp4        # Always keep backups
  transcripts/
    *.json                   # Word timestamps, sentences, pauses
  draft/
    scene_01.mp4 ... 06.mp4  # Clean-cut draft scenes
    draft_combined.mp4       # Combined draft for review
  avatar-version/
    images/
      frame_01.jpg ...         # Extracted original frames
      avatar_scene_01.png ...  # Nano Banana outputs
    audio/
      scene_XX_original.wav    # Extracted audio per scene
      scene_XX_<voice>.mp3    # Speech-to-speech outputs
    kling/
      scene_XX_kling.mp4       # Kling motion outputs
    final/
      scene_XX_final.mp4       # Per-scene finals
      final_avatar.mp4         # Final output with captions + music
      backup_*.mp4             # Backups before destructive edits

Step 10: Speed Up (Optional)

Apply 1.2x speedup without pitch shift for more dynamic pacing:

ffmpeg -y -i video.mp4 \
  -filter_complex "[0:v]setpts=PTS/1.2[v];[0:a]atempo=1.2[a]" \
  -map "[v]" -map "[a]" \
  -c:v libx264 -pix_fmt yuv420p -r 30 -c:a aac -b:a 192k \
  video_fast.mp4

atempo preserves pitch. setpts=PTS/1.2 speeds up video to match. Re-transcribe and rebuild captions after speedup.

CRITICAL: Never use the same file for input and output — ffmpeg silently fails. Always write to a new file then mv.

Lessons Learned (Hard-Won)

Always backup before overwriting source files — destructive cuts are irreversible
STS preserves stutters — clean audio BEFORE voice conversion, not after
STS can drop words at the end — "just does it" became "just..." when the clip was too long. Split long clips into halves before STS
STS + Kling must share the same source cut — mismatched timings = broken lip sync
Never use -shortest when muxing STS onto Kling — STS audio is often longer than Kling video, -shortest truncates final words. Use tpad=stop_mode=clone:stop_duration=1 to extend video instead
Long STS clips (>15s) can produce wrong voice — split into halves
ElevenLabs STS API requires --file-format other — old format strings rejected with 422
"TikTok style" in prompts = fake UI in images — use "vertical portrait photo"
Generate scene 1 first — use as reference for scenes 2+ for better consistency
Cut pauses to 0.15s for dynamic pacing — saves 5-10s, makes video snappy
Always re-transcribe the final cut — timestamps shift after every cut
Check for repeated words/phrases in STS output — "you use, you use," happens often
Check last scene for trailing audio — next take's words can bleed in
Fix transcription errors for brand names — ElevenLabs commonly garbles proper nouns; keep a per-project correction list
Trim trailing silence after last word — long sustained vowels ("build.") create dead air; cut +0.15s after voice stops, then append end card
Verify pause timestamps against CURRENT file — timestamps from a previous cut are invalid after re-encoding; always re-transcribe first
Do one comprehensive cut pass, not iterative — multiple rounds of cuts compound timestamp drift and make debugging harder
ElevenLabs music API caps at ~45s — for longer videos, loop the track with ffmpeg -stream_loop 2 -i bgm.mp3 -t <dur> -c:a copy bgm_full.mp3
Complex ffmpeg filter chains can silently fail — always verify output file exists and has expected duration before proceeding

Cost Estimate (6 scenes, ~80s video)

Step	Cost
Nano Banana images (6)	~$1.20
Kling v3 motion (6 clips)	~$4.50
ElevenLabs STS (6 clips)	~$1.00
ElevenLabs music (1 track)	~$0.50
ElevenLabs transcription (3x)	~$0.30
Total	~$7.50

Face-Swap TikTok Pipeline

Take filmed TikTok scenes of yourself and replace the presenter with a consistent AI-generated character, preserving the original motion with a new voice.

Prerequisites

Dependent skills: nano-banana-fal, kling-motion-control, record-transcribe-revoice, tiktok-promo-video
FAL_AI_KEY or FAL_KEY env var
ELEVENLABS_API_KEY env var (find in user's .env files)
python3, ffmpeg, ffprobe
fal-client Python package (auto-installed via uv run --with fal-client)

Pipeline Overview

Record → Transcribe (word timestamps) → Cut stutters → Combine draft
→ Review draft → Speech-to-speech (voice swap) → Nano Banana (face swap per scene)
→ Review images → Kling v3 (motion transfer) → Mux voice audio
→ Cut pauses → Re-transcribe → Burn captions → Mix music → Final

Step 0: Ask the User

Before starting, ask:

Which voice to use for speech-to-speech (list available with --list-voices)
Which face reference image to use for the character
Captions: burned into video or rely on TikTok's auto-captions?
The user should confirm the script/lines before recording

CRITICAL: Gate every phase with user review

Never proceed to the next step without user confirmation. Open all outputs with the default viewer at every phase:

Draft combined video → user reviews cuts
Nano Banana images → user reviews all before Kling
Final assembled video → user reviews before captions/music
Final with captions + music → user confirms before posting

Use python3 -c "import webbrowser; webbrowser.open('<path>')" or platform-native open command.

Step 1: Record & Transcribe

Record all lines in one continuous take — multiple attempts per phrase are fine.

# Open camera recorder (OS-agnostic)
# macOS: open {record-transcribe-revoice}/assets/camera-recorder.html
# Linux: xdg-open {record-transcribe-revoice}/assets/camera-recorder.html
# Windows: start {record-transcribe-revoice}/assets/camera-recorder.html
python3 -c "import webbrowser; webbrowser.open('{record-transcribe-revoice}/assets/camera-recorder.html')"

# After recording, extract audio and transcribe with word timestamps
ffmpeg -y -i recording.webm -vn -ac 1 -ar 16000 audio.wav
python3 {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \
  --input audio.wav --out-dir ./ --env-file <path-to-.env>

Step 2: Cut Stutters & Build Draft

Use word-level timestamps to identify and remove stutters. When multiple takes exist, use the last/best take.

CRITICAL: Stutter detection

Look for these patterns in the transcript:

Words ending with -- (e.g., "finish--", "Forge--") — false starts
Repeated phrases — speaker retrying a line
Words ending with ... — trailing off
Partial words (e.g., "re-", "de-") — word-level stutters

Cutting rules

Cut right after the last clean word before the stutter
Resume at the start of the clean retake
When two takes exist, prefer the second/later take (better delivery)
Check the last scene for trailing audio bleeding from abandoned next takes
Leave max 0.3s buffer around cuts

CRITICAL: Always backup before overwriting

cp source.mp4 backup_source.mp4  # ALWAYS before destructive edits

Combine into draft and open for review

ffmpeg -y -f concat -safe 0 -i concat.txt -c copy draft_combined.mp4
xdg-open draft_combined.mp4

Step 3: Speech-to-Speech (Voice Swap)

Convert your voice to the target character's voice using ElevenLabs speech-to-speech.

python3 {record-transcribe-revoice}/scripts/speech_to_speech_elevenlabs.py \
  --input-audio scene_XX.wav \
  --output scene_XX_voice.mp3 \
  --voice-id <VOICE_ID> \
  --env-file <path-to-.env> \
  --file-format other \
  --stability 0.15 --similarity-boost 0.6 --style 0.85

CRITICAL: API requires `--file-format other`

The ElevenLabs STS API changed — mp3_44100_128 is no longer valid. Always use --file-format other.

CRITICAL: Speech-to-speech preserves stutters

STS converts voice but keeps the speech rhythm. Cut ALL stutters from source audio BEFORE running STS. If stutters survive into STS output, you'll need to recut the final video.

Long audio (>15s) may produce wrong voice

Split scenes longer than ~15s into two halves, process each separately, then concat:

ffmpeg -y -i long_scene.wav -to 10.0 part_a.wav
ffmpeg -y -i long_scene.wav -ss 10.0 part_b.wav
# Process each, then concat

CRITICAL: STS and Kling must use the SAME source cut

Step 4: Generate Face-Swapped Images (Nano Banana 2)

Generate scene 1 FIRST

Prompt structure

Always include:

Character description (hair, accessories, outfit)
Same posture as the original frame (describe exactly what the person is doing)
Setting description (match the original room exactly)
Camera angle (match original)
End with "vertical portrait photo"

CRITICAL: Avoid TikTok UI hallucination

Nano Banana will generate fake TikTok UI elements (hearts, share buttons, usernames, LIVE badges) if:

The prompt contains "TikTok style" or social media references — use "vertical portrait photo" instead
The face reference image has UI overlays — the model picks up visual context from the reference

Fix: crop face reference to just the face, or add "clean photo, no text overlays, no UI elements" to prompt.

Run ALL scenes in parallel

uv run --with fal-client python3 {nano-banana-fal}/scripts/nano_banana_edit.py \
  --face <face_reference.png> \
  --prompt "<description>" \
  --output <scene_XX.png>

Resize if needed (must be <5MB for Kling)

size=$(stat -c%s image.png)
if [ "$size" -gt 5000000 ]; then
  ffmpeg -y -i image.png -vf "scale=1080:1920:force_original_aspect_ratio=decrease" resized.png
  mv resized.png image.png
fi

Open ALL images for review before proceeding

Step 5: Kling v3 Motion Transfer

FAL_KEY="$FAL_AI_KEY" uv run --with fal-client \
  python3 {kling-motion-control}/scripts/kling_motion_control.py \
  --image <scene_XX.png> \
  --video <draft_scene_XX.mp4> \
  --orientation video \
  --prompt "<description>" \
  --out <scene_XX_kling.mp4>

Run ALL in parallel

Always use the latest Kling model

Check fal.ai for the latest Kling motion control endpoint before running. Do not hardcode model versions — they update frequently.

Constraints

Minimum video duration: 3 seconds
Kling strips audio — must re-mux after
For clips >15s, split into chunks first

Step 6: Mux Voice Audio & Normalize

CRITICAL: Never use `-shortest` — it truncates audio

# Mux STS audio onto Kling video — extend video to match audio, NOT truncate audio
ffmpeg -y -i scene_XX_kling.mp4 -i scene_XX_voice.mp3 \
  -vf "scale=1080:1920:force_original_aspect_ratio=decrease,pad=1080:1920:(ow-iw)/2:(oh-ih)/2:black,tpad=stop_mode=clone:stop_duration=1" \
  -c:v libx264 -pix_fmt yuv420p -r 30 -ac 2 -ar 44100 -c:a aac -b:a 192k \
  -map 0:v -map 1:a scene_XX_muxed.mp4

# Normalize to -16 LUFS
ffmpeg -y -i scene_XX_muxed.mp4 -af "loudnorm=I=-16:LRA=11:TP=-1" \
  -c:v copy -c:a aac -b:a 192k scene_XX_final.mp4

Step 7: Assemble & Dynamic Cut

Concat scenes

printf "file 'scene_01_final.mp4'\n..." > concat.txt
ffmpeg -y -f concat -safe 0 -i concat.txt -c copy assembled.mp4

CRITICAL: Cut pauses for dynamic pacing

Transcribe the assembled video, then find all pauses >0.3s and cut them to ~0.15s:

Transcribe with word timestamps
Use the pauses JSON to find gaps
Present pauses to user in ...word || word... format with timestamps for review
Build keep-segments skipping the dead air
Re-encode and concat segments

This typically saves 5-10 seconds and makes the video much more dynamic.

CRITICAL: Also cut stutters and repeated words/phrases

STS often preserves or introduces:

Partial word stutters ("ju-", "de-", "re-")
Repeated words ("you use, you use,")
Trailing words that cut off ("just..." instead of "just does it")

Always do a word-level review of the transcription after assembly. Search for:

Words ending in - or -- or ...
Consecutive identical words or short phrases
Phrases that seem incomplete

CRITICAL: Verify pause timestamps match the actual source

When cutting pauses, always verify the timestamps come from the current version of the file. Timestamps shift after every cut. Re-transcribe before each cut pass.

End card

After the last spoken word, cut silence and append an end card:

ffmpeg -y -f lavfi -i "color=c=#333333:s=1080x1920:d=2:r=30" \
  -f lavfi -i "anullsrc=r=44100:cl=stereo" \
  -vf "drawtext=text='forgedemy.org':fontcolor=white:fontsize=72:font=Arial:x=(w-text_w)/2:y=(h-text_h)/2" \
  -c:v libx264 -pix_fmt yuv420p -c:a aac -b:a 192k -shortest endcard.mp4

Trim the main video right after the last word ends (+0.15s max), then concat with endcard.

TikTok description + hashtags

After final video is ready, generate a short description (2-3 sentences summarizing the content) plus 5 relevant hashtags. Keep it punchy and action-oriented. Ask the user to confirm before posting.

Step 8: Captions

CRITICAL: Always transcribe the FINAL cut

Never reuse earlier transcriptions — timing shifts after cuts make them wrong.

ffmpeg -y -i final.mp4 -vn -ac 1 -ar 16000 audio.wav
python3 {record-transcribe-revoice}/scripts/transcribe_with_elevenlabs.py \
  --input audio.wav --out-dir ./

Caption style (Forgedemy brand)

Regular text: light gray #CCCCCC / ASS &H00CCCCCC
Key words highlighted: orange #E8720C / ASS &H000C72E8
Font: Arial Bold, size 64
Black outline: borderwidth 4
Position: bottom third
Groups of 3-5 words per caption

CRITICAL: Fix transcription errors

Burn captions

ffmpeg -y -i video.mp4 -vf "ass=captions.ass" -c:v libx264 -pix_fmt yuv420p -c:a copy captioned.mp4

Step 9: Background Music

Generate with ElevenLabs Music API

curl -s -X POST "https://api.elevenlabs.io/v1/music" \
  -H "xi-api-key: $ELEVENLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Dynamic energetic tech promo, fast electronic beat, inspiring, instrumental only",
       "duration_seconds": 85, "instrumental": true}' \
  --output bgm.mp3

Mix at -25dB

dur=$(ffprobe -v quiet -show_entries format=duration -of csv=p=0 video.mp4)
fade_out=$(python3 -c "print(round($dur - 3, 2))")

ffmpeg -y -i video.mp4 -i bgm.mp3 \
  -filter_complex "\
[0:a]volume=1.0[voice];\
[1:a]atrim=0:$dur,volume=0.056,afade=t=in:d=2,afade=t=out:st=$fade_out:d=3[bgm];\
[voice][bgm]amix=inputs=2:duration=first:dropout_transition=0[aout]" \
  -map 0:v -map "[aout]" -c:v copy -c:a aac -b:a 192k final.mp4

File Organization

generated/<project-name>/
  recording/
    source.webm              # Original camera recording
    backup_source.mp4        # Always keep backups
  transcripts/
    *.json                   # Word timestamps, sentences, pauses
  draft/
    scene_01.mp4 ... 06.mp4  # Clean-cut draft scenes
    draft_combined.mp4       # Combined draft for review
  avatar-version/
    images/
      frame_01.jpg ...         # Extracted original frames
      avatar_scene_01.png ...  # Nano Banana outputs
    audio/
      scene_XX_original.wav    # Extracted audio per scene
      scene_XX_<voice>.mp3    # Speech-to-speech outputs
    kling/
      scene_XX_kling.mp4       # Kling motion outputs
    final/
      scene_XX_final.mp4       # Per-scene finals
      final_avatar.mp4         # Final output with captions + music
      backup_*.mp4             # Backups before destructive edits

Step 10: Speed Up (Optional)

Apply 1.2x speedup without pitch shift for more dynamic pacing:

ffmpeg -y -i video.mp4 \
  -filter_complex "[0:v]setpts=PTS/1.2[v];[0:a]atempo=1.2[a]" \
  -map "[v]" -map "[a]" \
  -c:v libx264 -pix_fmt yuv420p -r 30 -c:a aac -b:a 192k \
  video_fast.mp4

atempo preserves pitch. setpts=PTS/1.2 speeds up video to match. Re-transcribe and rebuild captions after speedup.

CRITICAL: Never use the same file for input and output — ffmpeg silently fails. Always write to a new file then mv.

Lessons Learned (Hard-Won)

Always backup before overwriting source files — destructive cuts are irreversible
STS preserves stutters — clean audio BEFORE voice conversion, not after
STS can drop words at the end — "just does it" became "just..." when the clip was too long. Split long clips into halves before STS
STS + Kling must share the same source cut — mismatched timings = broken lip sync
Never use -shortest when muxing STS onto Kling — STS audio is often longer than Kling video, -shortest truncates final words. Use tpad=stop_mode=clone:stop_duration=1 to extend video instead
Long STS clips (>15s) can produce wrong voice — split into halves
ElevenLabs STS API requires --file-format other — old format strings rejected with 422
"TikTok style" in prompts = fake UI in images — use "vertical portrait photo"
Generate scene 1 first — use as reference for scenes 2+ for better consistency
Cut pauses to 0.15s for dynamic pacing — saves 5-10s, makes video snappy
Always re-transcribe the final cut — timestamps shift after every cut
Check for repeated words/phrases in STS output — "you use, you use," happens often
Check last scene for trailing audio — next take's words can bleed in
Fix transcription errors for brand names — ElevenLabs commonly garbles proper nouns; keep a per-project correction list
Trim trailing silence after last word — long sustained vowels ("build.") create dead air; cut +0.15s after voice stops, then append end card
Verify pause timestamps against CURRENT file — timestamps from a previous cut are invalid after re-encoding; always re-transcribe first
Do one comprehensive cut pass, not iterative — multiple rounds of cuts compound timestamp drift and make debugging harder
ElevenLabs music API caps at ~45s — for longer videos, loop the track with ffmpeg -stream_loop 2 -i bgm.mp3 -t <dur> -c:a copy bgm_full.mp3
Complex ffmpeg filter chains can silently fail — always verify output file exists and has expected duration before proceeding

Cost Estimate (6 scenes, ~80s video)

Step	Cost
Nano Banana images (6)	~$1.20
Kling v3 motion (6 clips)	~$4.50
ElevenLabs STS (6 clips)	~$1.00
ElevenLabs music (1 track)	~$0.50
ElevenLabs transcription (3x)	~$0.30
Total	~$7.50

Face Swap Tiktok

SKILL.md

Face-Swap TikTok Pipeline

Prerequisites

Pipeline Overview

Step 0: Ask the User

CRITICAL: Gate every phase with user review

Step 1: Record & Transcribe

Step 2: Cut Stutters & Build Draft

CRITICAL: Stutter detection

Cutting rules

CRITICAL: Always backup before overwriting

Combine into draft and open for review

Step 3: Speech-to-Speech (Voice Swap)

CRITICAL: API requires --file-format other

CRITICAL: Speech-to-speech preserves stutters

Long audio (>15s) may produce wrong voice

CRITICAL: STS and Kling must use the SAME source cut

Step 4: Generate Face-Swapped Images (Nano Banana 2)

Generate scene 1 FIRST

Prompt structure

CRITICAL: Avoid TikTok UI hallucination

Run ALL scenes in parallel

Resize if needed (must be <5MB for Kling)

Open ALL images for review before proceeding

Step 5: Kling v3 Motion Transfer

Run ALL in parallel

Always use the latest Kling model

Constraints

Step 6: Mux Voice Audio & Normalize

CRITICAL: Never use -shortest — it truncates audio

Step 7: Assemble & Dynamic Cut

Concat scenes

CRITICAL: Cut pauses for dynamic pacing

CRITICAL: Also cut stutters and repeated words/phrases

CRITICAL: Verify pause timestamps match the actual source

End card

TikTok description + hashtags

Step 8: Captions

CRITICAL: Always transcribe the FINAL cut

Caption style (Forgedemy brand)

CRITICAL: Fix transcription errors

Burn captions

Step 9: Background Music

Generate with ElevenLabs Music API

Mix at -25dB

File Organization

Step 10: Speed Up (Optional)

Lessons Learned (Hard-Won)

Cost Estimate (6 scenes, ~80s video)

Preparing the source view

Face Swap Tiktok

SKILL.md

Face-Swap TikTok Pipeline

Prerequisites

Pipeline Overview

Step 0: Ask the User

CRITICAL: Gate every phase with user review

Step 1: Record & Transcribe

Step 2: Cut Stutters & Build Draft

CRITICAL: Stutter detection

Cutting rules

CRITICAL: Always backup before overwriting

Combine into draft and open for review

Step 3: Speech-to-Speech (Voice Swap)

CRITICAL: API requires --file-format other

CRITICAL: Speech-to-speech preserves stutters

Long audio (>15s) may produce wrong voice

CRITICAL: STS and Kling must use the SAME source cut

Step 4: Generate Face-Swapped Images (Nano Banana 2)

Generate scene 1 FIRST

Prompt structure

CRITICAL: Avoid TikTok UI hallucination

Run ALL scenes in parallel

Resize if needed (must be <5MB for Kling)

Open ALL images for review before proceeding

Step 5: Kling v3 Motion Transfer

Run ALL in parallel

Always use the latest Kling model

Constraints

CRITICAL: API requires `--file-format other`

CRITICAL: Never use `-shortest` — it truncates audio

CRITICAL: API requires `--file-format other`

CRITICAL: Never use `-shortest` — it truncates audio