Record, Transcribe, Revoice

Use this skill for a practical creator pipeline:

record a short take with camera + mic
transcribe it with word timestamps
cut obvious repeated words or stutters
generate a voice-to-voice pass
lay the new voice back onto the cleaned video

Prerequisites

You need:

ffmpeg and ffprobe
python3
an ELEVENLABS_API_KEY
a browser for the bundled recorder UI

If ffmpeg is missing, install it first:

macOS: brew install ffmpeg
Ubuntu/Debian: sudo apt update && sudo apt install -y ffmpeg
Arch: sudo pacman -S ffmpeg
Windows: winget install Gyan.FFmpeg

Capture

Use the bundled recorder page at assets/camera-recorder.html.txt.

Because browser camera access needs a local origin, serve the folder first:

cd assets
python3 -m http.server 8765

Then open:

http://127.0.0.1:8765/camera-recorder.html

Recorder expectations:

preview the camera feed
choose camera and microphone
record with audio enabled
save each take locally

Workflow

1. Transcribe with word timings

Run:

python3 scripts/transcribe_with_elevenlabs.py \
  --input /path/to/take.webm \
  --out-dir /path/to/output

This produces:

*.elevenlabs.transcript.json
*.clean.txt
*.sentences.json
*.pauses.json

2. Render a de-stutter preview

Run:

python3 scripts/build_destutter_preview.py \
  --media /path/to/take.webm \
  --transcript /path/to/take.elevenlabs.transcript.json \
  --output /path/to/take.destutter-preview.mp4

This only targets immediate doubled words or obvious stutters. It is a preview pass, not a full editorial cut.

3. Run speech-to-speech

First extract aligned audio from the de-stutter preview:

ffmpeg -i /path/to/take.destutter-preview.mp4 -vn -ac 1 /path/to/take.destutter-preview.wav

Run:

python3 scripts/speech_to_speech_elevenlabs.py \
  --input-audio /path/to/take.destutter-preview.wav \
  --voice-name "Celestia 6" \
  --output /path/to/take.v2v.mp3

Use --voice-id if you already know the exact ElevenLabs voice.

4. Lay the new voice back onto the video

Run:

python3 scripts/mux_audio_to_video.py \
  --video /path/to/take.destutter-preview.mp4 \
  --audio /path/to/take.v2v.mp3 \
  --output /path/to/take.final.mp4

Guardrails

Do not trust speech-to-speech timing blindly. Compare generated audio duration against the source before muxing.
Treat the de-stutter pass as a conservative cleanup only. Larger retakes still need manual editorial judgment.
Keep the original take, the transcript JSON, the de-stutter preview, and the speech-to-speech result as separate artifacts.
If the generated voice drifts badly in pacing, switch voice or settings rather than forcing a bad output into the video.

Files

assets/camera-recorder.html.txt: local browser recorder with camera + mic controls
scripts/transcribe_with_elevenlabs.py: extract/transcribe and emit word-level transcript artifacts
scripts/build_destutter_preview.py: find immediate repeated words and render a cleaned preview
scripts/speech_to_speech_elevenlabs.py: call ElevenLabs speech-to-speech
scripts/mux_audio_to_video.py: combine cleaned video with generated audio

Preparing the source view

Record Transcribe Revoice

SKILL.md

Record, Transcribe, Revoice

Prerequisites

Capture

Workflow

1. Transcribe with word timings

2. Render a de-stutter preview

3. Run speech-to-speech

4. Lay the new voice back onto the video

Guardrails

Files