Record, Transcribe, Revoice
Use this skill for a practical creator pipeline:
- record a short take with camera + mic
- transcribe it with word timestamps
- cut obvious repeated words or stutters
- generate a voice-to-voice pass
- lay the new voice back onto the cleaned video
Prerequisites
You need:
ffmpegandffprobepython3- an
ELEVENLABS_API_KEY - a browser for the bundled recorder UI
If ffmpeg is missing, install it first:
- macOS:
brew install ffmpeg - Ubuntu/Debian:
sudo apt update && sudo apt install -y ffmpeg - Arch:
sudo pacman -S ffmpeg - Windows:
winget install Gyan.FFmpeg
Capture
Use the bundled recorder page at assets/camera-recorder.html.txt.
Because browser camera access needs a local origin, serve the folder first:
cd assets
python3 -m http.server 8765Then open:
http://127.0.0.1:8765/camera-recorder.htmlRecorder expectations:
- preview the camera feed
- choose camera and microphone
- record with audio enabled
- save each take locally
Workflow
1. Transcribe with word timings
Run:
python3 scripts/transcribe_with_elevenlabs.py \
--input /path/to/take.webm \
--out-dir /path/to/outputThis produces:
*.elevenlabs.transcript.json*.clean.txt*.sentences.json*.pauses.json
2. Render a de-stutter preview
Run:
python3 scripts/build_destutter_preview.py \
--media /path/to/take.webm \
--transcript /path/to/take.elevenlabs.transcript.json \
--output /path/to/take.destutter-preview.mp4This only targets immediate doubled words or obvious stutters. It is a preview pass, not a full editorial cut.
3. Run speech-to-speech
First extract aligned audio from the de-stutter preview:
ffmpeg -i /path/to/take.destutter-preview.mp4 -vn -ac 1 /path/to/take.destutter-preview.wavRun:
python3 scripts/speech_to_speech_elevenlabs.py \
--input-audio /path/to/take.destutter-preview.wav \
--voice-name "Celestia 6" \
--output /path/to/take.v2v.mp3Use --voice-id if you already know the exact ElevenLabs voice.
4. Lay the new voice back onto the video
Run:
python3 scripts/mux_audio_to_video.py \
--video /path/to/take.destutter-preview.mp4 \
--audio /path/to/take.v2v.mp3 \
--output /path/to/take.final.mp4Guardrails
- Do not trust speech-to-speech timing blindly. Compare generated audio duration against the source before muxing.
- Treat the de-stutter pass as a conservative cleanup only. Larger retakes still need manual editorial judgment.
- Keep the original take, the transcript JSON, the de-stutter preview, and the speech-to-speech result as separate artifacts.
- If the generated voice drifts badly in pacing, switch voice or settings rather than forcing a bad output into the video.
Files
assets/camera-recorder.html.txt: local browser recorder with camera + mic controlsscripts/transcribe_with_elevenlabs.py: extract/transcribe and emit word-level transcript artifactsscripts/build_destutter_preview.py: find immediate repeated words and render a cleaned previewscripts/speech_to_speech_elevenlabs.py: call ElevenLabs speech-to-speechscripts/mux_audio_to_video.py: combine cleaned video with generated audio