Files
2026-03-17 16:53:22 -07:00

3.2 KiB

CLI reference (scripts/text_to_speech.py)

This file contains the "command catalog" for the bundled speech generation CLI. Keep SKILL.md as overview-first; put verbose CLI details here.

What this CLI does

  • speak: generate a single audio file
  • speak-batch: run many jobs from a JSONL file (one job per line)
  • list-voices: list supported voices

Real API calls require network access + OPENAI_API_KEY. --dry-run does not.

Quick start (works from any repo)

Set a stable path to the skill CLI (default CODEX_HOME is ~/.codex):

export CODEX_HOME="${CODEX_HOME:-$HOME/.codex}"
export TTS_GEN="$CODEX_HOME/skills/speech/scripts/text_to_speech.py"

Dry-run (no API call; no network required; does not require the openai package):

python "$TTS_GEN" speak --input "Test" --dry-run

Generate (requires OPENAI_API_KEY + network):

uv run --with openai python "$TTS_GEN" speak \
  --input "Today is a wonderful day to build something people love!" \
  --voice cedar \
  --instructions "Voice Affect: Warm and composed. Tone: upbeat and encouraging." \
  --response-format mp3 \
  --out speech.mp3

No uv installed? Use your active Python env:

python "$TTS_GEN" speak --input "Hello" --voice cedar --out speech.mp3

Guardrails (important)

  • Use python "$TTS_GEN" ... (or equivalent full path) for all TTS work.
  • Do not create one-off runners (e.g., gen_audio.py) unless the user explicitly asks.
  • Never modify scripts/text_to_speech.py. If something is missing, ask the user before doing anything else.

Defaults (unless overridden by flags)

  • Model: gpt-4o-mini-tts-2025-12-15
  • Voice: cedar
  • Response format: mp3
  • Speed: 1.0
  • Batch rpm cap: 50

Input limits

  • Input text must be <= 4096 characters per request.
  • For longer text, split into smaller chunks (manual or via batch JSONL).

Instructions compatibility

  • instructions are supported for GPT-4o mini TTS models.
  • tts-1 and tts-1-hd ignore instructions (the CLI will warn and drop them).

Common recipes

List voices:

python "$TTS_GEN" list-voices

Generate with explicit pacing:

python "$TTS_GEN" speak \
  --input "Welcome to the demo. We'll show how it works." \
  --instructions "Tone: friendly and confident. Pacing: steady and moderate." \
  --out demo.mp3

Batch generation (JSONL):

mkdir -p tmp/speech
cat > tmp/speech/jobs.jsonl << 'JSONL'
{"input":"Thank you for calling. Please hold.","voice":"cedar","response_format":"mp3","out":"hold.mp3"}
{"input":"For sales, press 1. For support, press 2.","voice":"marin","instructions":"Tone: clear and neutral. Pacing: slow.","response_format":"wav"}
JSONL

python "$TTS_GEN" speak-batch --input tmp/speech/jobs.jsonl --out-dir out --rpm 50

# Cleanup (recommended)
rm -f tmp/speech/jobs.jsonl

Notes:

  • Use --rpm to control rate limiting (default 50, max 50).
  • Per-job overrides are supported in JSONL (model, voice, response_format, speed, instructions, out).
  • Treat the JSONL file as temporary: write it under tmp/ and delete it after the run (do not commit it).

See also

  • API parameter quick reference: references/audio-api.md
  • Instruction patterns and examples: references/voice-directions.md