---
name: vox
description: Vox 中文 TTS——把中文（或中英混合）文本合成为自然真人语音，返回一个公网 mp3 URL；也支持用一段参考音频克隆出专属音色再合成。当用户要求「文本转语音 / TTS / 配音 / 朗读 / 生成中文音频 / Chinese TTS / synthesize speech / 声音克隆 / 用我的声音说话 / voice cloning」等任务时调用。基于 Vox 的 HTTP API（vox.timor419.com），用 curl 即可，无需安装额外工具。
license: MIT
allowed-tools:
  - Bash
---

# Vox 中文 TTS skill

You can synthesize natural-sounding Chinese speech via the Vox HTTP API.
The response is a JSON object whose `url` field points to a publicly
playable mp3 hosted on Vox's CDN. You hand that URL to the user — no
local file management, no playback gymnastics.

## When to use this skill

Trigger when the user asks to:

- 把一段中文（或中英混合）文本变成语音 / 配音 / 朗读 / 听起来自然的播音
- "TTS"、"text-to-speech"、"synthesize"、"generate audio for ..."
- 批量把脚本 / 字幕 / 文章生成有声内容
- 「用我的声音 / 用这段录音的声音」合成 —— 先走 voice-clone 端点克隆，再正常合成

## Setup (run this once before first use)

This skill calls the Vox service. The user needs an API key (a string
starting with `JK`, ~40 chars) before any call works.

**If the user does NOT yet have a key**, point them to:

> **https://vox.timor419.com/get-key**

That page is the single source of truth for current onboarding (sales
channel, login flow, key copy step). Do not memorize or quote its content
— it changes as Vox evolves. Just send the user there.

**Once the user has a `JK` key**, instruct them to set it as an env var
in the shell where their AI agent runs:

```bash
export VOX_API_KEY=JK********
```

Verify it's available before making the first call:

```bash
test -n "$VOX_API_KEY" && echo "ok" || echo "VOX_API_KEY not set"
```

**Security**: never ask the user to paste the API key into chat — it
would be retained in conversation history. The env-var path keeps it
local to their shell.

## How to call the API

All endpoints live at `https://vox.timor419.com/api/v1/` and accept
`Authorization: Bearer $VOX_API_KEY`. All responses are JSON.

### 1. List voices (no auth, for picking a voice_id)

```bash
curl -s https://vox.timor419.com/api/v1/voices | head -c 4000
```

Response shape:

```json
{
  "voices": [
    { "id": "云楚灵", "display_name": "云楚灵", "demo_url": "https://...", "featured": true },
    ...
  ],
  "total": 147
}
```

Voices come back **featured-first** — the first ~10 are the operator's
curated picks. When the user doesn't specify a voice, you can either
print 5-10 candidate `id` values and ask them to choose, or just omit
`voice_id` from the request — the server falls back to the first voice
in this list (the platform's current top recommendation), and the
response's `voice_id` field tells you which one was used.

**Tip**: calling this endpoint WITH `Authorization: Bearer $VOX_API_KEY`
returns the account's *actual* allowed list — some accounts have a
restricted voice package, and authenticated calls also append the
account's own cloned voices (flagged `"owned": true`). Prefer the
authenticated form in integrations.

### 2. Synthesize text → mp3 URL

`voice_id` is **optional**: omit it to let the server pick the first
voice from `/api/v1/voices`. Include it when you want a specific voice.

```bash
# Default voice (first in /api/v1/voices)
curl -s -X POST https://vox.timor419.com/api/v1/tts \
  -H "Authorization: Bearer $VOX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text":"你好，世界"}'

# Pick a specific voice
curl -s -X POST https://vox.timor419.com/api/v1/tts \
  -H "Authorization: Bearer $VOX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text":"你好，世界","voice_id":"云楚灵"}'
```

Optional `format` (default `mp3`): pass `"format":"wav"` to get PCM
16-bit / mono / 44.1kHz wav instead (same billing, `url` ends in `.wav`).
Useful when something needs a raw `.wav` file (e.g. car-head-unit chimes).

Response (JSON):

```json
{
  "url": "https://oss.timor419.com/tts/output/abc123def456789abcdef0123456789a.mp3",
  "request_id": "abc123def456789abcdef0123456789a",
  "voice_id": "云楚灵",
  "units": 4,
  "remaining": 999996,
  "duration_ms": 3245
}
```

Parse the JSON, extract `.url`, and **give that URL to the user as the
final answer** — they can click it to play in any browser. Do **not**
download the mp3 to disk unless the user explicitly asks for an offline
file (the URL is public and stable; downloading is unnecessary friction
in the common case).

A typical reply to the user looks like:

> 已合成。点这里听：https://oss.timor419.com/tts/output/abc123…a.mp3
>
> 音色：云楚灵 · 字符数：4 · 剩余余额：999996

### 3. Check balance (zero cost)

```bash
curl -s https://vox.timor419.com/api/v1/balance \
  -H "Authorization: Bearer $VOX_API_KEY"
```

Returns `{ balance, unit, account_number, app: { slug, name } }`.

### 4. Clone a voice from the user's own audio (voice cloning)

When the user wants TTS in **their own voice** (or any voice they have a
recording of), clone it first, then synthesize with the returned `voice_id`:

```bash
# From a local file:
curl -s -X POST https://vox.timor419.com/api/v1/voices/clone \
  -H "Authorization: Bearer $VOX_API_KEY" \
  -F "file=@my-voice.wav" \
  -F "name=我的声音" \
  --max-time 120

# Or from a publicly reachable URL (server downloads it for you):
curl -s -X POST https://vox.timor419.com/api/v1/voices/clone \
  -H "Authorization: Bearer $VOX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url":"https://example.com/my-voice.wav","name":"我的声音"}' \
  --max-time 120
```

Requirements for the reference audio: **≤ 15 MB, ≤ 15 seconds**; format
can be **wav / mp3 / flac / m4a / aac / ogg** (server transcodes
automatically). A clean, single-speaker recording without background
music clones best.
The request takes 20-40s on first clone (a demo line is rendered); pass
`--max-time 120` to curl. Prefer the URL form when the audio is already
hosted somewhere — repeated clones of the same URL are served from cache
without re-downloading.

Response:

```json
{
  "ok": true,
  "duplicate": false,
  "voice": {
    "id": "cv-3f9a1b2c4d5e",
    "display_name": "我的声音",
    "demo_url": "https://oss.timor419.com/tts/clone-demos/….mp3",
    "duration_sec": 8.4,
    "created_at": "2026-06-06T08:00:00.000Z"
  },
  "units": 300,
  "remaining": 999700
}
```

Then synthesize with it exactly like any catalog voice:

```bash
curl -s -X POST https://vox.timor419.com/api/v1/tts \
  -H "Authorization: Bearer $VOX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"text":"你好，这是我自己的声音","voice_id":"cv-3f9a1b2c4d5e"}'
```

Notes:

- **Pricing**: cloning costs a one-time fee (currently 300 units — same
  unit as TTS characters). Synthesis with a cloned voice costs the same
  per-character rate as catalog voices.
- **Dedup**: cloning the byte-identical file again is free and returns the
  existing voice (`"duplicate": true`). Don't worry about retries.
- **Privacy**: cloned voices are private to the API key's account. They
  appear in `GET /api/v1/voices` (flagged `"owned": true`) only when that
  endpoint is called WITH the Authorization header.
- **Delete**: `curl -X DELETE ".../api/v1/voices/clone?voice_id=cv-…" -H
  "Authorization: Bearer $VOX_API_KEY"`.
- Clone-specific errors: `unsupported_audio` / `file_too_large` /
  `url_too_large` / `duration_exceeded` (fix the file), `invalid_url` /
  `url_fetch_failed` (check the URL is publicly reachable),
  `render_failed` (retry — not charged).

## Optional: download the mp3 locally

Only if the user explicitly asks for a local file ("save it as
greeting.mp3", "I want to attach it to email", etc.):

```bash
curl -fsSL "$URL" -o greeting.mp3 && echo "saved to $(pwd)/greeting.mp3"
```

`$URL` is the value from `.url` in the previous response. The mp3 hosted
at that URL is permanent and public — no expiry to worry about today.

## Common errors and how to recover

| HTTP | error code | What it means | What to do |
|---|---|---|---|
| 401 | `missing_api_key` / `invalid_api_key` | env var unset or wrong | Walk user back through Setup (point to /get-key) |
| 402 | `insufficient_balance` | Out of characters | Tell user to buy a new activation code at https://vox.timor419.com/get-key |
| 403 | `account_disabled` / `api_key_revoked` | Key was killed | Tell user to log in to dashboard for a new key, or contact support |
| 403 | `voice_disabled` | The voice was retired by admin | Re-run the voices endpoint and pick another |
| 403 | `voice_not_allowed` | This account has a restricted voice package and the voice_id is outside it | Re-fetch `/api/v1/voices` WITH the Authorization header — it returns the account's actual allowed list — and pick from that |
| 429 | `rate_limited` | 60 req/min/key cap hit | Wait `Retry-After` header seconds, retry |
| 502 | `synthesis_failed` | ComfyUI render hiccup | Retry once or twice; if persistent, surface the error to the user |
| 502 | `storage_failed` | mp3 generated but upload to Vox CDN failed | Retry; if persistent, contact support |
| 504 | `synthesis_failed` (timeout) | Render exceeded timeout | Retry, optionally with shorter text |

Error response shape (uniform across all 4xx/5xx):

```json
{ "error": "<code>", "message": "<人类可读>", "detail": "..." }
```

## Constraints to respect

- `text` is 1-2000 characters. For longer scripts split into chunks
  (sentence boundaries) and concatenate the resulting urls into a list
  for the user, OR merge mp3s with `ffmpeg -f concat` if the user wants
  one file.
- Cost is per-character, billed only on successful `200` response with a
  URL. There is no free retry — every successful call charges. So
  validate the user's text before sending (e.g. confirm long inputs).
- The voice catalog can change. Don't hardcode voice IDs — always fetch
  fresh via the voices endpoint when the user wants to pick.
- The output `url` is currently kept indefinitely (no TTL). That may
  change as storage policy evolves; do not assume forever.

## Reference

- Web docs (HTTP API + examples in curl/Python/Node): https://vox.timor419.com/docs
- Get-API-key flow: https://vox.timor419.com/get-key
- Login + dashboard: https://vox.timor419.com/login