Trelis Chorus v1 — GGML / whisper.cpp
GGML quantizations of Trelis/Chorus-v1 for CPU inference with whisper.cpp.
Why GGML, not GGUF? whisper.cpp still uses the legacy GGML
.binformat — magic0x67676d6c, written byconvert-h5-to-ggml.py. llama.cpp moved to GGUF years ago, whisper.cpp has not; a HF-safetensors→GGUF converter for Whisper is tracked in ggml-org/whisper.cpp#3316 but is not yet merged. Same reason there is no native Q4_K_M: whisper.cpp's model loader allocates every tensor from a single globalwtype, so mixed-type quants aren't loadable.
| File | Size | Use |
|---|---|---|
ggml-chorus-v1-f16.bin |
1.6 GB | reference / parity baseline |
ggml-chorus-v1-q4_k.bin |
452 MB | recommended — 3.6× smaller than f16, CER ≲ 7% vs f32 |
ggml-chorus-v1-q5_k.bin |
547 MB | higher fidelity than q4_k, still a solid ~3× reduction |
ggml-chorus-v1-q8_0.bin |
834 MB | near-lossless, useful if you want to verify a fix quickly |
Speaker-conditioned two-speaker transcription: run once per speaker with --speaker 1 or --speaker 2 (see below). Input is 16 kHz mono, ≤ 30 s per call.
Long audio. The model handles up to 30 s per call and has no notion of global speaker identity beyond that window. Stitching longer audio (chunking + keeping speaker labels consistent across chunks) is left to the caller; the Trelis Router hosted endpoint handles this end-to-end.
Quickstart
# 1. Build patched whisper.cpp (adds --speaker flag, see patches/)
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
git apply ../patches/whisper.cpp-chorus-v1.patch
cmake -B build -DGGML_METAL=ON && cmake --build build -j --config Release
# 2. Run Chorus
./build/bin/whisper-cli \
-m ../ggml-chorus-v1-q4_k.bin \
-f your_clip.wav -l en --speaker 1 -nfa
./build/bin/whisper-cli \
-m ../ggml-chorus-v1-q4_k.bin \
-f your_clip.wav -l en --speaker 2 -nfa
Python usage
python scripts/run_chorus.py your_clip.wav # both speakers
python scripts/run_chorus.py your_clip.wav --json out.json # structured output
HTTP server
uv run --with fastapi --with uvicorn --with python-multipart \
python scripts/serve_chorus.py --port 8000
curl -s -F "file=@your_clip.wav" http://localhost:8000/transcribe | jq
How it works
Chorus v1 is a LoRA fine-tune of whisper-large-v3-turbo with two extra special tokens:
| token | id |
|---|---|
<|speaker1|> |
51866 |
<|speaker2|> |
51867 |
The decoder prefix is <\|startoftranscript\|> <\|en\|> <\|transcribe\|> <\|speakerN\|>, then the usual timestamp/text stream. Vanilla whisper.cpp has no way to inject that 4th token, so the patch in patches/whisper.cpp-chorus-v1.patch adds:
--speaker {1,2}CLI flag onwhisper-cli. Looks up<|speakerN|>in the vocab and inserts it after<|transcribe|>in the decoder init prompt.speaker_tokenfield onwhisper_full_paramsfor programmatic use.whisper_token_lookupC API — direct string→id lookup (the BPEwhisper_tokenizecan't resolve added special tokens atomically).- Vocab-based offset correction when
n_vocabexceeds 51866. Stock whisper.cpp assumes any tokens beyond the standard Whisper vocab are extra language codes; Chorus's two speaker tokens would otherwise shift<|transcribe|>,<|notimestamps|>, etc., by +40. --max-initial-tsflag (auto-set to 30.0 when--speakeris used). Whisper normally caps the first emitted timestamp to 1.0 s; Chorus's speaker 2 often starts several seconds in.
The Q4_K quant ships as ftype=12, standard whisper.cpp loader. A matching Q4_K_M (mixed Q4_K / Q6_K) was attempted — the converter (quantize-k-mixed) writes per-tensor types correctly, but whisper.cpp's model loader allocates all tensors from a single global wtype, so it rejects mixed types. True Q4_K_M needs a loader patch to collect per-tensor types ahead of allocation; shipping Q4_K-only for now. See patches/whisper.cpp-chorus-v1.patch for the converter source if you want to experiment.
Building from the upstream model
# Convert HF → f16 GGML (handles Chorus's expanded vocab; stock converter drops added_tokens)
python scripts/convert_chorus_to_ggml.py Trelis/Chorus-v1 ./openai-whisper ./models
# Quantize to Q4_K
./whisper.cpp/build/bin/whisper-quantize \
models/ggml-chorus-v1-f16.bin models/ggml-chorus-v1-q4_k.bin q4_k
openai-whisper is a checkout of https://github.com/openai/whisper (only whisper/assets/mel_filters.npz is actually read).
Parity vs transformers (reference f32 on CPU)
Evaluated on the HF Space sample_podcast.wav clip (30 s, two speakers, overlapping):
| CER vs transformers f32 | |
|---|---|
| Q4_K speaker 1 | 3.36 % |
| Q4_K speaker 2 | 6.74 % |
Differences are word-level substitutions in noisy overlaps (e.g. "big area" ↔ "bad area") rather than structural failures. Benchmark reproduction in scripts/parity_check.py of the source repo.
Limitations
Same as the base model (Trelis/Chorus-v1): two speakers, English, ≤30 s per call. For long audio, use the Trelis Router hosted endpoint.
License
Apache 2.0.
Model tree for Trelis/Chorus-v1-GGML
Base model
openai/whisper-large-v3