Trelis Chorus v1 — GGML / whisper.cpp

GGML quantizations of Trelis/Chorus-v1 for CPU inference with whisper.cpp.

Why GGML, not GGUF? whisper.cpp still uses the legacy GGML .bin format — magic 0x67676d6c, written by convert-h5-to-ggml.py. llama.cpp moved to GGUF years ago, whisper.cpp has not; a HF-safetensors→GGUF converter for Whisper is tracked in ggml-org/whisper.cpp#3316 but is not yet merged. Same reason there is no native Q4_K_M: whisper.cpp's model loader allocates every tensor from a single global wtype, so mixed-type quants aren't loadable.

File Size Use
ggml-chorus-v1-f16.bin 1.6 GB reference / parity baseline
ggml-chorus-v1-q4_k.bin 452 MB recommended — 3.6× smaller than f16, CER ≲ 7% vs f32
ggml-chorus-v1-q5_k.bin 547 MB higher fidelity than q4_k, still a solid ~3× reduction
ggml-chorus-v1-q8_0.bin 834 MB near-lossless, useful if you want to verify a fix quickly

Speaker-conditioned two-speaker transcription: run once per speaker with --speaker 1 or --speaker 2 (see below). Input is 16 kHz mono, ≤ 30 s per call.

Long audio. The model handles up to 30 s per call and has no notion of global speaker identity beyond that window. Stitching longer audio (chunking + keeping speaker labels consistent across chunks) is left to the caller; the Trelis Router hosted endpoint handles this end-to-end.

Quickstart

# 1. Build patched whisper.cpp (adds --speaker flag, see patches/)
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
git apply ../patches/whisper.cpp-chorus-v1.patch
cmake -B build -DGGML_METAL=ON && cmake --build build -j --config Release

# 2. Run Chorus
./build/bin/whisper-cli \
    -m ../ggml-chorus-v1-q4_k.bin \
    -f your_clip.wav -l en --speaker 1 -nfa
./build/bin/whisper-cli \
    -m ../ggml-chorus-v1-q4_k.bin \
    -f your_clip.wav -l en --speaker 2 -nfa

Python usage

python scripts/run_chorus.py your_clip.wav                  # both speakers
python scripts/run_chorus.py your_clip.wav --json out.json  # structured output

HTTP server

uv run --with fastapi --with uvicorn --with python-multipart \
    python scripts/serve_chorus.py --port 8000

curl -s -F "file=@your_clip.wav" http://localhost:8000/transcribe | jq

How it works

Chorus v1 is a LoRA fine-tune of whisper-large-v3-turbo with two extra special tokens:

token id
<|speaker1|> 51866
<|speaker2|> 51867

The decoder prefix is <\|startoftranscript\|> <\|en\|> <\|transcribe\|> <\|speakerN\|>, then the usual timestamp/text stream. Vanilla whisper.cpp has no way to inject that 4th token, so the patch in patches/whisper.cpp-chorus-v1.patch adds:

  1. --speaker {1,2} CLI flag on whisper-cli. Looks up <|speakerN|> in the vocab and inserts it after <|transcribe|> in the decoder init prompt.
  2. speaker_token field on whisper_full_params for programmatic use.
  3. whisper_token_lookup C API — direct string→id lookup (the BPE whisper_tokenize can't resolve added special tokens atomically).
  4. Vocab-based offset correction when n_vocab exceeds 51866. Stock whisper.cpp assumes any tokens beyond the standard Whisper vocab are extra language codes; Chorus's two speaker tokens would otherwise shift <|transcribe|>, <|notimestamps|>, etc., by +40.
  5. --max-initial-ts flag (auto-set to 30.0 when --speaker is used). Whisper normally caps the first emitted timestamp to 1.0 s; Chorus's speaker 2 often starts several seconds in.

The Q4_K quant ships as ftype=12, standard whisper.cpp loader. A matching Q4_K_M (mixed Q4_K / Q6_K) was attempted — the converter (quantize-k-mixed) writes per-tensor types correctly, but whisper.cpp's model loader allocates all tensors from a single global wtype, so it rejects mixed types. True Q4_K_M needs a loader patch to collect per-tensor types ahead of allocation; shipping Q4_K-only for now. See patches/whisper.cpp-chorus-v1.patch for the converter source if you want to experiment.

Building from the upstream model

# Convert HF → f16 GGML (handles Chorus's expanded vocab; stock converter drops added_tokens)
python scripts/convert_chorus_to_ggml.py Trelis/Chorus-v1 ./openai-whisper ./models
# Quantize to Q4_K
./whisper.cpp/build/bin/whisper-quantize \
    models/ggml-chorus-v1-f16.bin models/ggml-chorus-v1-q4_k.bin q4_k

openai-whisper is a checkout of https://github.com/openai/whisper (only whisper/assets/mel_filters.npz is actually read).

Parity vs transformers (reference f32 on CPU)

Evaluated on the HF Space sample_podcast.wav clip (30 s, two speakers, overlapping):

CER vs transformers f32
Q4_K speaker 1 3.36 %
Q4_K speaker 2 6.74 %

Differences are word-level substitutions in noisy overlaps (e.g. "big area" ↔ "bad area") rather than structural failures. Benchmark reproduction in scripts/parity_check.py of the source repo.

Limitations

Same as the base model (Trelis/Chorus-v1): two speakers, English, ≤30 s per call. For long audio, use the Trelis Router hosted endpoint.

License

Apache 2.0.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Trelis/Chorus-v1-GGML

Finetuned
Trelis/Chorus-v1
Finetuned
(1)
this model