Q8 quantization produces garbled output in thinking channel
#1
by ahoybrotherbear - opened
Running the Q8 model via mlx-vlm (installed from GitHub main branch) produces
garbled CJK/Unicode characters in the <channel>thought block. The same issue
occurs with a self-converted MXFP8 quantization. BF16 from google/gemma-4-31B-it
works perfectly with clean output.
Tested on Mac Studio M-series, 512GB RAM, mlx-vlm from git main (post-0.4.2).
BF16: WORKS
- Clean text output
- 10 tok/s, 62.8GB peak memory
Q8: BROKEN
- Garbled Unicode in thinking channel
- 18 tok/s, 34GB peak memory
MXFP8 self-converted: ALSO BROKEN
- Same garbled output pattern
- 18 tok/s, 33GB peak memory
Likely a quantization issue specific to Gemma 4's new thinking token format.
Will fix
prince-canuma changed discussion status to closed