Q8 quantization produces garbled output in thinking channel

#1
by ahoybrotherbear - opened
MLX Community org

Running the Q8 model via mlx-vlm (installed from GitHub main branch) produces
garbled CJK/Unicode characters in the <channel>thought block. The same issue
occurs with a self-converted MXFP8 quantization. BF16 from google/gemma-4-31B-it
works perfectly with clean output.

Tested on Mac Studio M-series, 512GB RAM, mlx-vlm from git main (post-0.4.2).

BF16: WORKS

  • Clean text output
  • 10 tok/s, 62.8GB peak memory

Q8: BROKEN

  • Garbled Unicode in thinking channel
  • 18 tok/s, 34GB peak memory

MXFP8 self-converted: ALSO BROKEN

  • Same garbled output pattern
  • 18 tok/s, 33GB peak memory

Likely a quantization issue specific to Gemma 4's new thinking token format.

MLX Community org

Will fix

MLX Community org
prince-canuma changed discussion status to closed

Sign up or log in to comment