mlx-community
/

gemma-4-31b-it-8bit

Image-Text-to-Text

8-bit precision

Model card Files Files and versions

Q8 quantization produces garbled output in thinking channel

#1

by ahoybrotherbear - opened 15 days ago

ahoybrotherbear

MLX Community org 15 days ago

Running the Q8 model via mlx-vlm (installed from GitHub main branch) produces
garbled CJK/Unicode characters in the <channel>thought block. The same issue
occurs with a self-converted MXFP8 quantization. BF16 from google/gemma-4-31B-it
works perfectly with clean output.

Tested on Mac Studio M-series, 512GB RAM, mlx-vlm from git main (post-0.4.2).

BF16: WORKS

Clean text output
10 tok/s, 62.8GB peak memory

Q8: BROKEN

Garbled Unicode in thinking channel
18 tok/s, 34GB peak memory

MXFP8 self-converted: ALSO BROKEN

Same garbled output pattern
18 tok/s, 33GB peak memory

Likely a quantization issue specific to Gemma 4's new thinking token format.

MLX Community org 15 days ago

Will fix

MLX Community org 15 days ago

Fixed in main
https://github.com/Blaizzy/mlx-vlm/pull/893

prince-canuma changed discussion status to closed 15 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment