Qwen Vers
Collection
3 items • Updated
A distilled 4B code/reasoning model in GGUF format, optimized for local inference via llama.cpp.
llama-cli -hf nphearum/Qwen3.5-4BxOpus-4.7-Code-Reasoning-Distilled-GGUF --jinja
llama-server \
-m Qwen3.5-4BxOpus-4.7-Code-Reasoning-Distilled.Q4_K_M.gguf \
--port 8001 --alias qwen3.5-4b-opus \
-c 65536 -n 8192 \
--temp 0.6 --top-p 0.95 --top-k 40 --repeat-penalty 1.05 \
--flash-attn on --ctk q8_0 --ctv q8_0 \
--jinja --chat-template-kwargs '{"enable_thinking": true}' \
-ngl -1
| Mode | Flags |
|---|---|
| Coding | --temp 0.4 --top-p 0.9 --top-k 50 --repeat-penalty 1.1 |
| Reasoning | --temp 0.6 --top-p 0.95 --top-k 40 |
| Low VRAM | -c 32768 -n 4096 --flash-attn off -ngl 20 |
-ngl -1)2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit