Question Answering
GGUF
qwen3_5
llama.cpp
vision-language-model
conversational

Qwen3.5-4BxOpus-4.7-Code-Reasoning-Distilled-GGUF

A distilled 4B code/reasoning model in GGUF format, optimized for local inference via llama.cpp.


Quick Start

llama-cli -hf nphearum/Qwen3.5-4BxOpus-4.7-Code-Reasoning-Distilled-GGUF --jinja

Server Launch

llama-server \
  -m Qwen3.5-4BxOpus-4.7-Code-Reasoning-Distilled.Q4_K_M.gguf \
  --port 8001 --alias qwen3.5-4b-opus \
  -c 65536 -n 8192 \
  --temp 0.6 --top-p 0.95 --top-k 40 --repeat-penalty 1.05 \
  --flash-attn on --ctk q8_0 --ctv q8_0 \
  --jinja --chat-template-kwargs '{"enable_thinking": true}' \
  -ngl -1

Preset Configs

Mode Flags
Coding --temp 0.4 --top-p 0.9 --top-k 50 --repeat-penalty 1.1
Reasoning --temp 0.6 --top-p 0.95 --top-k 40
Low VRAM -c 32768 -n 4096 --flash-attn off -ngl 20

Key Specs

  • Base: Qwen3.5 4B · Format: Q4_K_M GGUF
  • Context: 32K–64K practical · Output: up to 8K tokens
  • KV cache: 8-bit (q8_0) · GPU: full offload (-ngl -1)
  • Thinking mode: optional — improves reasoning, adds latency

Caveats

  • Quality drops beyond 64K context
  • 4B class — sensitive to sampling parameters
  • Thinking mode can produce unstable output if misaligned with training
Downloads last month
1,052
GGUF
Model size
4B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nphearum/Qwen3.5-4BxOpus-4.7-Code-Reasoning-Distilled-GGUF

Finetuned
Qwen/Qwen3.5-4B
Quantized
(171)
this model

Datasets used to train nphearum/Qwen3.5-4BxOpus-4.7-Code-Reasoning-Distilled-GGUF

Collection including nphearum/Qwen3.5-4BxOpus-4.7-Code-Reasoning-Distilled-GGUF