Bonsai 1-Bit TurboQuant
This repository bundles the Bonsai-8B GGUF checkpoint and the launcher scripts needed to run it with the TurboQuant-enabled llama.cpp fork.
TurboQuant does not change the 1-bit model weights. It changes the KV cache format at runtime, which is what reduces VRAM usage.
What is included
models/gguf/8B/Bonsai-8B.ggufscripts/common.shscripts/run_llama.shscripts/start_llama_server.shscripts/start_openwebui.shscripts/download_models.sh
Quick Start
Clone this repo and the TurboQuant llama.cpp fork side by side:
git clone https://github.com/Apothic-AI/bonsai-turboquant.git
git clone https://github.com/Apothic-AI/llama.cpp-1bit-turboquant.git ../llama.cpp-1bit-turboquant
cd bonsai-turboquant
Build the forked llama.cpp checkout with CUDA:
cd ../llama.cpp-1bit-turboquant
cmake -S . -B build-tbq-cuda -DGGML_CUDA=ON -DLLAMA_BUILD_SERVER=ON -DLLAMA_BUILD_TOOLS=ON
cmake --build build-tbq-cuda -j
cd ../bonsai-turboquant
Run Bonsai through the TurboQuant-enabled server:
BONSAI_LLAMA_BIN_DIR=../llama.cpp-1bit-turboquant/build-tbq-cuda/bin \
BONSAI_CACHE_TYPE_K=tbq4_0 \
BONSAI_CACHE_TYPE_V=tbq3_0 \
./scripts/start_llama_server.sh
The same override works for direct prompts:
BONSAI_LLAMA_BIN_DIR=../llama.cpp-1bit-turboquant/build-tbq-cuda/bin \
BONSAI_CACHE_TYPE_K=tbq4_0 \
BONSAI_CACHE_TYPE_V=tbq3_0 \
./scripts/run_llama.sh -p "Who are you?"
If you want the browser UI, start ./scripts/start_openwebui.sh after the server is running.
To fetch the 4B or 1.7B checkpoints later, use ./scripts/download_models.sh.
Notes
- This repo includes the 8B GGUF only.
- Set
BONSAI_MODELto8Bunless you have added additional model downloads locally. - The
llama.cppfork contains the TurboQuant runtime support.
- Downloads last month
- 852
Hardware compatibility
Log In to add your hardware
We're not able to determine the quantization variants.
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support