coverblew
/

llamita.cpp

Text Generation

Model card Files Files and versions

coverblew commited on 19 days ago

Commit

eaf5549

·

verified ·

1 Parent(s): ba1bd66

Add model card for llamita.cpp

Files changed (1) hide show

README.md +62 -0

README.md ADDED Viewed

	@@ -0,0 +1,62 @@

+---
+tags:
+- llama-cpp
+- jetson-nano
+- cuda-10
+- 1-bit
+- bonsai
+- edge-ai
+- gguf
+- nvidia
+- tegra
+- maxwell
+- quantization
+- arm64
+license: mit
+pipeline_tag: text-generation
+---
+# llamita.cpp
+> Run an 8-billion-parameter 1-bit LLM in 1.1 GB on a $99 Jetson Nano.
+**llamita.cpp** is a patched fork of [PrismML's llama.cpp](https://github.com/PrismML-Eng/llama.cpp) that enables [Bonsai](https://huggingface.co/collections/prism-ml/bonsai) 1-bit models (Q1_0_g128) to compile and run with **CUDA 10.2** on the **NVIDIA Jetson Nano** (SM 5.3 Maxwell, 4 GB RAM).
+## Results
+| Model | Size on disk | RAM used | Prompt | Generation | Board |
+|-------|-------------|----------|--------|------------|-------|
+| [Bonsai-8B](https://huggingface.co/prism-ml/Bonsai-8B-gguf) | 1.1 GB | 2.5 GB | 2.1 tok/s | 1.1 tok/s | Jetson Nano 4GB |
+| [Bonsai-4B](https://huggingface.co/prism-ml/Bonsai-4B-gguf) | 546 MB | ~1.5 GB | 3.6 tok/s | 1.6 tok/s | Jetson Nano 4GB |
+An 8-billion-parameter model running on a board with 128 CUDA cores and 4 GB of shared RAM.
+## What Was Changed
+27 files modified, ~3,200 lines of patches across 7 categories:
+1. **C++17 to C++14** — `if constexpr`, `std::is_same_v`, structured bindings, fold expressions
+2. **CUDA 10.2 API stubs** — `nv_bfloat16` type stub, `cooperative_groups/reduce.h`, `CUDA_R_16BF`
+3. **SM 5.3 Maxwell** — Warp size macros, MMQ params, flash attention disabled with stubs
+4. **ARM NEON GCC 8** — Custom struct types for broken `vld1q_*_x*` intrinsics
+5. **Linker** — `-lstdc++fs` for `std::filesystem`
+6. **Critical correctness fix** — `binbcast.cu` fold expression silently computing nothing
+7. **Build system** — `CUDA_STANDARD 14`, flash attention template exclusion
+## The Bug That Broke Everything
+During the C++14 port, a fold expression in `binbcast.cu` was replaced with `(void)0`. This silently broke ALL binary operations (add, multiply, subtract, divide). The model loaded, allocated memory, ran inference — and produced complete garbage. The fix was one line.
+## Links
+- **GitHub**: [coverblew/llamita.cpp](https://github.com/coverblew/llamita.cpp)
+- **Blog post**: [An 8B Model on a $99 Board](https://coverblew.github.io/llamita.cpp/)
+- **Patch documentation**: [PATCHES.md](https://github.com/coverblew/llamita.cpp/blob/main/PATCHES.md)
+- **Build guide**: [BUILD-JETSON.md](https://github.com/coverblew/llamita.cpp/blob/main/BUILD-JETSON.md)
+- **Benchmarks**: [jetson-nano-4gb.md](https://github.com/coverblew/llamita.cpp/blob/main/benchmarks/jetson-nano-4gb.md)
+## Credits
+- [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp) — Original llama.cpp (MIT)
+- [PrismML-Eng/llama.cpp](https://github.com/PrismML-Eng/llama.cpp) — Q1_0_g128 support (MIT)
+- [PrismML Bonsai models](https://huggingface.co/collections/prism-ml/bonsai) — 1-bit LLMs (Apache 2.0)