Add model card for llamita.cpp
Browse files
README.md
ADDED
|
@@ -0,0 +1,62 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- llama-cpp
|
| 4 |
+
- jetson-nano
|
| 5 |
+
- cuda-10
|
| 6 |
+
- 1-bit
|
| 7 |
+
- bonsai
|
| 8 |
+
- edge-ai
|
| 9 |
+
- gguf
|
| 10 |
+
- nvidia
|
| 11 |
+
- tegra
|
| 12 |
+
- maxwell
|
| 13 |
+
- quantization
|
| 14 |
+
- arm64
|
| 15 |
+
license: mit
|
| 16 |
+
pipeline_tag: text-generation
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# llamita.cpp
|
| 20 |
+
|
| 21 |
+
> Run an 8-billion-parameter 1-bit LLM in 1.1 GB on a $99 Jetson Nano.
|
| 22 |
+
|
| 23 |
+
**llamita.cpp** is a patched fork of [PrismML's llama.cpp](https://github.com/PrismML-Eng/llama.cpp) that enables [Bonsai](https://huggingface.co/collections/prism-ml/bonsai) 1-bit models (Q1_0_g128) to compile and run with **CUDA 10.2** on the **NVIDIA Jetson Nano** (SM 5.3 Maxwell, 4 GB RAM).
|
| 24 |
+
|
| 25 |
+
## Results
|
| 26 |
+
|
| 27 |
+
| Model | Size on disk | RAM used | Prompt | Generation | Board |
|
| 28 |
+
|-------|-------------|----------|--------|------------|-------|
|
| 29 |
+
| [Bonsai-8B](https://huggingface.co/prism-ml/Bonsai-8B-gguf) | 1.1 GB | 2.5 GB | 2.1 tok/s | 1.1 tok/s | Jetson Nano 4GB |
|
| 30 |
+
| [Bonsai-4B](https://huggingface.co/prism-ml/Bonsai-4B-gguf) | 546 MB | ~1.5 GB | 3.6 tok/s | 1.6 tok/s | Jetson Nano 4GB |
|
| 31 |
+
|
| 32 |
+
An 8-billion-parameter model running on a board with 128 CUDA cores and 4 GB of shared RAM.
|
| 33 |
+
|
| 34 |
+
## What Was Changed
|
| 35 |
+
|
| 36 |
+
27 files modified, ~3,200 lines of patches across 7 categories:
|
| 37 |
+
|
| 38 |
+
1. **C++17 to C++14** β `if constexpr`, `std::is_same_v`, structured bindings, fold expressions
|
| 39 |
+
2. **CUDA 10.2 API stubs** β `nv_bfloat16` type stub, `cooperative_groups/reduce.h`, `CUDA_R_16BF`
|
| 40 |
+
3. **SM 5.3 Maxwell** β Warp size macros, MMQ params, flash attention disabled with stubs
|
| 41 |
+
4. **ARM NEON GCC 8** β Custom struct types for broken `vld1q_*_x*` intrinsics
|
| 42 |
+
5. **Linker** β `-lstdc++fs` for `std::filesystem`
|
| 43 |
+
6. **Critical correctness fix** β `binbcast.cu` fold expression silently computing nothing
|
| 44 |
+
7. **Build system** β `CUDA_STANDARD 14`, flash attention template exclusion
|
| 45 |
+
|
| 46 |
+
## The Bug That Broke Everything
|
| 47 |
+
|
| 48 |
+
During the C++14 port, a fold expression in `binbcast.cu` was replaced with `(void)0`. This silently broke ALL binary operations (add, multiply, subtract, divide). The model loaded, allocated memory, ran inference β and produced complete garbage. The fix was one line.
|
| 49 |
+
|
| 50 |
+
## Links
|
| 51 |
+
|
| 52 |
+
- **GitHub**: [coverblew/llamita.cpp](https://github.com/coverblew/llamita.cpp)
|
| 53 |
+
- **Blog post**: [An 8B Model on a $99 Board](https://coverblew.github.io/llamita.cpp/)
|
| 54 |
+
- **Patch documentation**: [PATCHES.md](https://github.com/coverblew/llamita.cpp/blob/main/PATCHES.md)
|
| 55 |
+
- **Build guide**: [BUILD-JETSON.md](https://github.com/coverblew/llamita.cpp/blob/main/BUILD-JETSON.md)
|
| 56 |
+
- **Benchmarks**: [jetson-nano-4gb.md](https://github.com/coverblew/llamita.cpp/blob/main/benchmarks/jetson-nano-4gb.md)
|
| 57 |
+
|
| 58 |
+
## Credits
|
| 59 |
+
|
| 60 |
+
- [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp) β Original llama.cpp (MIT)
|
| 61 |
+
- [PrismML-Eng/llama.cpp](https://github.com/PrismML-Eng/llama.cpp) β Q1_0_g128 support (MIT)
|
| 62 |
+
- [PrismML Bonsai models](https://huggingface.co/collections/prism-ml/bonsai) β 1-bit LLMs (Apache 2.0)
|