drdraq commited on
Commit
40d9f20
Β·
verified Β·
1 Parent(s): b2e6314

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +281 -0
README.md ADDED
@@ -0,0 +1,281 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen3.5-0.8B
4
+ language:
5
+ - en
6
+ - zh
7
+ - multilingual
8
+ library_name: rust
9
+ tags:
10
+ - text-generation
11
+ - image-text-to-text
12
+ - video-text-to-text
13
+ - multimodal
14
+ - vision
15
+ - rust
16
+ - pure-rust
17
+ - no-python
18
+ - quantized
19
+ - deltanet
20
+ - hybrid-attention
21
+ - mobile
22
+ pipeline_tag: image-text-to-text
23
+ model-index:
24
+ - name: QORA-0.8B
25
+ results: []
26
+ ---
27
+
28
+ # QORA-0.8B
29
+
30
+ Pure Rust multimodal inference engine based on [Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B). No Python, no CUDA, no external ML frameworks. Single executable + model weights = portable AI that runs on any machine.
31
+
32
+ **Designed for mobile and edge devices** β€” only 600 MB model file, loads in under 1 second, and runs at ~4 tok/s on a standard CPU. **Smart system awareness** β€” automatically detects your hardware (RAM, CPU threads) on Windows, Linux, and macOS, and adjusts generation parameters so the model runs well even on constrained systems.
33
+
34
+ ## License
35
+
36
+ This project is licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0). The base model [Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) is released by the Qwen team under Apache 2.0.
37
+
38
+ ## What It Does
39
+
40
+ QORA-0.8B is a 0.8-billion parameter language model with built-in vision. It can:
41
+
42
+ - **Text generation** β€” answer questions, write code, summarize text
43
+ - **Image understanding** β€” describe photos, answer questions about images
44
+ - **Video understanding** β€” analyze frame sequences, describe motion and temporal changes
45
+ - **Thinking mode** β€” chain-of-thought reasoning with configurable budget
46
+
47
+ ## Architecture
48
+
49
+ QORA-0.8B uses a hybrid architecture combining two attention mechanisms:
50
+
51
+ | Component | Details |
52
+ |-----------|---------|
53
+ | **Parameters** | 0.8B total |
54
+ | **Hidden dim** | 1024 |
55
+ | **Layers** | 24 (18 DeltaNet + 6 Full Attention) |
56
+ | **Layer pattern** | 3x DeltaNet + 1x Full Attention, repeated 6 times |
57
+ | **Vocabulary** | 248,320 tokens |
58
+ | **Context** | 262K tokens natively |
59
+
60
+ ### DeltaNet Layers (18 of 24)
61
+ - Gated linear attention with delta rule state updates
62
+ - 16 QK heads + 16 V heads, head_dim=128
63
+ - Causal Conv1d (kernel=4) + SiLU activation
64
+ - O(1) memory per token (recurrent state, no KV cache needed)
65
+
66
+ ### Full Attention Layers (6 of 24)
67
+ - Grouped Query Attention (8Q / 2KV heads), head_dim=256
68
+ - QK-norm + partial RoPE (64/256 dims rotated), theta=10M
69
+ - Output gating (sigmoid gate on attention output)
70
+ - Standard KV cache
71
+
72
+ ### Vision Encoder
73
+ - 12-layer ViT, hidden=768, 12 heads
74
+ - Conv3d patch embedding [768, 3, 2, 16, 16] (temporal_patch_size=2)
75
+ - Learned positional embedding with bilinear interpolation from 48x48 grid
76
+ - 2D spatial RoPE (dim=32, theta=10000)
77
+ - 2x2 spatial merger: LayerNorm β†’ concat β†’ MLP(3072 β†’ 1024)
78
+ - **Images**: single frame duplicated along temporal axis
79
+ - **Video**: actual Conv3d over consecutive frame pairs (N frames β†’ N/2 temporal patches)
80
+
81
+ ## Smart System Awareness
82
+
83
+ QORA-0.8B detects your system at startup and automatically adjusts generation limits:
84
+
85
+ ```
86
+ QORA-0.8B - Pure Rust Multimodal Inference Engine
87
+ System: 16384 MB RAM (9856 MB free), 12 threads
88
+ ```
89
+
90
+ | Available RAM | Think Budget | Max Tokens | Behavior |
91
+ |---------------|-------------|------------|----------|
92
+ | < 4 GB | 128 (cap 256) | 256 (cap 512) | Minimal generation, warning displayed |
93
+ | 4-8 GB | 256 (cap 1024) | 512 (cap 1024) | Constrained, warning displayed |
94
+ | 8-12 GB | 1024 (cap 2048) | 1024 (cap 2048) | Normal operation |
95
+ | >= 12 GB | 2048 (cap 8192) | 2048 (cap 8192) | Full capability |
96
+
97
+ **Hard caps apply even to explicit user values** β€” if you pass `--max-tokens 5000` on a system with 6 GB free RAM, it gets clamped to 1024 automatically. This prevents the model from running for too long on weak systems.
98
+
99
+ Supports **Windows** (wmic), **Linux** (/proc/meminfo), and **macOS** (sysctl/vm_stat).
100
+
101
+ ## Weight Format
102
+
103
+ | Format | Size | Quality | Speed |
104
+ |--------|------|---------|-------|
105
+ | **Q4** (default) | ~600 MB | Good | ~3.9 tok/s |
106
+
107
+ Q4 uses 4-bit symmetric quantization with group_size=32 and LUT-optimized dequantization. Multi-threaded GEMV/GEMM via rayon for large matrices.
108
+
109
+ ### AVX-512 SIMD Acceleration
110
+
111
+ On CPUs with AVX-512 support (Intel 11th gen+, AMD Zen 4+), QORA-0.8B automatically uses hand-written AVX-512 SIMD kernels for significant CPU speedup:
112
+
113
+ | Kernel | Technique | Speedup |
114
+ |--------|-----------|---------|
115
+ | **Q4 GEMV** | `permutexvar_ps` 16-entry LUT lookup, nibble extract via `cvtepu8_epi32` | ~2.5x |
116
+ | **F16 GEMV** | `cvtph_ps` f16β†’f32 + `fmadd_ps` FMA accumulation | ~2.5x |
117
+ | **DeltaNet state** | Vectorized decay/retrieve/delta/output over 128-dim heads | ~3x |
118
+ | **Fused gate+up** | Parallel gate & up SIMD LUT decode in MLP | ~2.5x |
119
+
120
+ Detection is automatic at runtime β€” falls back to scalar code on non-AVX-512 CPUs with zero overhead.
121
+
122
+ The model is small enough that Q4 is the only format needed β€” it loads in under 1 second and uses minimal RAM.
123
+
124
+ ## Platform Support
125
+
126
+ | Platform | Binary | Status |
127
+ |----------|--------|--------|
128
+ | **Windows x86_64** | `qor08b.exe` | Tested |
129
+ | **Linux x86_64** | `qor08b` | Supported |
130
+ | **macOS aarch64** | `qor08b` | Supported |
131
+
132
+ CPU-only by design β€” the 0.8B model is small and fast enough that GPU is not needed. Pre-built binaries are available on the [Releases](https://github.com/qora-protocol/QPRA-LLM-0.8B/releases) page.
133
+
134
+ ## Quick Start
135
+
136
+ 1. Download from the [Releases](https://github.com/qora-protocol/QPRA-LLM-0.8B/releases) page:
137
+ - `model.qor08b` (600 MB)
138
+ - `tokenizer.json` (13 MB)
139
+ - `qor08b.exe` (Windows) or build from source (Linux/macOS)
140
+ 2. Run:
141
+
142
+ ```bash
143
+ # Text generation
144
+ qor08b --prompt "Explain quantum computing" --max-tokens 500
145
+
146
+ # Image understanding
147
+ qor08b --prompt "What's in this image?" --image photo.jpg
148
+
149
+ # Video understanding (directory of frame images)
150
+ qor08b --prompt "What happens in this video?" --video frames_dir/
151
+
152
+ # Thinking mode (default, extended reasoning)
153
+ qor08b --prompt "What is the capital of France?" --think-budget 512
154
+
155
+ # No-think mode (faster, direct answers)
156
+ qor08b --prompt "What is 2+2?" --no-think
157
+
158
+ # Greedy decoding (deterministic output)
159
+ qor08b --prompt "Hello" --greedy
160
+ ```
161
+
162
+ ### CLI Flags
163
+
164
+ | Flag | Description |
165
+ |------|-------------|
166
+ | `--prompt TEXT` | Input prompt (default: "Hello, how are you?") |
167
+ | `--image PATH` | Path to an image file (PNG/JPG) |
168
+ | `--video PATH` | Path to directory of frame images (PNG/JPG, sorted by name) |
169
+ | `--max-tokens N` | Max tokens to generate (default: 1024) |
170
+ | `--think-budget N` | Max thinking tokens before forcing answer (default: 1024) |
171
+ | `--no-think` | Disable thinking mode (direct answers) |
172
+ | `--show-think` | Display thinking tokens on stderr |
173
+ | `--greedy` | Greedy decoding (temperature=0, not recommended with thinking mode) |
174
+
175
+ ### Sampling Defaults
176
+
177
+ | Parameter | Think mode | No-think mode |
178
+ |-----------|-----------|---------------|
179
+ | temperature | 1.0 | 0.7 |
180
+ | top_k | 20 | 20 |
181
+ | top_p | 0.95 | 0.95 |
182
+ | presence_penalty | 1.5 | 1.5 |
183
+
184
+ ### Video Input
185
+
186
+ Video is provided as a directory of frame images (not a video file). Extract frames however you like:
187
+
188
+ ```bash
189
+ # Example: extract 4 frames from a video with ffmpeg
190
+ ffmpeg -i video.mp4 -vf "select=not(mod(n\,30))" -frames:v 4 frames/frame_%02d.png
191
+
192
+ # Then run
193
+ qor08b --prompt "Describe what happens" --video frames/
194
+ ```
195
+
196
+ Frames are loaded in alphabetical order, resized to uniform dimensions (max 768px, divisible by 32), and processed as temporal pairs via Conv3d. Odd frame counts are padded by duplicating the last frame.
197
+
198
+ ## Building from Source
199
+
200
+ ```bash
201
+ # All platforms (CPU-only, no GPU needed)
202
+ cargo build --release
203
+ ```
204
+
205
+ ### Dependencies
206
+
207
+ - **Language**: Pure Rust (2024 edition)
208
+ - `cortex` β€” Rust deep learning framework (used for binary format types only)
209
+ - `rayon` β€” Thread pool for parallel GEMV, attention
210
+ - `half` β€” F16 support
211
+ - `image` β€” Image loading (PNG/JPG)
212
+ - `tokenizers` β€” HuggingFace tokenizer
213
+ - `memmap2` β€” Memory-mapped I/O for converter
214
+ - `serde_json` β€” Config parsing
215
+ - **No ML framework** for inference β€” all matrix ops are hand-written Rust
216
+
217
+ ### Cross-Platform Releases
218
+
219
+ Pre-built binaries are automatically built via GitHub Actions for:
220
+ - **Windows x86_64**
221
+ - **Linux x86_64**
222
+ - **macOS aarch64**
223
+
224
+ Create a git tag (e.g. `v0.1.0`) and push to trigger a release build.
225
+
226
+ ## File Structure
227
+
228
+ ```
229
+ src/
230
+ main.rs β€” CLI entry point, argument parsing
231
+ config.rs β€” Model architecture configuration
232
+ gemv.rs β€” GEMV/GEMM kernels (F16 + Q4), hybrid forward pass, prefill
233
+ simd.rs β€” AVX-512 SIMD kernels (Q4/F16 GEMV, DeltaNet, fused MLP)
234
+ generate.rs β€” Text generation loop (text, image, video modes)
235
+ tokenizer.rs β€” Tokenizer wrapper and chat templates
236
+ vision.rs β€” Vision encoder (ViT + merger), image/video loading
237
+ save.rs β€” Binary model format (.qor08b) save/load
238
+ convert.rs β€” One-time safetensors β†’ .qor08b converter
239
+ system.rs β€” System awareness (RAM detection, smart limits)
240
+ lib.rs β€” Module exports
241
+ ```
242
+
243
+ ## Model Binary Format (.qor08b)
244
+
245
+ Custom binary format for fast loading:
246
+
247
+ ```
248
+ Header: "QR08" magic + version(u32) + format(u8: 0=F16, 1=Q4)
249
+ Config: Architecture params (vocab, hidden, layers, heads, etc.)
250
+ Layers: 24 layers, each with type byte + layer-specific weights
251
+ Global: Embedding + final norm + precomputed RoPE tables
252
+ Vision: Conv3d patch embed + pos_embed + 12 ViT blocks + merger MLP
253
+ ```
254
+
255
+ Loading is ~500ms for the Q4 model (~600 MB) via buffered sequential reads.
256
+
257
+ ## Performance
258
+
259
+ Tested on i5-11500 (6C/12T), 16GB RAM:
260
+
261
+ | Task | Speed |
262
+ |------|-------|
263
+ | Text decode | ~3.9 tok/s |
264
+ | Text prefill | ~13 tok/s (batched DeltaNet) |
265
+ | Model load | ~500ms (Q4, 600 MB) |
266
+ | RAM usage | ~791 MB |
267
+
268
+ CPU-only by design β€” the 0.8B model is small enough that CPU inference is fast and efficient, making it ideal for mobile and edge deployment without GPU dependencies. Batched DeltaNet prefill processes all GEMM projections in parallel across tokens, with only the lightweight conv1d and recurrent state update running sequentially.
269
+
270
+ ## Comparison with QORA-4B
271
+
272
+ | | QORA-0.8B | QORA-4B |
273
+ |---|-----------|---------|
274
+ | Parameters | 0.8B | 4B |
275
+ | Model size (Q4) | 600 MB | 2.9 GB |
276
+ | Load time | ~500ms | ~30s |
277
+ | Decode speed | ~3.9 tok/s | ~3.3 tok/s (GPU), ~1.3 tok/s (CPU) |
278
+ | RAM usage | ~791 MB | ~3.5 GB |
279
+ | GPU support | No (not needed) | Yes (Vulkan) |
280
+ | Vision | 12L ViT (768) | 24L ViT (1024) |
281
+ | Best for | Mobile, edge, quick tasks | Desktop, complex reasoning |