Anneketh Vij commited on
Commit
1068478
·
verified ·
1 Parent(s): 4b99dab

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +132 -0
README.md ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - es
6
+ - fr
7
+ - de
8
+ - it
9
+ - pt
10
+ - ru
11
+ - ar
12
+ - hi
13
+ - ko
14
+ - zh
15
+ library_name: transformers
16
+ base_model:
17
+ - arcee-ai/Trinity-Mini
18
+ base_model_relation: quantized
19
+ tags:
20
+ - moe
21
+ - nvfp4
22
+ - modelopt
23
+ - blackwell
24
+ - vllm
25
+ ---
26
+ <div align="center">
27
+ <picture>
28
+ <img
29
+ src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/i-v1KyAMOW_mgVGeic9WJ.png"
30
+ alt="Arcee Trinity Mini"
31
+ style="max-width: 100%; height: auto;"
32
+ >
33
+ </picture>
34
+ </div>
35
+
36
+ # Trinity Mini NVFP4
37
+
38
+ **This repository contains the NVFP4 quantized weights of Trinity-Mini for deployment on NVIDIA Blackwell GPUs.**
39
+
40
+ Trinity Mini is an Arcee AI 26B MoE model with 3B active parameters. It is the medium-sized model in our new Trinity family, a series of open-weight models for enterprise and tinkerers alike.
41
+
42
+ This model is tuned for reasoning, but in testing, it uses a similar total token count to competitive instruction-tuned models.
43
+
44
+ ***
45
+
46
+ Trinity Mini is trained on 10T tokens gathered and curated through a key partnership with [Datology](https://www.datologyai.com/), building upon the excellent dataset we used on [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) with additional math and code.
47
+
48
+ Training was performed on a cluster of 512 H200 GPUs powered by [Prime Intellect](https://www.primeintellect.ai/) using HSDP parallelism.
49
+
50
+ More details, including key architecture decisions, can be found on our blog [here](https://www.arcee.ai/blog/the-trinity-manifesto)
51
+
52
+ ***
53
+
54
+ ## Model Details
55
+
56
+ * **Model Architecture:** AfmoeForCausalLM
57
+ * **Parameters:** 26B, 3B active
58
+ * **Experts:** 128 total, 8 active, 1 shared
59
+ * **Context length:** 128k
60
+ * **Training Tokens:** 10T
61
+ * **License:** [Apache 2.0](https://huggingface.co/arcee-ai/Trinity-Mini#license)
62
+ * **Recommended settings:**
63
+ * temperature: 0.15
64
+ * top_k: 50
65
+ * top_p: 0.75
66
+ * min_p: 0.06
67
+
68
+ ***
69
+
70
+ ## Benchmarks
71
+
72
+ ![](https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/UMV0OZh_H1JfvgzBTXh6u.png)
73
+
74
+ <div align="center">
75
+ <picture>
76
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/sSVjGNHfrJKmQ6w8I18ek.png" style="background-color:ghostwhite;padding:5px;" width="17%" alt="Powered by Datology">
77
+ </picture>
78
+ </div>
79
+
80
+ ## Quantization Details
81
+
82
+ - **Scheme:** NVFP4 (`nvfp4_mlp_only` — MLP/expert weights only, attention remains BF16)
83
+ - **Tool:** [NVIDIA ModelOpt](https://github.com/NVIDIA/Model-Optimizer)
84
+ - **Calibration:** 512 samples, seq_length=2048, all-expert calibration enabled
85
+ - **KV cache:** Not quantized
86
+
87
+ ## Running with vLLM
88
+
89
+ Requires [vLLM](https://github.com/vllm-project/vllm) >= 0.18.0. Native FP4 compute requires Blackwell GPUs; older GPUs fall back to Marlin weight decompression automatically.
90
+
91
+ ### Blackwell GPUs (B200/B300/GB300) — Docker (recommended)
92
+
93
+ ```bash
94
+ docker run --runtime nvidia --gpus all -p 8000:8000 \
95
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
96
+ vllm/vllm-openai:v0.18.0-cu130 \
97
+ arcee-ai/Trinity-Mini-NVFP4 \
98
+ --trust-remote-code \
99
+ --gpu-memory-utilization 0.90 \
100
+ --max-model-len 8192
101
+ ```
102
+
103
+ ### Hopper GPUs (H100/H200) and others
104
+
105
+ ```bash
106
+ vllm serve arcee-ai/Trinity-Mini-NVFP4 \
107
+ --trust-remote-code \
108
+ --gpu-memory-utilization 0.90 \
109
+ --max-model-len 8192 \
110
+ --host 0.0.0.0 \
111
+ --port 8000
112
+ ```
113
+
114
+ **Note (Blackwell pip installs):** If installing vLLM via pip on Blackwell rather than using Docker, native FP4 kernels may produce incorrect output due to package version mismatches. As a workaround, force the Marlin backend:
115
+
116
+ ```bash
117
+ export VLLM_NVFP4_GEMM_BACKEND=marlin
118
+
119
+ vllm serve arcee-ai/Trinity-Mini-NVFP4 \
120
+ --trust-remote-code \
121
+ --moe-backend marlin \
122
+ --gpu-memory-utilization 0.90 \
123
+ --max-model-len 8192 \
124
+ --host 0.0.0.0 \
125
+ --port 8000
126
+ ```
127
+
128
+ Marlin decompresses FP4 weights to BF16 for compute, providing the full memory compression benefit (~3.7× vs BF16) but not native FP4 compute speedup. On Hopper GPUs (H100/H200), Marlin is selected automatically and no extra flags are needed.
129
+
130
+ ## License
131
+
132
+ Trinity-Mini-NVFP4 is released under the Apache-2.0 license.