Update README.md

50bb0c4 verified 6 days ago

7.77 kB

license: mit
language:
  - ru
  - en
pipeline_tag: text-generation
tags:
  - instruct
  - moe
  - multilingual
  - fp8
  - long-context
  - tool-use

GigaChat 3.1 Ultra

GigaChat 3.1 Ultra is the flagship instruct model of the GigaChat family. It is a large-scale Mixture-of-Experts (MoE) model with 702B total parameters and 36B active parameters, designed for multilingual assistant workloads, reasoning, code, tool use, and large-cluster deployment.

This version is designed for high-performance inference in fp8, the model in bf16 is GigaChat3.1-702B-A36B-bf16.

GGUF version is also avaliable - GigaChat3.1-702B-A36B-GGUF

More details can be found in the Habr article.

Model architecture

GigaChat 3.1 Ultra uses a custom MoE architecture with the following key components.

Mixture-of-Experts (MoE)

The model has 702B total parameters with 36B active parameters at inference time. This allows it to scale model capacity aggressively while keeping the active compute budget much lower than that of an equally large dense model.

Multi-head Latent Attention (MLA)

Instead of standard multi-head attention, the model uses MLA, which compresses the KV cache into a latent representation. This reduces memory usage and improves inference throughput, especially in long-context settings.

Multi-Token Prediction (MTP)

The model is trained with MTP, which allows it to predict multiple tokens per forward pass. In production systems, this can be used with speculative or parallel decoding techniques to improve throughput.

Training data

The base GigaChat 3 training corpus spans 10 languages and includes books, academic material, code datasets, and mathematics datasets. All data goes through deduplication, language filtering, and automatic quality checks based on heuristics and classifiers.

Synthetic data remains a major contributor to quality. Across the broader training corpus, we used approximately 5.5 trillion synthetic tokens, including:

question-answer data generated from source texts,
reverse-prompt chains for structured data generation,
model-authored notes embedded inside texts,
millions of synthetic tasks with solutions in mathematics and olympiad-style programming,
synthetic tests for code and reasoning tasks.

For the 3.1 release, we made major data improvements:

Hard-domain expansion at Stage 1.5: stronger coverage of mathematics, finance, physics, engineering, biology, chemistry, and medicine.
Stricter quality validation: our internal Revisor pipeline was extended with stronger checks for Markdown, LaTeX, and answer-format correctness.
LLM-judge validation: SFT and DPO data is validated with judges selected for the task type and response structure.
On-policy DPO data: preference pairs were generated from preview-model behavior, making them better aligned with real model failure modes.
Better product-oriented data: we expanded data for search-and-citation scenarios, file-aware code interpretation, personalization, and agentic dialogues with executable tool calls.
Improved answer style: we also revised formatting and writing guidelines to improve readability, correctness, and overall response quality.

Post-training improvements

DPO in native FP8

Unlike the preview release, GigaChat 3.1 Ultra includes a full DPO stage. This stage was redesigned for the MoE setup and trained in native FP8, not just quantized after training.

Important changes include:

MTP heads trained during DPO for better consistency between main-model predictions and MTP predictions,
weighted gamma with exponential decay over long sequences,
stronger tuning of batch size and DPO contribution,
better robustness against loop-inducing failure modes.

In our experiments, native FP8 DPO not only recovered the quality that could be lost with post-training FP8 quantization, but in some cases even exceeded the BF16 result while using substantially less memory.

Faster post-training

We also optimized the SFT pipeline with a combination of sequence packing, dynamic sequence parallelism, and additional pipeline optimizations. This reduced training cost significantly and improved GPU utilization, especially on long-context workloads.

Benchmark Results

Domain	Metric	GigaChat-2-Max	GigaChat-3-Ultra-Preview	GigaChat-3.1-Ultra	DeepSeek V3-0324	Qwen3-235B-A22B (Non-Thinking)
General Knowledge	MMLU RU	0.7999	0.7914	0.8267	0.8392	0.7953
General Knowledge	RUBQ	0.7473	0.7634	0.7986	0.7871	0.6577
General Knowledge	MEPA	0.6630	0.6830	0.7130	0.6770	-
General Knowledge	MMLU PRO	0.6660	0.7280	0.7668	0.7610	0.7370
General Knowledge	MMLU EN	0.8600	0.8430	0.8422	0.8820	0.8610
General Knowledge	BBH	0.5070	-	0.7027	-	0.6530
General Knowledge	SuperGPQA	-	0.4120	0.4892	0.4665	0.4406
Math	T-Math	0.1299	0.1450	0.2961	0.1450	0.2477
Math	Math 500	0.7160	0.7840	0.8920	0.8760	0.8600
Math	AIME	0.0833	0.1333	0.3333	0.2667	0.3500
Math	GPQA Five Shot	0.4400	0.4220	0.4597	0.4980	0.4690
Coding	HumanEval	0.8598	0.9024	0.9085	0.9329	0.9268
Total	Mean	0.6021	0.6115	0.6764	0.6482	0.6398

Arena Results

Arena	GigaChat-2-Max	GigaChat-3-Ultra-Preview	GigaChat-3.1-Ultra	DeepSeek V3-0324
Arena Hard Logs V3	64.9	50.5	90.2	80.1
Validator SBS Pollux	54.4	40.1	83.3	74.5
RU LLM Arena	55.4	44.9	70.9	72.1
Arena Hard RU	61.7	39.0	82.1	70.7
Average	59.1	43.6	81.63	74.4

Example evaluation setup

export HF_ALLOW_CODE_EVAL=1

# Example: launch SGLang server for a multi-node deployment
python -m sglang.launch_server \
  --model-path <path_to_model> \
  --host 127.0.0.1 \
  --port 30000 \
  --nnodes 2 \
  --node-rank <0_or_1> \
  --tp 16 \
  --ep 16 \
  --dtype auto \
  --mem-fraction-static 0.7 \
  --trust-remote-code \
  --allow-auto-truncate \
  --speculative-algorithm EAGLE \
  --speculative-num-steps 1 \
  --speculative-eagle-topk 1 \
  --speculative-num-draft-tokens 2 \
  --dist-init-addr <master_node_ip>:50000

# Example: run lm-eval on MMLU-Pro
python -m lm_eval \
  --model sglang-generate \
  --output_path <path_to_output> \
  --batch_size 16 \
  --model_args base_url=http://127.0.0.1:30000/generate,num_concurrent=16,tokenized_requests=True,max_length=131072,tokenizer=<path_to_model> \
  --trust_remote_code \
  --confirm_run_unsafe_code \
  --num_fewshot 5 \
  --tasks mmlu_pro

Inference and Deployment

GigaChat 3.1 Ultra is targeted at cluster and on-premise scenarios with significant infrastructure.

Key highlights:

support for popular inference engines (vLLM, SGLang, LMDeploy, TensorRT-LLM, etc.);
BF16 and FP8 modes (for FP8 — a separate build and GPU configuration recommendations);
use of MLA and MTP to reduce the KV cache and accelerate generation;
a proxy and gateway layer for integration with external services, tools, and agent frameworks.

For configuration, you can refer to the published guides for models of a similar scale:

DeepSeek-V3 — the How to run locally section in the official model card:
- https://github.com/deepseek-ai/DeepSeek-V3?tab=readme-ov-file#6-how-to-run-locally
Kimi-K2-Instruct — deployment recommendations (vLLM / SGLang / LMDeploy):
- https://huggingface.co/moonshotai/Kimi-K2-Instruct/blob/main/docs/deploy_guidance.md