Performance report for Q6_k on RTX 3090 and RTX 4090D
#6
by SlavikF - opened
RTX 3090, VRAM: 24GB
with full context it uses 17GB VRAM
PP starts at ~4000/s ( 1100 tokens )
Then goes down to 200/s on long context ( 192000 tokens ).
TG:
- 190 t/s on short context ( 1100 tokens )
- 48 t/s on long context ( 192000 tokens )
Log:
[51961] Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
[51961] load_backend: loaded CUDA backend from /app/libggml-cuda.so
[51961] load_backend: loaded CPU backend from /app/libggml-cpu-sapphirerapids.so
[51961] build: 8576 (afe65aa28) with GNU 11.4.0 for Linux x86_64
[51961] system info: n_threads = 12, n_threads_batch = 12, total_threads = 12
[51961] system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
[51961] srv load_model: loading model '/root/.cache/huggingface/hub/models--ai-sage--GigaChat3.1-10B-A1.8B-GGUF/snapshots/97045b260251cfa86f5ad25638fa2dd074153446/GigaChat3.1-10B-A1.8B-q6_K.gguf'
[51961] print_info: file size = 8.17 GiB (6.58 BPW)
[51961] load_tensors: offloaded 27/27 layers to GPU
[51961] llama_context: n_seq_max = 4
[51961] llama_context: n_ctx = 262144
[51961] llama_context: n_ctx_seq = 262144
[51961] llama_context: n_batch = 2048
[51961] llama_context: n_ubatch = 512
[51961] sched_reserve: Flash Attention was auto, set to enabled
[51961] sched_reserve: resolving fused Gated Delta Net support:
[51961] sched_reserve: fused Gated Delta Net (autoregressive) enabled
[51961] sched_reserve: fused Gated Delta Net (chunked) enabled
# long context:
[51961] prompt eval time = 765917.20 ms / 192000 tokens ( 3.99 ms per token, 250.68 tokens per second)
[51961] eval time = 13802.61 ms / 663 tokens ( 20.82 ms per token, 48.03 tokens per second)
# short context:
[51961] prompt eval time = 292.62 ms / 1176 tokens ( 0.25 ms per token, 4018.91 tokens per second)
[51961] eval time = 2026.23 ms / 384 tokens ( 5.28 ms per token, 189.51 tokens per second)
on RTX 4090D:
long context:
prompt eval time = 314152.47 ms / 192000 tokens ( 1.64 ms per token, 611.17 tokens per second)
eval time = 6503.02 ms / 555 tokens ( 11.72 ms per token, 85.34 tokens per second)
short context:
prompt eval time = 326.13 ms / 2267 tokens ( 0.14 ms per token, 6951.24 tokens per second)
eval time = 2112.52 ms / 505 tokens ( 4.18 ms per token, 239.05 tokens per second)
Docker command:
docker run -d --rm --name llama1 --gpus "device=0" \
-v /home/slavik/.cache/huggingface/hub:/root/.cache/huggingface/hub \
-p 8080:8080 \
--entrypoint ./llama-server \
ghcr.io/ggml-org/llama.cpp:server-cuda12-b8576 --host 0.0.0.0 --port 8080 --hf-repo ai-sage/GigaChat3.1-10B-A1.8B-GGUF:q6_k