Performance report for Q6_k on RTX 3090 and RTX 4090D

#6
by SlavikF - opened

RTX 3090, VRAM: 24GB

with full context it uses 17GB VRAM

PP starts at ~4000/s ( 1100 tokens )
Then goes down to 200/s on long context ( 192000 tokens ).

TG:

  • 190 t/s on short context ( 1100 tokens )
  • 48 t/s on long context ( 192000 tokens )

Log:

[51961]   Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
[51961] load_backend: loaded CUDA backend from /app/libggml-cuda.so
[51961] load_backend: loaded CPU backend from /app/libggml-cpu-sapphirerapids.so
[51961] build: 8576 (afe65aa28) with GNU 11.4.0 for Linux x86_64
[51961] system info: n_threads = 12, n_threads_batch = 12, total_threads = 12
[51961] system_info: n_threads = 12 (n_threads_batch = 12) / 12 | CUDA : ARCHS = 500,610,700,750,800,860,890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | AMX_INT8 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
[51961] srv    load_model: loading model '/root/.cache/huggingface/hub/models--ai-sage--GigaChat3.1-10B-A1.8B-GGUF/snapshots/97045b260251cfa86f5ad25638fa2dd074153446/GigaChat3.1-10B-A1.8B-q6_K.gguf'
[51961] print_info: file size   = 8.17 GiB (6.58 BPW) 
[51961] load_tensors: offloaded 27/27 layers to GPU
[51961] llama_context: n_seq_max     = 4
[51961] llama_context: n_ctx         = 262144
[51961] llama_context: n_ctx_seq     = 262144
[51961] llama_context: n_batch       = 2048
[51961] llama_context: n_ubatch      = 512
[51961] sched_reserve: Flash Attention was auto, set to enabled
[51961] sched_reserve: resolving fused Gated Delta Net support:
[51961] sched_reserve: fused Gated Delta Net (autoregressive) enabled
[51961] sched_reserve: fused Gated Delta Net (chunked) enabled

# long context:
[51961] prompt eval time =  765917.20 ms / 192000 tokens (    3.99 ms per token,   250.68 tokens per second)
[51961]        eval time =   13802.61 ms /   663 tokens (   20.82 ms per token,    48.03 tokens per second)

# short context:
[51961] prompt eval time =     292.62 ms /  1176 tokens (    0.25 ms per token,  4018.91 tokens per second)
[51961]        eval time =    2026.23 ms /   384 tokens (    5.28 ms per token,   189.51 tokens per second)

on RTX 4090D:

long context:

prompt eval time =  314152.47 ms / 192000 tokens (    1.64 ms per token,   611.17 tokens per second)
       eval time =    6503.02 ms /   555 tokens (   11.72 ms per token,    85.34 tokens per second)

short context:

prompt eval time =     326.13 ms /  2267 tokens (    0.14 ms per token,  6951.24 tokens per second)
       eval time =    2112.52 ms /   505 tokens (    4.18 ms per token,   239.05 tokens per second)

Docker command:

docker run -d --rm --name llama1 --gpus "device=0" \
  -v /home/slavik/.cache/huggingface/hub:/root/.cache/huggingface/hub \
  -p 8080:8080 \
  --entrypoint ./llama-server \
  ghcr.io/ggml-org/llama.cpp:server-cuda12-b8576  --host 0.0.0.0 --port 8080 --hf-repo ai-sage/GigaChat3.1-10B-A1.8B-GGUF:q6_k

Sign up or log in to comment