inclusionAI/LLaDA2.0-Uni · Using a single CPU kernel for some important things seems the performance bottleneck

Using a single CPU kernel for some important things seems the performance bottleneck

by rpeinl - opened 3 days ago

I'm running the model on a Kubernetes cluster with a Pro 6000 GPU and 16 CPU cores. However, generating a single 1024x1024 image with flash-attn installed takes 1:25 min for creation and 6-7s to decode using the turbo-decoder with 8 steps. Image quality is fine, but the GPU shows only 35-40% utilization and a single CPU core is 100% utilized. Therefore, it seems the model could run much faster, if it would parallel whatever is running on the CPU.
With reasoning turned on, it runs several minutes without any progress indicator and without even getting to the image creation part. In theory, the model should be quite fast on my hardware, but in practice it is slower than Flux 2 dev and other heavy models.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment