Using a single CPU kernel for some important things seems the performance bottleneck

#3
by rpeinl - opened

I'm running the model on a Kubernetes cluster with a Pro 6000 GPU and 16 CPU cores. However, generating a single 1024x1024 image with flash-attn installed takes 1:25 min for creation and 6-7s to decode using the turbo-decoder with 8 steps. Image quality is fine, but the GPU shows only 35-40% utilization and a single CPU core is 100% utilized. Therefore, it seems the model could run much faster, if it would parallel whatever is running on the CPU.
With reasoning turned on, it runs several minutes without any progress indicator and without even getting to the image creation part. In theory, the model should be quite fast on my hardware, but in practice it is slower than Flux 2 dev and other heavy models.

Sign up or log in to comment