Training stability on AMD ROCm (distributed backend, gloo vs nccl)

#3
by mrs83 - opened
ethicalabs.ai org

Use gloo, NOT nccl for the distributed backend on single-GPU ROCm.nccl is designed for multi-GPU NVIDIA communication.

On AMD ROCm (HIP), the NCCL comm teardown races with HIP device release during process exit, causing a ProcessGroupNCCL::abortCommsFromMap โ†’ getDevice HIP error (exit code 134).

The fix is dist.init_process_group(backend="gloo", ...)

Gloo handles single-process groups cleanly on both CUDA and ROCm with no teardown race.

Sign up or log in to comment