Training stability on AMD ROCm (distributed backend, gloo vs nccl)
#3
by mrs83 - opened
Use gloo, NOT nccl for the distributed backend on single-GPU ROCm.nccl is designed for multi-GPU NVIDIA communication.
On AMD ROCm (HIP), the NCCL comm teardown races with HIP device release during process exit, causing a ProcessGroupNCCL::abortCommsFromMap โ getDevice HIP error (exit code 134).
The fix is dist.init_process_group(backend="gloo", ...)
Gloo handles single-process groups cleanly on both CUDA and ROCm with no teardown race.