optimizer / docs /optimizations.md

Add optimization docs and update implementation guide [skip-build]

14040eb 8 days ago

6.38 kB

	# Performance Optimizations (vs. main)

	Summary of optimizations on branch `perf/pipelined-distributed-muon-clean` relative to `main`.

	---

	## 1. Batched Momentum (`core.py`)

	Before: Per-param `update_g()` — one `torch.add` + optional `torch.add_` per parameter.

	After: `_batch_pre_ortho()` — `_foreach_mul_`, `_foreach_add_` on lists of local tensors (unwrapped from DTensor). Single fused kernel per batch instead of N individual kernels.

	Impact: Eliminates N per-param Python-loop overhead + N small kernel launches. Scales with parameter count.

	---

	## 2. Pipeline Buffer Packing (`pipeline.py`)

	### Gather send buffer

	Before: Per-param `.to(COMM_DTYPE).contiguous()` followed by per-destination `append` to list, then `torch.cat` on the per-dst lists.

	After: Collect all grad slices in destination order in a single pass, then one `torch.cat` call. Avoids intermediate per-destination lists and redundant dtype conversions.

	### Scatter send buffer

	Before: Per-param, per-destination-rank: index `u_full[indices].flatten()`, append to per-dst list, then flatten+cat.

	After: Cache `u_full` conversions (avoid redundant `.to()` per dst_rank). Collect all slices in dst order in one pass, single `torch.cat`.

	Impact: Fewer kernel launches, less Python overhead, reduced intermediate allocations.

	---

	## 3. Zero-Copy Scatter (`pipeline.py`)

	Before: `_launch_scatter` pre-allocates `torch.empty_like(p.to_local())` for every param. `_complete_scatter` copies from recv_buf into these pre-allocated tensors via `copy_()`.

	After: `_complete_scatter` assigns views into `recv_buf` directly (via `recv_buf.narrow(...).view_as(...)`). No pre-allocation, no copy. The recv_buf storage stays alive through the views until `_update_params` consumes them.

	Impact: Eliminates N `empty_like` allocations + N `copy_` kernel launches per scatter stage.

	---

	## 4. Batched Parameter Update (`pipeline.py`)

	Before: Per-param loop calling `update_p()` (which unwraps DTensor, applies weight decay, applies update individually).

	After: Batched using `_foreach_mul_` (weight decay) and `_foreach_add_` (Muon update), grouped by `adjusted_lr` to preserve float32 alpha precision. Single kernel per group instead of per param.

	Impact: Reduces N per-param kernel launches to 1-2 batched kernel launches.

	---

	## 5. Parallel Metadata Caching (`muon.py`)

	Before: `init_state_and_assign_params()` called every step — sorts params by FLOP cost, assigns ownership via round-robin, precomputes per-rank indices/numels for all-to-all.

	After: `_parallel_cache` keyed by `tuple(names)`. First call computes and caches `ordered_names`, `name_to_state`, `rank`, `chunk_size`. Subsequent calls reuse cached metadata, only rebuilding `param_to_state` with current `id(p)` keys (since param objects are stable but ids may change for QK clip updates).

	Impact: Eliminates repeated sorting, mesh construction, and index precomputation on every step.

	---

	## 6. Expert Param Expansion Caching (`muon.py`)

	Before: `_expand_expert_params()` called every step — for each expert param `(E, out, in)`, creates E `nn.Parameter` wrappers (triggers `aten::detach`), indexes data and grad (`aten::select`), and wraps in DTensor for TP.

	After: `_expert_expand_cache` keyed by `tuple(id(p) for p in params)`. Cold path runs `_expand_expert_params` once and caches:

	- `expanded_names` / `expanded_params` — the nn.Parameter wrappers with stable data views
	- `grad_info` — per-expert-group metadata (orig param index, num experts, expanded start index, DTensor flag, TP mesh/placements)

	Hot path reuses cached nn.Parameter objects (data views are stable since optimizer updates happen in-place on the same storage). Only updates `.grad` on each cached expert param by slicing the current step's gradient.

	Eliminated on hot path:

	- `nn.Parameter()` construction — removes `aten::detach`
	- `local_data[i]` data slicing — removes half of `aten::select` + `aten::as_strided`
	- `DTensor.from_local()` for data — only needed for grad now
	- `is_expert_param()` name matching per step

	Still required per step:

	- `local_grad[i]` — grad tensor changes each step (nesterov)
	- `DTensor.from_local(slice_grad, ...)` — for TP expert grads
	- `p.grad = None` — freeing original 3D grad storage

	Impact: ~8ms CPU overhead reduction per step at production scale (64 GPUs, 48 local experts).

	---

	## 7. Newton-Schulz Compile + CUDA Graph (`newton_schulz.py`)

	Before: `_zeropower_via_newtonschulz5()` called directly every time.

	After: `zeropower_via_newtonschulz5()` wrapper with per-shape `torch.compile` caching + CUDA graph (`triton.cudagraphs=True`). Each unique shape gets its own compiled function stored in `_ns_per_shape`. Toggled via `set_ns_compile(enabled)`.

	Impact: After warmup, NS iterations run as CUDA graphs — eliminates per-step compilation overhead and CPU-GPU synchronization.

	---

	## 8. Removed `small_param_numel_threshold` (`muon.py`)

	Before: Small sharded DTensors (below threshold, default 65536) fell back to `distributed_muon()` which used per-param `full_tensor()` + redistribute.

	After: All sharded DTensors go to `parallel()`. `distributed_muon()` is retained as a test-only reference implementation. Uneven shard splits (e.g., MoE gate weights with fewer rows than shard ranks) are handled inline via `full_tensor()` fallback within the batched distributed_muon path.

	Impact: Simpler routing, no silent fallback to slower path.

	---

	## Summary Table

	\| Optimization \| Location \| Category \| Kernel Launches Saved \|
	\|---\|---\|---\|---\|
	\| Batched momentum \| `core.py` \| CPU + GPU \| N per-param → 2-3 batched \|
	\| Buffer packing (gather) \| `pipeline.py` \| CPU + GPU \| N cat+cast → 1 cat+cast \|
	\| Buffer packing (scatter) \| `pipeline.py` \| CPU + GPU \| N cat → 1 cat \|
	\| Zero-copy scatter \| `pipeline.py` \| GPU memory \| N alloc+copy → 0 \|
	\| Batched param update \| `pipeline.py` \| CPU + GPU \| N update → 1-2 batched \|
	\| Parallel metadata cache \| `muon.py` \| CPU \| Sort+index per step → once \|
	\| Expert expand cache \| `muon.py` \| CPU \| N detach+select → grad-only \|
	\| NS compile + CUDA graph \| `newton_schulz.py` \| GPU \| JIT warmup → graph replay \|
	\| Remove small_param_threshold \| `muon.py` \| Routing \| Simpler, unified path \|

	# Performance Optimizations (vs. main)

	Summary of optimizations on branch `perf/pipelined-distributed-muon-clean` relative to `main`.

	---

	## 1. Batched Momentum (`core.py`)

	Before: Per-param `update_g()` — one `torch.add` + optional `torch.add_` per parameter.

	After: `_batch_pre_ortho()` — `_foreach_mul_`, `_foreach_add_` on lists of local tensors (unwrapped from DTensor). Single fused kernel per batch instead of N individual kernels.

	Impact: Eliminates N per-param Python-loop overhead + N small kernel launches. Scales with parameter count.

	---

	## 2. Pipeline Buffer Packing (`pipeline.py`)

	### Gather send buffer

	Before: Per-param `.to(COMM_DTYPE).contiguous()` followed by per-destination `append` to list, then `torch.cat` on the per-dst lists.

	After: Collect all grad slices in destination order in a single pass, then one `torch.cat` call. Avoids intermediate per-destination lists and redundant dtype conversions.

	### Scatter send buffer

	Before: Per-param, per-destination-rank: index `u_full[indices].flatten()`, append to per-dst list, then flatten+cat.

	After: Cache `u_full` conversions (avoid redundant `.to()` per dst_rank). Collect all slices in dst order in one pass, single `torch.cat`.

	Impact: Fewer kernel launches, less Python overhead, reduced intermediate allocations.

	---

	## 3. Zero-Copy Scatter (`pipeline.py`)

	Before: `_launch_scatter` pre-allocates `torch.empty_like(p.to_local())` for every param. `_complete_scatter` copies from recv_buf into these pre-allocated tensors via `copy_()`.

	After: `_complete_scatter` assigns views into `recv_buf` directly (via `recv_buf.narrow(...).view_as(...)`). No pre-allocation, no copy. The recv_buf storage stays alive through the views until `_update_params` consumes them.

	Impact: Eliminates N `empty_like` allocations + N `copy_` kernel launches per scatter stage.

	---

	## 4. Batched Parameter Update (`pipeline.py`)

	Before: Per-param loop calling `update_p()` (which unwraps DTensor, applies weight decay, applies update individually).

	After: Batched using `_foreach_mul_` (weight decay) and `_foreach_add_` (Muon update), grouped by `adjusted_lr` to preserve float32 alpha precision. Single kernel per group instead of per param.

	Impact: Reduces N per-param kernel launches to 1-2 batched kernel launches.

	---

	## 5. Parallel Metadata Caching (`muon.py`)

	Before: `init_state_and_assign_params()` called every step — sorts params by FLOP cost, assigns ownership via round-robin, precomputes per-rank indices/numels for all-to-all.

	After: `_parallel_cache` keyed by `tuple(names)`. First call computes and caches `ordered_names`, `name_to_state`, `rank`, `chunk_size`. Subsequent calls reuse cached metadata, only rebuilding `param_to_state` with current `id(p)` keys (since param objects are stable but ids may change for QK clip updates).

	Impact: Eliminates repeated sorting, mesh construction, and index precomputation on every step.

	---

	## 6. Expert Param Expansion Caching (`muon.py`)

	Before: `_expand_expert_params()` called every step — for each expert param `(E, out, in)`, creates E `nn.Parameter` wrappers (triggers `aten::detach`), indexes data and grad (`aten::select`), and wraps in DTensor for TP.

	After: `_expert_expand_cache` keyed by `tuple(id(p) for p in params)`. Cold path runs `_expand_expert_params` once and caches:

	- `expanded_names` / `expanded_params` — the nn.Parameter wrappers with stable data views
	- `grad_info` — per-expert-group metadata (orig param index, num experts, expanded start index, DTensor flag, TP mesh/placements)

	Hot path reuses cached nn.Parameter objects (data views are stable since optimizer updates happen in-place on the same storage). Only updates `.grad` on each cached expert param by slicing the current step's gradient.

	Eliminated on hot path:

	- `nn.Parameter()` construction — removes `aten::detach`
	- `local_data[i]` data slicing — removes half of `aten::select` + `aten::as_strided`
	- `DTensor.from_local()` for data — only needed for grad now
	- `is_expert_param()` name matching per step

	Still required per step:

	- `local_grad[i]` — grad tensor changes each step (nesterov)
	- `DTensor.from_local(slice_grad, ...)` — for TP expert grads
	- `p.grad = None` — freeing original 3D grad storage

	Impact: ~8ms CPU overhead reduction per step at production scale (64 GPUs, 48 local experts).

	---

	## 7. Newton-Schulz Compile + CUDA Graph (`newton_schulz.py`)

	Before: `_zeropower_via_newtonschulz5()` called directly every time.

	After: `zeropower_via_newtonschulz5()` wrapper with per-shape `torch.compile` caching + CUDA graph (`triton.cudagraphs=True`). Each unique shape gets its own compiled function stored in `_ns_per_shape`. Toggled via `set_ns_compile(enabled)`.

	Impact: After warmup, NS iterations run as CUDA graphs — eliminates per-step compilation overhead and CPU-GPU synchronization.

	---

	## 8. Removed `small_param_numel_threshold` (`muon.py`)

	Before: Small sharded DTensors (below threshold, default 65536) fell back to `distributed_muon()` which used per-param `full_tensor()` + redistribute.

	After: All sharded DTensors go to `parallel()`. `distributed_muon()` is retained as a test-only reference implementation. Uneven shard splits (e.g., MoE gate weights with fewer rows than shard ranks) are handled inline via `full_tensor()` fallback within the batched distributed_muon path.

	Impact: Simpler routing, no silent fallback to slower path.

	---

	## Summary Table

	\| Optimization \| Location \| Category \| Kernel Launches Saved \|
	\|---\|---\|---\|---\|
	\| Batched momentum \| `core.py` \| CPU + GPU \| N per-param → 2-3 batched \|
	\| Buffer packing (gather) \| `pipeline.py` \| CPU + GPU \| N cat+cast → 1 cat+cast \|
	\| Buffer packing (scatter) \| `pipeline.py` \| CPU + GPU \| N cat → 1 cat \|
	\| Zero-copy scatter \| `pipeline.py` \| GPU memory \| N alloc+copy → 0 \|
	\| Batched param update \| `pipeline.py` \| CPU + GPU \| N update → 1-2 batched \|
	\| Parallel metadata cache \| `muon.py` \| CPU \| Sort+index per step → once \|
	\| Expert expand cache \| `muon.py` \| CPU \| N detach+select → grad-only \|
	\| NS compile + CUDA graph \| `newton_schulz.py` \| GPU \| JIT warmup → graph replay \|
	\| Remove small_param_threshold \| `muon.py` \| Routing \| Simpler, unified path \|