MemCode-VLA v11
Memory-conditioned visuomotor policy for robot manipulation.
Architecture: SmolVLM2-2.2B VLM → MoT dual-path 24-layer denoiser (LaST0 pattern) → Flow Matching action head → DeltaMem online memory → World Model Expert (LeWM-style) → Cascade Anchor Decoder (DiffusionDrive/BridgeDrive)
Training: 8×H100-80GB DDP (steps 0-50K) → 4×H100-80GB DDP (steps 50K-100K), checkpoints at 5K intervals
Config:
- B_ep=32, W=48, 24 MoT layers
- VLM: single layer 14/24 (GR00T N1 pattern)
- Anchor: 512 anchors, Sinkhorn+centering+focal KL, cosine distance
- DeltaMem: rank-8, per-layer delta-rule associative memory
- World Model: LeWM-style ARPredictor, H=8 history, S=2 stride (wm_min_frames=2, wm_align_weight=0.3)
- CoT: LaST0
<|latent_pad|>pattern, 4 latent reasoning tokens - Augmentations (enabled at step 50K): obs_dropout=0.10, memory_dropout=0.05
Checkpoints:
| Step | Action Loss | Anchor Eff Rank | WM Active | Notes |
|---|---|---|---|---|
| 5000 | - | - | - | |
| 10000 | - | - | - | |
| 15000 | - | - | - | |
| 20000 | - | - | - | |
| 25000 | - | - | - | |
| 30000 | - | - | - | |
| 35000 | - | - | - | |
| 40000 | 0.020 | 326/512 | 0.145 | |
| 45000 | - | - | - | |
| 50000 | 0.018 | 334/512 | 0.145 | Dropout + WM enhancements enabled after this step |
| 55000 | 0.035 | 444/512 | 0.145 | Initial dropout adaptation |
| 60000 | 0.030 | 430/512 | 0.145 | |
| 65000 | 0.031 | 387/512 | 0.145 | |
| 70000 | 0.025 | 384/512 | 0.145 | |
| 75000 | 0.025 | 392/512 | 0.145 |
Resume (8-GPU):
PYTORCH_ALLOC_CONF=expandable_segments:True CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
torchrun --standalone --nnodes=1 --nproc_per_node=8 \
-m xq_memcodevla.training.train train pretraining --resume
Resume (4-GPU, with adjusted gradient accumulation):
PYTORCH_ALLOC_CONF=expandable_segments:True CUDA_VISIBLE_DEVICES=4,5,6,7 \
torchrun --standalone --nnodes=1 --nproc_per_node=4 \
-m xq_memcodevla.training.train train pretraining --resume --accum-steps 4
Code: https://github.com/guohetian/XQ-MemCodeVLA (branch: dev)
Papers: MemCode-VLA (memory + planning) + TokenAct (efficient execution)