MemCode-VLA v11

Memory-conditioned visuomotor policy for robot manipulation.

Architecture: SmolVLM2-2.2B VLM → MoT dual-path 24-layer denoiser (LaST0 pattern) → Flow Matching action head → DeltaMem online memory → World Model Expert (LeWM-style) → Cascade Anchor Decoder (DiffusionDrive/BridgeDrive)

Training: 8×H100-80GB DDP (steps 0-50K) → 4×H100-80GB DDP (steps 50K-100K), checkpoints at 5K intervals

Config:

  • B_ep=32, W=48, 24 MoT layers
  • VLM: single layer 14/24 (GR00T N1 pattern)
  • Anchor: 512 anchors, Sinkhorn+centering+focal KL, cosine distance
  • DeltaMem: rank-8, per-layer delta-rule associative memory
  • World Model: LeWM-style ARPredictor, H=8 history, S=2 stride (wm_min_frames=2, wm_align_weight=0.3)
  • CoT: LaST0 <|latent_pad|> pattern, 4 latent reasoning tokens
  • Augmentations (enabled at step 50K): obs_dropout=0.10, memory_dropout=0.05

Checkpoints:

Step Action Loss Anchor Eff Rank WM Active Notes
5000 - - -
10000 - - -
15000 - - -
20000 - - -
25000 - - -
30000 - - -
35000 - - -
40000 0.020 326/512 0.145
45000 - - -
50000 0.018 334/512 0.145 Dropout + WM enhancements enabled after this step
55000 0.035 444/512 0.145 Initial dropout adaptation
60000 0.030 430/512 0.145
65000 0.031 387/512 0.145
70000 0.025 384/512 0.145
75000 0.025 392/512 0.145

Resume (8-GPU):

PYTORCH_ALLOC_CONF=expandable_segments:True CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  torchrun --standalone --nnodes=1 --nproc_per_node=8 \
  -m xq_memcodevla.training.train train pretraining --resume

Resume (4-GPU, with adjusted gradient accumulation):

PYTORCH_ALLOC_CONF=expandable_segments:True CUDA_VISIBLE_DEVICES=4,5,6,7 \
  torchrun --standalone --nnodes=1 --nproc_per_node=4 \
  -m xq_memcodevla.training.train train pretraining --resume --accum-steps 4

Code: https://github.com/guohetian/XQ-MemCodeVLA (branch: dev)

Papers: MemCode-VLA (memory + planning) + TokenAct (efficient execution)

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading