MemCode-VLA v11

Memory-conditioned visuomotor policy for robot manipulation.

Architecture: SmolVLM2-2.2B VLM → MoT dual-path 24-layer denoiser (LaST0 pattern) → Flow Matching action head → DeltaMem online memory → World Model Expert (LeWM-style) → Cascade Anchor Decoder (DiffusionDrive/BridgeDrive)

Training: 8×H100-80GB DDP (steps 0-50K) → 4×H100-80GB DDP (steps 50K-100K), checkpoints at 5K intervals

Config:

B_ep=32, W=48, 24 MoT layers
VLM: single layer 14/24 (GR00T N1 pattern)
Anchor: 512 anchors, Sinkhorn+centering+focal KL, cosine distance
DeltaMem: rank-8, per-layer delta-rule associative memory
World Model: LeWM-style ARPredictor, H=8 history, S=2 stride (wm_min_frames=2, wm_align_weight=0.3)
CoT: LaST0 <|latent_pad|> pattern, 4 latent reasoning tokens
Augmentations (enabled at step 50K): obs_dropout=0.10, memory_dropout=0.05

Checkpoints:

Step	Action Loss	Anchor Eff Rank	WM Active	Notes
5000	-	-	-
10000	-	-	-
15000	-	-	-
20000	-	-	-
25000	-	-	-
30000	-	-	-
35000	-	-	-
40000	0.020	326/512	0.145
45000	-	-	-
50000	0.018	334/512	0.145	Dropout + WM enhancements enabled after this step
55000	0.035	444/512	0.145	Initial dropout adaptation
60000	0.030	430/512	0.145
65000	0.031	387/512	0.145
70000	0.025	384/512	0.145
75000	0.025	392/512	0.145

Resume (8-GPU):

PYTORCH_ALLOC_CONF=expandable_segments:True CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  torchrun --standalone --nnodes=1 --nproc_per_node=8 \
  -m xq_memcodevla.training.train train pretraining --resume

Resume (4-GPU, with adjusted gradient accumulation):

PYTORCH_ALLOC_CONF=expandable_segments:True CUDA_VISIBLE_DEVICES=4,5,6,7 \
  torchrun --standalone --nnodes=1 --nproc_per_node=4 \
  -m xq_memcodevla.training.train train pretraining --resume --accum-steps 4

Code: https://github.com/guohetian/XQ-MemCodeVLA (branch: dev)

Papers: MemCode-VLA (memory + planning) + TokenAct (efficient execution)

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics