| ## Aetheris Student Model Configuration | |
| ## Target: ~500-800M parameters (HybridMambaMoE) | |
| ## | |
| ## Architecture: 24 layers alternating SSM (even) and MoE (odd) | |
| ## Vocab sized to match Aya tokenizer (256k) | |
| ## | |
| ## Wayy Research, 2024-2026 | |
| vocab_size: 256000 | |
| d_model: 1024 | |
| n_layer: 24 | |
| num_experts: 4 | |
| top_k: 1 | |
| d_ff: 3072 # d_model * 3 | |
| # SSM parameters | |
| ssm_d_state: 16 | |
| ssm_expand: 2 | |
| # d_inner: null # defaults to d_model * ssm_expand = 2048 | |
| # Training parameters | |
| load_balancing_coef: 0.01 | |
| router_z_loss_coef: 0.001 | |
| max_seq_len: 2048 | |
| dtype: "float16" | |
| # Optimization | |
| use_cpu_offload: false | |
| gradient_checkpointing: true | |
| checkpoint_ssm_layers: true | |
| use_flash_attention: false | |