Interesting Architectural design
Hi, really amazing work.
The architecture design caught my attention like you are doing 3x upsamling for hidden features in the MLP layers. General norm is 4x, the layer sharing and 832 embedding dimension. Curious to know about the experimentation done, finding and failed attempts.
Also, is this purely empirical or depends on other foundational work.
Hoping you will cover all of this in upcoming blog.
Hi, thanks a lot, really appreciate it!
Great question. The 4× intermediate size in MLPs is the standard for non-GLU variants, but for GLU-based variants we typically use ~3×. That’s because GLU involves three matrix multiplications, so the overall parameter count and compute end up being quite comparable to a 4× non-GLU setup.
On layer sharing, yes, that was a deliberate choice. At ~150M scale, models tend to have limited capacity, so sharing layers effectively increases the depth (you can think of it as ~2× effective layers), which helps improve performance without a proportional increase in parameters.
We did run multiple ablations and iterations before converging on this design, it wasn’t a one-shot decision. Some configurations didn’t perform as well in terms of stability and efficiency, which guided us toward the current setup.
Right now, we’re pre-training larger models (500M and 1B). Once that’s done, we’re planning to publish a detailed technical report covering the architecture decisions, experiments, and lessons learned.
Great!
Thanks for the response. Hope you will cover failed experiments and ablations as well.
Rooting for you!