math
updated
Why do Learning Rates Transfer? Reconciling Optimization and Scaling
Limits for Deep Learning
Paper
• 2402.17457
• Published
Curvature-Informed SGD via General Purpose Lie-Group Preconditioners
Paper
• 2402.04553
• Published
TextGrad: Automatic "Differentiation" via Text
Paper
• 2406.07496
• Published
• 31
Surge Phenomenon in Optimal Learning Rate and Batch Size Scaling
Paper
• 2405.14578
• Published
• 1
Fast Benchmarking of Accuracy vs. Training Time with Cyclic Learning
Rates
Paper
• 2206.00832
• Published
Large Language Models as Markov Chains
Paper
• 2410.02724
• Published
• 33
Old Optimizer, New Norm: An Anthology
Paper
• 2409.20325
• Published
• 3
Scaling Law with Learning Rate Annealing
Paper
• 2408.11029
• Published
• 4
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A
Gradient Perspective
Paper
• 2410.23743
• Published
• 64
ReLU's Revival: On the Entropic Overload in Normalization-Free Large
Language Models
Paper
• 2410.09637
• Published
• 3
In-context learning and Occam's razor
Paper
• 2410.14086
• Published
• 2
nGPT: Normalized Transformer with Representation Learning on the
Hypersphere
Paper
• 2410.01131
• Published
• 10
Cautious Optimizers: Improving Training with One Line of Code
Paper
• 2411.16085
• Published
• 19
MARS: Unleashing the Power of Variance Reduction for Training Large
Models
Paper
• 2411.10438
• Published
• 13
Understanding Gradient Descent through the Training Jacobian
Paper
• 2412.07003
• Published
Position: Don't use the CLT in LLM evals with fewer than a few hundred
datapoints
Paper
• 2503.01747
• Published
• 1
Small Batch Size Training for Language Models: When Vanilla SGD Works,
and Why Gradient Accumulation Is Wasteful
Paper
• 2507.07101
• Published
• 4