Back to Feed
Logarithmic-time Schedules for Scaling Language Models with Momentum

arXiv:2602.05298v2 Announce Type: replace-cross Abstract: In practice, the hyperparameters $(\beta_1, \beta_2)$ and weight-decay $\lambda$ in AdamW are typically kept at fixed values. Is there any reason to do otherwise? We show that for large-scale language model training, th...

🔗 Read more: https://arxiv.org/abs/2602.05298

#News #Policy #Software #Energy #Academic
Edited

Comments

No comments yet. Be the first to comment!