Logarithmic-time Schedules for Scaling Language Models with Momentum
arXiv:2602.05298v2 Announce Type: replace-cross Abstract: In practice, the hyperparameters $(\beta_1, \beta_2)$ and weight-decay $\lambda$ in AdamW are typically kept at fixed values. Is there any reason to do otherwise? We show that for large-scale language model training, th...
🔗 Read more: https://arxiv.org/abs/2602.05298
#News #Policy #Software #Energy #Academic
Edited
Comments
Log in to leave a comment.
No comments yet. Be the first to comment!