Suppressing Final Layer Hidden State Jumps in Transformer Pretraining
arXiv:2601.18302v1 Announce Type: new Abstract: This paper discusses the internal behavior of Transformer language models. Many recent pre-trained models have been reported to exhibit only...