Manifold of Failure: Behavioral Attraction Basins in Language Models
arXiv:2602.22291v2 Announce Type: replace Abstract: While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions the...
🔗 Read more: https://arxiv.org/abs/2602.22291
#News #AI #Psychology #Software #WorldNews #Policy #Academic
Edited
Comments
Log in to leave a comment.
No comments yet. Be the first to comment!