[ICLR'26] Official code for "Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training" - View it on GitHub
Star
11
Rank
1482100