Official code repository for "Self-Jailbreaking: Language Models Can Reason Themselves Out of Safety Alignment After Benign Reasoning Training" - View it on GitHub
Star
7
Rank
1878810