Repo for NeurIPS 2024 paper "Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes" - View it on GitHub
Star
1
Rank
5327228