LLM finetuning pipeline consisting in a separate critic Reward Model, implements PPO and GRPO, and evaluates them both individually - View it on GitHub
Star
0
Rank
13852712