Code for the paper 'Mediocricity is the key for LLM as a Judge Anchor Selection'. This project enables systematic pairwise evaluation of multiple models on Arena-hard and AlpacaEval datasets, examining the effect of the chosen `anchor', i.e., the model to which all the other evaluated models are compared. -
View it on GitHub