In this project, we develop an enterprise benchmark framework for large language model (LLM) evaluation. We extend HELM, an open-source benchmark framework developed by Stanford CRFM, to enable users evaluate LLMs with domain-specific datasets such as finance, legal, climate, and cybersecurity. -
View it on GitHub