LogSage is an LLM-based system that analyzes AI training logs to extract errors, find root causes, and suggest recovery actions, reducing downtime and manual intervention in large GPU clusters. - View it on GitHub
Star
4
Rank
2639892