EgoToM is an egocentric theory-of-mind benchmark built on Ego4D videos, containing multi-choice questions that evaluate multimodal large language models' ability to infer a camera wearer's goals, in-the-moment belief states, and future actions. -
View it on GitHub