Current advances in massive language fashions (LLMs) have elevated the demand for complete benchmarks to judge their capabilities as human-like brokers. Current benchmarks, whereas helpful, usually give attention to particular software situations, emphasizing job completion however failing to dissect the underlying abilities that drive these outcomes. This lack of granularity makes it troublesome to deeply discern the place failures stem from. Moreover, establishing these environments requires appreciable effort, and problems with unreliability and reproducibility generally come up, particularly in interactive duties. To deal with these limitations, we introduce the Large Multitask Agent Understanding (MMAU) benchmark, that includes complete offline duties that get rid of the necessity for complicated setting setups. It evaluates fashions throughout 5 domains, together with Instrument-use, Directed Acyclic Graph (DAG) QA, Knowledge Science and Machine Studying coding, Contest-level programming and Arithmetic, and covers 5 important capabilities: Understanding, Reasoning, Planning, Drawback-solving, and Self-correction. With a complete of 20 meticulously designed duties encompassing over 3K distinct prompts, MMAU gives a complete framework for evaluating the strengths and limitations of LLM brokers. By testing 18 consultant fashions on MMAU, we offer deep and insightful analyses. In the end, MMAU not solely sheds gentle on the capabilities and limitations of LLM brokers but additionally enhances the interpretability of their efficiency.