CRAB: cross-environment agent benchmark for multimodal language model agents

Xu T, Chen L, Wu D, Chen Y, Zhang Z, Yao X, Xie Z, Chen Y, Liu S, Qian B, Yang A, Jin Z
,
et al

The development of autonomous agents increasingly relies on Multimodal Language Models (MLMs) to perform tasks described in natural language with GUI
environments, such as websites, desktop computers, or mobile phones. Existing
benchmarks for MLM agents in interactive environments are limited by their focus
on a single environment, lack of detailed and generalized evaluation methods,
and the complexities of constructing tasks and evaluators. To overcome these
limitations, we introduce CRAB, the first agent benchmark framework designed to
support cross-environment tasks, incorporating a graph-based fine-grained evaluation method and an efficient mechanism for task and evaluator construction. Our
framework supports multiple devices and can be easily extended to any environment with a Python interface. Leveraging CRAB, we developed a cross-platform
Crab Benchmark-v0 comprising 120 tasks in computer desktop and mobile phone
environments. We evaluated four advanced MLMs using different single and
multi-agent system configurations on this benchmark. The experimental results
demonstrate that the single agent with GPT-4o achieves the best completion ratio
of 38.01%.