
VideoGameBench is a rigorous benchmark that evaluates VLMs’ real-time decision-making, perception, memory, and planning by challenging them to complete 1990s-era video games with only raw visual inputs and minimal control instructions.
Key Highlights
- Real-Time, Visually Rich Environments – Evaluates VLMs on 23 popular Game Boy and MS-DOS games, including 3 secret test games to assess generalization across unseen environments.
- Raw Pixels Only – No game-specific APIs, state overlays, or memory modules; models rely solely on frame-level screenshots and high-level objective/control text.
- VG-Agent Scaffold – Implements React-style agents with textual scratchpad memory, system/game prompts, and historical frame context for sequential action generation.
- Strict Evaluation Rules – Prevents use of any external tools, RAM access, overlays, or human guidance. This contrasts with past agents like “Gemini Plays Pokémon” which used engineered pathfinding tools.
- Zero-Shot + Latency-Limited Evaluation – Benchmarked in two modes: full real-time (VideoGameBench) and latency-free mode (VideoGameBench Lite), where emulators pause for agent reasoning.
- Automated Checkpoint-Based Scoring – Progress detection via perceptual hashing of visual frames against YouTube walkthroughs enables milestone-level measurement.
- Cross-Genre Coverage – Games span platformers, FPS, strategy, RPG, puzzle, and racing—requiring spatial reasoning, object interaction, resource management, and fine motor control.
- Challenging Results – Best model (Gemini 2.5 Pro) completed just 0.48% of games in real-time and 1.6% in Lite mode. No model reached the first checkpoint in 9/10 test games.
- Diagnostic Game Failures – Frontier VLMs failed at basic tasks (e.g., dragging, 2D grid navigation, clicking) in custom practice games, revealing fundamental deficits in spatial grounding.
- Common Failure Modes – Models exhibit a “knowing–doing gap” (e.g., pressing the right key but misaligned), memory overwriting, repeated loops, hallucinated progress, and visual misperceptions.
- Open Tools & Code – Benchmark code, prompts, and interface are open-source to drive transparent evaluation and community improvements.
Paper Resources:
- Project: https://www.vgbench.com/
- Paper: https://arxiv.org/abs/2505.18134
- Github: https://github.com/alexzhang13/VideoGameBench
Related articles from LearnOpenCV
- NVIDIA COSMOS REASON1: https://learnopencv.com/cosmos-reason-vlm-video-vqa/
- GR00T N1.5: https://learnopencv.com/gr00t-n1_5-explained/