Video Abstractive and Extractive Reasoning Benchmark Leaderboard

Leaderboard Visualization

We visualize model performance across all 10 tasks from abstractive and extractive reasoning to make cross-model strengths and weaknesses easier to compare at a glance.

Average Performance by Model

Each model is compared using its abstractive and extractive averages side by side.

Abstractive Avg. Extractive Avg.

Task-Level Heatmap Across 10 Tasks

Rows correspond to models, and columns correspond to the five abstractive and five extractive tasks.

Lower Higher

The growing interest in embodied agents increases the demand for spatiotemporal video understanding, yet existing benchmarks largely emphasize extractive reasoning, where answers can be explicitly presented within spatiotemporal events. It remains unclear whether multimodal large language models can instead perform abstractive spatiotemporal reasoning, which requires integrating observations over time, combining dispersed cues, and inferring implicit spatial and contextual structure. To address this gap, we formalize abstractive spatiotemporal reasoning from videos by introducing a structured evaluation taxonomy that systematically targets its core dimensions and construct a controllable, scenario-driven synthetic egocentric video dataset tailored to evaluate abstractive spatiotemporal reasoning capabilities, spanning object-, room-, and floor-plan–level scenarios. Based on this framework, we present VAEX-BENCH, a benchmark comprising five abstractive reasoning tasks together with their extractive counterparts. Our extensive experiments compare the performance of state-of-the-art MLLMs under extractive and abstractive settings, exposing their limitations on abstractive tasks and providing a fine-grained analysis of the underlying bottlenecks. The dataset will be released soon.

Model	Memory-Action	Map Direction	Map Scale	Simulation	Global Counting	Avg.
GPT-5.2 Closed	38.0	26.0	34.0	29.3	23.3	30.1
Gemini-3 Flash Closed	60.7	34.0	24.0	51.3	31.3	40.3
Gemini-3 Pro Closed	52.0	22.7	22.0	31.3	20.7	29.7
Claude 4.5 Haiku Closed	19.3	21.3	11.3	22.0	2.7	15.3
Claude 4.5 Sonnet Closed	30.0	19.3	19.3	40.7	11.3	24.1
Qwen3-VL-2B Open	26.0	18.7	19.3	26.0	6.7	19.3
Qwen3-VL-8B Open	34.0	22.0	15.3	34.0	17.3	24.5
Qwen3-VL-32B Open	40.0	26.0	23.3	42.7	17.3	29.9
Qwen3-VL-235B Open	43.3	16.7	13.3	46.7	13.3	26.7
InternVL3.5-2B Open	28.0	22.7	20.0	21.3	14.0	21.2
InternVL3.5-8B Open	36.7	24.0	21.3	26.0	16.0	24.8
Mistral-3.2-24B Open	28.7	16.7	10.0	18.7	14.0	17.6
Gemma-3-27B Open	32.0	24.0	23.3	28.7	10.7	23.7
GLM-4.6V-106B Open	35.3	18.0	20.7	37.3	6.0	23.5
Random	30.7	22.0	18.0	35.3	-	26.5
Human Performance	89.3	83.3	60.0	93.3	82.7	81.7

Model	Appearance Order	Relative Direction	Relative Distance	Route plan	Object Counting	Avg.
GPT-5.2 Closed	73.3	26.0	56.0	18.7	48.7	44.5
Gemini-3 Flash Closed	93.3	40.0	62.7	16.0	38.0	50.0
Gemini-3 Pro Closed	96.0	44.7	66.7	26.0	44.7	55.6
Claude 4.5 Haiku Closed	34.0	30.0	26.7	30.0	40.0	32.1
Claude 4.5 Sonnet Closed	72.7	24.7	28.0	13.3	30.7	33.9
Qwen3-VL-2B Open	82.0	29.3	42.7	22.7	40.0	43.3
Qwen3-VL-8B Open	80.0	33.3	54.0	25.3	34.0	45.3
Qwen3-VL-32B Open	90.0	26.0	53.3	6.7	51.3	45.5
Qwen3-VL-235B Open	93.3	38.0	54.0	20.0	43.3	49.7
InternVL3.5-2B Open	64.0	30.7	36.7	20.0	30.7	36.4
InternVL3.5-8B Open	73.3	28.7	36.0	20.7	32.0	38.1
Mistral-3.2-24B Open	28.7	28.0	40.0	17.3	40.7	30.9
Gemma-3-27B Open	55.3	33.3	28.0	28.7	49.3	38.9
GLM-4.6V-106B Open	85.3	29.3	58.4	10.0	52.0	46.9
Random	15.3	22.0	42.0	20.0	-	24.8
Human Performance	93.3	88.0	77.3	88.7	92.7	88.0

Perceptual Bottleneck

We analyze failures on Global Counting , requiring models to detect and track object instances before aggregating them into global counts. To obtain a granular view of model failures, we decompose errors into three mutually exclusive categories:

01.

Omission Error An error in which an object that is actually present in the scenario is incorrectly judged as absent and therefore not counted.

02.

Hallucination Error An error in which an object that does not actually exist in the scenario is incorrectly judged to be present and is therefore included in the count.

03.

Miscounting Error An error in which an object is correclty recognized as being present in the scenario, but the number assigned to that object is incorrect.

Bottleneck Analysis 1 — **Figure 2.** Error breakdown for the Global Counting task.

Temporal Bottleneck

We analyze failures on Memory-Action , requiring models to recall the sequence of visited rooms along the trajectory and use that temporal memory to identify the correct action associated with a target location. To disentangle different sources of failure, we perform a fine-grained error analysis along two hierarchical components:

Global Temporal Memory (Hop 1)

● Pass: Recalls the full trajectory.

● Fail: Forgets the sequence of rooms.

Local Temporal Memory (Hop 2)

● Pass: Matches action to target.

● Fail: Fails to link memory to action.

Spatial Bottleneck

We analyze failures on Map Direction , Map Scale , Simulation , requiring a coherent spatial representation of the environment. Models must reason about relative room orientation, maintain consistent metric proportions, and simulate viewpoint transformations within a unified layout. To diagnose these spatial bottlenecks, we evaluate the MLLMs to explicitly reconstruct a global cognitive map via grid-based floor-plan prediciton.

We present VAEX-BENCH, a scenario-driven benchmark for evaluating abstractive and extractive spatiotemporal reasoning in egocentric videos, moving beyond evidence-present extraction toward long-horizon memory, allocentric map reasoning, and global evidence aggregation. Unlike prior benchmarks built on fixed captured footage, VAEX-BENCH is manually constructed with controlled scenario design and spatially consistent QA pairs, enabling precise diagnosis of failure modes. Our experiments reveal persistent bottlenecks in global spatial memory and structured evidence integration across state-of-the-art MLLMs. Future work will focus on automating the pipeline to scale benchmark generation while preserving its controllability.