This leaderboard is sorted by F-score. To view other sorted results, please click on the corresponding cell.
This leaderboard is sorted by F-score. To view other sorted results, please click on the corresponding cell.
# | Model | LLM Params |
Date | Overall results on 5 metrics (%) | F-score on 4 primary categories (%) | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CO | INā | NAā | CGA | F-score | ENG | NAT | SCA | SAC | ||||
1 |
GPT-4.5
OpenAI |
- | 2025-02 | 52.9 | 42.5 | 4.6 | 55.4 | 54.1 | 49.5 | 57.5 | 57.9 | 51.4 |
2 |
InternVL3-2B
Shanghai AI Lab |
2B | 2025-04 | 15.5 | 80.5 | 4.0 | 16.1 | 15.8 | 10.9 | 23.4 | 16.4 | 14.8 |
3 |
InternVL3-38B
Shanghai AI Lab |
38B | 2025-04 | 31.4 | 67.7 | 0.9 | 31.7 | 31.5 | 21.3 | 33.3 | 35.7 | 31.8 |
4 |
InternVL3-1B
Shanghai AI Lab |
1B | 2025-04 | 11.3 | 82.5 | 6.1 | 12.1 | 11.7 | 6.1 | 19.2 | 12.1 | 11.2 |
5 |
InternVL3-9B
Shanghai AI Lab |
9B | 2025-04 | 22.6 | 72.9 | 4.5 | 23.7 | 23.1 | 12.8 | 33.2 | 27.9 | 19.9 |
6 |
InternVL3-8B
Shanghai AI Lab |
8B | 2025-04 | 23.3 | 75.2 | 1.5 | 23.7 | 23.5 | 16.2 | 30.7 | 25.6 | 22.4 |
7 |
InternVL3-78B
Shanghai AI Lab |
78B | 2025-04 | 33.7 | 65.6 | 0.7 | 33.9 | 33.8 | 25.4 | 41.2 | 38.6 | 30.6 |
8 |
InternVL3-14B
Shanghai AI Lab |
14B | 2025-04 | 24.9 | 73.3 | 1.8 | 25.4 | 25.2 | 14.6 | 32.3 | 28.4 | 24.6 |
9 |
LLaVA-1.5-7B
Llava Hugging Face |
7B | 2023-09 | 16.1 | 78.5 | 5.4 | 17.1 | 16.6 | 8.9 | 19.2 | 19.0 | 17.1 |
10 |
LLaVA-1.5-13B
Llava Hugging Face |
13B | 2023-09 | 19.3 | 76.7 | 4.1 | 20.1 | 19.7 | 11.6 | 21.2 | 21.9 | 20.8 |
11 |
Qwen2.5-VL-72B
Alibaba |
72B | 2025-02 | 38.7 | 57.3 | 4.0 | 40.3 | 39.5 | 26.1 | 47.0 | 48.2 | 34.7 |
12 |
Qwen2.5-VL-7B
Alibaba |
7B | 2025-02 | 24.7 | 71.2 | 4.1 | 25.8 | 25.3 | 13.8 | 25.6 | 30.8 | 25.1 |
13 |
Qwen2.5-VL-3B
Alibaba |
3B | 2025-02 | 22.3 | 74.5 | 3.2 | 23.0 | 22.6 | 12.1 | 30.3 | 29.2 | 18.6 |
14 |
Qwen2.5-VL-32B
Alibaba |
32B | 2025-02 | 30.3 | 67.1 | 2.7 | 31.1 | 30.7 | 18.1 | 39.3 | 37.4 | 27.0 |
1 |
Qwen2-VL-2B
Alibaba |
2B | 2024-08 | 16.3 | 73.3 | 10.4 | 18.2 | 17.2 | 12.9 | 24.9 | 16.4 | 17.4 |
2 |
Qwen2-VL-72B
Alibaba |
72B | 2024-08 | 32.7 | 59.0 | 8.3 | 35.7 | 34.2 | 20.2 | 39.0 | 40.0 | 33.2 |
3 |
Qwen2-VL-7B
Alibaba |
7B | 2024-08 | 22.4 | 69.4 | 8.2 | 24.4 | 23.4 | 15.9 | 23.9 | 25.1 | 25.0 |
4 |
LLaVA-NeXT-Video-34B
Llava Hugging Face |
34B | 2024-04 | 11.2 | 83.8 | 4.9 | 11.8 | 11.5 | 7.6 | 11.5 | 10.3 | 14.5 |
5 |
LLaVA-NeXT-Video-7B
Llava Hugging Face |
7B | 2024-04 | 9.3 | 52.9 | 37.8 | 14.9 | 11.4 | 7.4 | 15.1 | 14.5 | 8.7 |
6 |
LLaVA-OneVision-72B
Llava Hugging Face |
72B | 2024-08 | 25.4 | 73.6 | 1.0 | 25.7 | 25.5 | 15.9 | 25.3 | 28.5 | 27.3 |
7 |
LLaVA-OneVision-7B
Llava Hugging Face |
7B | 2024-08 | 18.9 | 76.6 | 4.5 | 19.8 | 19.3 | 12.1 | 26.3 | 21.3 | 18.4 |
8 |
LLaVA-OneVision-0.5B
Llava Hugging Face |
0.5B | 2024-08 | 7.8 | 85.7 | 6.5 | 8.3 | 8.0 | 6.1 | 11.6 | 5.8 | 10.0 |
9 |
o3
OpenAI |
- | 2025-04 | 66.3 | 33.6 | 0.1 | 66.4 | 66.3 | 63.0 | 71.3 | 63.5 | 68.8 |
10 |
o4-mini
OpenAI |
- | 2025-04 | 53.7 | 45.3 | 0.9 | 54.2 | 54.0 | 44.3 | 59.4 | 56.8 | 54.0 |
11 |
GPT-4o
OpenAI |
- | 2024-05 | 47.7 | 45.9 | 6.4 | 51.0 | 49.3 | 45.1 | 57.1 | 52.7 | 45.4 |
12 |
DeepSeek-VL2-Small
DeepSeek |
2.8B | 2024-12 | 5.9 | 52.1 | 42.1 | 10.1 | 7.4 | 3.9 | 10.1 | 9.5 | 6.2 |
13 |
DeepSeek-VL2
DeepSeek |
4.5B | 2024-12 | 3.2 | 49.1 | 47.7 | 6.1 | 4.2 | 3.0 | 4.4 | 3.0 | 5.9 |
14 |
DeepSeek-VL2-Tiny
DeepSeek |
1B | 2024-12 | 16.1 | 75.6 | 8.3 | 17.6 | 16.8 | 9.6 | 27.1 | 17.5 | 16.0 |
15 |
Claude Sonnet 4
Anthropic |
- | 2025-05 | 32.8 | 51.2 | 16.0 | 39.0 | 35.6 | 33.0 | 34.3 | 37.9 | 35.0 |
16 |
Claude 3.7 Sonnet
Anthropic |
- | 2025-02 | 32.6 | 47.3 | 20.1 | 40.8 | 36.2 | 24.2 | 40.3 | 41.9 | 34.5 |
17 |
Claude 3.5 Sonnet2
Anthropic |
- | 2024-10 | 33.7 | 58.2 | 8.1 | 36.7 | 35.2 | 26.1 | 37.2 | 38.5 | 35.4 |
18 |
Claude 3.5 Sonnet
Anthropic |
- | 2024-06 | 31.5 | 53.5 | 15.0 | 37.0 | 34.0 | 26.8 | 36.0 | 38.1 | 32.5 |
19 |
Gemini 2.5 Pro
|
- | 2025-03 | 61.2 | 34.3 | 4.5 | 64.1 | 62.6 | 53.5 | 65.8 | 67.1 | 61.5 |
20 |
Gemini 2.5 Flash
|
- | 2025-03 | 53.7 | 34.9 | 11.3 | 60.6 | 57.0 | 46.0 | 61.6 | 61.4 | 56.2 |
21 |
Qwen-VL-Max
Alibaba |
- | 2024-01 | 39.2 | 57.1 | 3.7 | 40.7 | 39.9 | 27.4 | 46.2 | 48.8 | 35.1 |
22 |
Qwen-VL-Plus
Alibaba |
- | 2023-11 | 21.9 | 63.3 | 14.7 | 25.7 | 23.7 | 10.5 | 25.6 | 30.5 | 21.8 |
23 |
Keye-VL
Kwai-Keye |
8B | 2025-06 | 25.4 | 53.9 | 20.7 | 32.0 | 28.3 | 15.3 | 23.1 | 37.2 | 26.6 |
24 |
Kimi-VL
MoonshotAI |
2.8B | 2025-04 | 18.3 | 44.4 | 37.3 | 29.1 | 22.4 | 14.9 | 19.8 | 25.2 | 24.5 |
The taxonomy of Video SimpleQA benchmark.
(a) Video distribution at the secondary level; (b) Question type distribution; (c) Key statistics of Video SimpleQA.
Comparisons with existing video benchmarks regarding the video domain, their knowledge-driven focus, emphasis on factuality, and provision of supporting evidence.
An overview of the construction pipeline of Video SimpleQA including the video & encyclopedia collection, QA annotation and quality control.
Evaluation results (%) of open-source and proprietary multi-modal LLMs on Video SimpleQA . For metrics, CO, NA, IN, and CGA denote āCorrectā, āNot attemptedā, āIncorrectā, and āCorrect given attemptedā, respectively. For subtopics, ENG, NAT, SCI and SAC represent āEngineeringā, āNatureā, āScienceā and āSociety and Cultureā.