This leaderboard is sorted by F-score. To view other sorted results, please click on the corresponding cell.
This leaderboard is sorted by F-score. To view other sorted results, please click on the corresponding cell.
# | Model | LLM Params |
Date | Overall results on 5 metrics (%) | F-score on 4 primary categories (%) | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CO | INā | NAā | CGA | F-score | ENG | NAT | SCA | SAC | ||||
1 |
o1-preview
OpenAI |
- | 2024-09 | 47.1 | 35.3 | 17.6 | 57.1 | 51.6 | 80.0 | 47.1 | 33.3 | 50.0 |
2 |
Gemini-2.0-Flash
|
- | 2024-12 | 41.5 | 28.7 | 29.8 | 59.1 | 48.8 | 56.8 | 41.8 | 30.6 | 55.8 |
3 |
Gemini-2.0-Flash-Thinking
|
- | 2024-12 | 45.9 | 41.0 | 13.1 | 52.8 | 49.1 | 41.7 | 42.4 | 34.8 | 70.6 |
4 |
Doubao-1.5-vision-pro
ByteDance |
- | 2025-01 | 29.6 | 18.5 | 51.9 | 61.6 | 40.0 | 39.4 | 35.5 | 29.8 | 50.1 |
5 |
Doubao-vision-pro
ByteDance |
- | 2025-01 | 37.0 | 38.1 | 24.9 | 49.2 | 42.2 | 46.8 | 35.5 | 27.3 | 52.3 |
6 |
Doubao-vision-lite
ByteDance |
- | 2025-01 | 17.3 | 15.2 | 67.5 | 53.3 | 26.2 | 28.4 | 24.7 | 14.0 | 29.0 |
7 |
GPT-4o-mini
OpenAI |
- | 2024-07 | 38.9 | 50.6 | 10.5 | 43.4 | 41.0 | 45.5 | 35.9 | 19.2 | 50.0 |
8 |
GPT-4o
OpenAI |
- | 2024-08 | 49.9 | 35.8 | 14.3 | 58.2 | 53.7 | 58.1 | 47.3 | 33.3 | 63.9 |
9 |
GPT-4V
OpenAI |
- | 2023-09 | 29.7 | 28.0 | 42.2 | 51.5 | 37.7 | 39.8 | 33.4 | 27.5 | 45.2 |
10 |
Claude-3.5-Sonnet
Anthropic |
- | 2024-06 | 36.9 | 40.2 | 22.9 | 47.8 | 41.7 | 49.2 | 35.2 | 29.7 | 47.9 |
11 |
Claude-3.5-SonnetV2
Anthropic |
- | 2024-10 | 42.1 | 38.1 | 19.8 | 52.5 | 46.7 | 57.5 | 38.0 | 33.7 | 53.9 |
12 |
Gemini-1.5-Pro
|
- | 2024-09 | 50.1 | 34.2 | 15.7 | 59.4 | 54.4 | 57.0 | 50.3 | 44.1 | 60.6 |
13 |
Gemini-1.5-Pro-Flash
|
- | 2024-09 | 40.5 | 29.2 | 30.3 | 58.1 | 47.7 | 50.7 | 46.2 | 36.9 | 49.5 |
14 |
Qwen-VL-MAX
Alibaba |
- | 2024-01 | 36.7 | 43.3 | 19.9 | 45.9 | 40.8 | 47.7 | 33.0 | 23.5 | 51.1 |
1 |
DeepSeek-VL2
DeepSeek |
- | 2024-12 | 23.6 | 57.7 | 18.8 | 29.0 | 26.0 | 30.0 | 21.6 | 19.0 | 31.0 |
2 |
Deepseek-VL2-Small
DeepSeek |
- | 2024-12 | 24.0 | 57.8 | 18.2 | 29.3 | 26.4 | 31.0 | 22.0 | 14.7 | 31.4 |
3 |
Deepseek-VL2-Tiny
DeepSeek |
- | 2024-12 | 20.0 | 57.5 | 22.5 | 25.8 | 22.5 | 27.1 | 18.7 | 14.8 | 27.5 |
4 |
LLaVA-OneVision-72B
Llava Hugging Face |
72B | 2024-09 | 27.8 | 55.9 | 16.2 | 33.2 | 30.3 | 37.9 | 24.0 | 20.7 | 35.5 |
5 |
LLaVA-OneVision-7B
Llava Hugging Face |
7B | 2024-09 | 19.6 | 33.1 | 47.3 | 37.2 | 25.7 | 29.7 | 20.0 | 20.0 | 32.3 |
6 |
LLaVA-OneVison-0.5B
Llava Hugging Face |
0.5B | 2024-09 | 16.5 | 58.6 | 24.9 | 22.0 | 18.8 | 26.9 | 12.3 | 14.0 | 23.4 |
7 |
Qwen2.5-VL-72B
Alibaba |
72B | 2025-02 | 36.1 | 31.6 | 32.3 | 53.3 | 43.1 | 46.4 | 37.9 | 27.0 | 51.8 |
8 |
Qwen2.5-VL-7B
Alibaba |
7B | 2025-02 | 33.0 | 45.8 | 21.2 | 41.9 | 36.9 | 38.2 | 35.3 | 23.1 | 41.2 |
9 |
Qwen2.5-VL-3B
Alibaba |
3B | 2025-02 | 28.0 | 52.2 | 19.9 | 34.9 | 31.0 | 32.3 | 29.3 | 18.6 | 35.1 |
10 |
Qwen2-VL-72B
Alibaba |
72B | 2024-08 | 28.0 | 40.1 | 31.9 | 41.1 | 33.3 | 35.3 | 28.7 | 25.0 | 40.4 |
11 |
Qwen2-VL-7B
Alibaba |
7B | 2024-08 | 26.3 | 40.6 | 33.1 | 39.3 | 31.5 | 36.8 | 28.3 | 23.0 | 33.5 |
12 |
Qwen2-VL-2B
Alibaba |
2B | 2024-08 | 26.6 | 47.0 | 26.4 | 36.2 | 30.7 | 35.0 | 27.4 | 19.6 | 34.2 |
13 |
InternVL2.5-78B
Shanghai AI Lab |
78B | 2024-11 | 31.2 | 53.7 | 15.1 | 36.8 | 33.8 | 38.7 | 28.2 | 24.6 | 40.3 |
14 |
InternVL2.5-38B
Shanghai AI Lab |
38B | 2024-11 | 29.3 | 51.1 | 19.6 | 36.4 | 32.4 | 38.5 | 27.2 | 21.2 | 37.5 |
15 |
InternVL2.5-26B
Shanghai AI Lab |
26B | 2024-11 | 28.0 | 50.5 | 21.5 | 35.7 | 31.4 | 35.9 | 27.6 | 19.9 | 35.2 |
16 |
InternVL2.5-8B
Shanghai AI Lab |
8B | 2024-11 | 22.1 | 64.0 | 13.9 | 25.6 | 23.7 | 26.9 | 19.1 | 13.8 | 30.1 |
17 |
InternVL2.5-4B
Shanghai AI Lab |
4B | 2024-11 | 21.2 | 64.0 | 14.8 | 24.9 | 22.9 | 28.1 | 17.5 | 17.2 | 28.2 |
18 |
InternVL2.5-2B
Shanghai AI Lab |
2B | 2024-11 | 16.7 | 65.6 | 17.7 | 20.3 | 18.3 | 22.5 | 13.9 | 12.6 | 22.7 |
19 |
InternVL2.5-1B
Shanghai AI Lab |
1B | 2024-11 | 15.7 | 60.1 | 24.2 | 20.7 | 17.8 | 24.3 | 13.2 | 10.6 | 21.1 |
20 |
LLaVA-NeXT-Video-34B
Llava Hugging Face |
34B | 2024-06 | 16.1 | 61.3 | 22.6 | 20.8 | 18.1 | 25.3 | 11.3 | 14.5 | 23.4 |
21 |
LLaVA-NeXT-Video-7B
Llava Hugging Face |
7B | 2024-06 | 10.9 | 43.0 | 46.1 | 20.3 | 14.2 | 19.6 | 9.2 | 9.1 | 19.7 |
22 |
ST-LLM
Peking University |
7B | 2024-03 | 26.6 | 59.9 | 13.6 | 30.7 | 28.5 | 31.5 | 23.4 | 18.8 | 35.8 |
23 |
Chat-UniVi
Peking University |
7B | 2023-11 | 8.5 | 58.5 | 33.0 | 12.6 | 10.1 | 11.1 | 8.4 | 5.6 | 13.1 |
24 |
PPLLaVA-Qwen
Peking University |
7B | 2024-10 | 20.1 | 48.8 | 31.2 | 29.2 | 23.8 | 26.2 | 17.8 | 14.4 | 33.1 |
25 |
PPLLaVA-Vicuna
Peking University |
7B | 2024-10 | 10.0 | 41.4 | 48.6 | 19.4 | 13.2 | 15.4 | 6.6 | 14.0 | 22.4 |
26 |
VideoLLaMA3
DAMO |
7B | 2025-01 | 25.3 | 60.8 | 13.9 | 29.4 | 27.2 | 36.9 | 18.4 | 20.0 | 34.5 |
27 |
Video-LLaVA
Peking University |
7B | 2023-11 | 15.6 | 64.6 | 19.8 | 19.5 | 17.3 | 23.9 | 9.7 | 11.0 | 24.2 |
The taxonomy of Video SimpleQA benchmark.
(a) Video distribution at the secondary level; (b) Question type distribution; (c) Key statistics of Video SimpleQA.
Comparisons with existing video benchmarks regarding the video domain, their knowledge-driven focus, emphasis on factuality, and provision of supporting evidence.
An overview of the construction pipeline of Video SimpleQA including the video & encyclopedia collection, QA annotation and quality control.
Evaluation results (%) of open-source and proprietary multi-modal LLMs on Video SimpleQA . For metrics, CO, NA, IN, and CGA denote āCorrectā, āNot attemptedā, āIncorrectā, and āCorrect given attemptedā, respectively. For subtopics, ENG, NAT, SCI and SAC represent āEngineeringā, āNatureā, āScienceā and āSociety and Cultureā.